BPX-Challenge Documentation

Table of Content

Environment Setup
Data Preprocessing
Model Training and Eval
Inference

Environment setup

Python virtual environments are used for this project. Execute the commands below in terminal to install all requirements.

~$chmod +x setup.sh
~$sh setup.sh
~$source jtk/bin/activate

Note: The setup has been tested successfully on Mac with Intel chip, Ubuntu, and Windows using CPU and GPUs. It doesn't work well with Mac Apple Silicon chip and the metrics computed with the silicon chip are notably lower.

For GPUs with older versions of cuda < 12.0, you can install pytorch with pip install torch==2.2.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html.

Preprocessing the dataset

Steps were taken to ensure new datasets can be easily generated for prediction durations. Major folder names or the prediction fail window can be modified in the environment file /.env. You can specify a SLIDE_N value in the environment file to generate binary labels for number of days before an ESP test fails. To execute any of the scripts, the virtual environment jtk must be active.

Generate the labels for training the data contained in folder "Train".
```
~$python scripts/preprocess_labels.py --folderpath "Train"
```
Note: This creates a folder ./Processed_14 where 14 is the specified SLIDE_N in the environment file.
Remove local outliers and crop the data to daily windows for training and prediction
```
~$python scripts/crop_processed_data.py -c=True
```
Note: This creates a folder ./Cropped_14. The script is executed across multiple-processors because of the numerous for loops and it takes a while to run (roughly 1.5 hours on an old macbook with 4 processors).

Both scripts only need to be run once.

Training the model

Arguments to the train script can be specified. By default, the data folder is obtained from the env file. Additional arguments like
- split - train-test-split percentage.
  - validation data is default to 10% of training data
- batch_size - batch size of the training dataset
- learning_rate - to update the model weights
- num_epochs - number of epochs to train the model for
- dropout - percentage of dropout during training
- num_layers - number of layers you want the model to have.
```
~$python scripts/train.py
```
- Training progress can be visualized in jupyter notebook - tensorboard_viz.ipynb

The training process saves the best checkpoint for each training run. To evaluate a model's checkpoint on the test dataset

~$python scripts/evaluate.py --chkpt="checkpoints/best-chckpt-v21.ckpt"


────────────────────────────────────────
Test metric             DataLoader 0
────────────────────────────────────────
test_acc                 0.915
test_fbeta               0.923
test_loss                0.275
────────────────────────────────────────

Inference

Inference assumes each well is saved using the naming nomenclature <well_api>.csv. Predictions can be generated by specifying the following parameters
- chkpt - path to best model checkpoint. Specify in env file for default
- api - well api number. It has no default value
- train_folder - name of training folder where all wells are saved. Defaults to "train".
- prob - probability cutoff for each prediction. Defaults to 0.8.
```
~$python scripts/inference.py --api="z8j7xj31j7" --train_folder=train --prob=0.8
```
The script calls a function that generates a dataframe formatted according to the submission outlines

The function can also be accessed in a jupyter notebook where it returns a dataframe
```
from scripts.inference import ESP_Predictor

model = ESP_Predictor(checkpoint_path = "best-chckpt-v21.ckpt",
                      api = "z8j7xj31j7",
                      csv_folder_path = "Train",
                      prob=0.8)

pred = model.predict(save_csv=False)

print(pred)
```
API Date Fail Probability

"z8jfojo31x" "2022-09-12" 0 0.17373599

"z8jfojo31x" "2022-09-13" 0 0.39669976

"z8jfojo31x" "2022-09-14" 0 0.43361512
To reproduce the submission files, run submission.py which generates predictions for all the wells in the specified folder 'Train'.
```
~$python scripts/submission.py --folder 'Train'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation.md

Documentation.md

BPX-Challenge Documentation

Table of Content

Environment setup

Preprocessing the dataset

Training the model

Inference

API	Date	Probability
"z8jfojo31x"	"2022-09-12"	0.17373599
"z8jfojo31x"	"2022-09-13"	0.39669976
"z8jfojo31x"	"2022-09-14"	0.43361512

Files

Documentation.md

Latest commit

History

Documentation.md

File metadata and controls

BPX-Challenge Documentation

Table of Content

Environment setup

Preprocessing the dataset

Training the model

Inference