Python virtual environments are used for this project. Execute the commands below in terminal to install all requirements.
~$chmod +x setup.sh
~$sh setup.sh
~$source jtk/bin/activate
Note: The setup has been tested successfully on Mac with Intel chip, Ubuntu, and Windows using CPU and GPUs. It doesn't work well with Mac Apple Silicon chip and the metrics computed with the silicon chip are notably lower.
For GPUs with older versions of cuda < 12.0, you can install pytorch with pip install torch==2.2.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
.
Steps were taken to ensure new datasets can be easily generated for prediction durations. Major folder names or the prediction fail window can be modified in the environment file /.env
. You can specify a SLIDE_N
value in the environment file to generate binary labels for number of days before an ESP test fails. To execute any of the scripts, the virtual environment jtk
must be active.
- Generate the labels for training the data contained in folder "Train".
Note: This creates a folder
~$python scripts/preprocess_labels.py --folderpath "Train"
./Processed_14
where 14 is the specified SLIDE_N in the environment file. - Remove local outliers and crop the data to daily windows for training and prediction
Note: This creates a folder
~$python scripts/crop_processed_data.py -c=True
./Cropped_14
. The script is executed across multiple-processors because of the numerous for loops and it takes a while to run (roughly 1.5 hours on an old macbook with 4 processors).
Both scripts only need to be run once.
-
Arguments to the train script can be specified. By default, the data folder is obtained from the env file. Additional arguments like
split
- train-test-split percentage.- validation data is default to 10% of training data
batch_size
- batch size of the training datasetlearning_rate
- to update the model weightsnum_epochs
- number of epochs to train the model fordropout
- percentage of dropout during trainingnum_layers
- number of layers you want the model to have.
~$python scripts/train.py
- Training progress can be visualized in jupyter notebook -
tensorboard_viz.ipynb
-
The training process saves the best checkpoint for each training run. To evaluate a model's checkpoint on the test dataset
~$python scripts/evaluate.py --chkpt="checkpoints/best-chckpt-v21.ckpt" ──────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────── test_acc 0.915 test_fbeta 0.923 test_loss 0.275 ────────────────────────────────────────
-
Inference assumes each well is saved using the naming nomenclature
<well_api>.csv
. Predictions can be generated by specifying the following parameterschkpt
- path to best model checkpoint. Specify in env file for defaultapi
- well api number. It has no default valuetrain_folder
- name of training folder where all wells are saved. Defaults to "train".prob
- probability cutoff for each prediction. Defaults to 0.8.
~$python scripts/inference.py --api="z8j7xj31j7" --train_folder=train --prob=0.8
The script calls a function that generates a dataframe formatted according to the submission outlines
The function can also be accessed in a jupyter notebook where it returns a dataframe
from scripts.inference import ESP_Predictor model = ESP_Predictor(checkpoint_path = "best-chckpt-v21.ckpt", api = "z8j7xj31j7", csv_folder_path = "Train", prob=0.8) pred = model.predict(save_csv=False) print(pred)
API Date Fail Probability "z8jfojo31x" "2022-09-12" 0 0.17373599 "z8jfojo31x" "2022-09-13" 0 0.39669976 "z8jfojo31x" "2022-09-14" 0 0.43361512 -
To reproduce the submission files, run
submission.py
which generates predictions for all the wells in the specified folder 'Train'.~$python scripts/submission.py --folder 'Train'