Skip to content

Latest commit

 

History

History
99 lines (82 loc) · 4.88 KB

Documentation.md

File metadata and controls

99 lines (82 loc) · 4.88 KB

BPX-Challenge Documentation

Table of Content

  1. Environment Setup
  2. Data Preprocessing
  3. Model Training and Eval
  4. Inference

Environment setup

Python virtual environments are used for this project. Execute the commands below in terminal to install all requirements.

~$chmod +x setup.sh
~$sh setup.sh
~$source jtk/bin/activate

Note: The setup has been tested successfully on Mac with Intel chip, Ubuntu, and Windows using CPU and GPUs. It doesn't work well with Mac Apple Silicon chip and the metrics computed with the silicon chip are notably lower.

For GPUs with older versions of cuda < 12.0, you can install pytorch with pip install torch==2.2.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html.

Preprocessing the dataset

Steps were taken to ensure new datasets can be easily generated for prediction durations. Major folder names or the prediction fail window can be modified in the environment file /.env. You can specify a SLIDE_N value in the environment file to generate binary labels for number of days before an ESP test fails. To execute any of the scripts, the virtual environment jtk must be active.

  1. Generate the labels for training the data contained in folder "Train".
    ~$python scripts/preprocess_labels.py --folderpath "Train"
    Note: This creates a folder ./Processed_14 where 14 is the specified SLIDE_N in the environment file.
  2. Remove local outliers and crop the data to daily windows for training and prediction
    ~$python scripts/crop_processed_data.py -c=True
    Note: This creates a folder ./Cropped_14. The script is executed across multiple-processors because of the numerous for loops and it takes a while to run (roughly 1.5 hours on an old macbook with 4 processors).

Both scripts only need to be run once.

Training the model

  1. Arguments to the train script can be specified. By default, the data folder is obtained from the env file. Additional arguments like

    • split - train-test-split percentage.
      • validation data is default to 10% of training data
    • batch_size - batch size of the training dataset
    • learning_rate - to update the model weights
    • num_epochs - number of epochs to train the model for
    • dropout - percentage of dropout during training
    • num_layers - number of layers you want the model to have.
    ~$python scripts/train.py
    • Training progress can be visualized in jupyter notebook - tensorboard_viz.ipynb
  2. The training process saves the best checkpoint for each training run. To evaluate a model's checkpoint on the test dataset

    ~$python scripts/evaluate.py --chkpt="checkpoints/best-chckpt-v21.ckpt"
    
    
    ────────────────────────────────────────
    Test metric             DataLoader 0
    ────────────────────────────────────────
    test_acc                 0.915
    test_fbeta               0.923
    test_loss                0.275
    ────────────────────────────────────────

Inference

  1. Inference assumes each well is saved using the naming nomenclature <well_api>.csv. Predictions can be generated by specifying the following parameters

    • chkpt - path to best model checkpoint. Specify in env file for default
    • api - well api number. It has no default value
    • train_folder - name of training folder where all wells are saved. Defaults to "train".
    • prob - probability cutoff for each prediction. Defaults to 0.8.
    ~$python scripts/inference.py --api="z8j7xj31j7" --train_folder=train --prob=0.8

    The script calls a function that generates a dataframe formatted according to the submission outlines

    The function can also be accessed in a jupyter notebook where it returns a dataframe

    from scripts.inference import ESP_Predictor
    
    model = ESP_Predictor(checkpoint_path = "best-chckpt-v21.ckpt",
                          api = "z8j7xj31j7",
                          csv_folder_path = "Train",
                          prob=0.8)
    
    pred = model.predict(save_csv=False)
    
    print(pred)
    API Date Fail Probability
    "z8jfojo31x" "2022-09-12" 0 0.17373599
    "z8jfojo31x" "2022-09-13" 0 0.39669976
    "z8jfojo31x" "2022-09-14" 0 0.43361512
  2. To reproduce the submission files, run submission.py which generates predictions for all the wells in the specified folder 'Train'.

    ~$python scripts/submission.py --folder 'Train'