Experiment 2 (#9)

* Updated paths * Implemented conversion of Burned Area from m2 to hectares before the fuel load is calculated in fuelload.py. This addresses issue #10. * Added new notebook with more concise pre-processing. It includes the conversion to dataframe (model input). It solves issues #10 and #12. It also defines a new threshold for BA (50 hectares, see #11, the new threshold is defined by FDG - fire expert) but a reliable reference source is still not available. * Notebook notebooks/preprocess_all_in_one.ipynb reformatted using black * Formatted src/utils/fuelload.py using black * Minor changes to notebooks/preprocess_all_in_one.ipynb * renamed notebook and finalised concise version of the data preparation step * Complete re-write of the data pre-processing step to avoid resampling. This addresses issue #13. * Added the following amongst predictors: GFED4 basis regions (as categorical variable) and area of grid cell at point (as continuous variable). * Load formula changed to BA*CC*AGB/AREA * added log-transformed variables * updated notebooks with latest run * model 6h, MAE * experiments as in ESA-D1 report * Updated README files to clarify there are two sets of experiments (by wikilimo and ecmwf). * Update README.md
ecmwf-projects · Jun 21, 2021 · 7fe88a5 · 7fe88a5
1 parent 879a0e8
commit 7fe88a5
Show file tree

Hide file tree

Showing 8 changed files with 16,989 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -28,7 +28,14 @@ pip install -U pip
 pip install -r requirements.txt
 ```
 
-This includes all the packages required for running the code in the repository.
+This includes all the packages required for running the code in the repository, with the exclusion of the notebooks in the folder `notebooks/ecmwf` (see `notebooks/ecmwf/README.md` for the additional dependencies to install).
+
+The content of this repository is split into 2 types of experiments:
+
+1. target is the fuel load = burned areas * above ground biomass
+2. target is dry matter = burned areas * above ground biomass * combustion coefficients / grid cell areas
+
+## Experiment 1
 
 ### Data Description
 7 years of global historical data, from 2010 - 2016 will be used for developing the machine learning models. All data used in this project is propietary and NOT meant for public release. Xarray, NumPy and netCDF libraries are used for working with the multi-dimensional geospatial data.
@@ -82,7 +89,7 @@ Args description:
       Where root_path is the root save path provided for pre-processing.py
      ```
 
-## Training
+### Training
 
 Entry-point for training is [src/train.py](src/train.py)
 ```
@@ -92,7 +99,7 @@ Args description:
       * `--exp_name`:  Name of the  training experiment used for logging.
 ```
 
-## Inference
+### Inference
 
 Entry-point for inference is [src/test.py](src/test.py)
 ```
@@ -103,37 +110,41 @@ Args description:
       * `--results_path`:  Directory where the result inference .csv files and .html visualizations are going to be stored.
 ```
 
-### Pre-trained models
+#### Pre-trained models
 Pre-trained models are available at:
 - [LightGBM.joblib](src/pre-trained_models/LightGBM.joblib)
 - [CatBoost.joblib](src/pre-trained_models/CatBoost.joblib)
 
 
-### Demo Notebooks
+#### Demo Notebooks
 Notebooks for training and inference:
 - [LightGBM_training.ipynb](notebooks/LightGBM_training.ipynb)
 - [LightGBM_inference.ipynb](notebooks/LightGBM_inference.ipynb)
 - [CatBoost_training.ipynb](notebooks/CatBoost_training.ipynb)
 - [CatBoost_inference.ipynb](notebooks/CatBoost_inference.ipynb)
 
-## Fuel Load Prediction Visualizations:
+### Fuel Load Prediction Visualizations:
 - CatBoost for Mid-Latitudes
 <img width="1025" alt="midlats-prediction-july16" src="https://user-images.githubusercontent.com/7680686/113362982-4d263500-936d-11eb-922e-5a0609e7a67e.png">
 
 - LightGBM for Tropics
 <img width="1025" alt="tropics-prediction-july16" src="https://user-images.githubusercontent.com/7680686/113362967-45ff2700-936d-11eb-93a3-5ad380393f03.png">
 
 
-## Adding New Features:
+### Adding New Features:
 - Make sure the new dataset to be added is a single file in `.nc` format, containing data from 2010-16 and in 0.25x0.25 grid cell resolution.
 - Match the features of the new dataset with the existing features. This can be done by going through `notebooks/EDA_pre-processed_data.ipynb`.
 - Add the feature path as a variable to `src/utils/data_paths.py`. Further the path variable is needed to be added to either the time dependant or independant list (depending on which category it belongs to) present inside `export_feature_paths()`.
 - The model will now also be trained on the added feature while running src/train.py!
 
 
-## Documentation
+### Documentation
 Documentation is available at: [https://ml-fuel.readthedocs.io/en/latest/index.html](https://ml-fuel.readthedocs.io/en/latest/index.html).
 
 
+## Experiment 2
+Please refer to `notebooks/ecmwf/README.md` for a description of this experiment, instructions to install additional dependencies and the notebooks with the steps to perform the experiment.
+
+
 ## Info
 This repository was developed by Anurag Saha Roy (@lazyoracle) and Roshni Biswas (@roshni-b) for the ESA-SMOS-2020 project. Contact email: `[email protected]`. The repository is now maintained by the Wildfire Danger Forecasting team at the European Centre for Medium-range Weather Forecast.
diff --git a/notebooks/ecmwf/README.md b/notebooks/ecmwf/README.md
@@ -0,0 +1,33 @@
+# Notebooks for pre-processing and modelling
+
+Notebooks in this folder contain all the steps for data exploration, pre-processing, modelling and explainability using the H2O.ai framework. 
+
+#### Install dependencies
+
+```bash
+conda install cartopy
+conda install -c h2oai h2o
+```
+
+## Notebooks
+
+### 1. Data preparation
+
+This notebook takes the raw/downloaded information and pre-processes it into a data frame. The data is then split into a train and test set using a stratified sampling strategy to make sure both datasets have the same proportion of biomes.
+
+### 2. Exploratory data analysis
+
+This notebook explores the data assembled in notebook `data_preparation.ipynb`. It looks at the probability distributions of outcome and predictors and identifies possible data transformations as well as correlations and redundandies amongst variables.
+
+### 3. Model benchmark tests
+
+This notebook uses the H2O.ai AutoML framework to benchmark various possible data transformations for outcome and predictors. It also compares model results in case all versus non-redundant features are used. The final result is a pre-processed dataset that will be used for the final modelling step in `model_definition.ipynb`.
+
+### 4. Model definition and evaluation
+
+This notebook uses the H2O.ai AutoML framework to model transformed outcome and predictors. It visualises averaged results over a map and uses the H2O.ai explainability module to identify model limitations and possible future avenues for improvements.
+
+
+## Info
+
+These notebooks were developed by the Wildfire Danger Forecasting team at the European Centre for Medium-range Weather Forecast for the ESA-SMOS-2020 project. For any queries, please contact ECMWF support portal: https://confluence.ecmwf.int/site/support.
diff --git a/notebooks/ecmwf/data_preparation.ipynb b/notebooks/ecmwf/data_preparation.ipynb