- Introduction
- Motivation
- Installation
- Data Collection
- [UPDATED!] Machine Learning Models
- Training Cycles and DataLoader Pipelines
- UNDER CONSTRUCTION
This project aims to generate predictions of the percentage of vegetation cover in Antioquia based on the Sentinel-2 and MOD44B.006 Terra Vegetation datasets. The architecture is designed to accept any number of bands (num_bands), but defaults to three bands corresponding to the RGB channels (B4, B3, B2) of Sentinel-2. The Sentinel-2 bands serve as the input to our model, while the target variable (total vegetation cover) is obtained by summing the Percent_Tree_Cover and Percent_NonTree_Vegetation bands to get Percent_VegetationCover.
WE STRONGLY SUGGEST THAT YOU READ THESE NOTEBOOKS:
- 00_data_extraction_process.MD
- 01_data_exploration.ipynb
- 02_training_process.ipynb
- 03_model_application_to_new_data.ipynb
The motivation behind this project is to provide government and environmental organizations with tools to predict vegetation coverage based on satellite images. This allows for better environmental monitoring and decision-making.
We have a DockerFile available to serve the model with a FastAPI endpoint. You can use it by:
- Installing with:
docker build -t udeaiforestimage:latest .- Run the container, be sure to mount a volume for the data on which you are going to run inferences. It must be in the same directory as the
DockerFIle
docker run -p 8000:8000 -v $(pwd)/data:/app/data -v $(pwd)/app/html_maps:/app/html_maps udeaiforestimage:latestThis will create a volume for your data and another one for your html maps
- Now, you can go to http://localhost:8000/form to generate a map with a specific Model and a Data Directory. LIke this:
Be ready to wait 15-20 min for the model to predict and make the map for each new dataset. You can see a loading bar in the console where you launched the docker.
- You can view any map using the
defof its dataset when using theview-mapendpoint. Like this:
http://localhost:8000/view-map/sentinel2rgbmedian2020.py- Officially tested on Ubuntu 22.04 LTS.
- Other operating systems and distributions may work but are not officially supported.
- Windows, as of October 2023, does not have the necessary dependencies to train the models on AMD GPUs.
To run the project, you'll need:
- Python 3.x
- pip
- A modified version of the
geetilestool, which can be installed using the following command:
pip install git+https://github.com/Felipe-RA/geetilesBEWARE! BEFORE RUNNING THE DOWNLOAD COMMAND:
We use a Notebook Authenticator for the project. Google will likely issue a warning advising you not to run the code unless you are using a notebook and understand the code being executed. The code responsible for these operations is located in the defs directory. This code defines the instruments (ImageCollection) and bands to be used, among other post-processing steps. While our defs files do not access your private data, malicious applications could misuse token access to compromise your information. Make sure to read and understand the code before executing any commands.
Use the geet tool to generate a grid over your Area of Interest (AOI) with the following command:
geet grid --aoi_wkt_file data/antioquia.wkt --chip_size_meters 1000 --aoi_name antioquia --dest_dir .--aoi_wkt_file: Path to the.wktfile defining your AOI.--chip_size_meters: Size in meters of each chip.--aoi_name: Name for the.geojsonfile to be created.--dest_dir: Directory where the.geojsonfile will be saved.
Important: You need a Google Earth Engine account and a cloud project to use the Earth Engine data pipeline.
-
Authorization: We use a Notebook Authenticator. Be cautious when following Google's warnings. The code that performs these operations is in the
defsdirectory. -
Download Duration: Depending on various factors, the download can take from minutes to days. For the
antioquia.wktAOI with 1000-meter chips, expect around 3 to 5 hours.
Execute the following command to start the download:
geet download --tiles_file path/to/your_grid_partition_aschips.geojson --dataset_def defs/CHOOSE_DEF_FILE.py --pixels_lonlat [100,100] --skip_if_existsReplace CHOOSE_DEF_FILE.py with your definition file like treecover2020.py or sentinel2-rgb-median-2020.
- Install the drivers
CUDA Toolkitfor NVIDIA GPUs. - Install the
ROCm OpenSource Platformfor AMD GPUs with SUPPORTED ARCHITECTURES
Be careful, GPU accelerated tools implementations are extremely dependant on the OS on which they are installed, all of the console commands listed below assume that you are working on Linux. If you are using other OS please refer to the Official PyTorch Installation Matrix
To install the required dependencies for running the machine learning models, execute the following commands:
pip install -r requirements.txtpip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118pip3 install torch torchvision torchaudioNEW! AMD Support for PyTorch using ROCm (Supported GPUs are changing on a regular basis, check here)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6The models are compatible with PyTorch's nn module. They are inspired by the VGG architecture and are hence named VGGUdeA.
The specific implementation of the models can be found in the src/classes directory.
- MultipleRegressionModel.py
- ParzenWindowModel.py
- VGGUdeaSpectral.py
- SpectralAttentionMode.py
These classes were implemented to be compatible with the base classes of sklearn (for MultipleRegressionModel.py and ParzenWindowModel.py) and PyTorch (for VGGUdeaSpectral.py).
The class SpectralAttentionMode.py is used as an attention mechanism for the neural network to focus on the training process.
The training cycles are available at src/generate_models. The application is made to support multiple models coexisting with each other, a TODO in the future is to allow for any specific model to be called from within a dockerized app (alongside the posibility to train and add a new trained model!)
To better understand the challenges involved on the data adquisition and alignment we strongly suggest that you read this support document attached at this link here or in the file /00_data_extraction_proccess.MD.
Nonetheless, we prepared a link to download the losslessly compressed dataset ready for training.
We recommend that you use this serialized resources if your aim is to reproduce the results of this notebook and wish to train the modules with the original data. Since the data size was reduced from ~30GB down to ~1.9GB
git clone https://github.com/Felipe-RA/udeai_forestpip install -r requirements.txt`cd path/where/you/have/the/repoExample: username@userhost:~/somepath/udeai_forest if your path looks like this, you are on the root and can proceed.
We understand that it may not be feasible to download the crude data for everyone. (it takes from ~4-8 hours)
Because of that, we prepared a link to download the losslessly compressed dataset ready for training. The highly efficient structure of the TorchTensors and other compression techniques allowed us to reduce the size of our X,y from 30GB down to just ~1.9GB.
Just put those two files X_tensor.pth and y_tensor.pth under the root of your local clone of the repo.
OPTIONAL
-
The script that makes possible this conversion is available under
src/dataset_serialization/serialize_sentinel_vegetation_data.pyand, as long as you have the datasets downloaded on your/datafolder (by following the instructions on DATA COLLECTION ), you can run it to generate your own serialized tensors like this:Remember to run from the root of the cloned repo:
python3 -m src.dataset_serialization.serialize_sentinel_vegetation_data
And you shall get a loading bar that tells you the progress of the serialization (this was made inhouse using the
tqdmwrapper for loading bars of progression)It looks like the image below, this process is done using cpu so the estimated time might be different based on your CPU specs and specially available RAM:
- To train the MultipleRegressionModel.py implementation:
python3 -m src.generate_models.train_MultipleRegression- To train the ParzenWindowModel.py implementation:
python3 -m src.generate_models.train_ParzenWindow- To train the VGGUdeaSpectral.py implementation:
python3 -m src.generate_models.train_VGGUdeaSpectralThese scripts will show in the terminal:
-
The number of
foldsandcandidates, with the total number offitsto do. -
The individual training results for each Fold (the
verbosityis set to3) -
The report of the training results
Also, these scripts generate a folder with all the obtained artifacts from the model generation. This folder is available at src/trained_models and contains the following artifacts:
- A subfolder with the name :
[ModelName][ModelID]- A
JSONlogging all the hyperparameters and scores obtained from the optimization score named[ModelName]_hyperparameter_optimizacion_log[ModelID].json - A
JSONthat contains the best hyperparameters named:[ModelName]_hyperparameters[ModelID].json - A .
joblibthat serializes the model to be used as a predictive tool. Its named `[ModelName]_model[ModelD].joblib - A
.txtthat contains a report of the scores, hyperparameters and training time obtained for the best model. Its named[ModelName]_report[ModelID]
- A
A visual example of this structure can be seen in the folder structure below:
Various test configurations are available under src/synthetic_test_training_cycles. These tests use:
- GPU, if available and configured (more on GPU configurations later). CPU is used otherwise.
- Synthetic data generated specifically for testing the architecture without having to load and process the large
.tifimages as discussed in the Data Download Section. - Multiple
test_*.pyfiles, each implementing different model optimization techniques.
Be aware that running these tests on a CPU is considerably slower. What might take seconds on a GPU could take minutes on a CPU.







