This repository accompanies our GIScience publication "Benchmarking regression models under spatial heterogeneity" (see reference below). In the code base, we provide 1) the script for reproducing our experiments on synthetic data, 2) the script for reproducing our benchmarking experiments on several real datasets and 3) an open-source Python implementation of spatial Random Forests. Each part is described in the following.
The required packages and our sprf package can be installed via pip in editable mode in a virtual environment with the following commands:
git clone https://github.com/mie-lab/spatial_rf_python.git
cd spatial_rf_python
python -m venv env
source env/bin/activate
pip install -e .
To reproduce our analysis on synthetic data, run:
python scripts/synthetic_tests.py
All results will be saved in a single csv file named synthetic_data_results.csv
.
We use five public data sets to validate our results and to benchmark different algorithms. The datasets are provided as csv fils in the data folder. They include
- A plants dataset
- A deforestation dataset
- A mortality rate dataset from here
Please cite these sources if reusing their data.
Our code for benchmarking is provided as a notebook and as a script. To reproduce our experiments from the paper, run
python scripts/benchmarks.py
The results will be saved as csv files in a folder named outputs
.
This repository further provides Python implementations of Spatial Random Forests. Different approaches have been proposed in the literature, but here, we focus on the one by Georganos et al termed Geographical Random Forests. We implement their approach, but since it is very inefficient to train one random forest per sample, we additionally implement a more efficient variant (which we simply call Spatial Random Forests): Instead of training one Random Forest per sample, we train a fixed number of random forests on spatially distinct set of points. The prediction is then a weighted average of the tree-wise predictions, weighted by the distance of the test sample from the centers of each tree (see figure below).
We demonstrate the usage of the spatial Random Forests in the demonstration notebook.
The usage is analogous to other scikit-learn models, except that the coordinates must also be given as input.
from sprf import SpatialRandomForest
spatial_rf = SpatialRandomForest()
spatial_rf.fit(train_x, train_y, train_coords)
test_pred = spatial_rf.predict(test_x, test_coords)
To train a Geographical Random Forest as proposed by Georganos et al, we provide the corresponding class which can be used in the same way:
from sprf import GeographicalRandomForest
geo_rf = GeographicalRandomForest()
geo_rf.fit(train_x, train_y, train_coords)
test_pred = geo_rf.predict(test_x, test_coords)
If you use our work, please cite our paper with the following bibtex entry:
@inproceedings{wiedemann2023benchmarking,
title={Benchmarking regression models under spatial heterogeneity},
author={Wiedemann, Nina and Martin, Henry and Westerholt, René},
booktitle={12th International Conference on Geographic Information Science (GIScience 2023)},
year={2023},
}