MAESTRO: Masked Autoencoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Official implementation for Masked Autoencoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Abstract | Datasets | Results | Getting Started
-
[2026-01] 🚀 Pre-trained MAESTRO models released on HuggingFace
- MAESTRO_FLAIR-HUB_base — pre-trained on FLAIR-HUB
- MAESTRO_S2-NAIP-urban_base — pre-trained on S2-NAIP-urban
-
[2025-11] 🎉 MAESTRO accepted at WACV 2026
MAESTRO is a tailored adaptation of the Masked Autoencoder (MAE) that effectively orchestrates the use of multimodal, multitemporal, and multispectral Earth Observation (EO) data. Evaluated on four EO datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining competitive on tasks dominated by a single monotemporal modality.
MAESTRO's contributions are as follows:
- Extensive benchmarking of multimodal and multitemporal SSL: Impact evaluation of various fusion strategies for multimodal and multitemporal SSL.
- Patch-group-wise normalization: Novel normalization scheme that normalizes reconstruction targets patch-wise within groups of highly correlated spectral bands.
- MAESTRO: Novel adaptation of the MAE that combines optimized fusion strategies with patch-group-wise normalization.
Our implementation already supports 5 datasets.
TreeSatAI-TS
Tree species identification in Germany, with 15 multi-label classes.
- Extent: 50,381 tiles of 60 × 60 m, covering 181 km².
- Modalities: aerial imagery RGB + NIR (0.2 m), Sentinel-1 time series, Sentinel-2 time series.
- Dataset split
PASTIS-HD
Agricultural crop segmentation in France, with 19 semantic classes.
- Extent: 433 tiles of 1280 × 1280 m, covering 709 km².
- Modalities: VHR satellite imagery SPOT 6-7 (1 m), Sentinel-1 time series, Sentinel-2 time series.
- Dataset split
FLAIR#2
Land cover segmentation in France, with 12 semantic classes. Note that the FLAIR#2 version used here is not the original release, but a regenerated version obtained by refiltering FLAIR-HUB.
- Extent: 77,762 tiles of 102.4 × 102.4 m, covering 815 km².
- Modalities: Aerial and elevation imagery RGB + NIR + DEM + DSM (0.2 m), Sentinel-2 time series.
- Dataset split
FLAIR-HUB
Land cover segmentation in France, with 15 semantic classes.
- Extent: 241,100 tiles of 102.4 × 102.4 m, covering 2,528 km².
- Modalities: Aerial and elevation imagery RGB + NIR + DEM + DSM (0.2 m), Sentinel-1 time series, Sentinel-2 time series.
- Dataset split
S2-NAIP urban
Super-resolution in urban areas of the United States. To construct this urban subset, the original S2-NAIP footprints are intersected with the urban set defined in Zooming-in zooming-out, which is limited to locations within a 5 km radius of U.S. cities with populations of at least 50,000.
- Extent: 167,397 tiles of size 640 m × 640 m, covering 68,565 km².
- Modalities: Aerial NAIP imagery RGB + NIR (1.25 m), Sentinel-1 time series, Sentinel-2 time series.
- Dataset split
Table 1 — Intra-dataset comparison of MAESTRO, supervised ViTs, and previous SOTA.
We report wF1 (%) on TreeSatAI-TS and mIoU (%) on PASTIS-HD, FLAIR#2, and FLAIR-HUB.
MAESTRO† models are pre-trained for twice the number of epochs.
| Model | TreeSatAI-TS | PASTIS-HD | FLAIR#2 | FLAIR-HUB |
|---|---|---|---|---|
| MAESTRO (ours) | 78.8 | 68.6 | 62.6 | 65.9 🟢↑1.6 |
| MAESTRO† (ours) | 79.4 🟢↑3.8 | 69.0 🟢↑2.5 | 63.3 🔴↓0.8 | 65.8 |
| ViT | 75.6 | 64.5 | 58.2 | 62.1 |
| Previous SOTA | 75.1 | 66.5 | 64.1 | 64.3 |
Table 2 — Cross-dataset comparison of MAESTRO and adapted baseline FMs. The baseline FMs are adapted to handle an arbitrary number of dates and/or to retain all modality bands.
We report wF1 (%) on TreeSatAI-TS and mIoU (%) on PASTIS-HD, FLAIR#2, and FLAIR-HUB.
| Model | Pre-training dataset | TreeSatAI-TS | PASTIS-HD | FLAIR#2 | FLAIR-HUB |
|---|---|---|---|---|---|
| MAESTRO (ours) | FLAIR-HUB | 79.6 | 68.0 | - | - |
| MAESTRO (ours) | S2-NAIP urban | 78.8 | 67.4 | 62.6 | 64.6 |
| DINO-v2 | LVD-142M | 76.7 | 64.4 | 64.2 | 66.0 |
| DINO-v2 sat. | Maxar Vivid2 | 76.3 | 64.0 | 63.5 | 66.0 |
| DOFA | DOFA MM | 76.0 | 62.9 | 62.3 | 65.1 |
| CROMA | SSL4EO | 70.5 | 65.0 | 39.0 | 44.3 |
| Prithvi-EO-2.0 | HLS | 75.6 | 66.2 | 41.8 | 44.9 |
| SatMAE | fMoW RGB+S | 76.9 | 66.6 | 42.5 | 45.0 |
The module is set up with Poetry:
# 1. Change directory
cd MAESTRO
# 2. Install dependencies with Poetry
poetry installPrerequisites:
- Fetch Dataset splits and move them to each dataset directory
- Fetch pre-trained weights and configuration as detailed in the models cards for MAESTRO_FLAIR-HUB_base and MAESTRO_S2-NAIP-urban_base
Then you can start from the example commands provided in the models cards for MAESTRO_FLAIR-HUB_base and MAESTRO_S2-NAIP-urban_base.
Prerequisites:
- Fetch Dataset splits and move them to each dataset directory
Then you can start from the following minimal examples.
Intra-dataset MAESTRO on TreeSatAI-TS:
# pre-train, probe and finetune on TreeSatAI-TS
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=treesatai_ts \
datasets.root_dir=/path/to/dataset/dir datasets.treesatai_ts.rel_dir=TreeSatAI-TS \
run.exp_dir=/path/to/experiments/dir run.exp_name=MAESTRO_TSAI-TS_baseIntra-dataset MAESTRO on PASTIS-HD:
# pre-train, probe and finetune on PASTIS-HD
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=pastis_hd \
datasets.root_dir=/path/to/dataset/dir datasets.pastis_hd.rel_dir=PASTIS-HD \
run.exp_dir=/path/to/experiments/dir run.exp_name=MAESTRO_PASTIS-HD_baseIntra-dataset MAESTRO on FLAIR-HUB:
# pre-train, probe and finetune on FLAIR-HUB
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=15 opt_finetune.epochs=100 \
datasets.name_dataset=flair \
datasets.root_dir=/path/to/dataset/dir datasets.flair.rel_dir=FLAIR-HUB \
run.exp_dir=/path/to/experiments/dir run.exp_name=MAESTRO_FLAIR-HUB_baseCross-dataset MAESTRO from S2-NAIP urban to TreeSatAI-TS:
# pre-train on S2-NAIP urban
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=15 opt_probe.epochs=0 opt_finetune.epochs=0 \
datasets.name_dataset=s2_naip \
datasets.root_dir=/path/to/dataset/dir datasets.s2_naip.rel_dir=s2-naip-urban \
run.exp_dir=/path/to/experiments/dir run.exp_name=MAESTRO_S2-NAIP-urban_base && \
# probe and finetune on TreeSatAI-TS
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=0 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=treesatai_ts \
datasets.treesatai_ts.aerial.image_size=240 datasets.treesatai_ts.aerial.patch_size.mae=16 \
datasets.treesatai_ts.s1_asc.name_embed=s1 datasets.treesatai_ts.s1_des.name_embed=s1 \
datasets.root_dir=/path/to/dataset/dir datasets.treesatai_ts.rel_dir=TreeSatAI-TS \
run.exp_dir=/path/to/experiments/dir run.load_name=MAESTRO_S2-NAIP-urban_base run.exp_name=MAESTRO_S2-NAIP-urban-x-TSAI-TS_base
On 2025-08-19 the repository history was cleaned to remove large files. If you previously cloned the repo, please reclone to avoid conflicts.
If you use this code, please cite:
@article{labatie2025maestro,
title={MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data},
author={Labatie, Antoine and Vaccaro, Michael and Lardiere, Nina and Garioud, Anatol and Gonthier, Nicolas},
journal={arXiv preprint arXiv:2508.10894},
year={2025}
}The experiments in the paper were conducted using HPC/AI resources from GENCI-IDRIS (allocations A0181013803, A0161013803, AD010114597R1, and AD011014690R1).
