🌟A collection of papers, datasets, benchmarks, code, and pre-trained weights for Remote Sensing Foundation Models (RSFMs).
🔥🔥🔥 Last Updated on 2024.08.19 🔥🔥🔥
- 2024.8.19: Update SpectralEarth.
- 2024.8.08: Update a survey paper.
- 2024.8.06: Update MA3E.
- Models
- Datasets & Benchmarks
- Others
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
GeoKR | Geographical Knowledge-Driven Representation Learning for Remote Sensing Images | TGRS2021 | GeoKR | link |
- | Self-Supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding | CVPRW2021 | Paper | link |
GASSL | Geography-Aware Self-Supervised Learning | ICCV2021 | GASSL | link |
SeCo | Seasonal Contrast: Unsupervised Pre-Training From Uncurated Remote Sensing Data | ICCV2021 | SeCo | link |
DINO-MM | Self-supervised Vision Transformers for Joint SAR-optical Representation Learning | IGARSS2022 | DINO-MM | link |
SatMAE | SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery | NeurIPS2022 | SatMAE | link |
RS-BYOL | Self-Supervised Learning for Invariant Representations From Multi-Spectral and SAR Images | JSTARS2022 | RS-BYOL | null |
GeCo | Geographical Supervision Correction for Remote Sensing Representation Learning | TGRS2022 | GeCo | null |
RingMo | RingMo: A remote sensing foundation model with masked image modeling | TGRS2022 | RingMo | Code |
RVSA | Advancing plain vision transformer toward remote sensing foundation model | TGRS2022 | RVSA | link |
RSP | An Empirical Study of Remote Sensing Pretraining | TGRS2022 | RSP | link |
MATTER | Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks | CVPR2022 | MATTER | null |
CSPT | Consecutive Pre-Training: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain | RS2022 | CSPT | link |
- | Self-supervised Vision Transformers for Land-cover Segmentation and Classification | CVPRW2022 | Paper | link |
BFM | A billion-scale foundation model for remote sensing images | Arxiv2023 | BFM | null |
TOV | TOV: The original vision model for optical remote sensing image understanding via self-supervised learning | JSTARS2023 | TOV | link |
CMID | CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding | TGRS2023 | CMID | link |
RingMo-Sense | RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling | TGRS2023 | RingMo-Sense | null |
IaI-SimCLR | Multi-Modal Multi-Objective Contrastive Learning for Sentinel-1/2 Imagery | CVPRW2023 | IaI-SimCLR | null |
CACo | Change-Aware Sampling and Contrastive Learning for Satellite Images | CVPR2023 | CACo | link |
SatLas | SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding | ICCV2023 | SatLas | link |
GFM | Towards Geospatial Foundation Models via Continual Pretraining | ICCV2023 | GFM | link |
Scale-MAE | Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning | ICCV2023 | Scale-MAE | link |
DINO-MC | DINO-MC: Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops | Arxiv2023 | DINO-MC | link |
CROMA | CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders | NeurIPS2023 | CROMA | link |
Cross-Scale MAE | Cross-Scale MAE: A Tale of Multiscale Exploitation in Remote Sensing | NeurIPS2023 | Cross-Scale MAE | link |
DeCUR | DeCUR: decoupling common & unique representations for multimodal self-supervision | Arxiv2023 | DeCUR | link |
Presto | Lightweight, Pre-trained Transformers for Remote Sensing Timeseries | Arxiv2023 | Presto | link |
CtxMIM | CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding | Arxiv2023 | CtxMIM | null |
FG-MAE | Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing | Arxiv2023 | FG-MAE | link |
Prithvi | Foundation Models for Generalist Geospatial Artificial Intelligence | Arxiv2023 | Prithvi | link |
RingMo-lite | RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework | Arxiv2023 | RingMo-lite | null |
- | A Self-Supervised Cross-Modal Remote Sensing Foundation Model with Multi-Domain Representation and Cross-Domain Fusion | IGARSS2023 | Paper | null |
EarthPT | EarthPT: a foundation model for Earth Observation | NeurIPS2023 CCAI workshop | EarthPT | link |
USat | USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery | Arxiv2023 | USat | link |
FoMo-Bench | FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models | Arxiv2023 | FoMo-Bench | link |
AIEarth | Analytical Insight of Earth: A Cloud-Platform of Intelligent Computing for Geospatial Big Data | Arxiv2023 | AIEarth | link |
- | Self-Supervised Learning for SAR ATR with a Knowledge-Guided Predictive Architecture | Arxiv2023 | Paper | link |
Clay | Clay Foundation Model | - | null | link |
Hydro | Hydro--A Foundation Model for Water in Satellite Imagery | - | null | link |
U-BARN | Self-Supervised Spatio-Temporal Representation Learning of Satellite Image Time Series | JSTARS2024 | Paper | link |
GeRSP | Generic Knowledge Boosted Pre-training For Remote Sensing Images | Arxiv2024 | GeRSP | GeRSP |
SwiMDiff | SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing Image | Arxiv2024 | SwiMDiff | null |
OFA-Net | One for All: Toward Unified Foundation Models for Earth Vision | Arxiv2024 | OFA-Net | null |
SMLFR | Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation | TGRS2024 | SMLFR | link |
SpectralGPT | SpectralGPT: Spectral Foundation Model | TPAMI2024 | SpectralGPT | link |
S2MAE | S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data | CVPR2024 | S2MAE | null |
SatMAE++ | Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery | CVPR2024 | SatMAE++ | link |
msGFM | Bridging Remote Sensors with Multisensor Geospatial Foundation Models | CVPR2024 | msGFM | link |
SkySense | SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | CVPR2024 | SkySense | Comming soon |
MTP | MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining | Arxiv2024 | MTP | link |
DOFA | Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities | Arxiv2024 | DOFA | link |
PIS | Pretrain A Remote Sensing Foundation Model by Promoting Intra-instance Similarity | - | null | link |
MMEarth | MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning | Arxiv2024 | MMEarth | link |
SARATR-X | SARATR-X: A Foundation Model for Synthetic Aperture Radar Images Target Recognition | Arxiv2024 | SARATR-X | link |
LeMeViT | LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation | IJCAI2024 | LeMeViT | link |
SoftCon | Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining | Arxiv2024 | SoftCon | link |
RS-DFM | RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks | Arxiv2024 | RS-DFM | null |
A2-MAE | A2-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder | Arxiv2024 | A2-MAE | null |
HyperSIGMA | HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model | Arxiv2024 | HyperSIGMA | link |
SelectiveMAE | Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset | Arxiv2024 | SelectiveMAE | link |
OmniSat | OmniSat: Self-Supervised Modality Fusion for Earth Observation | ECCV2024 | OmniSat | link |
MM-VSF | Towards a Knowledge guided Multimodal Foundation Model for Spatio-Temporal Remote Sensing Applications | Arxiv2024 | MM-VSF | null |
MA3E | Masked Angle-Aware Autoencoder for Remote Sensing Images | ECCV2024 | MA3E | link |
SpectralEarth | SpectralEarth: Training Hyperspectral Foundation Models at Scale | Arxiv2024 | SpectralEarth | null |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
RSGPT | RSGPT: A Remote Sensing Vision Language Model and Benchmark | Arxiv2023 | RSGPT | link |
RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Arxiv2023 | RemoteCLIP | link |
GeoRSCLIP | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv2023 | GeoRSCLIP | link |
GRAFT | Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | ICLR2024 | GRAFT | null |
- | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | Arxiv2023 | Paper | link |
- | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | Arxiv2024 | Paper | link |
SkyEyeGPT | SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model | Arxiv2024 | Paper | link |
EarthGPT | EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain | Arxiv2024 | Paper | null |
SkyCLIP | SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | AAAI2024 | SkyCLIP | link |
GeoChat | GeoChat: Grounded Large Vision-Language Model for Remote Sensing | CVPR2024 | GeoChat | link |
LHRS-Bot | LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model | Arxiv2024 | Paper | link |
H2RSVLM | H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model | Arxiv2024 | Paper | link |
RS-LLaVA | RS-LLaVA: Large Vision Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery | RS2024 | Paper | link |
SkySenseGPT | SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding | Arxiv2024 | Paper | link |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
Seg2Sat | Seg2Sat - Segmentation to aerial view using pretrained diffuser models | Github | null | link |
- | Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps | NeurIPSW2023 | Paper | link |
GeoRSSD | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv2023 | Paper | link |
DiffusionSat | DiffusionSat: A Generative Foundation Model for Satellite Imagery | ICLR2024 | DiffusionSat | link |
CRS-Diff | CRS-Diff: Controllable Generative Remote Sensing Foundation Model | Arxiv2024 | Paper | null |
MetaEarth | MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation | Arxiv2024 | Paper | link |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
CSP | CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations | ICML2023 | CSP | link |
GeoCLIP | GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization | NeurIPS2023 | GeoCLIP | link |
SatCLIP | SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery | Arxiv2023 | SatCLIP | link |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
- | Self-supervised audiovisual representation learning for remote sensing data | JAG2022 | Paper | link |
Abbreviation | Title | Publication | Paper | Code & Weights | Task |
---|---|---|---|---|---|
SS-MAE | SS-MAE: Spatial-Spectral Masked Auto-Encoder for Mulit-Source Remote Sensing Image Classification | TGRS2023 | Paper | link | Image Classification |
TTP | Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection | Arxiv2023 | Paper | link | Change Detection |
CSMAE | Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing | Arxiv2024 | Paper | link | Image Retrieval |
RSPrompter | RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model | TGRS2024 | Paper | link | Instance Segmentation |
BAN | A New Learning Paradigm for Foundation Model-based Remote Sensing Change Detection | TGRS2024 | Paper | link | Change Detection |
- | Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM) | Arxiv2024 | Paper | null | Change Detection (Optical & OSM data) |
AnyChange | Segment Any Change | Arxiv2024 | Paper | null | Zero-shot Change Detection |
RS-CapRet | Large Language Models for Captioning and Retrieving Remote Sensing Images | Arxiv2024 | Paper | null | Image Caption & Text-image Retrieval |
- | Task Specific Pretraining with Noisy Labels for Remote sensing Image Segmentation | Arxiv2024 | Paper | null | Image Segmentation (Noisy labels) |
RSBuilding | RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model | Arxiv2024 | Paper | link | Building Extraction and Change Detection |
SAM-Road | Segment Anything Model for Road Network Graph Extraction | Arxiv2024 | Paper | link | Road Extraction |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
GeoLLM-QA | Evaluating Tool-Augmented Agents in Remote Sensing Platforms | ICLR 2024 ML4RS Workshop | Paper | null |
RS-Agent | RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents | Arxiv2024 | Paper | null |
Abbreviation | Title | Publication | Paper | Link | Downstream Tasks |
---|---|---|---|---|---|
- | Revisiting pre-trained remote sensing model benchmarks: resizing and normalization matters | Arxiv2023 | Paper | link | Classification |
GEO-Bench | GEO-Bench: Toward Foundation Models for Earth Monitoring | Arxiv2023 | Paper | link | Classification & Segmentation |
FoMo-Bench | FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models | Arxiv2023 | FoMo-Bench | Comming soon | Classification & Segmentation & Detection for forest monitoring |
PhilEO | PhilEO Bench: Evaluating Geo-Spatial Foundation Models | Arxiv2024 | Paper | link | Segmentation & Regression estimation |
SkySense | SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | CVPR2024 | SkySense | Comming Soon | Classification & Segmentation & Detection & Change detection & Multi-Modal Segmentation: Time-insensitive LandCover Mapping & Multi-Modal Segmentation: Time-sensitive Crop Mapping & Multi-Modal Scene Classification |
VLEO-Bench | Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data | Arxiv2024 | VLEO-bench | link | Location Recognition & Captioning & Scene Classification & Counting & Detection & Change detection |
VRSBench | VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | Arxiv2024 | VRSBench | link | Image Captioning & Object Referring & Visual Question Answering |
Abbreviation | Title | Publication | Paper | Attribute | Link |
---|---|---|---|---|---|
fMoW | Functional Map of the World | CVPR2018 | fMoW | Vision | link |
SEN12MS | SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion | - | SEN12MS | Vision | link |
BEN-MM | BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval | GRSM2021 | BEN-MM | Vision | link |
MillionAID | On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances, and Million-AID | JSTARS2021 | MillionAID | Vision | link |
SeCo | Seasonal Contrast: Unsupervised Pre-Training From Uncurated Remote Sensing Data | ICCV2021 | SeCo | Vision | link |
fMoW-S2 | SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery | NeurIPS2022 | fMoW-S2 | Vision | link |
TOV-RS-Balanced | TOV: The original vision model for optical remote sensing image understanding via self-supervised learning | JSTARS2023 | TOV | Vision | link |
SSL4EO-S12 | SSL4EO-S12: A Large-Scale Multi-Modal, Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation | GRSM2023 | SSL4EO-S12 | Vision | link |
SSL4EO-L | SSL4EO-L: Datasets and Foundation Models for Landsat Imagery | Arxiv2023 | SSL4EO-L | Vision | link |
SatlasPretrain | SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding | ICCV2023 | SatlasPretrain | Vision (Supervised) | link |
CACo | Change-Aware Sampling and Contrastive Learning for Satellite Images | CVPR2023 | CACo | Vision | Comming soon |
SAMRS | SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model | NeurIPS2023 | SAMRS | Vision | link |
RSVG | RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | TGRS2023 | RSVG | Vision-Language | link |
RS5M | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv2023 | RS5M | Vision-Language | link |
GEO-Bench | GEO-Bench: Toward Foundation Models for Earth Monitoring | Arxiv2023 | GEO-Bench | Vision (Evaluation) | link |
RSICap & RSIEval | RSGPT: A Remote Sensing Vision Language Model and Benchmark | Arxiv2023 | RSGPT | Vision-Language | Comming soon |
Clay | Clay Foundation Model | - | null | Vision | link |
SATIN | SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models | ICCVW2023 | SATIN | Vision-Language | link |
SkyScript | SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | AAAI2024 | SkyScript | Vision-Language | link |
ChatEarthNet | ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing | Arxiv2024 | ChatEarthNet | Vision-Language | link |
LuoJiaHOG | LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrieval | Arxiv2024 | LuoJiaHOG | Vision-Language | null |
MMEarth | MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning | Arxiv2024 | MMEarth | Vision | link |
SeeFar | SeeFar: Satellite Agnostic Multi-Resolution Dataset for Geospatial Foundation Models | Arxiv2024 | SeeFar | Vision | link |
FIT-RS | SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding | Arxiv2024 | Paper | Vision-Language | link |
RS-GPT4V | RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding | Arxiv2024 | Paper | Vision-Language | link |
RS-4M | Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset | Arxiv2024 | RS-4M | Vision | link |
Major TOM | Major TOM: Expandable Datasets for Earth Observation | Arxiv2024 | Major TOM | Vision | link |
VRSBench | VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | Arxiv2024 | VRSBench | Vision-Language | link |
(TODO. This section is dedicated to recommending more relevant and impactful projects, with the hope of promoting the development of the RS community. 😄 🚀)
Title | Link | Brief Introduction |
---|---|---|
RSFMs (Remote Sensing Foundation Models) Playground | link | An open-source playground to streamline the evaluation and fine-tuning of RSFMs on various datasets. |
Title | Publication | Paper | Attribute |
---|---|---|---|
Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works | TGRS2023 | Paper | Vision & Vision-Language |
The Potential of Visual ChatGPT For Remote Sensing | Arxiv2023 | Paper | Vision-Language |
遥感大模型:进展与前瞻 | 武汉大学学报 (信息科学版) 2023 | Paper | Vision & Vision-Language |
地理人工智能样本:模型、质量与服务 | 武汉大学学报 (信息科学版) 2023 | Paper | - |
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey | JSTARS2023 | Paper | Vision & Vision-Language |
Revisiting pre-trained remote sensing model benchmarks: resizing and normalization matters | Arxiv2023 | Paper | Vision |
An Agenda for Multimodal Foundation Models for Earth Observation | IGARSS2023 | Paper | Vision |
Transfer learning in environmental remote sensing | RSE2024 | Paper | Transfer learning |
遥感基础模型发展综述与未来设想 | 遥感学报2023 | Paper | - |
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications | Arxiv2023 | Paper | Vision-Language |
Vision-Language Models in Remote Sensing: Current Progress and Future Trends | IEEE GRSM2024 | Paper | Vision-Language |
On the Foundations of Earth and Climate Foundation Models | Arxiv2024 | Paper | Vision & Vision-Language |
Towards Vision-Language Geo-Foundation Model: A Survey | Arxiv2024 | Paper | Vision-Language |
AI Foundation Models in Remote Sensing: A Survey | Arxiv2024 | Paper | Vision |
If you find this repository useful, please consider giving a star ⭐ and citation:
@inproceedings{guo2024skysense,
title={Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery},
author={Guo, Xin and Lao, Jiangwei and Dang, Bo and Zhang, Yingying and Yu, Lei and Ru, Lixiang and Zhong, Liheng and Huang, Ziyuan and Wu, Kang and Hu, Dingxiang and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={27672--27683},
year={2024}
}