Skip to content

Latest commit

 

History

History
128 lines (84 loc) · 6.97 KB

README.md

File metadata and controls

128 lines (84 loc) · 6.97 KB

OpenSentinelMap

The OpenSentinelMap dataset contains Sentinel-2 imagery and per-pixel semantic label masks derived from OpenStreetMap. It is described in this paper.

this is an overview image

Data Access

Azure Blob Storage (Free)

The dataset may be freely downloaded from Azure Blob Storage:

Spatial Cell Metadata

OSM Label Categories

OSM Rasterized Label Images

OSM Sentinel Imagery 2017

OSM Sentinel Imagery 2018

OSM Sentinel Imagery 2019

OSM Sentinel Imagery 2020

EuroSAT Sentinel L2A Resamples

AWS (Paid)

As a backup option, or for faster download speeds, the dataset is also available on Amazon S3. You can use the following command to download it, but beware that Amazon will charge your AWS profile about $40 in data transfer fees (about 9 cents a GB, and 445 GB in total). NOTE: This option will be deprecated soon in favor of Azure Blob Storage.

aws s3 cp s3://vsi-open-sentinel-map/ ./open-sentinel-map --recursive --request-payer

Data Format

Imagery

Image data is separated by year from 2017 to 2020. Each year's worth of Sentinel imagery is compressed into a osm_sentinel_imagery_{YEAR}.tgz file. These files can be untarred using the following command.

tar -xvzf osm_sentinel_imagery_{YEAR}.tgz

The untarred folders of sentinel imagery will have the format

MGRS_TILE/
    SPATIAL_CELL/
        {ID}_{YEAR}.npz

where each .npz file is a compressed numpy file containing the 32-bit float Bottom-of-Atmosphere imagery data. This file can be loaded from python using the numpy.load function, and the bands accessed via their keys. The bands are grouped by spatial resolution, and accessible using the key "gsd_{RESOLUTION}" (i.e. "gsd_10", "gsd_20", "gsd_60").

Note: The original Sentinel-2 data is stored as unsigned 16-bit integers. Our dataset converts to 32-bit floats and applies the Sentinel-2 scaling factor (divison by 10,000) to retrieve surface reflectance values. Although this should result in values from 0 to 1, some values will exceed 1 due to small errors in the data. We decided to keep these values greater than 1 for training robustness.

The "gsd_10" array bands have the order blue, green, red, and then NIR. The "gsd_20" bands have 4 vegetation red edge bands, followed by two SWIR bands. The "gsd_60" array consists of the coastal aerosol and water vapour bands. The exact corresponding bands from the Sentinel-2 platform are listed in the table below. Find more information about these spectral bands here.

Data Key Sentinel-2 Bands
gsd_10 B02, B03, B04, B08
gsd_20 B05, B06, B07, B8A, B11, B12
gsd_60 B01, B09

In addition to the "gsd_*" bands, the image files contain an "scl" band. The "scl" band contains the Scene Classification Layer values, which inform the quality of each pixel at 20 m. resolution. These values are described in the table below.

Label Classification
0 NO_DATA
1 SATURATED_OR_DEFECTIVE
2 CAST_SHADOWS
3 CLOUD_SHADOWS
4 VEGETATION
5 NOT_VEGETATED
6 WATER
7 UNCLASSIFIED
8 CLOUD_MEDIUM_PROBABILITY
9 CLOUD_HIGH_PROBABILITY
10 THIN_CIRRUS
11 SNOW or ICE

The image files also contain a "bad_percent" key, which is a float value between 0 and 1 describing the percentage of pixels within the "scl" band which we've determined to be bad data. Currently we filter images with more than 25% of their pixels having bad data. You can use this key to filter the dataset using a different threshold.

Annotations

The label images can be untarred using the command

tar -xvzf osm_label_images.tgz

These images are in PNG format, with label values as described in the osm_categories.json file.

Auxiliary Data

The spatial_cell_info CSV file contains metadata for each spatial cell: the lon/lat bounds, the MGRS tile it is within, and the training split it belongs to. Note that the current data split was performed at the MGRS tile level to prevent data leakage. Use caution if performing your own train/test split.

The osm_categories JSON file details the exact mapping from OpenStreetMap tags to OpenSentinelMap labels.

Licenses

This dataset is made available under the MIT license, freely available for both academic and commercial use.

Access to Sentinel data is free, full and open for the broad Regional, National, European and International user community. View Terms and Conditions.

OpenStreetMap® is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF).

Contact

email

How to Cite

bibtex:

@InProceedings{Johnson_2022_CVPR,
    author    = {Johnson, Noah and Treible, Wayne and Crispell, Daniel},
    title     = {OpenSentinelMap: A Large-Scale Land Use Dataset Using OpenStreetMap and Sentinel-2 Imagery},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2022},
    pages     = {1333-1341}
}

Acknowledgements

This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2021-2011000004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.