A repository of publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety government organizations, researchers, businesses, and individuals.
When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. This repository exists to help people promote and discover datasets that are available via AWS resources.
Each dataset in this repository is described with metadata saved in a YAML file in the /datasets directory. We use these YAML files to provide three services:
- A Registry of Open Data on AWS browser.
- A hosted YAML file listing all of the dataset entries.
- Hosted YAML files for each dataset.
The YAML files use this structure:
Name:
Description:
Contact:
ManagedBy:
UpdateFrequency:
Tags:
-
License:
Resources:
- Description:
ARN:
Region:
Type:
DataAtWork:
- Title:
URL:
AuthorName:
AuthorURL:
The metadata required for each dataset entry is as follows:
Field | Type | Description |
---|---|---|
Name | String | The public facing name of the dataset |
Description | String | A high-level description of the dataset |
Documentation | URL | A link to documentation of the dataset |
Contact | String | May be an email address, a link to contact form, a link to GitHub issues page, or any other instructions to contact the producer of the dataset |
ManagedBy | String | The name of the organization who is responsible for the data ingest process |
UpdateFrequency | String | An explanation of how frequently the dataset is updated |
Tags | List of strings | Tags that topically describe the dataset. A list of supported tags is maintained in the tags.yaml file in this repo. If you want to recommend a tag that is not included in tags.yaml, please submit a pull request to add it to that file. |
License | String | An explanation of the dataset license and/or a URL to more information about data terms of use of the dataset |
Resources | List of lists | A list of AWS resources that users can use to consume the data. Each resource entry requires the metadata below: |
Resources > Description | String | A technical description of the data available within the AWS resource, including information about file formats and scope. |
Resources > ARN | String | Amazon Resource Name for resource, e.g. arn:aws:s3:::commoncrawl |
Resources > Region | String | AWS region unique identifier, e.g. us-east-1 |
Resources > Type | String | Can be CloudFront Distribution, DB Snapshot, S3 Bucket, or SNS Topic. A list of supported resources is maintained in the resources.yaml file in this repo. If you want to recommend a resource that is not included in resources.yaml, please submit a pull request to add it to that file. |
Resources > RequesterPays (Optional) | Boolean | Only appropriate for Amazon S3 buckets, indicates whether the bucket has Requester Pays enabled or not. |
DataAtWork (Optional) | List of lists | A list of links to example usages of the data. Example usages can be code samples, tutorials, demos, or applications. |
DataAtWork > Title | String | The title of the example usage of the data. |
DataAtWork > URL | URL | A link to the example. |
DataAtWork > AuthorName | String | Name of person or entity that created the example. |
DataAtWork > AuthorURL | String | (Optional) URL for person or entity that created the example. |
Note also that we use the name of each YAML file as the URL slug for each dataset on the Registry of Open Data on AWS website. E.g. the metadata from 1000-genomes.yaml
is listed at https://registry.opendata.aws/1000-genomes/
Here is an example of the metadata behind this dataset registration: https://registry.opendata.aws/gdelt/
Name: Global Database of Events, Language and Tone (GDELT)
Description: |
This project Project monitors the world's broadcast, print,
and web news from nearly every corner of every country in
over 100 languages and identifies the people, locations,
organizations, counts, themes, sources, emotions, counts,
quotes, images and events driving our global society every
second of every day.
Documentation: http://www.gdeltproject.org/
Contact: http://www.gdeltproject.org/about.html#contact
UpdateFrequency: Daily
Tags:
- events
License: http://www.gdeltproject.org/about.html#termsofuse
Resources:
- Description: Project data files
ARN: arn:aws:s3:::gdelt-open-data
Region: us-east-1
Type: S3 Bucket
- Description: Notifications for new data
ARN: arn:aws:sns:us-east-1:928094251383:gdelt-csv
Region: us-east-1
Type: SNS Topic
DataAtWork:
- Title: Exploring GDELT with Athena
URL: http://blog.julien.org/2017/03/exploring-gdelt-data-set-with-amazon.html
AuthorName: Julien Simon
AuthorURL: https://twitter.com/julsimon
- Title: Running R on Amazon Athena
URL: https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
AuthorName: Gopal Wunnava
AuthorURL: https://www.linkedin.com/in/gopal-wunnava-b11a77/
- Title: Bootstrapping GeoMesa HBase on AWS S3
URL: http://www.geomesa.org/documentation/tutorials/geomesa-hbase-s3-on-aws.html
AuthorName: Commonwealth Computer Research, Inc.
AuthorURL: https://www.ccri.com
- Title: Creating PySpark DataFrame from CSV in AWS S3 in EMR
URL: https://gist.github.com/jakechen/6955f2de51212163312b6430555b8e0b
AuthorName: Jake Chen
AuthorURL: https://github.com/jakechen
You are welcome to contribute dataset entries or usage examples to the Registry of Open Data on AWS. Please review our contribution guidelines.