Skip to content

Commit

Permalink
Merge branch 'trs/reference/data-files'
Browse files Browse the repository at this point in the history
  • Loading branch information
tsibley committed Jan 31, 2023
2 parents 327e490 + 2186f69 commit e79ead4
Show file tree
Hide file tree
Showing 2 changed files with 103 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ team and other Nextstrain users provide assistance. For private inquiries,

reference/glossary
reference/data-formats
reference/data-files
FAQ <reference/faq>
reference/style
reference/governance
102 changes: 102 additions & 0 deletions src/reference/data-files.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
==========
Data files
==========

.. This document started in <https://docs.google.com/document/d/118zKcgUNESszsIfXw08qbubxpA6lJF1RhaMBktY11x0> as the capturing of our decisions around and plans for publishing data files.
Discussions and other context are linked there.
-trs, 30 Jan 2023
We publish continually-updated data files associated with our pathogen analyses.
The files show exactly what data is going into and out of our analyses published on `nextstrain.org <https://nextstrain.org>`__.
For others repeating our analyses or doing their own, whether with Nextstrain or not, these data files are useful as ready-made starting points.

This document describes the conventions of the data files we publish: their names, organization, contents, etc.

.. XXX TODO: Discuss expectations/social contract around these files as an
interface. -trs, 30 Jan 2023
.. note::
The publishing of our :doc:`SARS-CoV-2 (ncov) workflow's data files <ncov:reference/remote_inputs>` led us to the goal of doing the same for our other pathogen workflows too.
This work is still in-progress and not all of the examples given below exist yet.
Furthermore, some of the specific files available for SARS-CoV-2 do not conform to the organization below because they predate it.

At a broad level, we think about two kinds of data files:

Workflow files
Files which correspond to several :term:`builds <build>` visible on nextstrain.org, e.g. all of builds under <nextstrain.org/ncov/open/…>.
These often include the full metadata table, sequences FASTA, titer matrix, etc.

We often call these "inputs" colloquially because they're often the top-level inputs to a :term:`workflow`, but some of the files are actually workflow-level outputs.
(Albeit, outputs that can be used as time-saving inputs in later workflow runs.)

Build files
Files which correspond to a specific single :term:`build` visible on nextstrain.org, e.g. <`nextstrain.org/ncov/open/global/6m <https://nextstrain.org/ncov/open/global/6m>`__>.
These often include the subsampled metadata table, sequences FASTA, and Newick tree as well as the final :term:`dataset` JSONs.

We often call these "outputs" colloquially because they're produced by running a :term:`workflow`, but some of the files are actually the specific, subsampled inputs that went into the specific build.

Workflow and build files for public data are available from:

- https\://data.nextstrain.org
- s3://nextstrain-data
- gs://nextstrain-data

using the following path structures (with `URL Template <https://datatracker.ietf.org/doc/html/rfc6570>`__-style placeholders emphasized):

.. parsed-literal::
/files
/workflows
**{/workflow-repo}** (matching github.com/nextstrain{/workflow-repo})
**{/arbitrary-structure*}**
/metadata.tsv.zst
/sequences.fasta.zst
/datasets
**{/dataset*}** (matching nextstrain.org{/dataset*})
/metadata.tsv.gz
/sequences.fasta.zst
/tree.nwk.gz (hypothetical)
/clade-frequencies.tsv.gz (hypothetical)
/**{_dataset*}**.json (e.g. flu_seasonal_h3n2_ha_2y.json)
/…
Within each :file:`/files/workflows{\{/workflow-repo\}}/…` prefix, each workflow is responsible for the organization and structure of its own files.
Naming conventions and common patterns are used when possible, but different workflows will have different requirements.
For example, the `ncov <https://github.com/nextstrain/ncov>`__ and `seasonal-flu <https://github.com/nextstrain/seasonal-flu>`__ workflows may organize their sequence inputs differently because of the different nature of analyzing SARS-CoV-2 vs. influenza:

.. parsed-literal::
/files/workflows/ncov/**open/sequences**.fasta.zst
/files/workflows/seasonal-flu/**h3n2_ha_sequences**.fasta.zst
/files/workflows/seasonal-flu/**h3n2_na_sequences**.fasta.zst
Within each :file:`/files/datasets{\{/dataset*\}}/…` prefix, we intend to provide a common base set of files, e.g. :file:`metadata.tsv.gz` and :file:`sequences.fasta.zst`, across pathogens and workflows:

.. parsed-literal::
/files
/datasets
/ncov/open/global/6m
**/metadata.tsv.gz**
/mutation-summary.tsv.gz
/flu/seasonal/h3n2/ha/2y
**/metadata.tsv.gz**
/titers.tsv
/dengue/denv2
**/metadata.tsv.gz**
Extra files beyond the common set are ok and expected.

Although we strive to use fully `open data <https://opendatahandbook.org/guide/en/what-is-open-data/>`__ whenever possible, we cannot always redistribute the data we use.
Files containing private or otherwise restricted data are stored in access-restricted locations with the same structure as above, e.g.:

.. parsed-literal::
s3://nextstrain-data/files/workflows/ncov/**open**/metadata.tsv.gz
s3://nextstrain-data/files/datasets/ncov/**open**/global/6m/metadata.tsv.gz
s3://nextstrain-data-**private**/files/workflows/ncov/**gisaid**/metadata.tsv.gz
s3://nextstrain-data-**private**/files/datasets/ncov/**gisaid**/global/6m/metadata.tsv.gz

0 comments on commit e79ead4

Please sign in to comment.