Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new augur clades functionality #660

Closed
wants to merge 16 commits into from
Closed

Conversation

jameshadfield
Copy link
Member

This commit is a WIP commit to test the new functionality
being introduced in augur PR 728 [1]. This allows us to
simplify the nCoV workflow as we can explicitly define the
attribute names used for clade membership and branch
labelling.

The release of 728 will come with a new major version of
augur, and this workflow's requirements should be updated
accordingly.

These changes have only been tested for the "open" build,
which itself is a WIP.

[1] nextstrain/augur#728

tsibley and others added 16 commits May 27, 2021 11:33
All callers were removed in "Update syntax for multiple inputs and allow
downloading" (40ae575).
…hmark files

This rule covers multiple potential "origins" (including GISAID).
This both documents the accepted values and lets Snakemake validate the
config automatically.
…nfig keys

The patterns parallel the wildcard constraint declarations.
Superseded by changes in "Update clade definitions for emerging clades"
(184e25c).  I believe this line was missed for removal during a merge.
Workflow inputs (metadata + sequences) will soon be provisioned by
ncov-ingest under data.nextstrain.org/files/ncov/open/ and downloaded
from there by this profile config.  Input data is currently sourced just
from GenBank/INSDC, but in time will grow to include other open data
sources, such as COG-UK.

Based on the nextstrain-genbank profile, renamed to nextstrain-open to
reflect the broader scope.
We will start with major regional builds and leave state-level builds to
other groups.
Removes params that are already defaults in the workflow and do not need
to be in this config.
As this is specific to this profile, intended for internal use,
documenting within builds.yaml felt appropriate.
Any defined build sizes will create separate builds with modified
names. Parameters for `augur traits` are defined per build name,
and thus we wish to duplicate these so that they match the builds
created for each build size.
For open builds, the `{trait}_exposure` metadata is identical to the
`{trait}` value. Thus we can skip the travel history adjustment
rule. This necessitates updates to which values we use for DTA.
Namespaces the Auspice JSONs from just ncov_* into ncov_gisaid_* and
ncov_open_*, which will result in URL changes from, e.g. /ncov/global to
/ncov/gisaid/global and /ncov/open/global.

The results for trial builds are also slightly renamed to include this
namespacing *before* the "trial_${trial_name}" prefix.
Internal nextstrain workflows typically generate many datasets. Currently we tend to use a single auspice config JSON for each dataset, despite these configs being essentially identical. Furthermore, the "nextstrain-open" profile was using the config files from the main (GISAID) profile, which were not well suited to the metadata available for the open builds. (Note that the config file deleted in this commit was never being used.)

Here we move to generating the auspice configs via a rule, which has a number of advantages. It is now easy to get an overview of the config fields which are the same, and which ones are different across builds in a profile; comments (which are allowed since it's javascript) also help with understanding. A rule allows us to easily have different settings for different builds, and generation may become dynamic in the future. Finally it helps prevent different build config files diverging unintentionally.

Currently this is implemented for the nextstrain-open profile, but future work will extend this to the (GISAID) nextstrain profile. There will be a common "base" config which can be imported by both in this case.

For users running few builds, it's preferable to avoid this complexity and stick with the config-files approach we currently describe in the tutorials. Workflows with many targets may wish to add their own rules similar to that done here.
The GitHub Actions UI rolls up each step's output, so this will make it
easier to quickly see the build info by avoiding the need to scroll past
all the build launching output first.
Separate workflow jobs so that they can be independently managed in the
GitHub Actions UI.

Copy and pasted "nextstrain build" invocations (instead of, e.g., a
shared script or YAML anchor) so they can be independently tweaked as
needed in the future.  For example, I'm starting with the same resources
for each, but that's probably unnecessary right now and we may want to
tune it sooner than later.
This commit is a WIP commit to test the new functionality
being introduced in augur PR 728 [1]. This allows us to
simplify the nCoV workflow as we can explicitly define the
attribute names used for clade membership and branch
labelling.

These changes have only been tested for the "open" build,
which itself is a WIP.

[1] nextstrain/augur#728
@rneher
Copy link
Member

rneher commented Apr 7, 2023

superseeded by #1000

@rneher rneher closed this Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants