-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download data from S3 to start workflow #22
Conversation
Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow. This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.
This doesn't have to be part of this PR, but a nicer interface to aim towards would be using a single metadata file and adding the segment counts to that file. Would simplify the snakemake workflow a bit. I'm not sure whether metadata fields would have to be joined across the inputs (i.e. is there metadata that's only supplied for some segments and not others). |
@jameshadfield Good call. The first commit was my attempt to get S3-based data working without breaking any downstream steps in the workflow. But @trvrb had the same request for a single metadata file, so I'll try this out for this PR. Maybe we can chat tomorrow about specifics, though? In the mean time, I'll also fix the paths to input data for the CI builds. |
Replaces unparsed sequences (with metadata in headers) with parsed sequences and metadata as separate files. This change allows the CI workflow to copy example data into the data directory and run the workflow from these subtype- and segment-specific files, bypassing the new download and filter-by-subtype rules. One side effect of this change is that the subtype- and segment-specific sequences and metadata now live in the `data/` directory instead of the `results/` directory. This change makes this workflow more consistent with other Nextstrain workflows like Zika, etc.
I think we can plan to merge this PR when we're happy with it to include a single metadata file on S3. Then in a separate PR we can update the workflow to use S3 files and switch to using a single metadata file. |
Updates the "upload" workflow to create a single metadata file from the 8 individual metadata files by moving the "add segment counts" rule from the main phylogenetic workflow to the upload workflow. As a result, all subtypes have segment counts in their metadata regardless of whether the "same strains" path through the phylogenetic workflow is used or not. This commit updates the phylogenetic workflow to use a single metadata file for all segments and retains the conditional input logic for adding H5 clades for specific subtypes.
upload.smk
Outdated
sequences = "upload/results/sequences_{segment}.fasta", | ||
metadata = "upload/results/metadata_{segment}.tsv", | ||
params: | ||
fasta_fields = "strain virus isolate_id date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata entry for h5_clade
is also very incomplete. It's not used as a coloring and instead we're using either GISAID clade or LABEL clade. This would seem to just add confusion. How about dropping this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to leave this change to someone who knows the data better. I think we could merge this PR first and refine metadata in future commits/PRs, though.
Adds an initial README and moves the upload workflow Snakefile into the standard structure for an ingest workflow.
Updates the upload workflow to work as a top-level ingest workflow through a standard Snakefile entry point. As part of this standardization, this commit moves the script for adding segment counts into the ingest directory and updates the README to reflect constraints on how we need to run this workflow with the Nextstrain CLI (i.e., with the Docker runtime).
Avoid a situation where a user tries to run the ingest workflow with a different Nextstrain runtime that doesn't have fauna installed.
Usage shifted to ingest workflow in <#22>
Benchmarks are newly added as of <#22>
This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README
With PR #22 merged there is a single metadata TSV under data/ that can be used in the genome workflow rather than relying on the HA metadata.
Usage shifted to ingest workflow in <#22>
Benchmarks are newly added as of <#22>
This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README
Data source paths had changed via #22
Description of proposed changes
Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow.
One major change from this implementation is the replacement of one metadata file per subtype and segment with a single metadata file across all segments. The metadata file includes a
n_segments
column with the number of segment sequences available for each metadata record which allows the original "same strains" path through the phylogenetic workflow to work.To run upload to S3:
See the ingest README for more details.
After the upload, S3 will have one metadata for all subtypes and segments and one sequences file per gene segment across all subtypes like:
What this means for users
The changes in this PR will be breaking changes for some users including people who currently have credentials to access fauna but do not have AWS credentials to access the private bucket above. We will need to issue these users with AWS credentials that provide at least read access to
nextstrain-data-private
and they will need to learn how to pass those credentials to tools like the Nextstrain CLI (or through the envdir argument).Users who want to run the upload workflow will need read/write access to the private bucket. Ideally, we could limit the number of users who need these permissions by making the GitHub Action described in the next steps below.
Next steps
One immediate improvement to user experience of running the "upload" workflow would be to expose it through a GitHub Action in this repository, such that running the workflow only entails an authorized GitHub user clicking a "Run" button. Once this Action is in place, it could easily be expanded to automatically trigger new phylogenetic builds when the upload completes just like we do in the seasonal-flu workflow.
Related issue(s)
Checklist