Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to nextclade v3 & update default dataset tags #375

Merged
merged 24 commits into from
Apr 4, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Mar 5, 2024

Still have lots of work to do, but wanted to at least start a draft PR so folks can see progress.

This PR closes #349

🗑️ This dev branch should be deleted after merging to main. (subject to change)

🧠 Aim, Context and Functionality

This PR updates all TheiaCov workflows to use nextclade v3.

A new task was added for nextclade v3 specifically.

CI has undergone major updates to account for these changes AND to re-enable the theiacov_illumina_pe and theiacov_illumina_se CI workflows from running (they were disabled back in July 2023)

🛠️ Impacted Workflows/Tasks & Changes Being Made

All TheiaCov workflows are impacted across all organisms (sars-cov-2, Mpox, Flu, RSV-A, RSV-B) that we support and run through nextclade

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes*

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: upgraded to us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1 which was originally copied from Nextstrain's image on dockerhub

Databases or database versions changed: All nextclade_dataset_tags have been updated for all organisms

Data processing/commands changed: Details on changes from v2 ➡️ v3 can be found here: https://docs.nextstrain.org/projects/nextclade/en/stable/user/migration-v3.html#dataset-file-format-and-dataset-names-have-changed

The dataset files qc.jsonprimers.csv and virus_properties.json are now merged into a new file pathogen.json.

Dataset names have changed. There is no longer a separation to namereference and other attributes. The datasets are now uniquely identified by a path-like name, which corresponds to the path of the dataset in the [data repo](https://github.com/nextstrain/nextclade_data/tree/master/data).

File processing changed: The nextclade_output_parsing has not changed. The output TSV of nextclade v3 was readly parsed by the existing WDL task

Compute resources changed: No

➡️ Inputs

  • String? nextclade_dataset_reference has been removed from all TheiaCov workflows (and the organism_param subwf) as it is not longer required
  • String sc2_nextclade_ds_name = "nextstrain/sars-cov-2/wuhan-hu-1/orfs" is now a path-like name instead of the names like "sars-cov-2". Those names are still kept as shortcuts, but for futureproofing sake - I have changed these to the new path-like names for all organisms

⬅️ Outputs

  • outputs of all workflows have not changed, just reference from the nextclade_v3 task outputs instead of the old nextclade task.

🧪 Testing

Test Dataset

I am testing in Terra using our validation datasets that we used to validate new versions of PHB.

Commandline Testing with MiniWDL or Cromwell (optional)

Not going to show cmdline testing, everything will be tested in Terra

Terra Testing

Suggested Scenarios for Reviewer to Test

Need to test all organisms to ensure functionality exists for all organisms.

It seems a bit extreme, but I'd like to test all impacted workflows since I'd rather catch bugs now, instead of the 11th hour before the v2.0.0 release. SO I've created a terra workspace specifically for this testing https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history

Theiagen Version Release Testing (optional)

  • Will changes require functional or validation testing (checking outputs etc) during the release? yes, functional
  • Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. No
  • Are there any output files that should be checked after running the version release testing? nextclade output columns need to be reviewed

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes Does not required changing the diagrams

@kapsakcj
Copy link
Contributor Author

I created a clone of the PHB_Validation_TEMPLATE workspace specifically for testing these changes: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history

@kapsakcj kapsakcj marked this pull request as ready for review March 26, 2024 13:41
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Mar 26, 2024

Marking as ready for review to get the review process started.

I'm working on pulling a couple RSV-A and RSV-B genomes for testing (since the 2 I tested with didn't so well - failed to align well with the nextclade references) so I will test these and post a workflow link when it's completed.

2 RSV-A and 2 RSV-B samples tested via TheiaCoV_FASTA here: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history/1da2324d-436e-4b60-a79e-04bceb62a6f7

Other things to note about this PR

  • CI has been updated, including the enabling of TheiaCoV_Illumina_PE and TheiaCoV_IlluminaSE testing via miniwdl/pytest matrix. These were disabled previously because they were broken and we couldn't figure out how to fix them at the time. So I've fixed them and re-enabled them.
  • .gitignore has been updated to ignore common outputs from miniwdl 2024* and _LAST as well as cromwell* which ignores all cromwell run outputs

@sage-wright
Copy link
Member

sage-wright commented Mar 28, 2024

Conflicts need resolution, but proceeding with tests anyways:

  • Validation datasets:
  • TheiaCoV_FASTA_PHB here
  • TheiaCoV_ClearLabs_PHB here
  • TheiaCoV_Illumina_PE_PHB here
  • TheiaCoV_Illumina_SE_PHB here
  • TheiaCoV_ONT_PHB here

Failures happening because data got moved 😭

~{"--input-tree " + auspice_reference_tree_json} \
~{"--input-pathogen-json " + nextclade_pathogen_json} \
~{"--input-annotation " + gene_annotations_gff} \
~{"--input-pcr-primers " + pcr_primers_csv} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks as if --input-pcr-primers was also removed as an input flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --input-root-seq flag (and associated root_sequencetask input) were removed from the task, but looks as if these were just renamed in nextclade v3 ^^ info in same docs linked above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you. I will remove the --input-pcr-primers and add the --input-root-seq to --input-ref as specified in their docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in 2b7470a

reference_tree_json = reference_tree_json,
qc_config_json = qc_config_json,
nextclade_pathogen_json = nextclade_pathogen_json,
gene_annotations_gff = gene_annotations_gff,
pcr_primers_csv = pcr_primers_csv,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above RE input removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in 0800fa0

@kevinlibuit
Copy link
Contributor

Details for days! That was a beast to review. Thanks for all of the documentation and specific workspace for validation runs. I've combed through all of your runs and feel confident in these code changes; also launched a few runs myself within the same workspace for sanity's sake.

Just two points to address:

  1. Made a few comments regarding input name changes/removals that seem a bit out of sync with respect to nextclade v3 migration guidance -- and because this will only impact optional parameters, I think we can verify things in running a single workflow rather than needing to re-run the whole testing dataset.
  2. I don't see a functional test for workflows/phylogenetics/wf_nextclade_addToRefTree.wdl. It's the same underlying changes being made so I don't anticipate any functional difference, but I would like to see a successful fun before merging this PR.

Once these are resolved we can move forward with a merge!

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Apr 4, 2024

Thanks for the thorough review, kevin.

Made a few comments regarding input name changes/removals that seem a bit out of sync with respect to nextclade v3 migration guidance -- and because this will only impact optional parameters, I think we can verify things in running a single workflow rather than needing to re-run the whole testing dataset.
I don't see a functional test for workflows/phylogenetics/wf_nextclade_addToRefTree.wdl. It's the same underlying changes being made so I don't anticipate any functional difference, but I would like to see a successful fun before merging this PR.

I'll update CI first, then...

I'll launch a function test of the nextclade_addToRefTree workflow as well as TheiaCoV_FASTA_PHB on the various organisms in our validation dataset to ensure everything runs as expected.

EDIT:

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Apr 4, 2024

OK everything should be resolved, I think we are ready for approval and merge

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Apr 4, 2024

So FYI the nextclade run --input-pcr-primers option is available in v3.1.0 and above: nextstrain/nextclade#1432 (comment)

Should we add it back to the WDL task or leave as is? I don't know of any of our users that utilize this option, so I'm tempted to leave it out.

@kevinlibuit
Copy link
Contributor

So FYI the nextclade run --input-pcr-primers option is available in v3.1.0 and above: nextstrain/nextclade#1432 (comment)

Should we add it back to the WDL task or leave as is? I don't know of any of our users that utilize this option, so I'm tempted to leave it out.

That's fair. Let's get this closed and avoid scope creep; changes made in this PR address #349 for PHB users. Can you open a new issue for us to add that option back and we can discuss internally when to prioritize things

@kevinlibuit kevinlibuit merged commit 51274c7 into main Apr 4, 2024
21 checks passed
@kapsakcj kapsakcj deleted the cjk-nextclade-v3 branch April 17, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade TheiaCov workflows to use nextclade v3.x
3 participants