Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant include strains #843

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jan 25, 2022

Description of proposed changes

Removes older references from the list of strains to force include in
analyses. We use the single Wuhan/Hu-1/2019 strains as the reference for
alignment and time tree rooting, so the other two strains are not
required. Importantly, these additional strains get used in proximity
calculations with priority-based subsampling leading to the unexpected
inclusion of contextual strains that look like these redundant root
sequences. We try to avoid this problem in the proximity calculations by
defining a list of strains to ignore
, but this list only includes
the strain used to root the time tree. Rather than updating this list to
include more strains, we can just remove the strains from the include
file.

Testing

  • Tested by CI
  • Tested by Cassia

Removes older references from the list of strains to force include in
analyses. We use the single Wuhan/Hu-1/2019 strains as the reference for
alignment and time tree rooting, so the other two strains are not
required. Importantly, these additional strains get used in proximity
calculations with priority-based subsampling leading to the unexpected
inclusion of contextual strains that look like these redundant root
sequences. We try to avoid this problem in the proximity calculations by
defining a list of strains to ignore [1], but this list only includes
the strain used to root the time tree. Rather than updating this list to
include more strains, we can just remove the strains from the include
file.

[1] https://github.com/nextstrain/ncov/blob/51ac8143761057c14de3841e46b82e08f7fef4b6/workflow/snakemake_rules/main_workflow.smk#L372
@huddlej huddlej added the bug Something isn't working label Jan 25, 2022
@huddlej huddlej self-assigned this Jan 25, 2022
Defines a list of strains to ignore, including the two reference
sequences we use for GISAID and GenBank builds and passes this list to
the `ignore_seqs` parameter of the proximity calculations instead of the
current build's root sequence alone.
Copy link
Collaborator

@cassiawag cassiawag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out on ncov-kenya, and the proximity sequences & priority scores were as expected. Thanks @huddlej!

@huddlej huddlej requested a review from rneher January 26, 2022 00:22
@huddlej
Copy link
Contributor Author

huddlej commented Jan 26, 2022

@rneher Does this look ok to you, too? This bit of the workflow is just complicated enough that I can imagine I'm missing something with this change...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants