Skip to content

fix: don't count mutations at masked sites towards branch length#1779

Merged
rneher merged 3 commits intomasterfrom
fix/mutation-at-masked-sites
Mar 26, 2025
Merged

fix: don't count mutations at masked sites towards branch length#1779
rneher merged 3 commits intomasterfrom
fix/mutation-at-masked-sites

Conversation

@rneher
Copy link
Member

@rneher rneher commented Mar 21, 2025

augur refine has the mode --divergence-units mutations in which case branch length are given as number of mutations in an ancestral reconstruction. We ignore gaps and ambiguous sites in this calculation as we should, but in the edge case when there is no information at all (site is N everywhere), we count a large number of spurious mutations. This happens when the root state is picked by from the probability distribution at random, and other states are set to the most likely state. @corneliusroemer discovered this in the case of mpox where about 10k sites are masked. The probability distribution for these 10k states in [0.2, 0.2, 0.2, 0.2, 0.2]. The root was picked at random, other states via np.argmax([0.2, 0.2, 0.2, 0.2, 0.2]). If these differed, this generated 10k unwanted mutations.

This PR excludes those sites from the branch length calculation. A list of all these sites is anyway available within treetime.

@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.23%. Comparing base (bbadca3) to head (0d4eed4).
Report is 161 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1779      +/-   ##
==========================================
+ Coverage   73.21%   73.23%   +0.02%     
==========================================
  Files          79       79              
  Lines        8395     8402       +7     
  Branches     1713     1715       +2     
==========================================
+ Hits         6146     6153       +7     
  Misses       1960     1960              
  Partials      289      289              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

huddlej added 2 commits March 24, 2025 11:13
Adds a functional test for augur refine's behavior when one or more
columns in the alignment are all Ns. In this case, the columns should
get ignored in the final branch length calculations as ambiguous
positions. This test adds an alternate version of the Zika alignment and
the expected branch lengths JSON where the third column of the alignment
has been changed to all Ns. Since this column originally had a variant
in one sample that would have changed the overall branch lengths, this
change produces different a node data JSON output with a shorter branch
length.
Adds a comment to clarify why we can break from the loop through all
positions in TreeTime's compressed alignment data structure. For people
who are not familiar with that data structure, the appearance of a break
in the for loop could appear to be a bug.
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @rneher. I pushed commits to add a test and comment the logic related to the compressed alignment data structure which should help future developers who aren't as familiar with TreeTime internals.

@rneher
Copy link
Member Author

rneher commented Mar 26, 2025

Thanks, @huddlej

@rneher rneher merged commit 04d1a00 into master Mar 26, 2025
36 checks passed
@rneher rneher deleted the fix/mutation-at-masked-sites branch March 26, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants