-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow reconstruction of amino acid sequences #1258
Conversation
Pathogen workflows are currently failing because the node data jsons this produces doesn't contain the reconstructed sequences. I'll re-enable that. |
|
tests now fail with
which I consider a problem with the tests... |
anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']), | ||
'strand': '+', 'type': 'source'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of type: source
?
This change is the reason for the failing tests. The test can be easily changed to pass again, but I want to understand the reasoning first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source
is the name of the sequence with respect to which annotations in genbank files are made. If you look at this file for example you see a source
field at the beginning of the annotation block. In GFF these are called landmarks. In augur translate we assign this also for the feature type source
:
https://github.com/nextstrain/augur/blob/master/augur/translate.py#L395-L400
I added this here just in analogy to translate
, I don't think auspice requires it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While Auspice doesn't use this, the "type"
key (with string
value) is specified in our node-data JSON annotations schema and exported (auspice) JSON schema so tests shouldn't fail.
if not is_vcf and args.genes: | ||
from .utils import load_features | ||
## load features; only requested features if genes given | ||
features = load_features(args.annotation, args.genes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether it's fair to lump it into this PR but our annotations parsing needs a lot of work.
Testing this PR (for HepB) with NC_003977.gb
(straight from GenBank):
File "***/augur/augur/ancestral.py", line 272, in run
feat = features[gene]
KeyError: 'X'
Testing with the nextclade GFF for the same genome:
Couldn't find gene X in GFF or GenBank file
Couldn't find gene pre-capsid in GFF or GenBank file
Couldn't find gene capsid in GFF or GenBank file
Couldn't find gene pol in GFF or GenBank file
Couldn't find gene envL in GFF or GenBank file
Couldn't find gene envM in GFF or GenBank file
Couldn't find gene envS in GFF or GenBank file
Read in 0 features from reference sequence file
Processing gene: X
Traceback (most recent call last):
File "/Users/enigma/projects/nextstrain/augur/augur/__init__.py", line 66, in run
return args.__command__.run(args)
File "/Users/enigma/projects/nextstrain/augur/augur/ancestral.py", line 272, in run
feat = features[gene]
KeyError: 'X'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following GFF file works. However as this is intended to be run in conjunction with Nextclade we should really sort the parsing out!
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_003977.2 1 3182
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10407
NC_003977.2 RefSeq region 1 3182 . + . ID=NC_003977.2:1..3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw
NC_003977.2 RefSeq gene 1376 1840 . + 0 gene=X;
NC_003977.2 RefSeq gene 1816 2454 . + 0 gene=pre-capsid;
NC_003977.2 RefSeq gene 1903 2454 . + 0 gene=capsid;
NC_003977.2 RefSeq gene 2309 4807 . + 0 gene=pol;
NC_003977.2 RefSeq gene 2850 4019 . + 0 gene=envL;
NC_003977.2 RefSeq gene 3174 4019 . + 0 gene=envM;
NC_003977.2 RefSeq gene 157 837 . + 0 gene=envS;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is a bit tricky. Our parsing is first limiting itself to source
and gene
, and then ignores "subfeatures". The GFF parser parses CDS that are children of a gene into feat.subfeatures
. So yes, this is something we need to tackle.
What we want load_features
to do is to return a source/landmark feature and then all the bits that need translating, probably best as type='CDS'
rather than type='gene'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a simpler parser for the nextclade json here:
#1263
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this, @rneher! Below are some initial thoughts from playing with the new interface. I have only tested the new functionality in an ad hoc way from the command line. I'll spend a little more time tomorrow playing with its integration into a workflow. I can mockup functional tests once we settle the interface.
Big picture
I notice that the interface here allows users to reconstruct nucleotide and amino acid sequences in the same command. I had originally been thinking of this new functionality as a drop-in replacement for the standalone scripts like translations_aamut.py so our workflows would continue to have an nt_muts.json
and an aa_muts.json
. The interface in this PR makes more sense, I think, although we’ll have to be careful in our documentation and tutorials to clarify when you need two rules with augur ancestral
and translate
(e.g., Zika) vs. when you need one rule with augur ancestral
. When we only use one call to ancestral
, should we standardize the output filename to just muts.json
? I guess we’ll eventually have most pathogens using Nextalign, so maybe there won’t be much confusion in that future.
Interface
For the command line options, I wish the --translations
argument accepted a list of specific files instead of a template string. Even it is a bit of extra typing for users, it is clearer to list the specific files and ties into workflow managers better as a dynamic list of inputs. I guess the list becomes a pain for SARS-CoV-2 and other pathogens with a lot of genes, but only when manually running commands from the terminal.
I also wish that the new interface including an argument for --output-translations
akin to augur translate
’s --alignment-output
argument, so we could have reconstructed translations for internal nodes in FASTA format. We use those outputs in multiple places in the seasonal flu workflow, for example. The corresponding argument in translate
accepts a template string like 'my_alignment_%GENE.fasta'
, but I slightly prefer the way the ncov workflow explicitly lists the output files with internal node translations even though the explicit_translation script creates those outputs implicitly. This approach allows you to explicitly reference output FASTA files from other rules. An example command from the seasonal flu repo directory for H3N2 HA would be:
augur ancestral \
--tree tree.nwk \
--alignment aligned.fasta \
--annotation ../../../config/h3n2/ha/genemap.gff \
--genes SigPep HA1 HA2 \
--translations nextalign/masked.gene.SigPep.fasta \
nextalign/masked.gene.HA1.fasta \
nextalign/masked.gene.HA2.fasta \
--output-node-data muts.json \
--output-sequences nt_muts.fasta \ # Not currently used in flu workflow
--output-translations aa_muts.SigPep.fasta \
aa_muts.HA1.fasta \
aa_muts.HA2.fasta
The new ability to annotate mutations on the branch leading to the inferred root of the tree using the --root-sequence
argument is really cool! Does this argument refer to “root” as in the root of the tree? It seems more like the “reference” of the alignment which could also be the root of the tree but doesn’t have to be.
Minor notes
I’d like to add some functional tests for the new interface, once we agree on what that should be (based on discussion on the above).
We’ll need to update the docstring of the ancestral.py
module to refer to amino acid sequences in addition to nucleotide sequences.
As noted above, arguments need grouping into different sections and checks for mutually required arguments.
As a very minor note, I wish that internal functions run_ancestral
and collect_mutations_and_sequences
didn’t need to pass around the state of the input being VCF or not (is_vcf
). That the input is VCF feels like a technical detail for the command line interface that affects downstream steps in different ways that need their own flags. For example, I found the original interface for collect_mutations_and_sequences
with the mask_ambiguous
argument much clearer to read. The calling function(s) could determine whether being a VCF means masking ambiguous characters and that would decouple the internal functions from the details of the input file format. The run_ancestral
function could accept an argument for mask_ambiguous
that it passes on to the lower level function and it could also accept a boolean argument like use_root_as_reference
that would fill the place of is_vcf
in the ref=root_sequence if is_vcf else None
expression. (Whew! The least important note required the most text to describe…)
Thanks @huddlej . A few follow up comments:
my idea here was to pick up the nextalign output where both alignments are available.
your wish might be granted :) easy change. I had just copied the corresponding (output) parts in
easy to add -- mostly a question about interface. I was thinking of removing the full ancestral sequences from the node data json.
I agree that "reference" is more natural in this context, but I worried about it being ubiquitously used across workflows for various purposes, most of which don't have anything to do with "mutations". It has, on the other hand, exactly the same purpose as
the previous version was conflating |
I cleaned up functional tests for augur ancestral (and fix them so they pass with the new output format for annotations) and then added one for the amino acid sequence reconstruction. In doing so for the Zika example data, I realized how painful it was to list 12 genes with 12 input files even without listing 12 output files for the reconstructed translations. Based on that experience, I'm more sold on the template string input/output approach that you had originally implemented, @rneher! I also found a new (to me) error with TreeTime when I tried to run augur ancestral with just the 2K Zika gene's amino acid sequences which I've included below. My guess is that the issue is that all of the input amino acid sequences are identical. I'll create a separate TreeTime issue for this. augur ancestral is using TreeTime version 0.11.0
Read in 2 features from reference sequence file
Processing gene: 2K
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: divide by zero encountered in divide
W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: invalid value encountered in divide
W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:10: RuntimeWarning: invalid value encountered in scalar subtract
return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
Traceback (most recent call last):
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/__init__.py", line 66, in run
return args.__command__.run(args)
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 274, in run
aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 131, in run_ancestral
tt = ancestral_sequence_inference(tree=T, aln=aln, ref=root_sequence if is_vcf else None, marginal=marginal,
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 70, in ancestral_sequence_inference
tt.infer_ancestral_sequences(infer_gtr=infer_gtr, marginal=bool_marginal,
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 526, in infer_ancestral_sequences
self.infer_gtr(marginal=marginal, **kwargs)
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 1497, in infer_gtr
self._gtr = GTR.infer(n_ij, T_ia.sum(axis=-1), root_state, fixed_pi=fixed_pi, pc=pc,
File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py", line 550, in infer
while LA.norm(pi_old-pi) > dp and count < Nit:
File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/scipy/linalg/_misc.py", line 146, in norm
a = np.asarray_chkfinite(a)
File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/numpy/lib/function_base.py", line 628, in asarray_chkfinite
raise ValueError(
ValueError: array must not contain infs or NaNs |
thanks! We can revert back to the input template -- I don't have strong feelings. The check that the length of the annotation is consistent with the file is still useful, I think. the TreeTime error is indeed from the identical sequences. There is nothing to estimate the GTR model from. I thought this had been regularized by pseudo counts. |
Thanks for reverting that, @rneher. I will add the option to write out translation FASTAs and update the functional test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested the new annotation and root-sequence features and they worked for me - save the documentation issue with single vs double %
in the template definition
augur/ancestral.py
Outdated
parser.add_argument('--genes', nargs='+', help="genes to translate (list or file containing list)") | ||
parser.add_argument('--translations', type=str, help="translated alignments for each CDS/Gene. " | ||
"Currently only supported for fasta-input. Specify the file name via a " | ||
"template like 'my_alignment_%%GENE.fasta' where %%GENE will be replaced " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rneher When I tried to use the template with double %%
I got an error:
ERROR: SequenceData: loading alignment failed... builds/21L/translations/aligned.gene.%ORF1a.fasta
The correct template string might currently require only a single percent sign - not sure whether this is an error in the documentation or the implementation.
When I used a single %
things worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
percent signs need to be escaped with another percent sign. augur ancestral -h
shows it only with a single percent sign as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I've never used Python pre f-strings which make use of %
for formatting unnecessary.
Adding an f
or r
before the string would allow us to not need to escape. But this is a matter of taste and not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super useful for making Nextclade datasets, am already using it through pip install -e .
until we've got this released.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1258 +/- ##
==========================================
+ Coverage 69.69% 69.85% +0.15%
==========================================
Files 67 67
Lines 7099 7155 +56
Branches 1726 1740 +14
==========================================
+ Hits 4948 4998 +50
- Misses 1846 1851 +5
- Partials 305 306 +1
☔ View full report in Codecov by Sentry. |
0b1c44a
to
f740b14
Compare
I split 9bde475...73e3cdd into #1275 to fix CI for other PRs. I force-pushed to rebase on the merged changes and drop those commits from here. |
augur ancestral so far only dealt with nucleotide sequences, while translation were done codon by codon from the nucleotide sequences in later steps. If translations for tips are available, the ancestral amino acid sequences can be determined analogously via ancestral reconstruction. This is implemented in this commit. For each node, the node data structure will contain a `muts` and a `aa_muts` field. The latter is a dict with gene/cds names as keys and `<ancestral><pos><derived>` mutations (same as in augur translate). In addition, an annotation and reference sequences are derived stored in the node-data-json.
…notation and translation
Reorganize functional tests for augur ancestral to follow the standard pattern with "cram" and "data" subdirectories and separate cram files for individual functional tests. Paves the way for adding another functional test for amino acid sequence reconstruction.
Adds amino acid sequences for the corresponding nucleotide alignment in the ancestral data directory (created from augur translate but mimicking output from Nextalign) and adds a functional test for the new interface that allows reconstruction of amino acid sequences for internal nodes from the given tip sequences per gene.
This should fail.
Adds logic to save ancestral amino acid sequences to one FASTA file per gene and updates the functional test to reflect the expected number of internal nodes in these outputs.
Fixes a bug with VCF inputs for augur ancestral where the reference sequence is already a string, so it could not be converted to a string from a BioPython sequence as the existing code expected. This commit converts the BioPython Seq instance into a string after loading the root sequence, so we always work with a string. It also updates the docstring for `ancestral_sequence_inference` to include the "ref" argument.
Adds a functional test for the root sequence input of the new ancestral interface and fixes a bug revealed by this test where the reference sequence needs to be converted to a BioPython SeqIO.Seq instance prior to calling the feature extraction method on it. The feature extraction method accepts Seq and str inputs, but it will return an output of the same type. When we previously passed a string, the returned string didn't have the expected `translate` method.
Reorganizes arguments for ancestral subcommand into logical argument groups and adds validation and a functional test for new amino acid sequence inputs that must all be provided when any are.
Ignores the path to GFF/GenBank annotations in the JSON output for an augur ancestral functional test. This path will vary across platforms based on the working directory used.
Include link to PR, PR authors, and note about new translations output.
See also: c4d07c7
c4c2dec
to
7cfbd9b
Compare
Force-pushed to fix the merge conflict. This should be good to merge. |
augur ancestral so far only dealt with nucleotide sequences, while translation were done codon by codon from the nucleotide sequences in later steps. If translations for tips are available, the ancestral amino acid sequences can be determined analogously via ancestral reconstruction. This is implemented in this commit.
For each node, the node data structure will contain a
muts
and aaa_muts
field. The latter is a dict with gene/cds names as keys and<ancestral><pos><derived>
mutations (same as in augur translate). In addition, an annotation and reference sequences are derived stored in the node-data-json.Checklist