-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv #1329
base: master
Are you sure you want to change the base?
Conversation
Currently, to assign clades at internal nodes, we require a `clades.tsv` which isn't a natural way to define clades in most cases Rather, one often wants to infer clades from a set of labeled tips This PR adds a new subcommand `clades2` to infer clades at all tips and internal nodes from a subset of labeled tips Currently, the implementation is a stripped down version of `traits` with confidence/entropy/model output removed But this is an implementation detail that might change in the future Some limitations of the current approach: - clades can be non-monophyletic - hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three) In theory, hierarchy could be taken into account but this may not be required for this command to be useful already. The name is provisional, I couldn't immediately think of a good one so I went for `clades2`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Starting a thread for command placement/naming)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a user of clades so I don't think my opinion matters much, but my inclination would be to make it a part of augur clades
because the output file is the same. I don't think --mode
is necessary since it can be inferred by the inputs. The argparse parser can be written to require either --mutations
+--clades
or --metadata
+--clade-column
. Example:
# "old" clades
augur clades \
--tree tree.nwk \
--mutations aa_muts.json nt_muts_small.json \
--clades clades.tsv \
--output-node-data clades.json
# "new" clades
augur clades \
--tree tree.nwk \
--metadata metadata.tsv \
--clade-column clade \
--output-node-data clades.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, @victorlin. We'd probably want separate argument groups to clearly show which arguments to use for which "mode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch that no mode is necessary and one can infer from input files. One disadvantage of lumping the two modes together in one command is that the command becomes more complicated to understand for end users. There stilll are two modes, they just share a few arguments and capabilities but not all and which are shared and which aren't is not obvious.
I'm not sure what the rationale is for lumping if there isn't much shared code/logic. What's the advantage of having it be under one command? I'm open either way, just not fully convinced yet of lumping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think shared code/logic is necessary to bundle under one command. In trying to imagine myself as a user, my thinking was that this would be an expansion of augur clades
's objective from
Assign clades to nodes in a tree based on amino-acid or nucleotide signatures
to
Assign clades to nodes in a tree based on either amino-acid/nucleotide signatures or clades from metadata
In other words, there is a shared objective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point on few shared arguments might matter more if there are clades
-specific customizations that are available in one and not the other. This doesn't include args like --metadata-id-columns
, which is unrelated to clades
. As far as I can tell, besides the input arguments, I don't think there are any clades
-specific arguments to either command.
Bundling also make it more likely that any feature additions added to one are added to both.
@corneliusroemer Can you say a little more about why the current
Maybe another question is how important it is for the proposed new interface to exactly match the functionality provided by |
Thanks for the good questions @huddlej! It's true that one could use traits to do what the command does in the current state of the PR, but we will eventually want to use a different treetime function under the hood (ancestral reconstruction rather than mugration) and that means we can't use traits anymore - unless one were to make clades inference essentially a separate traits mode altogether. Adding branch label functionality makes sense, but it's not essential for the main use cases I've thought of. Same for confidence, if possible nice to have but depending on the algorithm used for inference, we might not get confidence (e.g. Fitch/parsimony won't give you confidence). |
Thanks, @corneliusroemer. I see now how different this logic needs to be from I agree with @victorlin's assessment in the comments above that placing this new functionality in Do you think @victorlin's example interface would meet your needs as a user, @corneliusroemer? Is there anything you'd change from the UI perspective? We could chat about this synchronously any time you'd like, too, if that's easier than GitHub comments... |
Currently, to assign clades at internal nodes, we require a
clades.tsv
listing defining mutations for each clade which isn't a natural way to define clades in many cases: often people define clades through representative strains.This draft PR is an attempt to offer an alternative clades command that does clade assignment more like trait inference/ancestral reconstruction: from a set of labeled tips to internal nodes and unlabeled tips.
Such a command will be particularly useful for bootstrapping reference trees for Nextclade datasets. However, there are many more use cases, e.g. getting lineages onto internal nodes in ncov.
Currently, the implementation is a stripped down version of
traits
with confidence/entropy/model output removed. However this is an implementation detail that will probably change - so it's best not to focus on that part.It would be great to get feedback in general, but in particular on the following points: How should this command be included? Should it be a new subcommand - the way it's done right now with the place holder name
clades2
eventually replaced by a better name, or should we put the functionality insideaugur clades
and gate it behind a--mode
?My gut preference is to make a new subcommand as the input files to the command are quite different: taking a metadata.tsv and a metadata column name instead of a nuc_mutations.json and a clades.tsv - but we might also want to avoid proliferation of new subcommands.
Some limitations of the current implementation: