-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to specify arbitrary branch & clade labels #728
Conversation
This comment has been minimized.
This comment has been minimized.
081628c
to
32fe34e
Compare
augur/clades.py
Outdated
def create_node_data_structure(basal_clade_nodes, clade_membership, args): | ||
node_data = {} | ||
|
||
if (not args.label_name and not args.trait_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to allow workflows to continue without needing changes. There were 2 ways I could think of allowing this:
augur clades
without these 2 arguments used the old behaviour & exported bothclade_membership
andclade_annotation
as node traits. These would be picked up byaugur export v2
and the latter turned into the branch labelclade
. The downside is that the file structure foraugur clades
is different if you don't provide arguments than if you do.augur clades
without these 2 arguments now stores clade membership as before but stores branch labels in the newbranch_labels
structure under a keyclade
. This structure needs no special interpretation byaugur export v2
, and we will end up with identical Auspice JSONs as previously. The downside is that the format of the file produced byaugur clades
differs.
I went with option 2, but am open to other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make clade
and clade_membership
the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?
As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.
If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run
.
Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! My reason for requiring users to specify is to allow users to call augur clades
multiple times in a single workflow, sometimes with branch labels, sometimes without (and perhaps sometimes without trait names etc). If these had defaults, then the defaults will end up exported in the auspice JSON which may be undesired and potentially confusing as it wouldn't be clear which invocation of augur clades
produced it.
Concrete examples for discussion:
augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic
augur export ...
This is going to end up with "clade" branch labels representing pangolin clades, which wasn't the desired intention of the workflow.
augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic
augur clades ... --branch-label emerging_lineage # no trait labels
augur clades ... --trait-label WHO
augur export ...
We're going to get branch labels "clade", which I think will be WHO clades (this is an implementation detail of augur export
as to which one is picked - worst case they may be a mixture!). We're also going to get a colouring clade_membership
, which actually refers to emerging lineage, but it isn't obvious why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Those examples helped! If I understand correctly, there are two separate issues that we’re trying to address by requiring these new arguments:
-
Users can run
augur clades
multiple times with the same (default) attribute/label names andaugur export
will happily consume these and prefer one over the other in a surprising or unpredictable order (at least for the user). -
Users can run
augur clades
with one or both of the new arguments depending on what they want to annotate (clade attributes only, branch labels only, or both). Using default values would produce unwanted outcomes when users choose only attributes or only branch labels and also gets an annotation for the other possible representation.
Is that summary generally correct?
I can see how requiring these arguments tries to protect against conflicting node/branch attributes in augur export
. This is similar to why augur distance
requires --attribute-name
. But this seems to be a general problem with the export logic where we don't check (I think?) for collisions in attribute names from different data sources. So, even though we require the user to specify attribute names, there is no reason they couldn’t specify the same names in separate commands and still get a surprise collision. Another way to address issue 1 would be to check for these types of collisions in augur export
and either warn the user or throw an error. In addition to addressing Issue 1 here, this solution would also address other cases in the real world where people accidentally define the same attribute in separate runs of other augur commands. If issue 1 was the only issue, I'd still prefer to set sane defaults and not expect the user to change their behavior.
Issue 2 is one I missed on my initial read through the code (that you can define attributes or labels and not both). Still, I wonder about how bad it would be for users to get branch labels when they only request attributes. If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature? Is the worst case scenario here that the user is annoyed to get an annotation they don’t expect? We have already been providing these dual annotations, so would they actually be surprised? The main issue seems to be when the default names for the other representation conflict across multiple runs of the same command.
This example also makes me wonder about the value of using different names for attributes and branch labels. The name we use describes the data source of the clade annotations and not how Auspice represents clade annotations. That a clade appears as a color-by or branch label in Auspice is a separate technical consideration.
I also don’t see the harm in annotating both node and branch attributes with the same name by default. What if use the same name for both attributes and keep a sane default value (e.g., “clade”)? This approach allows the user who only runs clades once in a workflow to change nothing and run:
# Annotate both node and branch attributes. The user gets
# output that differs in its JSON structure but appears the
# same way in Auspice as it always has. Augur export knows
# how to handle the new JSON structure in this same release
# of Augur, so we don't need any special checks for backward
# compatibility.
augur clades \
--clades clades.tsv \
--output clades.json
Then, the user who wants to run multiple instances of clades in a single workflow can run the following commands to be more explicit about their attribute names:
# Provide explicit node/branch attribute names.
augur clades \
--clades clades.tsv \
--attribute-name nextstrain_clade \
--output clades.json
augur clades \
--clades pango.tsv \
--attribute-name pango \
--output pango.json
If users specify the same attribute name in separate data sources, augur export
should complain loudly:
# Use the default attribute name. Annotate both node and
# branch attributes.
augur clades \
--clades clades.tsv \
--output clades.json
# Accidentally reuse the same default attribute name.
augur clades \
--clades pango.tsv \
--output pango.json
# Validate attribute names from distinct data sources.
augur export v2 \
--node-data clades.json pango.json
...
ERROR: Multiple node data files ("clades.json", "pango.json") provide the same attribute name ("clade"). Resolve conflicting attribute names (e.g., by specifying `--attribute-name`) for these data files and try again.
Allowing default values makes this a backward-compatible change where most users do not have to do anything. Using the same name for node and branch attributes allows the user to know which augur clades
invocation produced those attributes and not have to think about how the clade annotation is represented in Auspice. Checking for collisions in node/branch attribute names in augur export
alerts the user when they accidentally reuse the same attribute names in separate invocations and tells them how to correct the problem (and fixes a more general issue with augur export
). I think this approach also simplifies the code in this PR a bit by reusing the same attribute name.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the scope of augur clades
, I agree with these points (branch labels and node attrs stored under the same attribute name, both exported, one optional `--attribute-name" arg with a default of "clade"), and am happy to make the changes to that code.
Where it gets tricky is in augur export
, because that is when we combine various pieces of data into a desired visualisation. To cleanly demarcate data generation vs visualisation, I do think these complexities are the remit of augur export
. How we determine what's exported has always been somewhat poorly documented and without looking at the code I can't remember what happens in many cases:
- What happens if pieces of (meta)data differ in the metadata TSV and a node-data file?
- Is a node-data attribute always exported as a colouring, even if we provide a list of desired colourings in an auspice config JSON which doesn't specify it?
- what about if we provide a list of colourings on the command line?
- What about if we do both?
- Are there special cases? (Yes, at least 18 and probably more.)
This relates to this PR as I think we want to have answers to the following questions:
- Previously, clade colourings were exported as "clade_membership", and this was always set as a colouring if a node-data file provided it. This is easy to update to "clade" if we we want to keep this behaviour.
- If an auspice config JSON specified a colouring for
key="clade_membership"
, which many do, but such an attribute is no longer provided in any node-data JSONs, what do we do? - Is there a way to limit the exported colourings from node-data produced by
augur clades
? i.e. is specifying a list of colourings in the config JSON able to prevent the export of such a node-data attribute? - Currently there's no general way to export branch labels (that's part of this PR). Do we extend the auspice config PR to allow these to be specified? Does this act the same way as
colorings
?
P.S.
If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature?
It's a bug. The dataset for visualisation should be completely customisable - if you believe such a colouring / branch label is scientifically not valid, you should be able to prevent it appearing in Auspice. I realise there's many cases in augur export
where things like these happen; they're bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, @jameshadfield! It's going to make the ncov workflow simpler, but it also paves the way for us to do cool things with custom branch labels or alternate clade annotations in other projects.
My main request below is that we provide default values for the new attribute/label variables, so users do not have to provide values if they don't want to.
As with the schema update PR, we could merge this as is, or we could pair-program some doctests. Whatever works best for you...
augur/clades.py
Outdated
def create_node_data_structure(basal_clade_nodes, clade_membership, args): | ||
node_data = {} | ||
|
||
if (not args.label_name and not args.trait_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make clade
and clade_membership
the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?
As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.
If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run
.
Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...
4d05cab
to
16280dd
Compare
Thanks for the great review @huddlej. Following on from #728 (comment), I've updated I started some overly simple functional tests of While creating these tests I noticed a bunch of little things which are all out of scope for this PR... should I create issues for these?
|
This looks pretty good to me. A few questions: We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly. I think the current (and probably previous) behavior Lines 152 to 158 in 16280dd
is fine. But might be good to stick in a comment. I am wondering whether we should instead of Lines 212 to 215 in 16280dd
use a structure like this
the The structure would be a bit more symmetrical in branches and nodes and might be more future proof bc we could add additional branch attributes without cluttering the top level. In other commands, this is used for auxillary info like version numbers, etc.... |
Thanks @rneher Current root node behaviour hasn't changed, but I'm not exactly sure what you mean. Are you saying that if a clade should be defined at the root, re: updated |
This commit is a WIP commit to test the new functionality being introduced in augur PR 728 [1]. This allows us to simplify the nCoV workflow as we can explicitly define the attribute names used for clade membership and branch labelling. These changes have only been tested for the "open" build, which itself is a WIP. [1] nextstrain/augur#728
16280dd
to
8a0c06e
Compare
From @rneher's review:
@jameshadfield, I understood this to mean that Augur is not guaranteed to assign the root node to a clade (the root sequence might not have any of the defined mutations), so its clade membership is implicitly undefined. We could add a check for the root node in the clades dict and then explicitly assign it a value, to make this logic clearer. From @rneher's review:
I like this symmetry, too, as a flexible way to annotate anything we like about branches and mostly for the parallel naming of "physical" objects. From @jameshadfield:
When you and I talked about this on Zoom, @jameshadfield, I think this was why we didn't use Even if this is the case, it seems that the node data JSON format is a kind of generic interface that could be decoupled from how the final output of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the edits to the main interface, @jameshadfield. This looks really good. The only bit to resolve before we merge is the branch_labels
vs. branches
naming question.
augur/clades.py
Outdated
|
||
# third pass to propagate 'clade_membership' | ||
# propagate 'clade_membership' to children nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we could check for the root node's clade membership and assign it something like "undefined", if we wanted to handle this case explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit unsure about all this.
Nodes not part of clades ("undefined") aren't part of the output of augur clades
, so explicitly annotating the root node as such would be strange.
When the inputs define the sequence for the root node, then the root can be annotated with a clade - see the nCoV workflow where the root node+branch are assigned clade 19A. My understanding is that this needs the entire root sequence as an input, we don't infer this from the observed mutations, but this is something we should test / document.
8a0c06e
to
866c968
Compare
Rebased this onto master now that #737 is merged & updated the issues @huddlej pointed out. My observation about the node-data structure doesn't involve auspice, rather the difference this would cause in nodes & branches structure within a single node-data JSON, with branches having a second level of hierarchy, e.g. {
"nodes": {
"ARG/Cordoba-12873-61/2020": {
"clade_membership": "20A"
}
},
"branches": {
"labels": {
"NODE_0000000": {
"clade_membership": "19A"
}
}
}
} As long as we are aware of this, I'm happy to shift to this structure. |
Ah, I see. That example clears it up. I think what @rneher is recommending looks like this instead: {
"nodes": {
"ARG/Cordoba-12873-61/2020": {
"clade_membership": "20A"
}
},
"branches": {
"NODE_0000000": {
"labels": {
"clade": "19A"
}
}
}
} How do you feel about this approach? |
yes, this is what I meant. I hope this is more generic and future proof (things like support values could live on branches). On the other hand, we do assign a bunch of things to
But I would prefer a top-level |
866c968
to
f7cc5ab
Compare
Updated this PR to use the new structure from @huddlej / @rneher above:
And added some more functional tests. I think it'd be worth running nextstrain/ncov#660 with this (updated) PR as a final round of tests before merge. I'll start this run now. |
f7cc5ab
to
422084d
Compare
15b29a8
to
007cb47
Compare
After being on the agenda forever I'm finally going to get this merged. The overall summary is as per this comment above.
Good call - I've modified
I think this is the better direction - as per John & Richard's comments above. I do see the fear that mutations never get moved across, but I hope they do!
I'm still wrapping my head around this and trying to construct a test to really understand what's going on here (and to understand if providing a reference changes things). I'll do that separately to this PR however as the behavior is unchanged here |
Our current implementation of read_node_data requires that every node in the tree is specified in the (merged) node_data files. For mutations this is overkill -- many nodes don't have mutations and it's overkill to require node_data JSONs to specify things like `"node_name": {"muts": []}`. This may well be the general behaviour we want, but i didn't want to modify the read_node_data function which sees extensive use. A welcome side effect of these changes is that we no longer have to supply both nuc and aa_muts.
See comments in tests/functional/clades.t Also adds / updates comments and docstrings which were noticed as I worked through the code relating to these tests.
Workflows may be using this so I elected to hide it rather than remove it (and warn people it's a no-op if they do happen to be using it)
This function had a few subtle bugs in it which are fixed here, as well as improving the warning message to explain how this may affect clade inference. Note that the presence of sequences on nodes other than the root is not considered by augur clades.
We could check all of these up-front instead of exiting upon the first error, and such a check should be part of validation within augur clades, but this commit is a simple solution to fix a reported bug. Closes #965
A fatal error is raised if no clades are defined, but if a clade is not found on the tree it's only a warning. Suggested in #735
Multiple mutations at the same position on a single branch are now a fatal error. Previous behaviour was to overwrite such mutations when parsing. Suggested by #735.
Multiple improvements to augur clades
@jameshadfield It would be good to add to the Changelog how the internal representation of node data has changed. I couldn't find the info at a glance and this PR has many comments. See e.g. this failure where a script-created node-data-json is no longe accepted by export: nextstrain/conda-base#27 (comment)
Also, I think this should be reclassified as a breaking change, given that we use a lot of custom scripts in our workflows. |
In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring. This is problematic for test scripts that may produce empty node data in test cases. This PR removes the eager validation. In the future we could reintroduce it as a warning. And possibly an error but with opt-out.
Resolves #1215 Warn instead error when no nodes in a node data json, fixing issue introduced recently in PR #728 In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring. This is problematic for test scripts that may produce empty node data in test cases. This PR removes the eager validation. In the future we could reintroduce it as a warning. And possibly an error but with opt-out. This type of node data json was previously errored on by augur export, it is now accepted again: ```json { "nodes": {}, "rbd_level_details": {} } ``` <!-- Start typing the name of a related issue and GitHub will auto-suggest the issue number for you. --> Fixes the ncov pathogen-CI issue: nextstrain/conda-base#27 (comment) What steps should be taken to test the changes you've proposed? If you added or changed behavior in the codebase, did you update the tests, or do you need help with this? - [x] nextstrain/conda-base#27 (comment) is fixed, export now accepts empty nodes dicts again
This updates the workflow to use the new clades interface from augur v22 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used, but not removed here, as it contains logic to export spike mutations as branch labels which may be useful at some point. If we do use this, it would be better to produce an intermediate node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.
This updates the workflow to use the new clades interface from augur v22 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used, but not removed here, as it contains logic to export spike mutations as branch labels which may be useful at some point. If we do use this, it would be better to produce an intermediate node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.
The intention of the coloring logic is that if an auspice-config provides the clade_membership key then it is exported at that position in the colorings list. If clade_membership is not explicitly set in the config (but is present in a node-data file) then we have (for a very long time) added it as the very first entry in the colorings list. PR #728 (augur v22.0.0) erroneously modified the behavior of the second case described above, which has now been restored by this commit.
The intention of the coloring logic is that if an auspice-config provides the clade_membership key then it is exported at that position in the colorings list. If clade_membership is not explicitly set in the config (but is present in a node-data file) then we have (for a very long time) added it as the very first entry in the colorings list. PR #728 (augur v22.0.0) erroneously modified the behavior of the second case described above, which has now been restored by this commit.
This updates the workflow to use the new clades interface from augur v22.0.1 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used, but not removed here, as it contains logic to export spike mutations as branch labels which may be useful at some point. If we do use this, it would be better to produce an intermediate node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.
This updates the workflow to use the new clades interface from augur v22 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. The minimum augur version is bumped to 22.0.1, as that includes a couple of important bug-fixes. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used and thus removed here (as recommended in code review: #1000 (comment)) Note that it contained unused functionality to export spike mutations; if we reinstate this in the future we should update the output format to produce a node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.
The JSON output from `augur clades` was updated to separate `nodes` and `branches` in nextstrain/augur#728 so now the `assign_rbd_levels` script needs to parse the `branches` in order to find the basal node.
The JSON output from `augur clades` was updated to separate `nodes` and `branches` in nextstrain/augur#728 so now the `assign_rbd_levels` script needs to parse the `branches` in order to find the basal node.
The JSON output from `augur clades` was updated to separate `nodes` and `branches` in nextstrain/augur#728 so now the `assign_rbd_levels` script needs to parse the `branches` in order to find the basal node.
This PR consists of a pair of commits (see messages for details of each). These will allow us to specify the names of clades produced by
augur clades
and haveaugur export v2
export arbitrary branch labels without the need of ad-hoc scripts. This PR closes #720.As an example for testing, the following patch shows how our ncov workflow can be simplified, as we can run multiple
augur clades
rules and pass their output directly toaugur export
, which removes the need for two extra rules.I've tested this in a variety of settings, but more is needed. Unit tests (or similar) would be useful here, but it's been a while since I've written these for augur (anyone want to pair program these?).