Allow users to specify arbitrary branch & clade labels #728

jameshadfield · 2021-05-27T23:46:53Z

This PR consists of a pair of commits (see messages for details of each). These will allow us to specify the names of clades produced by augur clades and have augur export v2 export arbitrary branch labels without the need of ad-hoc scripts. This PR closes #720.

As an example for testing, the following patch shows how our ncov workflow can be simplified, as we can run multiple augur clades rules and pass their output directly to augur export, which removes the need for two extra rules.

diff --git a/workflow/snakemake_rules/main_workflow.smk b/workflow/snakemake_rules/main_workflow.smk
index f9cacf08..c978d321 100644
--- a/workflow/snakemake_rules/main_workflow.smk
+++ b/workflow/snakemake_rules/main_workflow.smk
@@ -915,6 +915,9 @@ rule clades:
         clades = rules.clade_files.output
     output:
         clade_data = "results/{build_name}/clades.json"
+    params:
+        trait_name = "clade_membership",
+        label_name = "clade"
     log:
         "logs/clades_{build_name}.txt"
     benchmark:
@@ -928,6 +931,7 @@ rule clades:
         augur clades --tree {input.tree} \
             --mutations {input.nuc_muts} {input.aa_muts} \
             --clades {input.clades} \
+            --trait-name {params.trait_name} --label-name {params.label_name} \
             --output-node-data {output.clade_data} 2>&1 | tee {log}
         """
 
@@ -940,7 +944,10 @@ rule emerging_lineages:
         emerging_lineages = config["files"]["emerging_lineages"],
         clades = config["files"]["clades"]
     output:
-        clade_data = "results/{build_name}/temp_emerging_lineages.json"
+        clade_data = "results/{build_name}/emerging_lineages.json"
+    params:
+        trait_name = "emerging_lineage",
+        label_name = "emerging_lineage"
     log:
         "logs/emerging_lineages_{build_name}.txt"
     benchmark:
@@ -954,28 +961,10 @@ rule emerging_lineages:
         augur clades --tree {input.tree} \
             --mutations {input.nuc_muts} {input.aa_muts} \
             --clades {input.emerging_lineages} \
+            --trait-name {params.trait_name} --label-name {params.label_name} \
             --output-node-data {output.clade_data} 2>&1 | tee {log}
         """
 
-rule rename_emerging_lineages:
-    input:
-        node_data = rules.emerging_lineages.output.clade_data
-    output:
-        clade_data = "results/{build_name}/emerging_lineages.json"
-    benchmark:
-        "benchmarks/rename_emerging_lineages_{build_name}.txt"
-    run:
-        import json
-        with open(input.node_data, 'r', encoding='utf-8') as fh:
-            d = json.load(fh)
-            new_data = {}
-            for k,v in d['nodes'].items():
-                if "clade_membership" in v:
-                    new_data[k] = {"emerging_lineage": v["clade_membership"]}
-        with open(output.clade_data, "w") as fh:
-            json.dump({"nodes": new_data}, fh, indent=2)
-
-
 rule colors:
     message: "Constructing colors file"
     input:
@@ -1124,7 +1113,7 @@ def _get_node_data_by_wildcards(wildcards):
         rules.refine.output.node_data,
         rules.ancestral.output.node_data,
         rules.translate.output.node_data,
-        rules.rename_emerging_lineages.output.clade_data,
+        rules.emerging_lineages.output.clade_data,
         rules.clades.output.clade_data,
         rules.recency.output.node_data,
         rules.traits.output.node_data,
@@ -1180,28 +1169,10 @@ rule export:
             --output {output.auspice_json} 2>&1 | tee {log}
         """
 
-rule add_branch_labels:
-    message: "Adding custom branch labels to the Auspice JSON"
-    input:
-        auspice_json = rules.export.output.auspice_json,
-        emerging_clades = rules.emerging_lineages.output.clade_data
-    output:
-        auspice_json = "results/{build_name}/ncov_with_branch_labels.json"
-    log:
-        "logs/add_branch_labels{build_name}.txt"
-    conda: config["conda_environment"]
-    shell:
-        """
-        python3 ./scripts/add_branch_labels.py \
-            --input {input.auspice_json} \
-            --emerging-clades {input.emerging_clades} \
-            --output {output.auspice_json}
-        """
-
 rule incorporate_travel_history:
     message: "Adjusting main auspice JSON to take into account travel history"
     input:
-        auspice_json = rules.add_branch_labels.output.auspice_json,
+        auspice_json = rules.export.output.auspice_json,
         colors = lambda w: config["builds"][w.build_name]["colors"] if "colors" in config["builds"][w.build_name] else ( config["files"]["colors"] if "colors" in config["files"] else rules.colors.output.colors.format(**w) ),
         lat_longs = config["files"]["lat_longs"]
     params:
@@ -1228,7 +1199,7 @@ rule incorporate_travel_history:
 rule finalize:
     message: "Remove extraneous colorings for main build and move frequencies"
     input:
-        auspice_json = lambda w: rules.add_branch_labels.output.auspice_json if config.get("skip_travel_history_adjustment", False) else rules.incorporate_travel_history.output.auspice_json,
+        auspice_json = lambda w: rules.export.output.auspice_json if config.get("skip_travel_history_adjustment", False) else rules.incorporate_travel_history.output.auspice_json,
         frequencies = rules.tip_frequencies.output.tip_frequencies_json,
         root_sequence_json = rules.export.output.root_sequence_json
     output:

I've tested this in a variety of settings, but more is needed. Unit tests (or similar) would be useful here, but it's been a while since I've written these for augur (anyone want to pair program these?).

augur/clades.py

jameshadfield · 2021-05-28T05:15:22Z

augur/clades.py

+def create_node_data_structure(basal_clade_nodes, clade_membership, args):
+    node_data = {}
+
+    if (not args.label_name and not args.trait_name):


I wanted to allow workflows to continue without needing changes. There were 2 ways I could think of allowing this:

augur clades without these 2 arguments used the old behaviour & exported both clade_membership and clade_annotation as node traits. These would be picked up by augur export v2 and the latter turned into the branch label clade. The downside is that the file structure for augur clades is different if you don't provide arguments than if you do.

augur clades without these 2 arguments now stores clade membership as before but stores branch labels in the new branch_labels structure under a key clade. This structure needs no special interpretation by augur export v2, and we will end up with identical Auspice JSONs as previously. The downside is that the format of the file produced by augur clades differs.

I went with option 2, but am open to other suggestions.

Can we make clade and clade_membership the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?

As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.

If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run.

Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...

Good question! My reason for requiring users to specify is to allow users to call augur clades multiple times in a single workflow, sometimes with branch labels, sometimes without (and perhaps sometimes without trait names etc). If these had defaults, then the defaults will end up exported in the auspice JSON which may be undesired and potentially confusing as it wouldn't be clear which invocation of augur clades produced it.

Concrete examples for discussion:

augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic augur export ...

This is going to end up with "clade" branch labels representing pangolin clades, which wasn't the desired intention of the workflow.

augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic augur clades ... --branch-label emerging_lineage # no trait labels augur clades ... --trait-label WHO augur export ...

We're going to get branch labels "clade", which I think will be WHO clades (this is an implementation detail of augur export as to which one is picked - worst case they may be a mixture!). We're also going to get a colouring clade_membership, which actually refers to emerging lineage, but it isn't obvious why.

Got it. Those examples helped! If I understand correctly, there are two separate issues that we’re trying to address by requiring these new arguments:

Users can run augur clades multiple times with the same (default) attribute/label names and augur export will happily consume these and prefer one over the other in a surprising or unpredictable order (at least for the user).

Users can run augur clades with one or both of the new arguments depending on what they want to annotate (clade attributes only, branch labels only, or both). Using default values would produce unwanted outcomes when users choose only attributes or only branch labels and also gets an annotation for the other possible representation.

Is that summary generally correct?

I can see how requiring these arguments tries to protect against conflicting node/branch attributes in augur export. This is similar to why augur distance requires --attribute-name. But this seems to be a general problem with the export logic where we don't check (I think?) for collisions in attribute names from different data sources. So, even though we require the user to specify attribute names, there is no reason they couldn’t specify the same names in separate commands and still get a surprise collision. Another way to address issue 1 would be to check for these types of collisions in augur export and either warn the user or throw an error. In addition to addressing Issue 1 here, this solution would also address other cases in the real world where people accidentally define the same attribute in separate runs of other augur commands. If issue 1 was the only issue, I'd still prefer to set sane defaults and not expect the user to change their behavior.

Issue 2 is one I missed on my initial read through the code (that you can define attributes or labels and not both). Still, I wonder about how bad it would be for users to get branch labels when they only request attributes. If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature? Is the worst case scenario here that the user is annoyed to get an annotation they don’t expect? We have already been providing these dual annotations, so would they actually be surprised? The main issue seems to be when the default names for the other representation conflict across multiple runs of the same command.

This example also makes me wonder about the value of using different names for attributes and branch labels. The name we use describes the data source of the clade annotations and not how Auspice represents clade annotations. That a clade appears as a color-by or branch label in Auspice is a separate technical consideration.

I also don’t see the harm in annotating both node and branch attributes with the same name by default. What if use the same name for both attributes and keep a sane default value (e.g., “clade”)? This approach allows the user who only runs clades once in a workflow to change nothing and run:

# Annotate both node and branch attributes. The user gets # output that differs in its JSON structure but appears the # same way in Auspice as it always has. Augur export knows # how to handle the new JSON structure in this same release # of Augur, so we don't need any special checks for backward # compatibility. augur clades \ --clades clades.tsv \ --output clades.json

Then, the user who wants to run multiple instances of clades in a single workflow can run the following commands to be more explicit about their attribute names:

# Provide explicit node/branch attribute names. augur clades \ --clades clades.tsv \ --attribute-name nextstrain_clade \ --output clades.json augur clades \ --clades pango.tsv \ --attribute-name pango \ --output pango.json

If users specify the same attribute name in separate data sources, augur export should complain loudly:

# Use the default attribute name. Annotate both node and # branch attributes. augur clades \ --clades clades.tsv \ --output clades.json # Accidentally reuse the same default attribute name. augur clades \ --clades pango.tsv \ --output pango.json # Validate attribute names from distinct data sources. augur export v2 \ --node-data clades.json pango.json ... ERROR: Multiple node data files ("clades.json", "pango.json") provide the same attribute name ("clade"). Resolve conflicting attribute names (e.g., by specifying `--attribute-name`) for these data files and try again.

Allowing default values makes this a backward-compatible change where most users do not have to do anything. Using the same name for node and branch attributes allows the user to know which augur clades invocation produced those attributes and not have to think about how the clade annotation is represented in Auspice. Checking for collisions in node/branch attribute names in augur export alerts the user when they accidentally reuse the same attribute names in separate invocations and tells them how to correct the problem (and fixes a more general issue with augur export). I think this approach also simplifies the code in this PR a bit by reusing the same attribute name.

What do you think?

Considering the scope of augur clades, I agree with these points (branch labels and node attrs stored under the same attribute name, both exported, one optional `--attribute-name" arg with a default of "clade"), and am happy to make the changes to that code.

Where it gets tricky is in augur export, because that is when we combine various pieces of data into a desired visualisation. To cleanly demarcate data generation vs visualisation, I do think these complexities are the remit of augur export. How we determine what's exported has always been somewhat poorly documented and without looking at the code I can't remember what happens in many cases:

What happens if pieces of (meta)data differ in the metadata TSV and a node-data file?

Is a node-data attribute always exported as a colouring, even if we provide a list of desired colourings in an auspice config JSON which doesn't specify it?

what about if we provide a list of colourings on the command line?

What about if we do both?

Are there special cases? (Yes, at least 18 and probably more.)

This relates to this PR as I think we want to have answers to the following questions:

Previously, clade colourings were exported as "clade_membership", and this was always set as a colouring if a node-data file provided it. This is easy to update to "clade" if we we want to keep this behaviour.

If an auspice config JSON specified a colouring for key="clade_membership", which many do, but such an attribute is no longer provided in any node-data JSONs, what do we do?

Is there a way to limit the exported colourings from node-data produced by augur clades? i.e. is specifying a list of colourings in the config JSON able to prevent the export of such a node-data attribute?

Currently there's no general way to export branch labels (that's part of this PR). Do we extend the auspice config PR to allow these to be specified? Does this act the same way as colorings?

P.S.

If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature?

It's a bug. The dataset for visualisation should be completely customisable - if you believe such a colouring / branch label is scientifically not valid, you should be able to prevent it appearing in Auspice. I realise there's many cases in augur export where things like these happen; they're bugs.

huddlej

This is awesome, @jameshadfield! It's going to make the ncov workflow simpler, but it also paves the way for us to do cool things with custom branch labels or alternate clade annotations in other projects.

My main request below is that we provide default values for the new attribute/label variables, so users do not have to provide values if they don't want to.

As with the schema update PR, we could merge this as is, or we could pair-program some doctests. Whatever works best for you...

augur/clades.py

huddlej · 2021-06-03T23:33:23Z

augur/clades.py

+def create_node_data_structure(basal_clade_nodes, clade_membership, args):
+    node_data = {}
+
+    if (not args.label_name and not args.trait_name):


Can we make clade and clade_membership the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?

As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.

If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run.

Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...

augur/clades.py

augur/export_v2.py

jameshadfield · 2021-06-05T04:44:40Z

Thanks for the great review @huddlej. Following on from #728 (comment), I've updated augur clades with your suggestions in 16280dd (I'll squash this with it's parent before merge), but I haven't update augur export v2 yet.

I started some overly simple functional tests of augur clades using a small tree and a few mutations:

While creating these tests I noticed a bunch of little things which are all out of scope for this PR... should I create issues for these?

Despite the help indicating that nucleotide and/or amino-acid mutations are required, the node-data JSONs, when combined, must contain muts and aa_muts keys for each node because the augur clades codes assumes their existence.
Every node in the tree must have a corresponding entry in a node-data JSON, even if it has no mutations (this is asserted in NodeReader)
A single branch can define multiple mutations at the same position without an error being thrown, but each mutation overrides the previous and the results are unexpected. We should probably exit in this case.
#-prefixed lines in the clades TSV work as comments, but they're actually read as potentially valid clade definitions! I suggest we add comment='#' to pd.read_csv here.
The behaviour of augur clades means that if there are multiple nodes containing clade-defining mutations (i.e. the clade is polyphyletic), then we only annotate clades on the biggest monophyly. We should warn when situations like this arrise, or allow this to be relaxed. I expect it'll become common to want to define "clades" via a small set of constellation nCoV mutations, and expect polyphyletic colourings in Auspice.
Relatedly, how we calculate "biggest" took a bit of time for me to understand. As far as I can tell (it may be different for VCF inputs), we count the number of descendant nodes which have not mutated away from the clade-defining set of mutations, but don't require these nodes to actually be in the clade (e.g. tipE counted as within cladeDEF for this purpose, but in the output it is (correctly) annotated as cladeE).

tests/functional/clades.t

rneher · 2021-06-09T19:52:41Z

This looks pretty good to me. A few questions:

We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly. I think the current (and probably previous) behavior

augur/augur/clades.py

Lines 152 to 158 in 16280dd

    
           # propagate 'clade_membership' to children nodes 
        
           # don't propagate if encountering 'clade_annotation' 
        
           for node in tree.find_clades(order = 'preorder'): 
        
               for child in node: 
        
                   # if the child doesn't define the start of its own clade, but the parent belongs to a clade, then inherit that membership 
        
                   if child.name not in basal_clade_nodes and node.name in clade_membership: 
        
                       clade_membership[child.name] = clade_membership[node.name]

is fine. But might be good to stick in a comment.

I am wondering whether we should instead of branch_labels here

augur/augur/clades.py

Lines 212 to 215 in 16280dd

    
           node_data = { 
        
             'nodes': {node: {args.attribute_name: clade} for node,clade in clade_membership.items()}, 
        
             'branch_labels': {node: {args.attribute_name: clade} for node,clade in basal_clade_nodes.items()} 
        
           }

use a structure like this

{
  nodes: { node1: { key: value}...},
  branches: { branch1: {key:value}...}
}

the key: value in this case could be labels: { pango: B.1.1.7}.

The structure would be a bit more symmetrical in branches and nodes and might be more future proof bc we could add additional branch attributes without cluttering the top level. In other commands, this is used for auxillary info like version numbers, etc....

jameshadfield · 2021-06-09T22:23:09Z

Thanks @rneher

Current root node behaviour hasn't changed, but I'm not exactly sure what you mean. Are you saying that if a clade should be defined at the root, augur clades wouldn't do this? I would have expected it to do so if you provided a reference sequence.

re: updated nodes & branches structure, you're essentially proposing that the node-data structure for branches start to converge on the auspice dataset structure for branch_attrs. Would it be strange to have different structure for nodes & branches in node-data JSONs? cc @huddlej

This commit is a WIP commit to test the new functionality being introduced in augur PR 728 [1]. This allows us to simplify the nCoV workflow as we can explicitly define the attribute names used for clade membership and branch labelling. These changes have only been tested for the "open" build, which itself is a WIP. [1] nextstrain/augur#728

huddlej · 2021-06-14T22:52:45Z

From @rneher's review:

We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly.

@jameshadfield, I understood this to mean that Augur is not guaranteed to assign the root node to a clade (the root sequence might not have any of the defined mutations), so its clade membership is implicitly undefined. We could add a check for the root node in the clades dict and then explicitly assign it a value, to make this logic clearer.

From @rneher's review:

The structure would be a bit more symmetrical in branches and nodes and might be more future proof...[snip]

I like this symmetry, too, as a flexible way to annotate anything we like about branches and mostly for the parallel naming of "physical" objects.

From @jameshadfield:

Would it be strange to have different structure for nodes & branches in node-data JSONs?

When you and I talked about this on Zoom, @jameshadfield, I think this was why we didn't use branches instead of branch_labels, but now I don't fully understand the issue. Is the issue that the final Auspice JSONs produce node_attrs and branch_attrs that don't have the same structure, so it might be misleading to use node data JSON inputs that suggest those inputs will be structured similarly?

Even if this is the case, it seems that the node data JSON format is a kind of generic interface that could be decoupled from how the final output of augur export handles the data. Not knowing nearly as much about Auspice as you and Richard, I wouldn't be surprised if augur export transformed my node data into something Auspice-specific...

huddlej

Thank you for the edits to the main interface, @jameshadfield. This looks really good. The only bit to resolve before we merge is the branch_labels vs. branches naming question.

augur/clades.py

huddlej · 2021-06-14T22:56:28Z

augur/clades.py


-    # third pass to propagate 'clade_membership'
+    # propagate 'clade_membership' to children nodes


This is where we could check for the root node's clade membership and assign it something like "undefined", if we wanted to handle this case explicitly.

I'm still a bit unsure about all this.

Nodes not part of clades ("undefined") aren't part of the output of augur clades, so explicitly annotating the root node as such would be strange.

When the inputs define the sequence for the root node, then the root can be annotated with a clade - see the nCoV workflow where the root node+branch are assigned clade 19A. My understanding is that this needs the entire root sequence as an input, we don't infer this from the observed mutations, but this is something we should test / document.

tests/functional/clades.t

jameshadfield · 2021-06-15T02:26:20Z

Rebased this onto master now that #737 is merged & updated the issues @huddlej pointed out.

My observation about the node-data structure doesn't involve auspice, rather the difference this would cause in nodes & branches structure within a single node-data JSON, with branches having a second level of hierarchy, e.g.

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "labels": {
      "NODE_0000000": {
        "clade_membership": "19A"
      }
    }
  }
}

As long as we are aware of this, I'm happy to shift to this structure.

huddlej · 2021-06-15T19:33:54Z

Ah, I see. That example clears it up. I think what @rneher is recommending looks like this instead:

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "NODE_0000000": {
      "labels": {
        "clade": "19A"
      }
    }
  }
}

How do you feel about this approach?

rneher · 2021-06-15T21:16:25Z

yes, this is what I meant. I hope this is more generic and future proof (things like support values could live on branches). On the other hand, we do assign a bunch of things to nodes that really should be branch properties (like mutations or branch lengths). So I guess we could stick the branch label to the node structure as

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A",
      "branch_labels":{"clade":'20A'},
    }
  },

But I would prefer a top-level branches to a top-level branch-labels.

jameshadfield · 2021-06-15T23:14:52Z

Updated this PR to use the new structure from @huddlej / @rneher above:

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "NODE_0000000": {
      "labels": {
        "clade": "19A"
      }
    }
  }
}

And added some more functional tests. I think it'd be worth running nextstrain/ncov#660 with this (updated) PR as a final round of tests before merge. I'll start this run now.

tests/functional/clades/expected-output-default.json

jameshadfield · 2023-04-11T02:18:34Z

After being on the agenda forever I'm finally going to get this merged. The overall summary is as per this comment above.

[@joverlee521] Based on conversation in Auspice, we should check that any arbitrary label_key is not "none" so that they don't clash with ?branchLabel=none to hide branch labels.

Good call - I've modified augur export and added this to a test to ensure such a key will not be exported.

[@trvrb] Start with labels as the only thing in branches and plan to migrate mutations etc... down the line.

I think this is the better direction - as per John & Richard's comments above. I do see the fear that mutations never get moved across, but I hope they do!

[@rneher] We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly.

I'm still wrapping my head around this and trying to construct a test to really understand what's going on here (and to understand if providing a reference changes things). I'll do that separately to this PR however as the behavior is unchanged here

This PR will close #720
This PR will close #1027

Our current implementation of read_node_data requires that every node in the tree is specified in the (merged) node_data files. For mutations this is overkill -- many nodes don't have mutations and it's overkill to require node_data JSONs to specify things like `"node_name": {"muts": []}`. This may well be the general behaviour we want, but i didn't want to modify the read_node_data function which sees extensive use. A welcome side effect of these changes is that we no longer have to supply both nuc and aa_muts.

See comments in tests/functional/clades.t Also adds / updates comments and docstrings which were noticed as I worked through the code relating to these tests.

Workflows may be using this so I elected to hide it rather than remove it (and warn people it's a no-op if they do happen to be using it)

This function had a few subtle bugs in it which are fixed here, as well as improving the warning message to explain how this may affect clade inference. Note that the presence of sequences on nodes other than the root is not considered by augur clades.

We could check all of these up-front instead of exiting upon the first error, and such a check should be part of validation within augur clades, but this commit is a simple solution to fix a reported bug. Closes #965

Closes #1153

A fatal error is raised if no clades are defined, but if a clade is not found on the tree it's only a warning. Suggested in #735

Multiple mutations at the same position on a single branch are now a fatal error. Previous behaviour was to overwrite such mutations when parsing. Suggested by #735.

Multiple improvements to augur clades

corneliusroemer · 2023-05-15T16:03:34Z

@jameshadfield It would be good to add to the Changelog how the internal representation of node data has changed. I couldn't find the info at a glance and this PR has many comments. See e.g. this failure where a script-created node-data-json is no longe accepted by export: nextstrain/conda-base#27 (comment)

ERROR: results/europe/rbd_levels.json did not contain either `nodes` or `branches`. Please check the formatting of this JSON!

Also, I think this should be reclassified as a breaking change, given that we use a lot of custom scripts in our workflows.

In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring. This is problematic for test scripts that may produce empty node data in test cases. This PR removes the eager validation. In the future we could reintroduce it as a warning. And possibly an error but with opt-out.

Resolves #1215 Warn instead error when no nodes in a node data json, fixing issue introduced recently in PR #728 In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring. This is problematic for test scripts that may produce empty node data in test cases. This PR removes the eager validation. In the future we could reintroduce it as a warning. And possibly an error but with opt-out. This type of node data json was previously errored on by augur export, it is now accepted again: ```json { "nodes": {}, "rbd_level_details": {} } ```  Fixes the ncov pathogen-CI issue: nextstrain/conda-base#27 (comment) What steps should be taken to test the changes you've proposed? If you added or changed behavior in the codebase, did you update the tests, or do you need help with this? - [x] nextstrain/conda-base#27 (comment) is fixed, export now accepts empty nodes dicts again

This updates the workflow to use the new clades interface from augur v22 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used, but not removed here, as it contains logic to export spike mutations as branch labels which may be useful at some point. If we do use this, it would be better to produce an intermediate node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.

The intention of the coloring logic is that if an auspice-config provides the clade_membership key then it is exported at that position in the colorings list. If clade_membership is not explicitly set in the config (but is present in a node-data file) then we have (for a very long time) added it as the very first entry in the colorings list. PR #728 (augur v22.0.0) erroneously modified the behavior of the second case described above, which has now been restored by this commit.

This updates the workflow to use the new clades interface from augur v22.0.1 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used, but not removed here, as it contains logic to export spike mutations as branch labels which may be useful at some point. If we do use this, it would be better to produce an intermediate node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.

This updates the workflow to use the new clades interface from augur v22 (see nextstrain/augur#728). In the process we can remove two rules from the workflow. The minimum augur version is bumped to 22.0.1, as that includes a couple of important bug-fixes. If this workflow is run with augur prior to v22, the emerging_lineages rule will error due to unknown arguments. The script add_branch_labels.py is no longer used and thus removed here (as recommended in code review: #1000 (comment)) Note that it contained unused functionality to export spike mutations; if we reinstate this in the future we should update the output format to produce a node-data JSON with a custom branch label to avoid modifying the auspice JSON after export.

The JSON output from `augur clades` was updated to separate `nodes` and `branches` in nextstrain/augur#728 so now the `assign_rbd_levels` script needs to parse the `branches` in order to find the basal node.

jameshadfield requested review from huddlej and a team May 27, 2021 23:46

This comment has been minimized.

Sign in to view

jameshadfield commented May 27, 2021

View reviewed changes

augur/clades.py Outdated Show resolved Hide resolved

jameshadfield force-pushed the branch-labels branch from 081628c to 32fe34e Compare May 28, 2021 05:09

jameshadfield commented May 28, 2021

View reviewed changes

jameshadfield mentioned this pull request May 30, 2021

Use (new) augur clades functionality to simplify workflow nextstrain/ncov#647

Open

huddlej requested changes Jun 3, 2021

View reviewed changes

jameshadfield force-pushed the branch-labels branch from 4d05cab to 16280dd Compare June 5, 2021 04:11

jameshadfield commented Jun 7, 2021

View reviewed changes

tests/functional/clades.t Outdated Show resolved Hide resolved

huddlej mentioned this pull request Jun 7, 2021

Document "node data" JSON format nextstrain/docs.nextstrain.org#58

Open

jameshadfield mentioned this pull request Jun 9, 2021

Grab bag of improvements to augur clades #735

Open

6 tasks

jameshadfield force-pushed the branch-labels branch from 16280dd to 8a0c06e Compare June 10, 2021 00:14

jameshadfield mentioned this pull request Jun 10, 2021

Use new augur clades functionality nextstrain/ncov#660

Closed

huddlej added this to the Feature release 12.1.0 milestone Jun 14, 2021

huddlej reviewed Jun 14, 2021

View reviewed changes

jameshadfield force-pushed the branch-labels branch from 8a0c06e to 866c968 Compare June 15, 2021 02:12

jameshadfield force-pushed the branch-labels branch from 866c968 to f7cc5ab Compare June 15, 2021 23:12

huddlej approved these changes Jun 15, 2021

View reviewed changes

trvrb reviewed Jun 15, 2021

View reviewed changes

tests/functional/clades/expected-output-default.json Outdated Show resolved Hide resolved

jameshadfield force-pushed the branch-labels branch from f7cc5ab to 422084d Compare June 15, 2021 23:52

jameshadfield force-pushed the branch-labels branch from 15b29a8 to 007cb47 Compare April 11, 2023 02:13

jameshadfield mentioned this pull request Apr 11, 2023

ENH: Allow specification of node_data key that augur clades outputs #1027

Open

jameshadfield added 8 commits April 11, 2023 21:03

[clades] tests for clades set at the root node

22e2444

See comments in tests/functional/clades.t Also adds / updates comments and docstrings which were noticed as I worked through the code relating to these tests.

[clades] supress unused --references arg

0cb841d

Workflows may be using this so I elected to hide it rather than remove it (and warn people it's a no-op if they do happen to be using it)

[clades] catch error where pos is beyond ref length

a356a9e

We could check all of these up-front instead of exiting upon the first error, and such a check should be part of validation within augur clades, but this commit is a simple solution to fix a reported bug. Closes #965

[clades] require required arguments

2c6b662

Closes #1153

[clades] warnings for unfound clades

40e549d

A fatal error is raised if no clades are defined, but if a clade is not found on the tree it's only a warning. Suggested in #735

[clades] check for multiple mutations at same pos

e5cfc3a

Multiple mutations at the same position on a single branch are now a fatal error. Previous behaviour was to overwrite such mutations when parsing. Suggested by #735.

jameshadfield mentioned this pull request Apr 11, 2023

Multiple improvements to augur clades #1199

Merged

1 task

Merge pull request #1199 from nextstrain/clade-fixes

dd318ba

Multiple improvements to augur clades

jameshadfield merged commit 631feb6 into master May 4, 2023

jameshadfield deleted the branch-labels branch May 4, 2023 03:50

This was referenced May 15, 2023

fix: don't error if node data file is empty #1214

Merged

BUG: Export complains if node data json contains only empty dicts for nodes and branches #1215

Closed

joverlee521 mentioned this pull request May 13, 2024

Use gene reference files to generate E gene trees nextstrain/dengue#48

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to specify arbitrary branch & clade labels #728

Allow users to specify arbitrary branch & clade labels #728

jameshadfield commented May 27, 2021 •

edited

Loading

This comment has been minimized.

jameshadfield May 28, 2021 •

edited

Loading

huddlej Jun 3, 2021 •

edited by jameshadfield

Loading

jameshadfield Jun 4, 2021

huddlej Jun 4, 2021

jameshadfield Jun 5, 2021 •

edited

Loading

huddlej left a comment

huddlej Jun 3, 2021 •

edited by jameshadfield

Loading

jameshadfield commented Jun 5, 2021 •

edited

Loading

rneher commented Jun 9, 2021

jameshadfield commented Jun 9, 2021 •

edited

Loading

huddlej commented Jun 14, 2021

huddlej left a comment

huddlej Jun 14, 2021

jameshadfield Jun 15, 2021 •

edited

Loading

jameshadfield commented Jun 15, 2021

huddlej commented Jun 15, 2021

rneher commented Jun 15, 2021

jameshadfield commented Jun 15, 2021

jameshadfield commented Apr 11, 2023 •

edited

Loading

corneliusroemer commented May 15, 2023 •

edited

Loading


		# third pass to propagate 'clade_membership'
		# propagate 'clade_membership' to children nodes

Allow users to specify arbitrary branch & clade labels #728

Allow users to specify arbitrary branch & clade labels #728

Conversation

jameshadfield commented May 27, 2021 • edited Loading

This comment has been minimized.

jameshadfield May 28, 2021 • edited Loading

Choose a reason for hiding this comment

huddlej Jun 3, 2021 • edited by jameshadfield Loading

Choose a reason for hiding this comment

jameshadfield Jun 4, 2021

Choose a reason for hiding this comment

huddlej Jun 4, 2021

Choose a reason for hiding this comment

jameshadfield Jun 5, 2021 • edited Loading

Choose a reason for hiding this comment

huddlej left a comment

Choose a reason for hiding this comment

huddlej Jun 3, 2021 • edited by jameshadfield Loading

Choose a reason for hiding this comment

jameshadfield commented Jun 5, 2021 • edited Loading

rneher commented Jun 9, 2021

jameshadfield commented Jun 9, 2021 • edited Loading

huddlej commented Jun 14, 2021

huddlej left a comment

Choose a reason for hiding this comment

huddlej Jun 14, 2021

Choose a reason for hiding this comment

jameshadfield Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

jameshadfield commented Jun 15, 2021

huddlej commented Jun 15, 2021

rneher commented Jun 15, 2021

jameshadfield commented Jun 15, 2021

jameshadfield commented Apr 11, 2023 • edited Loading

corneliusroemer commented May 15, 2023 • edited Loading

jameshadfield commented May 27, 2021 •

edited

Loading

jameshadfield May 28, 2021 •

edited

Loading

huddlej Jun 3, 2021 •

edited by jameshadfield

Loading

jameshadfield Jun 5, 2021 •

edited

Loading

huddlej Jun 3, 2021 •

edited by jameshadfield

Loading

jameshadfield commented Jun 5, 2021 •

edited

Loading

jameshadfield commented Jun 9, 2021 •

edited

Loading

jameshadfield Jun 15, 2021 •

edited

Loading

jameshadfield commented Apr 11, 2023 •

edited

Loading

corneliusroemer commented May 15, 2023 •

edited

Loading