Minimal export (no node-data files needed) #1299

jameshadfield · 2023-08-31T03:51:03Z

Allows a minimal augur export using only a (newick) tree as input,
functionality that we've wanted for over 4 years! To facilitate this we
parse branch lengths¹ from the newick file if such data wasn't available
in the node-data inputs (e.g. because there are none!).

The code for deciding where to read divergence from has been refactored
and in the process improved: the (rare? never encountered?) case where
divergence was sometimes read from node-data keys 'mutation_length' and
sometimes from 'branch_length' can non longer happen.

If data is provided which doesn't define divergence or num_date
(irregardless of whether node-data files were provided as inputs), then
the resulting dataset will fail validation.

Closes #273

¹ I suppose these might represent time in certain cases, but I haven't
seen such data in Newick files.

codecov · 2023-08-31T04:18:09Z

Codecov Report

Patch coverage is 76.00% of modified lines.

❗ Current head 57bdf28 differs from pull request most recent head 9afa278. Consider uploading reports for the commit 9afa278 to get more accurate results

Files Changed	Coverage
augur/validate.py	`ø`
augur/export_v2.py	`76.00%`

📢 Thoughts on this report? Let us know!.

huddlej

So cool, @jameshadfield! I left some comments for consideration, but none are blocking.

huddlej · 2023-08-31T20:06:46Z

tests/functional/export_v2/cram/minimal.t

+The above minimal.json takes divergence from the newick file. This converts newick divergences of (e.g.) '1' to `1.0`
+because BioPython uses floats (which is perfectly reasonable). Remove the decimal to diff the JSON.
+(Note that Auspice won't behave any differently)
+  $ sed 's/\.0//' minimal.json > minimal.no-decimal.json


Could you use --significant-digits 0 in the call to diff_jsons.py below to achieve this same effect without additional copies of the JSONs?

Oh cool, I forgot this was an option!

Unfortunately in the case of 0 significant digits we are still left with the associated int vs float type difference (maybe? It doesn't look like it's removing the .0 at all, but I presume deepdiff is showing the unmodified input values here)

{'old_type': <class 'int'>, 'new_type': <class 'float'>, 'old_value': 1, 'new_value': 1.0},

Ah, I'm sorry that wasn't just a one-line fix, @jameshadfield! DeepDiff supports ignoring type changes for this scenario. If you're open to it, I can push a commit that adds a flag for the diff JSONs script to --ignore-numeric-type-changes. I bet this would be handy for other tests, too...

I added this functionality in 57bdf28.

Although this reminded me that there is a deep diff CLI that we could eventually switch to... 🤦🏻

huddlej · 2023-08-31T20:10:08Z

augur/export_v2.py

+        return lambda node, metadata: metadata['branch_length']
+    if T.root.branch_length is not None:
+        return lambda node, metadata: node.branch_length
+    return None


Should we throw an exception when no branch length data are available instead of returning None? If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error, but we could raise an exception that gets nicely printed by Augur's generic exception handling logic instead.

I think the scenario where this function gets called and its output sent to convert_tree_to_json_structure is different from when we pass None to convert_tree_to_json_structure, since it suggests we expect to find divergence information.

If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error

I think you're referring to this line? That would actually only print an error if div¹ values were set on the root node but missing on some other node on the tree. In other words, trees without any div values were just fine. I wouldn't be surprised if the error was never printed! We now avoid this situation by requiring that such information is present for every node in the tree. Although I guess we don't technically enforce that for newick trees (root branch with div, but a later branch missing it), perhaps that's worth covering?

the scenario where this function [node_div()] gets called ... suggests we expect to find divergence information.

There'll be situations where we don't have divergence info, e.g. BEAST trees or any dataset with a newick tree without branch lengths. Not something we commonly run ourselves, but they do exist. If temporal information is not provided either then the dataset will fail validation. Open to improvements here tho!

¹ Here this would be mutation_length or branch_length in the node-data

There'll be situations where we don't have divergence info

Whoa...I didn't realize this happened. I see why you'd want to be more lenient in the branch length handling, then. Thank you for this additional context!

huddlej · 2023-08-31T20:12:49Z

augur/data/schema-export-v2.json

@@ -130,7 +130,7 @@
            "type" : "object",
            "$comment": "The phylogeny in a nested JSON structure",
            "additionalProperties": false,
-            "required": ["name"],
+            "required": ["name", "node_attrs"],


Will this change and the two below to required attributes require a major version release for Augur? I can't remember how we decided to treat schema updates, but since more external software now consumes Auspice JSONs, we might want to announce these changes more loudly/broadly.

In the context of Auspice a tree without node_attrs was never valid, in the sense that it wouldn't render correctly (because this implies no distance metric). In the context of Augur perhaps such datasets existed? I can't think of any situations where it would be used, but maybe?

Would be good to add to change log, but probably no major version needed?

Added changelog entry mentioning this

tests/functional/export_v2/cram/minimal.t

corneliusroemer

Perfect. I think it's just missing a change log entry, unless I'm blind

corneliusroemer · 2023-09-19T13:09:54Z

augur/export_v2.py

@@ -111,44 +111,63 @@ def order_nodes(node):
        order_nodes(od['tree'])
    return od

-def convert_tree_to_json_structure(node, metadata, div=0):
+
+def node_div(T, node_attrs):


Type annotations could be helpful here, but this is optional of course

Yeah! It's been on my to-do list to start using them in Python code, but not for this PR 😬

These properties were never actually used (neither exported from augur export v2 nor consumed by auspice) Closes #867 <#867>

A number of parts of the auspice-config have identical (or almost-identical) shape to those in the resulting dataset JSON, although the actual data may be modified as it passes through `augur export v2`. Rather than referencing the entire auspice-config schema and pruning down properties, which I don't actually think is possible in jsonschema, I chose to use $refs at a more fine grained level which I find easier to read. The actual schema definitions should be unchanged by this commit, although comments / descriptions have been improved.

Auspice can already set the tip label via URL state (`?tl=...`) and will shortly be able to parse the display_default added here. Closes #1115 <#1115>

These work fine in Auspice. While the 'colorings' property is optional, `augur export v2` will always set a (possibly empty) array. I also chose to allow the auspice config file to have an empty colorings definition, which in practice behaves the same as leaving it out. Addresses comment in #273 <#273 (comment)>

Removes previously valid string patterns which were never used within augur and would result in unexpected behaviour in auspice. Also updates the patternProperties of CDSs to match that used in the genome_annotations (schema-annotations.json)

Trees could currently be produced with neither "div" nor "num_date" information. Arguably Auspice could interpret these as cladograms but as it stands these datasets aren't rendered by Auspice. Datasets without this information are easy to create but rare in practice: If the node-data files don't define "mutation_length" or "branch_length" then there's no "div" and if they don't define "num_date" then that's not there either. Note that what we really want to require is that "div" is present on all nodes and/or "num_date" is present on all nodes, but the schema doesn't let us do this.

Allows a minimal `augur export` using only a (newick) tree as input, functionality that we've wanted for over 4 years! To facilitate this we parse branch lengths¹ from the newick file if such data wasn't available in the node-data inputs (e.g. because there are none!). The code for deciding where to read divergence from has been refactored and in the process improved: the (rare? never encountered?) case where divergence was sometimes read from node-data keys 'mutation_length' and sometimes from 'branch_length' can non longer happen. If data is provided which doesn't define divergence or num_date (irregardless of whether node-data files were provided as inputs), then the resulting dataset will fail validation. Closes #273 <#273> ¹ I suppose these might represent time in certain cases, but I haven't seen such data in Newick files.

Adds a flag to the diff JSONs script to ignore numeric type changes when running DeepDiff [1]. Updates the export v2 minimal export test to use this new flag instead of creating an intermediate file with sed. [1] https://zepworks.com/deepdiff/current/ignore_types_or_values.html

jameshadfield · 2023-09-20T04:35:37Z

Rebased onto master, changelog entries added, and PR base changed to master.

corneliusroemer · 2023-09-27T17:24:04Z

augur/data/schema-auspice-config-v2.json

@@ -181,7 +184,13 @@
                    "type": "string",
                    "pattern": "^(none|[a-zA-Z0-9]+)$"
                },
+                "tip_label": {


The inclusion of tip_label is a bug - as Auspice doesn't yet support this tip label. Unless nextstrain/auspice#1692 gets merged soon we should revert this from the schema.
I'm sorry I missed this in review, I'll remember to look more carefully at schema changes.
@jameshadfield

I've made an issue for better support of different auspice schema versions here: #1326

corneliusroemer · 2023-09-27T17:26:02Z

CHANGES.md

+* A number of schema updates and improvements [#1299][] (@jameshadfield)
+    * We now require all nodes to have `node_attrs` on them with one of `div` or `num_date` present
+    * Some never-used properties are removed from the schemas, including a pattern for defining nucleotide INDELs which was never used by augur or auspice.
+    * Tip label defaults are now settable within the auspice-config JSON


Allowed per validation in auspice-config but not yet supported by current auspice 🙃

I think my preference would be to first allow something in Auspice, then add it to schema, rather than other way round - but I guess as long as auspice still works and just lacks a minor default feature, it's no a big deal either way.

Yeah totally, I forgot that auspice PR hadn't been merged + released. I'll do that today

Auspice 2.49.0 released which parses this display_default. It'll appear on nextstrain.org / auspice.us etc over the coming day or two

jameshadfield changed the title ~~Export sans node data~~ Minimal export (no node-data files needed) Aug 31, 2023

jameshadfield force-pushed the export-sans-node-data branch 2 times, most recently from 18281c0 to 4991b9a Compare August 31, 2023 04:01

jameshadfield force-pushed the export-sans-node-data branch from 4991b9a to 416e2a2 Compare August 31, 2023 04:43

huddlej reviewed Aug 31, 2023

View reviewed changes

corneliusroemer reviewed Sep 19, 2023

View reviewed changes

huddlej approved these changes Sep 19, 2023

View reviewed changes

jameshadfield and others added 9 commits September 20, 2023 16:22

Remove unused properties from schema

fbbc319

These properties were never actually used (neither exported from augur export v2 nor consumed by auspice) Closes #867 <#867>

[schema] Allow default tip labels

6ce1bf7

Auspice can already set the tip label via URL state (`?tl=...`) and will shortly be able to parse the display_default added here. Closes #1115 <#1115>

[schema] improve mutations

8209dad

Removes previously valid string patterns which were never used within augur and would result in unexpected behaviour in auspice. Also updates the patternProperties of CDSs to match that used in the genome_annotations (schema-annotations.json)

changelog

9afa278

jameshadfield force-pushed the export-sans-node-data branch from 57bdf28 to 9afa278 Compare September 20, 2023 04:34

jameshadfield changed the base branch from schema-updates to master September 20, 2023 04:35

jameshadfield mentioned this pull request Sep 20, 2023

Schema updates #1298

Closed

jameshadfield merged commit e19b5e5 into master Sep 20, 2023
1 check passed

jameshadfield deleted the export-sans-node-data branch September 20, 2023 04:38

corneliusroemer mentioned this pull request Sep 27, 2023

ENH: allow export to validate schema for a specific version of auspice #1326

Open

corneliusroemer reviewed Sep 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal export (no node-data files needed) #1299

Minimal export (no node-data files needed) #1299

jameshadfield commented Aug 31, 2023

codecov bot commented Aug 31, 2023 •

edited

Loading

huddlej left a comment

huddlej Aug 31, 2023

jameshadfield Aug 31, 2023 •

edited

Loading

huddlej Sep 19, 2023

huddlej Sep 19, 2023

huddlej Sep 19, 2023

huddlej Aug 31, 2023

jameshadfield Aug 31, 2023 •

edited

Loading

huddlej Sep 19, 2023

huddlej Aug 31, 2023

jameshadfield Aug 31, 2023

corneliusroemer Sep 19, 2023

jameshadfield Sep 20, 2023

corneliusroemer left a comment

corneliusroemer Sep 19, 2023

jameshadfield Sep 20, 2023

jameshadfield commented Sep 20, 2023

corneliusroemer Sep 27, 2023 •

edited by victorlin

Loading

corneliusroemer Sep 27, 2023 •

edited

Loading

jameshadfield Sep 27, 2023

jameshadfield Sep 28, 2023

Minimal export (no node-data files needed) #1299

Minimal export (no node-data files needed) #1299

Conversation

jameshadfield commented Aug 31, 2023

codecov bot commented Aug 31, 2023 • edited Loading

Codecov Report

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corneliusroemer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield commented Sep 20, 2023

corneliusroemer Sep 27, 2023 • edited by victorlin Loading

Choose a reason for hiding this comment

corneliusroemer Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 31, 2023 •

edited

Loading

jameshadfield Aug 31, 2023 •

edited

Loading

jameshadfield Aug 31, 2023 •

edited

Loading

corneliusroemer Sep 27, 2023 •

edited by victorlin

Loading

corneliusroemer Sep 27, 2023 •

edited

Loading