-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal export (no node-data files needed) #1299
Conversation
18281c0
to
4991b9a
Compare
Codecov ReportPatch coverage is
📢 Thoughts on this report? Let us know!. |
4991b9a
to
416e2a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cool, @jameshadfield! I left some comments for consideration, but none are blocking.
The above minimal.json takes divergence from the newick file. This converts newick divergences of (e.g.) '1' to `1.0` | ||
because BioPython uses floats (which is perfectly reasonable). Remove the decimal to diff the JSON. | ||
(Note that Auspice won't behave any differently) | ||
$ sed 's/\.0//' minimal.json > minimal.no-decimal.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use --significant-digits 0
in the call to diff_jsons.py
below to achieve this same effect without additional copies of the JSONs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool, I forgot this was an option!
Unfortunately in the case of 0 significant digits we are still left with the associated int vs float type difference (maybe? It doesn't look like it's removing the .0
at all, but I presume deepdiff is showing the unmodified input values here)
{'old_type': <class 'int'>, 'new_type': <class 'float'>, 'old_value': 1, 'new_value': 1.0},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm sorry that wasn't just a one-line fix, @jameshadfield! DeepDiff supports ignoring type changes for this scenario. If you're open to it, I can push a commit that adds a flag for the diff JSONs script to --ignore-numeric-type-changes
. I bet this would be handy for other tests, too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this functionality in 57bdf28.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this reminded me that there is a deep diff CLI that we could eventually switch to... 🤦🏻
return lambda node, metadata: metadata['branch_length'] | ||
if T.root.branch_length is not None: | ||
return lambda node, metadata: node.branch_length | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw an exception when no branch length data are available instead of returning None? If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error, but we could raise an exception that gets nicely printed by Augur's generic exception handling logic instead.
I think the scenario where this function gets called and its output sent to convert_tree_to_json_structure
is different from when we pass None
to convert_tree_to_json_structure
, since it suggests we expect to find divergence information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error
I think you're referring to this line? That would actually only print an error if div¹ values were set on the root node but missing on some other node on the tree. In other words, trees without any div values were just fine. I wouldn't be surprised if the error was never printed! We now avoid this situation by requiring that such information is present for every node in the tree. Although I guess we don't technically enforce that for newick trees (root branch with div, but a later branch missing it), perhaps that's worth covering?
the scenario where this function [
node_div()
] gets called ... suggests we expect to find divergence information.
There'll be situations where we don't have divergence info, e.g. BEAST trees or any dataset with a newick tree without branch lengths. Not something we commonly run ourselves, but they do exist. If temporal information is not provided either then the dataset will fail validation. Open to improvements here tho!
¹ Here this would be mutation_length
or branch_length
in the node-data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'll be situations where we don't have divergence info
Whoa...I didn't realize this happened. I see why you'd want to be more lenient in the branch length handling, then. Thank you for this additional context!
@@ -130,7 +130,7 @@ | |||
"type" : "object", | |||
"$comment": "The phylogeny in a nested JSON structure", | |||
"additionalProperties": false, | |||
"required": ["name"], | |||
"required": ["name", "node_attrs"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this change and the two below to required attributes require a major version release for Augur? I can't remember how we decided to treat schema updates, but since more external software now consumes Auspice JSONs, we might want to announce these changes more loudly/broadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of Auspice a tree without node_attrs
was never valid, in the sense that it wouldn't render correctly (because this implies no distance metric). In the context of Augur perhaps such datasets existed? I can't think of any situations where it would be used, but maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add to change log, but probably no major version needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added changelog entry mentioning this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. I think it's just missing a change log entry, unless I'm blind
@@ -111,44 +111,63 @@ def order_nodes(node): | |||
order_nodes(od['tree']) | |||
return od | |||
|
|||
def convert_tree_to_json_structure(node, metadata, div=0): | |||
|
|||
def node_div(T, node_attrs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotations could be helpful here, but this is optional of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! It's been on my to-do list to start using them in Python code, but not for this PR 😬
A number of parts of the auspice-config have identical (or almost-identical) shape to those in the resulting dataset JSON, although the actual data may be modified as it passes through `augur export v2`. Rather than referencing the entire auspice-config schema and pruning down properties, which I don't actually think is possible in jsonschema, I chose to use $refs at a more fine grained level which I find easier to read. The actual schema definitions should be unchanged by this commit, although comments / descriptions have been improved.
These work fine in Auspice. While the 'colorings' property is optional, `augur export v2` will always set a (possibly empty) array. I also chose to allow the auspice config file to have an empty colorings definition, which in practice behaves the same as leaving it out. Addresses comment in #273 <#273 (comment)>
Removes previously valid string patterns which were never used within augur and would result in unexpected behaviour in auspice. Also updates the patternProperties of CDSs to match that used in the genome_annotations (schema-annotations.json)
Trees could currently be produced with neither "div" nor "num_date" information. Arguably Auspice could interpret these as cladograms but as it stands these datasets aren't rendered by Auspice. Datasets without this information are easy to create but rare in practice: If the node-data files don't define "mutation_length" or "branch_length" then there's no "div" and if they don't define "num_date" then that's not there either. Note that what we really want to require is that "div" is present on all nodes and/or "num_date" is present on all nodes, but the schema doesn't let us do this.
Allows a minimal `augur export` using only a (newick) tree as input, functionality that we've wanted for over 4 years! To facilitate this we parse branch lengths¹ from the newick file if such data wasn't available in the node-data inputs (e.g. because there are none!). The code for deciding where to read divergence from has been refactored and in the process improved: the (rare? never encountered?) case where divergence was sometimes read from node-data keys 'mutation_length' and sometimes from 'branch_length' can non longer happen. If data is provided which doesn't define divergence or num_date (irregardless of whether node-data files were provided as inputs), then the resulting dataset will fail validation. Closes #273 <#273> ¹ I suppose these might represent time in certain cases, but I haven't seen such data in Newick files.
Adds a flag to the diff JSONs script to ignore numeric type changes when running DeepDiff [1]. Updates the export v2 minimal export test to use this new flag instead of creating an intermediate file with sed. [1] https://zepworks.com/deepdiff/current/ignore_types_or_values.html
57bdf28
to
9afa278
Compare
Rebased onto master, changelog entries added, and PR base changed to master. |
@@ -181,7 +184,13 @@ | |||
"type": "string", | |||
"pattern": "^(none|[a-zA-Z0-9]+)$" | |||
}, | |||
"tip_label": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inclusion of tip_label
is a bug - as Auspice doesn't yet support this tip label. Unless nextstrain/auspice#1692 gets merged soon we should revert this from the schema.
I'm sorry I missed this in review, I'll remember to look more carefully at schema changes.
@jameshadfield
I've made an issue for better support of different auspice schema versions here: #1326
* A number of schema updates and improvements [#1299][] (@jameshadfield) | ||
* We now require all nodes to have `node_attrs` on them with one of `div` or `num_date` present | ||
* Some never-used properties are removed from the schemas, including a pattern for defining nucleotide INDELs which was never used by augur or auspice. | ||
* Tip label defaults are now settable within the auspice-config JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowed per validation in auspice-config but not yet supported by current auspice 🙃
I think my preference would be to first allow something in Auspice, then add it to schema, rather than other way round - but I guess as long as auspice still works and just lacks a minor default feature, it's no a big deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah totally, I forgot that auspice PR hadn't been merged + released. I'll do that today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auspice 2.49.0 released which parses this display_default. It'll appear on nextstrain.org / auspice.us etc over the coming day or two
Allows a minimal
augur export
using only a (newick) tree as input,functionality that we've wanted for over 4 years! To facilitate this we
parse branch lengths¹ from the newick file if such data wasn't available
in the node-data inputs (e.g. because there are none!).
The code for deciding where to read divergence from has been refactored
and in the process improved: the (rare? never encountered?) case where
divergence was sometimes read from node-data keys 'mutation_length' and
sometimes from 'branch_length' can non longer happen.
If data is provided which doesn't define divergence or num_date
(irregardless of whether node-data files were provided as inputs), then
the resulting dataset will fail validation.
Closes #273
¹ I suppose these might represent time in certain cases, but I haven't
seen such data in Newick files.