-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Remove attempt at extracting variables from --query
#1278
filter: Remove attempt at extracting variables from --query
#1278
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1278 +/- ##
==========================================
- Coverage 69.80% 69.74% -0.06%
==========================================
Files 67 67
Lines 7160 7146 -14
Branches 1742 1742
==========================================
- Hits 4998 4984 -14
Misses 1855 1855
Partials 307 307
☔ View full report in Codecov by Sentry. |
for column in metadata.columns: | ||
metadata[column] = pd.to_numeric(metadata[column], errors='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is converting to numeric in place? Could that potentially affect downstream filters?
I'm specifically thinking of the date
column being all year values, so they would convert to numeric. Then they would cause errors in any date filtering queries that expect them to be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this errors only when using both --query
and --exclude-ambiguous-dates-by
. I wrote an example test for this which passes locally on master but fails when rebased onto this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely. This partially reverts commit 2ead5b3 and related changes. I chose to keep the definition of PandasUndefinedVariableError at the top-level¹ because it keeps external references next to each other, and typing_extensions in the dependency list² because it is bound to be useful at some point, and updating dependencies on Augur's Bioconda recipe is a hassle. ¹ 602e3d5 ² 2658659
Shows the interaction between the query and exclude-ambiguous-dates-by options. This test currently does not work as intended and will be fixed in the following commit.
This fixes the unexpected behavior described in the previous commit.
9f9bc4f
to
58e24a3
Compare
I have confidence in these changes after @joverlee521's review, so merging without approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-merge approval 👍
I'm not a huge fan of the blanket conversion to numeric, but it should cover a majority of the use cases for augur filter
. There are probably weird edge cases but they're probably unavoidable until we allow user defined types.
Some that immediately come to mind are numeric strain names or clade names. In a current project, I just needed to write a query like this in a standalone Python notebook, where my clade names are SHA256 hashes that can be fully numeric: large_frequencies.query(
"(future_timepoint == '2018-01-01') & (clade_membership in ['8000939', 'd9f0744']) & (horizon == 12)"
) If I tried to do a query like that with this PR's |
I agree the blanket conversion is not ideal. This was a problem introduced with I can't think of any harm as long as it's scoped to usage in this function (by not modifying values in-place). There are 2 possible scenarios:
(1) will silently fail numeric conversion. If a numerical query is applied to that column, it will expose the internal error to the user (e.g. (2) is already a problem with pandas's automatic type inference. This impacts queries such as the example by @huddlej above in the case that all clade names happen to be numeric in the chunk of metadata being processed. I think the proper solution is user-defined types. |
I think we could potentially sidestep further Pandas-specific query parsing and instead rely on the fact that Pandas' query grammar is a subset of Python's, and uses the |
Concretely, something like: columns_used = [
node.id
for node in ast.walk(ast.parse(query))
if isinstance(node, ast.Name) ] |
Description of proposed changes
The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely.
This partially reverts commit 2ead5b3 and related changes.
I chose to keep the definition of PandasUndefinedVariableError at the top-level¹ because it keeps external references next to each other, and typing_extensions in the dependency list² because it is bound to be useful at some point, and updating dependencies on Augur's Bioconda recipe is a hassle.
¹ 602e3d5
² 2658659
Related issue(s)
Fixes #1277
Testing
Checklist