filter: Try converting all queried columns to numerical type #1268

victorlin · 2023-07-28T18:26:52Z

Description of proposed changes

The dtype inference in augur.io.read_metadata does not support numerical columns with empty values (because it calls pandas.read_csv with na_filter=False¹). This gets around that limitation by converting columns before querying.

I also considered infer_objects and convert_dtypes, but those are not useful here since they only support soft (not hard) conversions².

¹ a1bfce4
² https://stackoverflow.com/a/60278450

Related issue(s)

Fixes #1269

Addresses #1252 (comment)

Prompted by in-lab discussion with a user.

Testing

Test added and updated to show change in behavior
Checks pass

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

codecov · 2023-07-28T19:39:42Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.31% 🎉

Comparison is base (4f5559a) 69.36% compared to head (ce756c3) 69.67%.
Report is 6 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1268      +/-   ##
==========================================
+ Coverage   69.36%   69.67%   +0.31%     
==========================================
  Files          66       67       +1     
  Lines        7024     7104      +80     
  Branches     1708     1727      +19     
==========================================
+ Hits         4872     4950      +78     
- Misses       1847     1848       +1     
- Partials      305      306       +1

Files Changed	Coverage Δ
augur/filter/include_exclude_rules.py	`97.93% <100.00%> (+0.19%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The dtype inference in augur.io.read_metadata does not support numerical columns with empty values (because it calls pandas.read_csv with na_filter=False¹). This gets around that limitation by converting columns before querying. I also considered infer_objects and convert_dtypes, but those are not useful here since they only support soft (not hard) conversions². ¹ a1bfce4 ² https://stackoverflow.com/a/60278450

In the end, it's only worth calling to_numeric on the columns used for numerical comparison. This gets us halfway there, since in most cases, only a small subset of metadata columns are used in the query. This is a hacky approach, but it is more computationally efficient.

Since this is now being used in multiple places of the file.

victorlin · 2023-07-28T23:03:39Z

augur/filter/include_exclude_rules.py

+    # Try converting all queried columns to numeric.
+    for column in extract_variables(query).intersection(metadata.columns):
+        metadata[column] = pd.to_numeric(metadata[column], errors='ignore')


In the force-pushes above, I split 77ef4a5 into b325b97 + 2ead5b3.

victorlin · 2023-07-31T18:08:21Z

Merging pre-review since this is a functional improvement, as noted on Slack.

Add test to show existing behavior

e914cab

victorlin self-assigned this Jul 28, 2023

victorlin force-pushed the victorlin/filter-query-numerical branch from 23f409c to cdeb40d Compare July 28, 2023 19:36

victorlin marked this pull request as ready for review July 28, 2023 19:42

victorlin requested a review from a team July 28, 2023 19:42

victorlin force-pushed the victorlin/filter-query-numerical branch from 4a99511 to 5c8f6b8 Compare July 28, 2023 22:59

victorlin added 4 commits July 28, 2023 16:02

Import pandas UndefinedVariableError as a global

602e3d5

Since this is now being used in multiple places of the file.

Update changelog

ce756c3

victorlin force-pushed the victorlin/filter-query-numerical branch from 5c8f6b8 to ce756c3 Compare July 28, 2023 23:02

victorlin commented Jul 28, 2023

View reviewed changes

victorlin mentioned this pull request Jul 31, 2023

filter: --query fails when numerical comparisons are used on columns with missing values #1269

Closed

victorlin merged commit a35f7a6 into master Jul 31, 2023
26 checks passed

victorlin deleted the victorlin/filter-query-numerical branch July 31, 2023 18:08

victorlin mentioned this pull request Aug 16, 2023

filter: Remove attempt at extracting variables from --query #1278

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Try converting all queried columns to numerical type #1268

filter: Try converting all queried columns to numerical type #1268

victorlin commented Jul 28, 2023 •

edited

Loading

codecov bot commented Jul 28, 2023 •

edited

Loading

victorlin Jul 28, 2023

victorlin commented Jul 31, 2023

filter: Try converting all queried columns to numerical type #1268

filter: Try converting all queried columns to numerical type #1268

Conversation

victorlin commented Jul 28, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

codecov bot commented Jul 28, 2023 • edited Loading

Codecov Report

victorlin Jul 28, 2023

Choose a reason for hiding this comment

victorlin commented Jul 31, 2023

victorlin commented Jul 28, 2023 •

edited

Loading

codecov bot commented Jul 28, 2023 •

edited

Loading