-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Try converting all queried columns to numerical type #1268
Conversation
23f409c
to
cdeb40d
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1268 +/- ##
==========================================
+ Coverage 69.36% 69.67% +0.31%
==========================================
Files 66 67 +1
Lines 7024 7104 +80
Branches 1708 1727 +19
==========================================
+ Hits 4872 4950 +78
- Misses 1847 1848 +1
- Partials 305 306 +1
☔ View full report in Codecov by Sentry. |
4a99511
to
5c8f6b8
Compare
The dtype inference in augur.io.read_metadata does not support numerical columns with empty values (because it calls pandas.read_csv with na_filter=False¹). This gets around that limitation by converting columns before querying. I also considered infer_objects and convert_dtypes, but those are not useful here since they only support soft (not hard) conversions². ¹ a1bfce4 ² https://stackoverflow.com/a/60278450
In the end, it's only worth calling to_numeric on the columns used for numerical comparison. This gets us halfway there, since in most cases, only a small subset of metadata columns are used in the query. This is a hacky approach, but it is more computationally efficient.
Since this is now being used in multiple places of the file.
5c8f6b8
to
ce756c3
Compare
# Try converting all queried columns to numeric. | ||
for column in extract_variables(query).intersection(metadata.columns): | ||
metadata[column] = pd.to_numeric(metadata[column], errors='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging pre-review since this is a functional improvement, as noted on Slack. |
Description of proposed changes
The dtype inference in augur.io.read_metadata does not support numerical columns with empty values (because it calls pandas.read_csv with na_filter=False¹). This gets around that limitation by converting columns before querying.
I also considered infer_objects and convert_dtypes, but those are not useful here since they only support soft (not hard) conversions².
¹ a1bfce4
² https://stackoverflow.com/a/60278450
Related issue(s)
Fixes #1269
Addresses #1252 (comment)
Prompted by in-lab discussion with a user.
Testing
Checklist