-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read a subset of metadata columns #1294
Conversation
a4a7892
to
7f9e462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real-world testing
I ran this against the metadata file produced by ncov-ingest (s3://nextstrain-data/files/ncov/open/metadata.tsv.zst
) which has 8.5 million rows x 58 columns. I used the following command to sample to 10 random sequences:
augur filter \
--metadata metadata.tsv \
--subsample-max-sequences 10 \
--output-metadata out.tsv
This took 8m21s to run on master
, and 6m21s with changes from this PR. Here's profiling results before and after, which I visualized in Snakeviz. A summary:
- Time spent accessing in-memory DataFrames dropped from 494s to 258s.
- Without these changes,
to_csv
takes just a fraction of a second because the metadata for the 10 sequences is already loaded into memory. - With these changes, it takes 109s to run through the metadata file to find the lines for the 10 sequences that are wanted.
The example command benefits from a net positive improvement in run time. Although writing time increased due to 5173cb7, reading time decreased even more due to ac23e80.
This was a "best case scenario" for these changes though, since no metadata columns were used, only strain name. I should probably test with --group-by
, --min-date
, and other options that load additional columns to get a better picture.
I did not do any memory profiling. Memory usage is not an issue without these changes, and should be less of an issue with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get round to doing more testing/profiling with --group-by
and --min-date
and what not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not yet. Still planning to do so before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested using the ncov 100k subsample as input to an augur filter
command I grabbed from an ncov build. Run time was 16s on master
and 7.37s with these changes (cProfile files).
Summary:
- Metadata reading dropped from 4.3s to 2.15s (pandas
readers.py:read
) - Accessing in-memory DataFrames dropped from 8.63s to 1.127s (pandas
indexing.py
) - Output writing increased from ~0s to 1.35s (
write_metadata_based_outputs
)
augur filter \
--metadata ~/tmp/augur-filter/metadata.tsv.xz \
--include defaults/include.txt \
--exclude defaults/exclude.txt \
--max-date 6M \
--exclude-where 'region!=Asia' country=China country=India \
--group-by country year month \
--subsample-max-sequences 200 \
--output-strains ~/tmp/augur-filter/strains.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I triggered a ncov GISAID trial run using a Docker image including these changes - it completed successfully in 6h 36m 55s. This is pretty much the same as another trial run 2 days before at 6h 40m 27s. I don't know how much variance there is between run times, and I don't want to compare against non-trial runs or older runs (those have additional Slack notifications and different input metadata sizes). So by this comparison alone, there doesn't seem to be a significant performance benefit for the ncov workflow with GISAID configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
ISTM that last time I looked at ncov's execution profile, by far the slowest step was TreeTime. So not altogether surprising that filter's speed isn't a big impact in the context of a full build.
I grabbed the benchmarks/subsample_*
files from those runs to get a little more granular insight into differences in wall clock time and max RSS for each subsample rule invocation.
avg(after - before)
for wall clock time was -112s, so it shaved roughly 2 min off each subsample step on average. Equivalent for max RSS is -276 (MB, I believe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall! Several small comments and one significant behaviour change to consider.
#1252 changes the same parts of the code and is likely to be merged soon, so I'm going to rebase this PR on top of that. |
7f9e462
to
96e891c
Compare
0e35137
to
0038a4f
Compare
Force-pushed changes:
Will mark this PR as draft until conversations above are addressed. |
0038a4f
to
b884236
Compare
2a90aab
to
7e81765
Compare
64a5474
to
679519e
Compare
679519e
to
1a2dae9
Compare
1a2dae9
to
f0d0e47
Compare
013fa01
to
3335625
Compare
ff98de7
to
5be2639
Compare
3335625
to
0924b31
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1294 +/- ##
==========================================
+ Coverage 66.73% 67.13% +0.39%
==========================================
Files 69 69
Lines 7321 7446 +125
Branches 1798 1831 +33
==========================================
+ Hits 4886 4999 +113
- Misses 2168 2176 +8
- Partials 267 271 +4 ☔ View full report in Codecov by Sentry. |
0924b31
to
1d52bc5
Compare
5be2639
to
e13834d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I haven't gone over the new stuff with as fine a toothed comb as my first pass, but I did skim thru it.
Metadata is defined by its delimiter, columns, and rows. These properties are inferred and returned in read_metadata, but that function is pandas-focused. This commit is part of a PR with aims to move away from using pandas to represent and access metadata at least in some parts of the code. Alternatives will need a way to access these properties, and this new class interface serves that purpose. It will be used in subsequent commits.
Previously, the output metadata and strain files were constructed as multi-stage exports of an in-memory DataFrame representation of the original metadata. This worked fine, except: 1. The order of rows was not guaranteed to follow that of the original metadata. 2. This relies on perfect round-tripping of file→DataFrame→file contents. 3. This relies on having all columns of the metadata available in the in-memory DataFrame representation. (1) is not a big deal, but is addressed by this change. (2) can be prone to issues, though nothing blocking has been brought to our attention. (3) is the main motivator given recent ideas to optimize I/O operations in augur filter. With this change, there is no longer a need to have all columns of the metadata stored in memory, which opens the door to further optimizations.
Previously added¹ then removed² due to being insufficient. Adding it back with a new approach and explicit limitations. ¹ 2ead5b3 ² e8322a5 This will be used in a later commit. Co-authored-by: Thomas Sibley <[email protected]>
Converting queried columns should be much better than converting all columns: in most cases, only a small subset of metadata columns are used in the query. A stretch goal would be to convert only the columns used for numerical comparison. However, detecting which are used *as numeric* is a harder problem, and benefits are marginal for common queries that don't use many columns. Note that the column extraction function isn't perfect and it returns all or nothing. If nothing is returned, then fall back to the previous behavior of converting all columns.
1d52bc5
to
5d1c045
Compare
This serves as an "escape hatch" for when automatic type inference does not work as expected for whatever reason.
The motivation here is that it's common for subcommands to use just a fraction of the columns. This should have performance improvements when the metadata file is large and most columns are unused. Subsequent commits will update subcommands to use this new feature.
Before these changes, all columns were read into memory even if they were not used for actual filtering. The only reason was because metadata-based outputs were created from the in-memory representation of metadata. An earlier commit switched to creating those outputs by doing another pass of the original metadata. That means there is no longer a need to have all columns in memory.
Before these changes, all columns were read into memory even though only a few are used. This reads in the minimum necessary columns, which brings performance improvements.
Before these changes, all columns were read into memory even though only the ID and date columns are used. This reads in just the two columns, which brings performance improvements.
5d1c045
to
b56f699
Compare
Description of proposed changes
Reading a subset of metadata columns is straightforward for most subcommands, but requires some big changes in
augur filter
:--output-metadata
relied on an in-memory pandas DataFrame containing all columns. This PR re-implements that option and similar options using a custom Metadata class to write output metadata based on a streamed representation of the input file.augur filter
requires knowing which columns are used byaugur filter --query
. Automatic detection is implemented here, but--query
is generic enough that I'm not confident the detection will always work. In cases where it doesn't, an error is raised.--query-columns
allows the user to specify what columns are used in--query
along with the expected data types.See commits for details.
Related issues
This PR builds on ideas from discussion on Slack and experimentation in #1242.
Checklist
--query-columns