Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read a subset of metadata columns #1294

Merged
merged 10 commits into from
Feb 8, 2024
Merged

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 25, 2023

Description of proposed changes

Reading a subset of metadata columns is straightforward for most subcommands, but requires some big changes in augur filter:

  • Previously, --output-metadata relied on an in-memory pandas DataFrame containing all columns. This PR re-implements that option and similar options using a custom Metadata class to write output metadata based on a streamed representation of the input file.
  • Reading a subset of columns for augur filter requires knowing which columns are used by augur filter --query. Automatic detection is implemented here, but --query is generic enough that I'm not confident the detection will always work. In cases where it doesn't, an error is raised.
  • A new option --query-columns allows the user to specify what columns are used in --query along with the expected data types.

See commits for details.

Related issues

This PR builds on ideas from discussion on Slack and experimentation in #1242.

Checklist

  • Tests modified for changes in behavior
  • Tests added for --query-columns
  • Checks pass
  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.
  • Try running on large datasets (thread)

@victorlin victorlin self-assigned this Aug 25, 2023
@victorlin victorlin requested a review from a team August 25, 2023 18:33
@victorlin victorlin marked this pull request as ready for review August 25, 2023 18:33
@victorlin victorlin mentioned this pull request Aug 25, 2023
7 tasks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real-world testing

I ran this against the metadata file produced by ncov-ingest (s3://nextstrain-data/files/ncov/open/metadata.tsv.zst) which has 8.5 million rows x 58 columns. I used the following command to sample to 10 random sequences:

augur filter \
  --metadata metadata.tsv \
  --subsample-max-sequences 10 \
  --output-metadata out.tsv

This took 8m21s to run on master, and 6m21s with changes from this PR. Here's profiling results before and after, which I visualized in Snakeviz. A summary:

  • Time spent accessing in-memory DataFrames dropped from 494s to 258s.
  • Without these changes, to_csv takes just a fraction of a second because the metadata for the 10 sequences is already loaded into memory.
  • With these changes, it takes 109s to run through the metadata file to find the lines for the 10 sequences that are wanted.

The example command benefits from a net positive improvement in run time. Although writing time increased due to 5173cb7, reading time decreased even more due to ac23e80.

This was a "best case scenario" for these changes though, since no metadata columns were used, only strain name. I should probably test with --group-by, --min-date, and other options that load additional columns to get a better picture.

I did not do any memory profiling. Memory usage is not an issue without these changes, and should be less of an issue with the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get round to doing more testing/profiling with --group-by and --min-date and what not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not yet. Still planning to do so before merging.

Copy link
Member Author

@victorlin victorlin Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested using the ncov 100k subsample as input to an augur filter command I grabbed from an ncov build. Run time was 16s on master and 7.37s with these changes (cProfile files).

Summary:

  • Metadata reading dropped from 4.3s to 2.15s (pandas readers.py:read)
  • Accessing in-memory DataFrames dropped from 8.63s to 1.127s (pandas indexing.py)
  • Output writing increased from ~0s to 1.35s (write_metadata_based_outputs)
augur filter \
    --metadata ~/tmp/augur-filter/metadata.tsv.xz \
    --include defaults/include.txt \
    --exclude defaults/exclude.txt \
    --max-date 6M \
    --exclude-where 'region!=Asia' country=China country=India \
    --group-by country year month \
    --subsample-max-sequences 200 \
    --output-strains ~/tmp/augur-filter/strains.txt

Copy link
Member Author

@victorlin victorlin Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triggered a ncov GISAID trial run using a Docker image including these changes - it completed successfully in 6h 36m 55s. This is pretty much the same as another trial run 2 days before at 6h 40m 27s. I don't know how much variance there is between run times, and I don't want to compare against non-trial runs or older runs (those have additional Slack notifications and different input metadata sizes). So by this comparison alone, there doesn't seem to be a significant performance benefit for the ncov workflow with GISAID configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

ISTM that last time I looked at ncov's execution profile, by far the slowest step was TreeTime. So not altogether surprising that filter's speed isn't a big impact in the context of a full build.

I grabbed the benchmarks/subsample_* files from those runs to get a little more granular insight into differences in wall clock time and max RSS for each subsample rule invocation.

avg(after - before) for wall clock time was -112s, so it shaved roughly 2 min off each subsample step on average. Equivalent for max RSS is -276 (MB, I believe).

augur/io/metadata.py Outdated Show resolved Hide resolved
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall! Several small comments and one significant behaviour change to consider.

augur/io/metadata.py Outdated Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/filter/_run.py Outdated Show resolved Hide resolved
augur/filter/include_exclude_rules.py Show resolved Hide resolved
augur/filter/__init__.py Outdated Show resolved Hide resolved
augur/filter/io.py Outdated Show resolved Hide resolved
augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved
augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved
augur/filter/io.py Outdated Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
@victorlin
Copy link
Member Author

#1252 changes the same parts of the code and is likely to be merged soon, so I'm going to rebase this PR on top of that.

@victorlin victorlin changed the base branch from master to victorlin/read-metadata-dtypes January 19, 2024 23:45
@victorlin victorlin force-pushed the victorlin/improved-filter-io branch 3 times, most recently from 0e35137 to 0038a4f Compare January 22, 2024 19:56
@victorlin
Copy link
Member Author

Force-pushed changes:

Will mark this PR as draft until conversations above are addressed.

@victorlin victorlin marked this pull request as draft January 22, 2024 19:56
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a90aab to 7e81765 Compare January 24, 2024 23:40
Base automatically changed from victorlin/read-metadata-dtypes to master January 24, 2024 23:43
@victorlin victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 64a5474 to 679519e Compare January 31, 2024 19:48
@victorlin victorlin mentioned this pull request Jan 31, 2024
2 tasks
@victorlin victorlin changed the base branch from master to victorlin/miscellaneous January 31, 2024 19:52
@victorlin victorlin changed the title filter: Improve I/O operations and query handling Improve metadata I/O operations and filter query handling Feb 1, 2024
@victorlin victorlin marked this pull request as ready for review February 1, 2024 20:40
@victorlin victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 013fa01 to 3335625 Compare February 3, 2024 00:45
Copy link

codecov bot commented Feb 3, 2024

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (9b31ad8) 66.73% compared to head (b56f699) 67.13%.

Files Patch % Lines
augur/filter/include_exclude_rules.py 78.37% 5 Missing and 3 partials ⚠️
augur/filter/io.py 97.05% 1 Missing and 1 partial ⚠️
augur/io/metadata.py 95.55% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1294      +/-   ##
==========================================
+ Coverage   66.73%   67.13%   +0.39%     
==========================================
  Files          69       69              
  Lines        7321     7446     +125     
  Branches     1798     1831      +33     
==========================================
+ Hits         4886     4999     +113     
- Misses       2168     2176       +8     
- Partials      267      271       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin changed the title Improve metadata I/O operations and filter query handling Read a subset of metadata columns Feb 6, 2024
Base automatically changed from victorlin/miscellaneous to master February 7, 2024 23:32
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I haven't gone over the new stuff with as fine a toothed comb as my first pass, but I did skim thru it.

victorlin and others added 4 commits February 7, 2024 17:43
Metadata is defined by its delimiter, columns, and rows. These
properties are inferred and returned in read_metadata, but that function
is pandas-focused. This commit is part of a PR with aims to move away
from using pandas to represent and access metadata at least in some
parts of the code. Alternatives will need a way to access these
properties, and this new class interface serves that purpose.

It will be used in subsequent commits.
Previously, the output metadata and strain files were constructed as
multi-stage exports of an in-memory DataFrame representation of the
original metadata. This worked fine, except:

1. The order of rows was not guaranteed to follow that of the original
   metadata.
2. This relies on perfect round-tripping of file→DataFrame→file
   contents.
3. This relies on having all columns of the metadata available in the
   in-memory DataFrame representation.

(1) is not a big deal, but is addressed by this change.
(2) can be prone to issues, though nothing blocking has been brought to
our attention.
(3) is the main motivator given recent ideas to optimize I/O operations
in augur filter. With this change, there is no longer a need to have all
columns of the metadata stored in memory, which opens the door to
further optimizations.
Previously added¹ then removed² due to being insufficient. Adding it
back with a new approach and explicit limitations.

¹ 2ead5b3
² e8322a5

This will be used in a later commit.

Co-authored-by: Thomas Sibley <[email protected]>
Converting queried columns should be much better than converting
all columns: in most cases, only a small subset of metadata
columns are used in the query.

A stretch goal would be to convert only the columns used for numerical
comparison. However, detecting which are used *as numeric* is a harder
problem, and benefits are marginal for common queries that don't use
many columns.

Note that the column extraction function isn't perfect and it returns
all or nothing. If nothing is returned, then fall back to the previous
behavior of converting all columns.
This serves as an "escape hatch" for when automatic type inference does
not work as expected for whatever reason.
The motivation here is that it's common for subcommands to use just a
fraction of the columns. This should have performance improvements when
the metadata file is large and most columns are unused.

Subsequent commits will update subcommands to use this new feature.
Before these changes, all columns were read into memory even if they
were not used for actual filtering. The only reason was because
metadata-based outputs were created from the in-memory representation of
metadata. An earlier commit switched to creating those outputs by doing
another pass of the original metadata. That means there is no longer a
need to have all columns in memory.
Before these changes, all columns were read into memory even though only
a few are used. This reads in the minimum necessary columns, which
brings performance improvements.
Before these changes, all columns were read into memory even though only
the ID and date columns are used. This reads in just the two columns,
which brings performance improvements.
@victorlin victorlin merged commit 8678ae9 into master Feb 8, 2024
20 checks passed
@victorlin victorlin deleted the victorlin/improved-filter-io branch February 8, 2024 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants