Optimizing KMCP with HumGut #38

ericvdtoorn · 2023-08-29T10:15:09Z

Hello Shenwei,
Thank you for making this new metagenomic tool! I'm interested in benchmarking its performance, and for that I want to perform classification on a large (10,000+) number of samples. There are a number of minor things I've come across but I have an end to end sample running now. My main objective for opening this issue is to discuss potential optimizations to what I'm doing to reduce the total time for running this.

Currently, end to end this single sample took 9 hours (08:58) with 32 cores and 23 GB of RAM (100GB was max)

Database used was the HumGut database, split over 10 indices
Working directory was in RAM (/dev/shm), started by copying the samples to that (done to avoid being impacted by NAS speeds)
Times
- Step 1: kmcp search, through parallel with 12 threads per job, took 2 hours per search
- Step 2: kmcp merge, took 15 mins with 64 threads
- Step 3: kmcp profile, took 4h52 mins with 64 threads
  - I could add the --no-amb-corr flag to reduce by 1h53 mins

From this, I could potentially reduce the time to ~6 hours by allocating more cores and increasing the number of searches running in parallel and adding the --no-amb-corr flag.
Are there more ways in which I am missing optimizations?

I've created a gist with the actual SLURM submission, can also upload the logs of the jobs if that's helpful.

The text was updated successfully, but these errors were encountered:

shenwei356 · 2023-08-29T14:08:16Z

Hi Eric, thanks for your interest.

Firstly, for the database, I'm wondering why you split it over 10 indices. How many reference genomes do you use? Previously, I created one with 30,691 genomes, with a total size of 21.6GB. That would fill into your HPC node (100GB). I mean you could simply create one KMCP database, and search for one sample against it on a HPC node with all CPUs (64 cores). So the merging process would be unnecessary.

$ du -sh humgut.kmcp
22G     humgut.kmcp


$ more humgut.kmcp/R001/__db.yml 
version: 4
unikiVersion: 4
alias: humgut.kmcp
k: 0
ks:
- 21
hashed: true
canonical: true
scaled: false
scale: 1
minimizer: false
minimizer-w: 0
syncmer: false
syncmer-s: 0
split-seq: true
split-size: 0
split-num: 10
split-overlap: 0
compact-size: true
hashes: 1
fpr: 0.3
numNameGroups: 306910
blocksize: 9592
totalKmers: 61025880337
files:

But according to the #36, you seemed not to split the genome set.

shenwei356 · 2023-08-29T14:16:07Z

To be honest, there's little space to significantly improve the speed of KMCP :(. I have to say, 10,000+ is such a large number. Unless you have a large number of HPC nodes, that would take a long time with KMCP. May be use can use a portion of the samples for benchmark purpose.

ericvdtoorn · 2023-08-29T14:46:35Z

Hi Eric, thanks for your interest.

Firstly, for the database, I'm wondering why you split it over 10 indices. How many reference genomes do you use? Previously, I created one with 30,691 genomes, with a total size of 21.6GB. That would fill into your HPC node (100GB). I mean you could simply create one KMCP database, and search for one sample against it on a HPC node with all CPUs (64 cores). So the merging process would be unnecessary.
$ du -sh humgut.kmcp
22G     humgut.kmcp


$ more humgut.kmcp/R001/__db.yml 
version: 4
unikiVersion: 4
alias: humgut.kmcp
k: 0
ks:
- 21
hashed: true
canonical: true
scaled: false
scale: 1
minimizer: false
minimizer-w: 0
syncmer: false
syncmer-s: 0
split-seq: true
split-size: 0
split-num: 10
split-overlap: 0
compact-size: true
hashes: 1
fpr: 0.3
numNameGroups: 306910
blocksize: 9592
totalKmers: 61025880337
files:
But according to the #36, you seemed not to split the genome set.

Oh? I followed the instructions on the second part of this to build the split database and did get multiple DBs:

# number of databases
N_DB=10
CURRENT=$SLURM_ARRAY_TASK_ID
# split -n r/$N_DB -d  ${DB_NAME}.files.txt ${DB_NAME}.n$N_DB-
f=$(sed -n "${CURRENT}p" subsets.txt)
# for f in ${DB_NAME}.n$N_DB-*; do

echo $f;
kmcp compute -i $f -O $f-k21-n10 -k 21 -n 10 -l 150 -B plasmid \
    --log $f-k21-n10.log --force -j 32
kmcp index -j 32 -I $f-k21-n10 -O $f.kmcp -n 1 -f 0.3 \
    --log $f.kmcp.log --force
# cp taxid and name mapping file to database directory
cp taxid.map name.map $f.kmcp/

> fd __db.yml
kmcp_humgut.n10-05.kmcp/R001/__db.yml
kmcp_humgut.n10-03.kmcp/R001/__db.yml
kmcp_humgut.n10-00.kmcp/R001/__db.yml
kmcp_humgut.n10-08.kmcp/R001/__db.yml
kmcp_humgut.n10-09.kmcp/R001/__db.yml
kmcp_humgut.n10-06.kmcp/R001/__db.yml
kmcp_humgut.n10-07.kmcp/R001/__db.yml
kmcp_humgut.n10-04.kmcp/R001/__db.yml
humgut.kmcp.single/R001/__db.yml
kmcp_humgut.n10-02.kmcp/R001/__db.yml
kmcp_humgut.n10-01.kmcp/R001/__db.yml

And I'm using a list of the n10-NN databases as my index.

Did this as from your results it seemed to be a lot faster to work with a split database.

ericvdtoorn · 2023-08-29T15:10:32Z

To be honest, there's little space to significantly improve the speed of KMCP :(. I have to say, 10,000+ is such a large number. Unless you have a large number of HPC nodes, that would take a long time with KMCP. May be use can use a portion of the samples for benchmark purpose.

Unfortunate to hear it can't go faster yet, though it won't be much of an issue for most users (not everyday you're running a lot of samples ;-) ).
I've requested access to more compute, hopefully that will be enough to run (a significant part of) the samples.

shenwei356 · 2023-08-29T15:12:23Z

Did this as from your results it seemed to be a lot faster to work with a split database.

Yes for one sample. Not for multiple samples. You can also test with the single database (humgut.kmcp.single).

Using split databases is mainly for two scenes:

Parallelization with HPC nodes, with each node used for searching against one split database. This would largely improve the search speed for ONE sample as all CPUs are working simultaneously for just one sample, and a merge step is needed. However, for many samples, the total CPU hours are roughly the same as searching against one single database for each sample on an HPC node.
A single database built from a large number of reference genomes can't fill into the RAM. So you have to use split databases. Obviously, you don't need to.

ericvdtoorn · 2023-09-01T18:11:13Z

Hi, thank you for your comments, I'm setting up the single server now.

Separately, I added some code to do performance tracking to the kmcp profile command, and got the attached flamegraph (can be viewed with go tool pprof -http=:8080 ./file).

From this it seems that there is a significant cost (30%+) associated with parsing the data. Given that the input of the profile function is internal anyway, I think it's feasible to first convert it into a different format with static typing (e.g. Parquet) or even as an in-memory structure like Arrow.

In the process, I also noticed that the output of the merged TSVs has a slight bug; there is an extra tab after every row but the first, making strict parsers panic.

DRR171473.kmcp.perf.profile.zip

shenwei356 · 2023-09-02T16:59:54Z

I'm thrilled you use the pprof tool to analyze the code. Previously, I used it a lot to improve the performance.

From this it seems that there is a significant cost (30%+) associated with parsing the data

Yes, parsing floats (fpr and qcov) is slow. I've tried some methods, but there's no significant improvement.

Given that the input of the profile function is internal anyway, I think it's feasible to first convert it into a different format with static typing (e.g. Parquet) or even as an in-memory structure like Arrow.

The input of the profile, i.e., search result, needs to be in a plain format for easy interpretation. Cause KMCP can act as a sequence/genome search tool.

But I think it's a good way to use binary format for fast download parsing, companying with a view command to convert to plain format.

I also noticed that the output of the merged TSVs has a slight bug; there is an extra tab after every row but the first, making strict parsers panic.

Thanks, it's fixed. Please use the new binaries:

changes

ericvdtoorn · 2023-09-02T22:39:19Z

I'm thrilled you use the pprof tool to analyze the code. Previously, I used it a lot to improve the performance.

Pleasure to look at!

Yes, parsing floats (fpr and qcov) is slow. I've tried some methods, but there's no significant improvement.
The input of the profile, i.e., search result, needs to be in a plain format for easy interpretation. Cause KMCP can act as a sequence/genome search tool.
But I think it's a good way to use binary format for fast download parsing, companying with a view command to convert to plain format.

Ah yes, I'm focused on the profiling for my usecase but that's a valid point. Could potentially also be inferred from the outFile parameters of the search and merge commands?
The buffered reading also occurs 5 times, would also be possible to cache the entire thing into memory (if it's available, perhaps something like the -w param for search?)

Thanks, it's fixed. Please use the new binaries:

kmcp_linux_amd64.tar.gz

kmcp_linux_arm64.tar.gz

changes

Thank you for the quick fix!

shenwei356 · 2023-09-03T05:02:07Z

I check the pprof again. Besides float parsings, gzip reading and writing and column value splitting are also performance bottlenecks (HTML: [kmcp cpu.zip]) .

(https://github.com/shenwei356/kmcp/files/12505199/kmcp.cpu.zip))

For buffering search results in memory, I don't think it's feasible, cause some gzip-compressed search result files are quite large (>10GB) in my analysis, which would occupy a huge number of memory, even using some compact data structures.

Using serializing binary files should be the best way for quick downstream parsing.

shenwei356 added a commit that referenced this issue Sep 2, 2023

merge: trim tailing tab in merged result. #38

fd8808f

shenwei356 mentioned this issue Sep 4, 2023

TODO: save the search result into a serializing binary file for fast downstream parsing #40

Open

shenwei356 mentioned this issue Sep 18, 2023

Update KMCP to 0.9.4 bioconda/bioconda-recipes#43093

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing KMCP with HumGut #38

Optimizing KMCP with HumGut #38

ericvdtoorn commented Aug 29, 2023

shenwei356 commented Aug 29, 2023 •

edited

Loading

shenwei356 commented Aug 29, 2023

ericvdtoorn commented Aug 29, 2023 •

edited

Loading

ericvdtoorn commented Aug 29, 2023

shenwei356 commented Aug 29, 2023

ericvdtoorn commented Sep 1, 2023 •

edited

Loading

shenwei356 commented Sep 2, 2023

ericvdtoorn commented Sep 2, 2023

shenwei356 commented Sep 3, 2023

Optimizing KMCP with HumGut #38

Optimizing KMCP with HumGut #38

Comments

ericvdtoorn commented Aug 29, 2023

shenwei356 commented Aug 29, 2023 • edited Loading

shenwei356 commented Aug 29, 2023

ericvdtoorn commented Aug 29, 2023 • edited Loading

ericvdtoorn commented Aug 29, 2023

shenwei356 commented Aug 29, 2023

ericvdtoorn commented Sep 1, 2023 • edited Loading

shenwei356 commented Sep 2, 2023

ericvdtoorn commented Sep 2, 2023

shenwei356 commented Sep 3, 2023

shenwei356 commented Aug 29, 2023 •

edited

Loading

ericvdtoorn commented Aug 29, 2023 •

edited

Loading

ericvdtoorn commented Sep 1, 2023 •

edited

Loading