-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing KMCP with HumGut #38
Comments
Hi Eric, thanks for your interest. Firstly, for the database, I'm wondering why you split it over 10 indices. How many reference genomes do you use? Previously, I created one with 30,691 genomes, with a total size of 21.6GB. That would fill into your HPC node (100GB). I mean you could simply create one KMCP database, and search for one sample against it on a HPC node with all CPUs (64 cores). So the merging process would be unnecessary.
But according to the #36, you seemed not to split the genome set. |
To be honest, there's little space to significantly improve the speed of KMCP :(. I have to say, 10,000+ is such a large number. Unless you have a large number of HPC nodes, that would take a long time with KMCP. May be use can use a portion of the samples for benchmark purpose. |
Oh? I followed the instructions on the second part of this to build the split database and did get multiple DBs: # number of databases
N_DB=10
CURRENT=$SLURM_ARRAY_TASK_ID
# split -n r/$N_DB -d ${DB_NAME}.files.txt ${DB_NAME}.n$N_DB-
f=$(sed -n "${CURRENT}p" subsets.txt)
# for f in ${DB_NAME}.n$N_DB-*; do
echo $f;
kmcp compute -i $f -O $f-k21-n10 -k 21 -n 10 -l 150 -B plasmid \
--log $f-k21-n10.log --force -j 32
kmcp index -j 32 -I $f-k21-n10 -O $f.kmcp -n 1 -f 0.3 \
--log $f.kmcp.log --force
# cp taxid and name mapping file to database directory
cp taxid.map name.map $f.kmcp/
And I'm using a list of the Did this as from your results it seemed to be a lot faster to work with a split database. |
Unfortunate to hear it can't go faster yet, though it won't be much of an issue for most users (not everyday you're running a lot of samples ;-) ). |
Yes for one sample. Not for multiple samples. You can also test with the single database ( Using split databases is mainly for two scenes:
|
I'm thrilled you use the pprof tool to analyze the code. Previously, I used it a lot to improve the performance.
Yes, parsing floats (fpr and qcov) is slow. I've tried some methods, but there's no significant improvement.
The input of the But I think it's a good way to use binary format for fast download parsing, companying with a
Thanks, it's fixed. Please use the new binaries: |
Pleasure to look at!
Ah yes, I'm focused on the profiling for my usecase but that's a valid point. Could potentially also be inferred from the outFile parameters of the
Thank you for the quick fix! |
I check the pprof again. Besides float parsings, gzip reading and writing and column value splitting are also performance bottlenecks (HTML: [kmcp cpu.zip]) . (https://github.com/shenwei356/kmcp/files/12505199/kmcp.cpu.zip)) For buffering search results in memory, I don't think it's feasible, cause some gzip-compressed search result files are quite large (>10GB) in my analysis, which would occupy a huge number of memory, even using some compact data structures. Using serializing binary files should be the best way for quick downstream parsing. |
Hello Shenwei,
Thank you for making this new metagenomic tool! I'm interested in benchmarking its performance, and for that I want to perform classification on a large (10,000+) number of samples. There are a number of minor things I've come across but I have an end to end sample running now. My main objective for opening this issue is to discuss potential optimizations to what I'm doing to reduce the total time for running this.
Currently, end to end this single sample took 9 hours (08:58) with 32 cores and 23 GB of RAM (100GB was max)
kmcp search
, through parallel with 12 threads per job, took 2 hours per searchkmcp merge
, took 15 mins with 64 threadskmcp profile
, took 4h52 mins with 64 threads--no-amb-corr
flag to reduce by 1h53 minsFrom this, I could potentially reduce the time to ~6 hours by allocating more cores and increasing the number of searches running in parallel and adding the
--no-amb-corr
flag.Are there more ways in which I am missing optimizations?
I've created a gist with the actual SLURM submission, can also upload the logs of the jobs if that's helpful.
The text was updated successfully, but these errors were encountered: