-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vcfdist requires too much memory when using VCF with SVs included #34
Comments
Thanks for starting a discussion on this. At the moment, I would not recommend running The reason for this is that unlike other tools, vcfdist performs full alignment of all variants, and can guarantee that equivalent variants will always be found. With a 100Kb variant, vcfdist will have to do ~300Kb by ~300Kb alignment (since the variant can shift in repetitive regions). The alignment algorithm used in vcfdist v2 uses O(n*n) memory. With 4-byte integers, that's a 360GB dynamic programming matrix. I'm currently working on vcfdist v3, which will use a graph wavefront alignment algorithm that requires O(n*s) memory, where I apologize for the inconvenience, but vcfdist won't be able to support variants that large for a while. |
Hi @TimD1, thanks for the reply. I tried running the same file against itself with |
vcfdist was basically designed to achieve better accuracy at the expense of increased memory/computation. The alignment is working correctly, it just requires a lot of memory. I would try things in the following order:
Note that cheaper clustering algorithms will impact the results of SV benchmarking more than SNPs or INDELs (because equivalent SVs can generally be located farther apart on the reference. Supplementary Table 3 of our manuscript has a comparison of a few of these clustering options. The wiki has more info on vcfdist's clustering parameters as well. |
Hi Tim, I've tried using your three suggestions, and still end up running out of memory even at 64GB. Here's an excerpt of the logs before the process gets killed:
I'm wondering if similar to the |
I did a bit more digging into this and made some observations. I tried splitting by chromosome, and found some required much more memory than others, to the point of some shards failing while other succeed. The common thread was the |
I am also unable to run vcfdist with SVs, running into all the issues described above. |
Just wanted to give an update to say I'm actively working on this. I'm currently making the clustering process more efficient for larger variants, and will then shift the precision/recall alignment/calculation from Dijkstra-based to graph WFA-based. I'm also adding in a |
Hi, I've been trying to use Vcfdist to compare a SNP + SV callset against truth with both, and it seems to require a prohibitive amount of memory when the SVs are included (it runs relatively quickly with just subsetting to the SNPs + small INDELs). Here is the full command used:
In this example I took a NIST-Q100 fully phased truth set here (e.g. file
GRCh38_HG2-T2TQ100-V1.1.vcf.gz
) and ran it against itself as both truth and query VCF. With 64GB of RAM and 8 cpus it crashes, and running with 256GB or so with many cores runs for a prohibitively long time.It seems like given how fast it is with the SNPs that there might be something unoptimized when comparing SVs, either in the algorithm or the implementation. I wonder if running a profiler can help see if there's a memory leak or an unreasonable combinatorial explosion in this case that can be managed better.
Here is the full logs for the command:
The text was updated successfully, but these errors were encountered: