Vcfdist requires too much memory when using VCF with SVs included #34

rickymagner · 2024-11-01T16:10:53Z

Hi, I've been trying to use Vcfdist to compare a SNP + SV callset against truth with both, and it seems to require a prohibitive amount of memory when the SVs are included (it runs relatively quickly with just subsetting to the SNPs + small INDELs). Here is the full command used:

vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6

In this example I took a NIST-Q100 fully phased truth set here (e.g. file GRCh38_HG2-T2TQ100-V1.1.vcf.gz) and ran it against itself as both truth and query VCF. With 64GB of RAM and 8 cpus it crashes, and running with 256GB or so with many cores runs for a prohibitively long time.

It seems like given how fast it is with the SNPs that there might be something unoptimized when comparing SVs, either in the algorithm or the implementation. I wonder if running a profiler can help see if there's a memory leak or an unreasonable combinatorial explosion in this case that can be managed better.

Here is the full logs for the command:

[INFO vcfdist 22:27:29] Command: 'vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6'
[INFO vcfdist 22:27:29]
[INFO vcfdist 22:27:29] [0/8] Loading reference FASTA '/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta'
[INFO vcfdist 22:27:45]
[INFO vcfdist 22:27:45] [Q 0/8] Parsing QUERY VCF 'comp-svs.vcf.gz'
[WARN vcfdist 22:27:45] 'PS' tag not defined in QUERY VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in QUERY VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in QUERY VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] QUERY VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [T 0/8] Parsing TRUTH VCF '/cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz'
[WARN vcfdist 22:27:46] 'PS' tag not defined in TRUTH VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in TRUTH VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in TRUTH VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] TRUTH VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Checking contigs:
[INFO vcfdist 22:27:46] All contig checks passed!
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [Q 3/8] Wavefront clustering QUERY VCF 'comp-svs.vcf.gz'
/cromwell_root/script: line 46: 17 Killed vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6

The text was updated successfully, but these errors were encountered:

TimD1 · 2024-11-02T12:48:45Z

Thanks for starting a discussion on this. At the moment, I would not recommend running vcfdist on datasets with variants larger than 10Kb.

The reason for this is that unlike other tools, vcfdist performs full alignment of all variants, and can guarantee that equivalent variants will always be found. With a 100Kb variant, vcfdist will have to do ~300Kb by ~300Kb alignment (since the variant can shift in repetitive regions). The alignment algorithm used in vcfdist v2 uses O(n*n) memory. With 4-byte integers, that's a 360GB dynamic programming matrix.

I'm currently working on vcfdist v3, which will use a graph wavefront alignment algorithm that requires O(n*s) memory, where s is the alignment score. This will scale better, but will still not be a magic bullet. I will likely have to limit the maximum alignment score, and provide special behavior when this limit is hit (which should only occur for FNs). I'll have to mark the truth variant that caused the largest score increase as a FN, and try aligning again without it.

I apologize for the inconvenience, but vcfdist won't be able to support variants that large for a while.

rickymagner · 2024-11-04T20:54:23Z

Hi @TimD1, thanks for the reply. I tried running the same file against itself with --largest-variant capped at 10,000 but I still ran into a memory error. Is it possible there is something else causing excessive memory use, or even if the flag isn't properly implemented to filter out longer variants? I'm still trying to debug a bit with the file but thought I'd ask to see if you have some parameters you could use to get the above command to complete.

TimD1 · 2024-11-05T01:41:28Z

vcfdist was basically designed to achieve better accuracy at the expense of increased memory/computation. The alignment is working correctly, it just requires a lot of memory.

I would try things in the following order:

Make sure your benchmarking BED includes only the regions you're interested in. If you need to use a particular BED, then there's nothing you can do here, but it's more efficient to filter out variants by region before evaluation than after.
Reduce the maximum variant size with --largest-variant. You've already gone down from 100Kb to 10Kb, so you probably won't want to do this further. Just note that vcfdist memory usage and runtime is O(n*n) in regards to this parameter, so running with --largest-variant 1000 will be at least 100x faster and use 100x less memory, so you can run this very quickly to get some initial results.
The default clustering algorithm is --cluster biwfa, which dynamically calculates the left and rightmost reaches of variants (which quickly gets expensive for SVs). This means that a 10Kb variant in a highly repetitive region may be able to shift left or right 50Kb and lead to an 100Kb by 100Kb alignment. Note that all changes from the default clustering may result in undetected equivalent variants between truth and query VCFs (e.g. false FPs), but this should happen at a very low rate (and only to equivalent variants that are located far away from one another on the reference) with large runtime/memory usage improvements. This trade-off is most likely worth it.
1. My first suggestion for changing clustering is to use --cluster biwfa --max-iterations 1 (which doesn't allow re-calculation of the reaches if e.g. the reaches of two 10Kb variants overlap). This should impact your results very minimally.
2. If that's not enough, my second suggestion is to use --cluster size 100, which will use a simpler clustering algorithm. Basically, the evaluation will assume that query variants are within 100 reference bases or the size of the variant (whichever is larger) from its corresponding counterpart in the truth VCF.
3. The last option is to use --cluster gap 100, which basically assumes that evaluated variants are within 100 bases of corresponding variants in the truth VCF.

Note that cheaper clustering algorithms will impact the results of SV benchmarking more than SNPs or INDELs (because equivalent SVs can generally be located farther apart on the reference. Supplementary Table 3 of our manuscript has a comparison of a few of these clustering options. The wiki has more info on vcfdist's clustering parameters as well.

rickymagner · 2024-11-08T05:48:27Z

Hi Tim, I've tried using your three suggestions, and still end up running out of memory even at 64GB. Here's an excerpt of the logs before the process gets killed:

...
[INFO vcfdist 19:29:41]
[INFO vcfdist 19:29:41] [4/8] Superclustering TRUTH and QUERY variants
[INFO vcfdist 19:29:42] Total superclusters: 2663733
[INFO vcfdist 19:29:42] Largest supercluster (bases): 30616
[INFO vcfdist 19:29:42] Largest supercluster (vars): 2182
[INFO vcfdist 19:29:42] Average supercluster (bases): 74.261
[INFO vcfdist 19:29:42] Average supercluster (vars): 4.758
[WARN vcfdist 19:29:42] 52 total superclusters contain variants from multiple phase sets (PS)
[INFO vcfdist 19:29:42] QUERY phase sets: 2217
[INFO vcfdist 19:29:42] QUERY phase block NG50: 2503804
[INFO vcfdist 19:29:42] TRUTH phase sets: 24
[INFO vcfdist 19:29:42] TRUTH phase block NG50: 156040895
[INFO vcfdist 19:29:42]
[INFO vcfdist 19:29:42] Sorting superclusters by size
[INFO vcfdist 19:29:42]
[INFO vcfdist 19:29:42] [5/8] Calculating precision and recall
[INFO vcfdist 19:29:42] Superclusters using 0.000 to 5.250 GB RAM each ( 12 threads): 2663544
[INFO vcfdist 19:29:42] Superclusters using 5.250 to 10.500 GB RAM each ( 6 threads): 150
[INFO vcfdist 19:29:42] Superclusters using 10.500 to 21.000 GB RAM each ( 3 threads): 36
[INFO vcfdist 19:29:42] Superclusters using 21.000 to 63.000 GB RAM each ( 1 threads): 3

I'm wondering if similar to the --max-variant flag there could be a --max-supercluster-size parameter that would essentially skip these 3 superclusters grabbing for way too much memory, assuming that's really the issue. I'm ok throwing away a small handful of variants just to be able to finish the compute. Do you think something like that might help or just a red herring? In that case it would be nice to still have a log of which variants/superclusters were skipped somewhere.

rickymagner · 2024-11-14T01:49:43Z

I did a bit more digging into this and made some observations. I tried splitting by chromosome, and found some required much more memory than others, to the point of some shards failing while other succeed. The common thread was the Largest supercluster (vars) were higher for the failed shards, and lower for the successful ones. This seems to suggest that this is the limiting factor, and a flag like --max-supercluster-size with the option to write to something like a skipped_sites.vcf.gz might "solve" this issue entirely.

kcleal · 2024-12-12T10:25:40Z

I am also unable to run vcfdist with SVs, running into all the issues described above.

TimD1 · 2024-12-13T19:05:41Z

Just wanted to give an update to say I'm actively working on this. I'm currently making the clustering process more efficient for larger variants, and will then shift the precision/recall alignment/calculation from Dijkstra-based to graph WFA-based. I'm also adding in a --max-dist flag that will terminate large FN alignments early (which are currently causing the computation/memory usage to explode).

kcleal · 2025-02-05T11:30:25Z

Hi @TimD1,
Just wondering if you are still working on this?

JosephLalli · 2025-02-12T17:39:46Z

Also want to reup this. I love this tool, but I am finding that when working with T2T variant calls, the algorithm as written is choking on some of the more complex regions. I am benchmarking 1KGP calls against their pangenomic vcf representation. Roughly half of the samples I'm benchmarking are fine when working with chr22, but the other half encounter issues like below:

--- 1.5 hours and 250GB of normalizing truth ---
[INFO vcfdist 17:18:36] Sorting superclusters by size
[WARN vcfdist 17:18:36] Max (64.000GB) RAM exceeded (603.267GB req) for supercluster chr22:6731977-6905653, running anyways
[WARN vcfdist 17:18:36] Max (64.000GB) RAM exceeded (1584.086GB req) for supercluster chr22:8489971-8776590, running anyways
[WARN vcfdist 17:18:36] Max (64.000GB) RAM exceeded (1095.820GB req) for supercluster chr22:9949928-10185723, running anyways

And then, after another few hours, vcfdist silently crashes.

I agree that it would be helpful to be able to set a hard cap on supercluster size, beyond which we can report the region as a problem to align. Maybe report as a bedfile, so that these regions can be excluded in other runs.

At an absolute minimum, I would love to optionally be able to exclude these superclusters from analysis.

Is this an easy to implement feature @TimD1 ?

Thanks,
Joe

TimD1 · 2025-02-14T19:35:56Z

Hey, thanks for both bringing this up, @JosephLalli @kcleal !

Previously, I had decided to focus my efforts on the next major release, which should both be more flexible and greatly reduce memory usage, to the point that a max supercluster size may no longer be needed. I've made substantial progress on this, but I'm busy and development has been slower than I initially expected.

In hindsight, I should have first added a --max-supercluster-size into the current vcfdist release. I think the best implementation is for superclusters larger than this threshold to be broken into two (or more) superclusters in an intelligent manner (e.g. at the largest gaps). This will break guarantees about discovering all equivalent variants, but should impact results minimally in practice and I can print warnings for each supercluster where this occurs.

I'm sorry that this memory issue has been impacting your analyses, and will try to push out this feature (--max-supercluster-size) as soon as I can.

JosephLalli · 2025-02-14T19:39:21Z

I've been there man, no worries. I just wanted to be sure this was on your radar. If the problem was urgent/extreme, I would have tried to put together a pull request.

Thank you so much for this tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vcfdist requires too much memory when using VCF with SVs included #34

Vcfdist requires too much memory when using VCF with SVs included #34

rickymagner commented Nov 1, 2024

TimD1 commented Nov 2, 2024 •

edited

Loading

rickymagner commented Nov 4, 2024

TimD1 commented Nov 5, 2024

rickymagner commented Nov 8, 2024

rickymagner commented Nov 14, 2024

kcleal commented Dec 12, 2024

TimD1 commented Dec 13, 2024

kcleal commented Feb 5, 2025

JosephLalli commented Feb 12, 2025 •

edited

Loading

TimD1 commented Feb 14, 2025

JosephLalli commented Feb 14, 2025

Vcfdist requires too much memory when using VCF with SVs included #34

Vcfdist requires too much memory when using VCF with SVs included #34

Comments

rickymagner commented Nov 1, 2024

TimD1 commented Nov 2, 2024 • edited Loading

rickymagner commented Nov 4, 2024

TimD1 commented Nov 5, 2024

rickymagner commented Nov 8, 2024

rickymagner commented Nov 14, 2024

kcleal commented Dec 12, 2024

TimD1 commented Dec 13, 2024

kcleal commented Feb 5, 2025

JosephLalli commented Feb 12, 2025 • edited Loading

TimD1 commented Feb 14, 2025

JosephLalli commented Feb 14, 2025

TimD1 commented Nov 2, 2024 •

edited

Loading

JosephLalli commented Feb 12, 2025 •

edited

Loading