Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent HGNC:ID results between single-thread and multi-thread in vep #1759

Open
karlestira opened this issue Sep 26, 2024 · 6 comments
Assignees

Comments

@karlestira
Copy link

Describe the issue

vep give different result when using multi-thread(--fork).

problem:
Some gene(like ENSG00000169047 or ENSG00000168769) will loss its refseq HGNC ID(near field EntrezGene) when using --fork, and they are shown in single-thread result.

Additional information

This inconsistent is due to the thread setting, same threads give same results bewteen different running, but different threads setting lead to different result.

I believe this is a multi-thread inconsistent bug. And I think this bug happens widely, Any WES vcf and VEP merged cache can reproduce the problem, no specific inputs need.

System

  • VEP version: 112.0(conda build: pl5321h2a3209d_0, conda channel: anaconda/cloud/bioconda)
  • VEP Cache version: homo_sapiens_merged/112_GRCh37
  • Perl version: 5.32.1
  • OS: Debian GNU/Linux 10 (buster)
  • tabix installed ? tabix 1.20 from conda

Full VEP command line

vep --input_file test.vcf --output_file test.vep.10.vcf --format vcf --vcf --symbol --biotype --hgvs --fasta ucsc.hg19.fa --offline --cache --dir_cache /opt/vep/database --no_stats --merged --fork 10 --buffer_size 10000

info line in output vcf:
##VEP="v112.0" API="v112" time="2024-09-26 17:00:31" cache="/opt/vep/database/homo_sapiens_merged/112_GRCh37" ensembl=112.7104005 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356 1000genomes="phase3" COSMIC="98" ClinVar="202306" HGMD-PUBLIC="20204" assembly="GRCh37.p13" dbSNP="156" gencode="GENCODE 19" genebuild="2011-04" gnomADe="r2.1" polyphen="2.2.2" refseq="105.20220307 - GCF_000001405.25_GRCh37.p13_genomic.gff" regbuild="1.0" sift="sift5.2.2"

Full error message

No error message.

@likhitha-surapaneni likhitha-surapaneni self-assigned this Sep 27, 2024
@likhitha-surapaneni
Copy link
Contributor

Hi @karlestira,
Unfortunately I am not able to reproduce this issue on my end. Is there a specific test input file that can be shared to help us debug this?

Kind regards,
Likhitha

@karlestira
Copy link
Author

karlestira commented Oct 8, 2024

Hi @karlestira, Unfortunately I am not able to reproduce this issue on my end. Is there a specific test input file that can be shared to help us debug this?

Kind regards, Likhitha

vcf is from vardict, and some pre-process has been done.

cmd:
5 threads:
vep --input_file NA12878L1.vardict.head10000.vcf --output_file NA12878L1.vep.5.head10000.vcf --format vcf --vcf --symbol --biotype --hgvs --fasta ucsc.hg19.fa --offline --cache --dir_cache vep_db --no_stats --merged --fork 5 --buffer_size 10000
10 threads:
vep --input_file NA12878L1.vardict.head10000.vcf --output_file NA12878L1.vep.10.head10000.vcf --format vcf --vcf --symbol --biotype --hgvs --fasta ucsc.hg19.fa --offline --cache --dir_cache vep_db --no_stats --merged --fork 10 --buffer_size 10000

using VEP database download from ftp(sorry I forgot the url) with the name: homo_sapiens_merged_vep_112_GRCh37.tar.gz

then:
diff NA12878L1.vep.5.head10000.vcf NA12878L1.vep.10.head10000.vcf

In my system, the diff is between line 62(the cmd line, it is no problem), 4892, 4893, 4987, 4988(these 4 lines is different in HGNC ID when transcript is from refseq)

NA12878L1.vardict.head10000.vcf.gz

@TimD1
Copy link

TimD1 commented Oct 23, 2024

I have encountered the same issue, using VEP version 111.0.

@christopher-hardy
Copy link

Hi @likhitha-surapaneni, thanks for your help :)

I am seeing this issue as well w/ VEP version 111.0. Were you able to replicate with the data from @karlestira, or would it be helpful to provide another example?

Even before a patch is applied, it would be awesome if you could comment on what might be causing this issue so that we know if it's specific to the HGNC annotations (which I'm not super concerned about) or a more general issue with forking that may result in other more serious discrepancies?

@TimD1
Copy link

TimD1 commented Nov 4, 2024

@likhitha-surapaneni , thanks for looking into this!

After a more thorough comparison between VEP runs with and without forking, I am starting to notice more serious issues than just dropped HGNC identifiers (which affects roughly 1 out of 10,000 SNPs).

In particular, about 1 out of 4 structural variants (SVs) is annotated differently between single-thread and multi-thread VEP. The vast majority of these differences (>90%) are instances where multi-thread VEP drops one of several entire CSQ annotations for a variant. Here's an example:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT">
single-thread VEP: CSQ=allele1|consequence1|impact1,allele2|consequence2|impact2,allele3|consequence3|impact3

multi-thread VEP: CSQ=allele1|consequence1|impact1,allele3|consequence3|impact3

@likhitha-surapaneni
Copy link
Contributor

Hi @TimD1 , @christopher-hardy, @karlestira
We are still investigating this. Thank you for providing the details, we will get back to you with an update.

Kind regards,
Likhitha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants