-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
advise on best practice for VCF file input #5
Comments
I think it's breaking at the step where it needs to form this genotype dictionary. We expect the format of the variant to be: This function is likely the culprit. I would check to see how the VCF is formatting those genotypes for those SNPs, b/c if I recall correctly it should be able to handle INDELs def genotype_dictionary(var):
"""
A function to build the genotype dictionary that
converts the raw vcf version of a genotype into a
numeric representation
args
----
var - variant e.g. 0/0, 0|1
returns
-------
int - int representation of genotype
"""
if var[1] == '/':
split_gt = var.strip().split("/")
else:
split_gt = var.strip().split("|")
if split_gt[0] == '0' and split_gt[1] == '0':
return 0
elif split_gt[0] == '.' and split_gt[1] == '.':
return 0
elif split_gt[0] != split_gt[1]:
return 1
elif split_gt[0] == split_gt[1]:
return 2``` |
But I see no harm in just filtering for SNPs; it might work as well. |
I checked the VCF file for the first SNP in the error log. ##bcftools_viewVersion=1.13+htslib-1.17 ##bcftools_viewCommand=view -r chr10:47228 FBA-0001.fb.splitmulti.ucsc.sort.vcf.gz; Date=Tue Feb 27 10:10:04 2024 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FBA-0001-2 FBA-0001 FBA-0001-1 The SNP is not in the VCF file. |
@mxiong2 But my best guess is this: Unknown genotype: .:.. So that message is saying to me that when the code goes to parse genotypes it extracts this ".:.." which means either there is an encoding system for genotypes we did not account for, or we're extracting the incorrect information. |
@andreirajkovic |
@mxiong2 |
@andreirajkovic |
I try with a different VCF file to see if I could reproduced the error. I still get the same error message. To see if I get a different error message, I filtered the VCF to only include chromosome 1. I still get the same error: Creating snvstory_ancestry_run ... done Input VCF success. File size: 21676774 Expected input: /data/vcf_files/FBA-0002.fb.ucsc.sort.chr1.vcf.gz is a file Multisample: True and sample name: FBA-0002-2 Gzipped status: True gnomAD_continental gnomAD_eur gnomAD_eas 1kGP_amr Unknown genotype: .:.. list index out of range Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]" Cannot coerce data. Check sparse matrix: None Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
File "/opt/Ancestry/igm_churchill_ancestry/main.py", line 4, in
File "/opt/Ancestry/igm_churchill_ancestry/cli.py", line 182, in run_ancestry
File "/opt/Ancestry/igm_churchill_ancestry/pipelines/ancestry_prediction.py", line 137, in run_ancestry_pipeline
TypeError: cannot unpack non-iterable NoneType object ERROR: 1 |
@mxiong2 Interesting. I'll take a look at this tonight to see if I can reproduce this on my end |
@mxiong2 |
@mxiong2 could you also show me what output you get when you run this command on your own vcf: bcftools query -f '[%GT\t]\n' your_sample.vcf.gz | tr '\t' '\n' | sort | uniq You should see something like: 0|0
0|1
1|0
1|1 |
@andreirajkovic . |
I was successful at running a sample of ours that was genotype again without doing any post-processing of the VCF file just to test what may be causing the issues with the actual VCF files that we plan to process with snvstory. |
From discussions with @mxiong2 locally, it does look like the issue was related to the presence of multialleic site calls (especially for indels). For people who might run into this error in the future, they can try running |
Thank you @bpow! That's something we should probably add to our README.md. If the issue is resolve feel free to close! |
Actually, the multiallelic sites may have been a red herring (or may be something different). It looks like another problem might be that we use a variant calling pipeline that actually calls sites outside the PAR on X as hemizygous (i.e., just genotypes of '0' or '1') in individuals with 46,XY chromosome complement. These rows would cause a problem in your As an aside, the code as written assumes that there would never be a multiallelic site with >10 genotypes (since it just looks for the delimiter in the second position-- '10/11' would not be recognized). This may not be true for particularly complex indels/haplotypes. You could instead use something like: def genotype_dictionary(var):
"""
A function to build the genotype dictionary that
converts the raw vcf version of a genotype into a
numeric representation
args
----
var - variant e.g. 0/0, 0|1
returns
-------
int - int representation of genotype
"""
if '/' in var:
split_gt = var.strip().split('/')
elif '|' in var:
split_gt = var.split('|')
else:
# no delimiter: hemizygous site
if var.strip() in ('0', '.'):
return 0
else:
return 1 # or should this be something else?
if split_gt[0] == '0' and split_gt[1] == '0':
return 0
elif split_gt[0] == '.' and split_gt[1] == '.':
return 0
elif split_gt[0] != split_gt[1]:
return 1 # possible unexpected behavior: genotype '1/2' is interpreted same as '0/1'
elif split_gt[0] == split_gt[1]:
return 2 Note that I'm not sure how hemizygous sites should be represented (should they be coded as '1' since there is 1 alternate allele, or '2' because, like homalt sites, there is no reference allele present? Do sites on X/Y even get used as informative markers in your analysis, if not, then we could just exclude these chromosomes on input. Also note that this function as written still has some other ambiguities to how things might be mapped with multiallelic sites. As mentioned in the comment, '0/1' and '1/2' end up being both coded as 1, despite the former having a reference allele and the latter not. Correspondingly, '1/1' and '2/2' would fall to the last elif and be coded the same as homalt even though they represent different alt alleles. |
Hey @bpow, Great catch on the limitation of the multiallelic sites; we will need to put a fix in for that. The ML algos don't use the sex chromosomes to compute ancestry. I would recommend to omit them from the analysis; in a future release we should filter this out from the start. On the hemizygous point for autosomal alleles, I assume these would be missed and so a fix like the one you're suggesting might work. Also thank you for the code you provided, I'll review it and see how best to incorporate it. |
Can you advise on how best to filter a raw VCF filter for input? Can the VCF file contain a mixture of SNPs/INDELs or only VCF files that have been filter for SNPs only?
I get the following error:
Creating snvstory_ancestry_run ... done
Input VCF success. File size: 226571
Expected input: /data/vcf_files/output.snp.vcf.gz is a file
Multisample: True and sample name: FBA-0001-2
Gzipped status: True
gnomAD_continental
gnomAD_eur
gnomAD_eas
1kGP_amr
Unknown genotype: .:.. list index out of range
Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]"
The text was updated successfully, but these errors were encountered: