advise on best practice for VCF file input #5

mxiong2 · 2024-02-26T20:11:50Z

Can you advise on how best to filter a raw VCF filter for input? Can the VCF file contain a mixture of SNPs/INDELs or only VCF files that have been filter for SNPs only?

I get the following error:

Creating snvstory_ancestry_run ... done

Input VCF success. File size: 226571

Expected input: /data/vcf_files/output.snp.vcf.gz is a file

Multisample: True and sample name: FBA-0001-2

Gzipped status: True

gnomAD_continental

gnomAD_eur

gnomAD_eas

1kGP_amr

Unknown genotype: .:.. list index out of range

Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]"

andreirajkovic · 2024-02-26T21:45:23Z

I think it's breaking at the step where it needs to form this genotype dictionary. We expect the format of the variant to be:
var - variant e.g. 0/0, 0|1

This function is likely the culprit. I would check to see how the VCF is formatting those genotypes for those SNPs, b/c if I recall correctly it should be able to handle INDELs

def genotype_dictionary(var):
    """
    A function to build the genotype dictionary that
    converts the raw vcf version of a genotype into a
    numeric representation

    args
    ----
    var - variant e.g. 0/0, 0|1

    returns
    -------
    int - int representation of genotype
    """
    if var[1] == '/':
        split_gt = var.strip().split("/")
    else:
        split_gt = var.strip().split("|")
    if split_gt[0] == '0' and split_gt[1] == '0':
        return 0
    elif split_gt[0] == '.' and split_gt[1] == '.':
        return 0
    elif split_gt[0] != split_gt[1]:
        return 1
    elif split_gt[0] == split_gt[1]:
        return 2```

andreirajkovic · 2024-02-26T21:57:40Z

But I see no harm in just filtering for SNPs; it might work as well.

mxiong2 · 2024-02-27T15:15:49Z

@andreirajkovic

I checked the VCF file for the first SNP in the error log.

##bcftools_viewVersion=1.13+htslib-1.17

##bcftools_viewCommand=view -r chr10:47228 FBA-0001.fb.splitmulti.ucsc.sort.vcf.gz; Date=Tue Feb 27 10:10:04 2024

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FBA-0001-2 FBA-0001 FBA-0001-1

The SNP is not in the VCF file.

andreirajkovic · 2024-02-27T16:20:09Z

@mxiong2
Ohhh okay. I think those SNPs in the error log are actually the ones we use for the assignment of the ancestry and since the parser failed it's saying that it can't find any of those SNPs i.e. that error message is a bit of a red herring. If you don't mind attaching your VCF or just a part of it up here I can step through the debugger to see what's going wrong.

But my best guess is this:

Unknown genotype: .:..

So that message is saying to me that when the code goes to parse genotypes it extracts this ".:.." which means either there is an encoding system for genotypes we did not account for, or we're extracting the incorrect information.

mxiong2 · 2024-02-27T16:27:57Z

@andreirajkovic
Unfortunately, I cannot share parts of the VCF file. I will try to go back to the original VCF file that has not been post process to determine if any additional VCF filtering is needed. As always, part of the challenge is getting the input right. How much missing data can snvstory accept? Thanks for all your work.

andreirajkovic · 2024-02-27T17:02:18Z

@mxiong2
It should be pretty flexible; I don't think we ran any benchmarks on that, but we typically need only a few hundred to make a good estimation of the ancestry. Let me see if we can post an example VCF up here to help out.

mxiong2 · 2024-02-27T17:07:35Z

@andreirajkovic
Thanks!

mxiong2 · 2024-02-27T19:13:04Z

@andreirajkovic

I try with a different VCF file to see if I could reproduced the error. I still get the same error message. To see if I get a different error message, I filtered the VCF to only include chromosome 1. I still get the same error:

Creating snvstory_ancestry_run ... done

Input VCF success. File size: 21676774

Expected input: /data/vcf_files/FBA-0002.fb.ucsc.sort.chr1.vcf.gz is a file

Multisample: True and sample name: FBA-0002-2

Gzipped status: True

gnomAD_continental

gnomAD_eur

gnomAD_eas

1kGP_amr

Unknown genotype: .:.. list index out of range

Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]"

Cannot coerce data. Check sparse matrix: None

Traceback (most recent call last):

File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main

"__main__", mod_spec)

File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code

exec(code, run_globals)

File "/opt/Ancestry/igm_churchill_ancestry/main.py", line 4, in

run_ancestry()

File "/opt/Ancestry/igm_churchill_ancestry/cli.py", line 182, in run_ancestry

ofn=ofn)

File "/opt/Ancestry/igm_churchill_ancestry/pipelines/ancestry_prediction.py", line 137, in run_ancestry_pipeline

yprob, ylabel = predict_ancestry(s_matrix, ml_dir=ml_dir, n_classes=n_classes, model_type='svm')

TypeError: cannot unpack non-iterable NoneType object

ERROR: 1

andreirajkovic · 2024-02-27T19:51:48Z

@mxiong2 Interesting. I'll take a look at this tonight to see if I can reproduce this on my end

andreirajkovic · 2024-02-28T22:45:29Z

HG00096_extracted.vcf.gz

@mxiong2
See if you can run the attached vcf from the 1KGP

andreirajkovic · 2024-02-28T22:51:41Z

@mxiong2 could you also show me what output you get when you run this command on your own vcf:

bcftools query -f '[%GT\t]\n' your_sample.vcf.gz | tr '\t' '\n' | sort | uniq

You should see something like:

0|0
0|1
1|0
1|1

mxiong2 · 2024-02-29T15:17:20Z

@andreirajkovic
I get something like this:

.
0
0/0
0/1
0/10
0/2
0/3
0/4
0/5
0/6
0/7
0/8
1
1/1
1/2
1/3
1/4
1/5
1/6
1/7
2/2
2/3
2/4
2/5
3/3
3/4
3/5
3/6
4
4/4
4/5
6/6
8/9

mxiong2 · 2024-02-29T15:19:38Z

HG00096_extracted.vcf.gz

@mxiong2 See if you can run the attached vcf from the 1KGP

I was successful at running a sample of ours that was genotype again without doing any post-processing of the VCF file just to test what may be causing the issues with the actual VCF files that we plan to process with snvstory.

bpow · 2024-02-29T18:44:03Z

From discussions with @mxiong2 locally, it does look like the issue was related to the presence of multialleic site calls (especially for indels). For people who might run into this error in the future, they can try running bcftools norm -m (or similar tool) to split these multiallelic rows as preprocessing of the vcf before running snvstory.

andreirajkovic · 2024-02-29T19:22:20Z

Thank you @bpow! That's something we should probably add to our README.md. If the issue is resolve feel free to close!

bpow · 2024-03-02T19:31:30Z

Actually, the multiallelic sites may have been a red herring (or may be something different). It looks like another problem might be that we use a variant calling pipeline that actually calls sites outside the PAR on X as hemizygous (i.e., just genotypes of '0' or '1') in individuals with 46,XY chromosome complement.

These rows would cause a problem in your genotype_dictionary function because there is neither an '/' nor a '|' to split on, so no split_gt[1]...

As an aside, the code as written assumes that there would never be a multiallelic site with >10 genotypes (since it just looks for the delimiter in the second position-- '10/11' would not be recognized). This may not be true for particularly complex indels/haplotypes. You could instead use something like:

def genotype_dictionary(var):
    """
    A function to build the genotype dictionary that
    converts the raw vcf version of a genotype into a
    numeric representation

    args
    ----
    var - variant e.g. 0/0, 0|1

    returns
    -------
    int - int representation of genotype
    """
    if '/' in var:
        split_gt = var.strip().split('/')
    elif '|' in var:
        split_gt = var.split('|')
    else:
        # no delimiter: hemizygous site
        if var.strip() in ('0', '.'):
            return 0
        else:
            return 1 # or should this be something else?
    if split_gt[0] == '0' and split_gt[1] == '0':
        return 0
    elif split_gt[0] == '.' and split_gt[1] == '.':
        return 0
    elif split_gt[0] != split_gt[1]:
        return 1 # possible unexpected behavior: genotype '1/2' is interpreted same as '0/1'
    elif split_gt[0] == split_gt[1]:
        return 2

Note that I'm not sure how hemizygous sites should be represented (should they be coded as '1' since there is 1 alternate allele, or '2' because, like homalt sites, there is no reference allele present? Do sites on X/Y even get used as informative markers in your analysis, if not, then we could just exclude these chromosomes on input.

Also note that this function as written still has some other ambiguities to how things might be mapped with multiallelic sites. As mentioned in the comment, '0/1' and '1/2' end up being both coded as 1, despite the former having a reference allele and the latter not. Correspondingly, '1/1' and '2/2' would fall to the last elif and be coded the same as homalt even though they represent different alt alleles.

andreirajkovic · 2024-03-04T16:57:36Z

Hey @bpow,

Great catch on the limitation of the multiallelic sites; we will need to put a fix in for that. The ML algos don't use the sex chromosomes to compute ancestry. I would recommend to omit them from the analysis; in a future release we should filter this out from the start. On the hemizygous point for autosomal alleles, I assume these would be missed and so a fix like the one you're suggesting might work.

Also thank you for the code you provided, I'll review it and see how best to incorporate it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advise on best practice for VCF file input #5

advise on best practice for VCF file input #5

mxiong2 commented Feb 26, 2024

andreirajkovic commented Feb 26, 2024 •

edited

Loading

andreirajkovic commented Feb 26, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

andreirajkovic commented Feb 28, 2024

andreirajkovic commented Feb 28, 2024

mxiong2 commented Feb 29, 2024

mxiong2 commented Feb 29, 2024

bpow commented Feb 29, 2024

andreirajkovic commented Feb 29, 2024

bpow commented Mar 2, 2024

andreirajkovic commented Mar 4, 2024

advise on best practice for VCF file input #5

advise on best practice for VCF file input #5

Comments

mxiong2 commented Feb 26, 2024

andreirajkovic commented Feb 26, 2024 • edited Loading

andreirajkovic commented Feb 26, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

mxiong2 commented Feb 27, 2024

andreirajkovic commented Feb 27, 2024

andreirajkovic commented Feb 28, 2024

andreirajkovic commented Feb 28, 2024

mxiong2 commented Feb 29, 2024

mxiong2 commented Feb 29, 2024

bpow commented Feb 29, 2024

andreirajkovic commented Feb 29, 2024

bpow commented Mar 2, 2024

andreirajkovic commented Mar 4, 2024

andreirajkovic commented Feb 26, 2024 •

edited

Loading