Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advise on best practice for VCF file input #5

Open
mxiong2 opened this issue Feb 26, 2024 · 17 comments
Open

advise on best practice for VCF file input #5

mxiong2 opened this issue Feb 26, 2024 · 17 comments

Comments

@mxiong2
Copy link

mxiong2 commented Feb 26, 2024

Can you advise on how best to filter a raw VCF filter for input? Can the VCF file contain a mixture of SNPs/INDELs or only VCF files that have been filter for SNPs only?

I get the following error:

Creating snvstory_ancestry_run ... done

Input VCF success. File size: 226571

Expected input: /data/vcf_files/output.snp.vcf.gz is a file

Multisample: True and sample name: FBA-0001-2

Gzipped status: True

gnomAD_continental

gnomAD_eur

gnomAD_eas

1kGP_amr

Unknown genotype: .:.. list index out of range

Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]"

@andreirajkovic
Copy link
Collaborator

andreirajkovic commented Feb 26, 2024

I think it's breaking at the step where it needs to form this genotype dictionary. We expect the format of the variant to be:
var - variant e.g. 0/0, 0|1

This function is likely the culprit. I would check to see how the VCF is formatting those genotypes for those SNPs, b/c if I recall correctly it should be able to handle INDELs

def genotype_dictionary(var):
    """
    A function to build the genotype dictionary that
    converts the raw vcf version of a genotype into a
    numeric representation

    args
    ----
    var - variant e.g. 0/0, 0|1

    returns
    -------
    int - int representation of genotype
    """
    if var[1] == '/':
        split_gt = var.strip().split("/")
    else:
        split_gt = var.strip().split("|")
    if split_gt[0] == '0' and split_gt[1] == '0':
        return 0
    elif split_gt[0] == '.' and split_gt[1] == '.':
        return 0
    elif split_gt[0] != split_gt[1]:
        return 1
    elif split_gt[0] == split_gt[1]:
        return 2```

@andreirajkovic
Copy link
Collaborator

But I see no harm in just filtering for SNPs; it might work as well.

@mxiong2
Copy link
Author

mxiong2 commented Feb 27, 2024

@andreirajkovic

I checked the VCF file for the first SNP in the error log.

##bcftools_viewVersion=1.13+htslib-1.17

##bcftools_viewCommand=view -r chr10:47228 FBA-0001.fb.splitmulti.ucsc.sort.vcf.gz; Date=Tue Feb 27 10:10:04 2024

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FBA-0001-2 FBA-0001 FBA-0001-1

The SNP is not in the VCF file.

@andreirajkovic
Copy link
Collaborator

@mxiong2
Ohhh okay. I think those SNPs in the error log are actually the ones we use for the assignment of the ancestry and since the parser failed it's saying that it can't find any of those SNPs i.e. that error message is a bit of a red herring. If you don't mind attaching your VCF or just a part of it up here I can step through the debugger to see what's going wrong.

But my best guess is this:

Unknown genotype: .:..

So that message is saying to me that when the code goes to parse genotypes it extracts this ".:.." which means either there is an encoding system for genotypes we did not account for, or we're extracting the incorrect information.

@mxiong2
Copy link
Author

mxiong2 commented Feb 27, 2024

@andreirajkovic
Unfortunately, I cannot share parts of the VCF file. I will try to go back to the original VCF file that has not been post process to determine if any additional VCF filtering is needed. As always, part of the challenge is getting the input right. How much missing data can snvstory accept? Thanks for all your work.

@andreirajkovic
Copy link
Collaborator

@mxiong2
It should be pretty flexible; I don't think we ran any benchmarks on that, but we typically need only a few hundred to make a good estimation of the ancestry. Let me see if we can post an example VCF up here to help out.

@mxiong2
Copy link
Author

mxiong2 commented Feb 27, 2024

@andreirajkovic
Thanks!

@mxiong2
Copy link
Author

mxiong2 commented Feb 27, 2024

@andreirajkovic

I try with a different VCF file to see if I could reproduced the error. I still get the same error message. To see if I get a different error message, I filtered the VCF to only include chromosome 1. I still get the same error:

Creating snvstory_ancestry_run ... done

Input VCF success. File size: 21676774

Expected input: /data/vcf_files/FBA-0002.fb.ucsc.sort.chr1.vcf.gz is a file

Multisample: True and sample name: FBA-0002-2

Gzipped status: True

gnomAD_continental

gnomAD_eur

gnomAD_eas

1kGP_amr

Unknown genotype: .:.. list index out of range

Check ordered length of SNPs. "None of [Index(['chr10_47140_G_A', 'chr10_47228_CATTGCTGTAAACTGCTCTGAG_C',\n 'chr10_47254_GCT_G', 'chr10_47278_T_C', 'chr10_47317_G_A',\n 'chr10_47347_C_T', 'chr10_47375_A_G', 'chr10_47389_C_T',\n 'chr10_47405_T_C', 'chr10_47447_G_A',\n ...\n 'chr9_138221069_C_G', 'chr9_138221079_C_T', 'chr9_138221103_G_A',\n 'chr9_138221150_C_G', 'chr9_138221191_C_T', 'chr9_138221192_C_T',\n 'chr9_138221260_TCTC_T', 'chr9_138231024_G_C', 'chr9_138231051_T_A',\n 'chr9_138233800_G_C'],\n dtype='object', length=281093)] are in the [columns]"

Cannot coerce data. Check sparse matrix: None

Traceback (most recent call last):

File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main

"__main__", mod_spec)

File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code

exec(code, run_globals)

File "/opt/Ancestry/igm_churchill_ancestry/main.py", line 4, in

run_ancestry()

File "/opt/Ancestry/igm_churchill_ancestry/cli.py", line 182, in run_ancestry

ofn=ofn)

File "/opt/Ancestry/igm_churchill_ancestry/pipelines/ancestry_prediction.py", line 137, in run_ancestry_pipeline

yprob, ylabel = predict_ancestry(s_matrix, ml_dir=ml_dir, n_classes=n_classes, model_type='svm')

TypeError: cannot unpack non-iterable NoneType object

ERROR: 1

@andreirajkovic
Copy link
Collaborator

@mxiong2 Interesting. I'll take a look at this tonight to see if I can reproduce this on my end

@andreirajkovic
Copy link
Collaborator

HG00096_extracted.vcf.gz

@mxiong2
See if you can run the attached vcf from the 1KGP

@andreirajkovic
Copy link
Collaborator

@mxiong2 could you also show me what output you get when you run this command on your own vcf:

bcftools query -f '[%GT\t]\n' your_sample.vcf.gz | tr '\t' '\n' | sort | uniq

You should see something like:

0|0
0|1
1|0
1|1

@mxiong2
Copy link
Author

mxiong2 commented Feb 29, 2024

@andreirajkovic
I get something like this:

.
0
0/0
0/1
0/10
0/2
0/3
0/4
0/5
0/6
0/7
0/8
1
1/1
1/2
1/3
1/4
1/5
1/6
1/7
2/2
2/3
2/4
2/5
3/3
3/4
3/5
3/6
4
4/4
4/5
6/6
8/9

@mxiong2
Copy link
Author

mxiong2 commented Feb 29, 2024

HG00096_extracted.vcf.gz

@mxiong2 See if you can run the attached vcf from the 1KGP

I was successful at running a sample of ours that was genotype again without doing any post-processing of the VCF file just to test what may be causing the issues with the actual VCF files that we plan to process with snvstory.

@bpow
Copy link
Contributor

bpow commented Feb 29, 2024

From discussions with @mxiong2 locally, it does look like the issue was related to the presence of multialleic site calls (especially for indels). For people who might run into this error in the future, they can try running bcftools norm -m (or similar tool) to split these multiallelic rows as preprocessing of the vcf before running snvstory.

@andreirajkovic
Copy link
Collaborator

Thank you @bpow! That's something we should probably add to our README.md. If the issue is resolve feel free to close!

@bpow
Copy link
Contributor

bpow commented Mar 2, 2024

Actually, the multiallelic sites may have been a red herring (or may be something different). It looks like another problem might be that we use a variant calling pipeline that actually calls sites outside the PAR on X as hemizygous (i.e., just genotypes of '0' or '1') in individuals with 46,XY chromosome complement.

These rows would cause a problem in your genotype_dictionary function because there is neither an '/' nor a '|' to split on, so no split_gt[1]...

As an aside, the code as written assumes that there would never be a multiallelic site with >10 genotypes (since it just looks for the delimiter in the second position-- '10/11' would not be recognized). This may not be true for particularly complex indels/haplotypes. You could instead use something like:

def genotype_dictionary(var):
    """
    A function to build the genotype dictionary that
    converts the raw vcf version of a genotype into a
    numeric representation

    args
    ----
    var - variant e.g. 0/0, 0|1

    returns
    -------
    int - int representation of genotype
    """
    if '/' in var:
        split_gt = var.strip().split('/')
    elif '|' in var:
        split_gt = var.split('|')
    else:
        # no delimiter: hemizygous site
        if var.strip() in ('0', '.'):
            return 0
        else:
            return 1 # or should this be something else?
    if split_gt[0] == '0' and split_gt[1] == '0':
        return 0
    elif split_gt[0] == '.' and split_gt[1] == '.':
        return 0
    elif split_gt[0] != split_gt[1]:
        return 1 # possible unexpected behavior: genotype '1/2' is interpreted same as '0/1'
    elif split_gt[0] == split_gt[1]:
        return 2

Note that I'm not sure how hemizygous sites should be represented (should they be coded as '1' since there is 1 alternate allele, or '2' because, like homalt sites, there is no reference allele present? Do sites on X/Y even get used as informative markers in your analysis, if not, then we could just exclude these chromosomes on input.

Also note that this function as written still has some other ambiguities to how things might be mapped with multiallelic sites. As mentioned in the comment, '0/1' and '1/2' end up being both coded as 1, despite the former having a reference allele and the latter not. Correspondingly, '1/1' and '2/2' would fall to the last elif and be coded the same as homalt even though they represent different alt alleles.

@andreirajkovic
Copy link
Collaborator

Hey @bpow,

Great catch on the limitation of the multiallelic sites; we will need to put a fix in for that. The ML algos don't use the sex chromosomes to compute ancestry. I would recommend to omit them from the analysis; in a future release we should filter this out from the start. On the hemizygous point for autosomal alleles, I assume these would be missed and so a fix like the one you're suggesting might work.

Also thank you for the code you provided, I'll review it and see how best to incorporate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants