Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't have maternal and paternal sequence data, how to build hap-kmer #111

Open
socialhang opened this issue Sep 3, 2023 · 1 comment
Open

Comments

@socialhang
Copy link

Hi,

Due to the nature of our system we are unable to acquire the parents of our sequenced individual, we sequencing a lot of children data.
What should I do to build hap-kmer using these children NGS data?

best wishes!
hang

@arangrhie
Copy link
Contributor

Hello,

This is an interesting question, not so easy to answer though.

  1. Using imputed phased variants
    If you have a collapsed assembly from the child (or children), you could potentially re-map the reads to the assembly (assuming it's nicely polished as well, and near-T2T), and phase the variants called from short-reads using long-reads (HiFi or ONT) or long-range reads (Strand-Seq or HiC). There are several tools that help doing the phasing part.
    Once the het variants are phased, you could make two haplotype consensuses using bcftools consensus.

Then, you could collect kmers from each haplotype assembly, and use the $MERQURY/trio/hapmer.sh with -no-filter option to get pseudo-haplotype specific kmers and use that to evaluate the original collapsed assembly.

It's not guaranteed that a 'hapmer' is consistent across all chromosomes though, and likely be inconsistent between chromosomal arms if lacking long-range information. Lack of HiC coverage could miss a few true heterozygous variants, especially those between blocks in long stretches of homozygosity.
Also, there are so many assumptions made here. There may be regions where it is not easy to map with short-reads, therefore the variant calling accuracy will be poor in those region. I'd only use the phase block information and switch error rate results form Merqury, and not put too much on the blob-plots.

If you already have a diploid assembly, it will be counter intuitive to re-use the assembly specific kmers for evaluation, so unfortunately no, I don't have a good answer.

  1. Using population specific kmers
    If the parents come from different populations (i.e. subsepcies), and there are some genome sequencing reads available for those, you could try using those population data kmers as the 'parental' kmer db. Run $MERQURY/trio/hapmers.sh with each kmers from the population merged with meryl union-sum.

  2. Sex chromosome limited kmers
    If your genome follows XY or WZ system, you could use the homogametic (e.g. XX) vs. heterogametic (e.g. XY) samples to extract kmers that are only present in the XY samples. This is not perfect, nor ideal to evaluate the whole genome, but should allow to get an estimate of the Y (or W) specific sequences.

Also linking #6

Best,
Arang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants