Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get haplotype-specific reads? #52

Open
DayTimeMouse opened this issue Sep 26, 2024 · 5 comments
Open

How to get haplotype-specific reads? #52

DayTimeMouse opened this issue Sep 26, 2024 · 5 comments

Comments

@DayTimeMouse
Copy link

DayTimeMouse commented Sep 26, 2024

Hi,

Thanks for developing this awesome tools. I'd like to get haplotype-specific HiFi reads via haplotype-specific kmers.

I run script to get haplotype-specific kmers, meryl version is 1.3:
meryl count k=21 hap1.fa output hap1.meryl
meryl count k=21 hap2.fa output hap2.meryl
meryl difference hap1.meryl hap2.meryl output hap1.only.meryl
meryl difference hap2.meryl hap1.meryl output hap2.only.meryl
meryl intersect hap1.meryl hap2.meryl output hap1_hap2_intersect.meryl

get read-meryl:
meryl count k=21 hifi_reads.fq.gz output hifi.meryl

I want to partition HiFi reads(each read) into two haplotypes. Now, I do not know how to do next? meryl can do this? Or, do you have some ideas for this?

Thanks in advance.

@arangrhie
Copy link
Contributor

Hi @DayTimeMouse,

One fundamental question is, how confident you are with your hap1.fa and hap2.fa.
Looks like your hap1.meryl and hap2.meryl are reflecting what's unique in each assembly, not the biologically inherited parental specific kmers.

Anyways, you could try meryl-lookup -existence -sequence hifi_reads.fq.gz -mers hap1.meryl hap2.meryl > hifi_reads.hapmer.count.
This will provide count of kmers shared (found) in each read sequence.

Small explanation from the help message:

...
     The output file has format:
         <sequence_name> <tab> <kmers_in_sequence> <tab> <kmers_in_db> <tab> <kmers_shared> [...]

     With one input database:
         sequence1 <tab> 8415 <tab> 12856825 <tab> 8145

     With two input databases:
         sequence1 <tab> 8415 <tab> 12856825 <tab> 8145 <tab> 575757256 <tab> 8354

You could bin the reads based on the hap1 and hap2 mers found in each fastq sequence.
Usually, it's expected to find kmers from one haplotype and a few from the other haplotype, due to possible noises in the reads. There may be reads with no hapmers if the two haplotypes share a large homozygous block.
If there are any switch errors in the assembly, you'd see reads with substantial kmers with both hap kmers.

Alternate option is to use trio-binning implemented in Canu, but this might fit better if you are actually using hapmers inferred from the parental genomes as it normalizes based on the kmers found in the dbs.

/path/to/canu/build/bin/splitHaplotype \
 -cl 1000 -memory 48 -threads 24 \
 -H ../hap1.meryl 1 hap1.fasta.gz \
 -H ../hap2.meryl 1 hap2.fasta.gz \
 -A ./unk.fasta.gz \
 hifi_reads.fq.gz

-cl 1000 will filter out reads < 1kb.

@DayTimeMouse
Copy link
Author

But I do not have real parental genomes, I used hifiasm(HiFi + ONT +HiC) to get hap1.fa and hap2.fa.
In this case, do you still recommand me to use Canu(trio-binning splitHaplotype)? Canu(trio-binning) and count hapmers to partition reads via meryl-lookup, which method is fit better for me?

@arangrhie
Copy link
Contributor

Then using meryl-lookup might work the best. Use meryl v1.4.1 by the way, there were some improvements / fixes made particularly in meryl-lookup.

@DayTimeMouse
Copy link
Author

Many thanks!

@DayTimeMouse
Copy link
Author

Hi,

I have another question. How to get haplotype-specific HiC reads(paired-end)? Using haplotype-specific kmers?

Best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants