Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: increase visibility of SNPs at wider views? #1551

Open
kevfengler227 opened this issue Aug 28, 2024 · 24 comments
Open

question: increase visibility of SNPs at wider views? #1551

kevfengler227 opened this issue Aug 28, 2024 · 24 comments
Milestone

Comments

@kevfengler227
Copy link

Is there a way to increase the visibility of SNPs at wider views? In the example below, I can see SNP differences between alignments in a 56 kb window, but not in a 140 kb window, which encompasses the entire region I want to display.

image

image

@kevfengler227
Copy link
Author

I should add the coverage track has the desired SNP visibility, just not the alignments.

image

@jrobinso
Copy link
Contributor

Probably not at the moment, but I will look into it. When zoomed in all mismatches are shown, not just those deemed significant for the coverage track. My vague recollection is we stop doing this at some resolution as it becomes too cluttered, but this should be revisited.

@jrobinso
Copy link
Contributor

Just a note here -- this only seems to happen with long read (3rd gen) data.

@kevfengler227
Copy link
Author

Indeed. These are actually 140 kb genomic segments aligned as HiFi reads. But I am trying to show the SNP variation and haplotypes in each genome. This is one way to turn IGV into a pangenome viewer!

@jrobinso
Copy link
Contributor

Interesting. I'll have this fixed soon. If the dataset you are using or creating is public let me know, it would be an interesting test case to add.

@jrobinso
Copy link
Contributor

One issue that arises as you zoom out is many bases land on the same pixel. At 100kb approximately 100 bases / pixel. For typical reads this means that nearly every pixel of the alignment will have a mismatch, often multiple mismatches. We might need some user options for how to handle this.

@jrobinso
Copy link
Contributor

As illustration here is a 247kb window of pacbio alignments with mismatches drawn. Its not usable, and rendering is extremely slow. So some preferences or special mode is needed here
Screenshot 2024-08-28 at 11 01 21 PM

@kevfengler227
Copy link
Author

yes, I did not intend to use this capability for PacBio reads or ONT reads, but rather genomes with relatively few differences. So a "genome" mode would be ideal. I can send you an example public dataset.

Of course, the user needs to do some upfront work to create the ideal input data, but reducing each genome to 1x is extremely powerful, rather than 30x PacBio, and the only practical way to view a large pangenome.

@jrobinso
Copy link
Contributor

A public dataset would be helpful.

@jrobinso
Copy link
Contributor

There will still be limits on zoom out as at a minimum the sequence for the entire region needs to be loaded, not to mention the read sequence in every alignment. We could not view an entire chromosome with read sequences for example.

@kevfengler227
Copy link
Author

Admittedly, this will probably only work well for low diversity applications like my initial request. In that case there are only ~13 SNPs in a handful of genomes in a 140 kb range, which was just out of visibility limit, so I was hoping for way to crank up the SNP visibility, but that wouldn't make sense if there was a ton of variation- which is often the case for plant pangenomes.

It seems that 114 kb is the visibility max for SNPs, but INDELs are visible at much wider ranges.

image

image

But this real world example from the maize pangenome probably has too many SNPs to display nicely at wider-ranges, but in some specific cases it would still be useful

@kevfengler227
Copy link
Author

here genomes were aligned in 100 kb consecutive chunks

@kevfengler227
Copy link
Author

Here is a test dataset of mock data, with a few SNPs over 245 kb

image
10genomes.fasta.gz

test.fasta.gz

@kevfengler227
Copy link
Author

kevfengler227 commented Aug 29, 2024

minimap2 -ax map-hifi -t4 test.fasta 10genomes.fasta | samtools view -b -1 - | samtools sort --write-index -o 10genomes.bam

@kevfengler227
Copy link
Author

So basically trying to use IGV has a haplotype-viewer

@jrobinso
Copy link
Contributor

I've never used minimap2 but that's o.k. I think the simplest resolution of this issue would be to just make the max window for showing mismatches user settable, probably as a preference. A new display mode is a bigger topic that deserves its own issue, and would be longer term and prioritized vs other bigger topics.

I will also make snp display subject to the limit. BTW currently the limit is not on the genomic window, which can vary by display size, but on the resolution in bp / pixel

@kevfengler227
Copy link
Author

sounds great. thanks!

@jrobinso jrobinso added this to the 2.19.0 milestone Aug 30, 2024
@baozg
Copy link

baozg commented Sep 5, 2024

Related question: If loading IGV with more than 100 genomes (wholge genome alignment by minimap2 -x asm20), the speed would be very slow. If there any way to speed it up?

@kevfengler227
Copy link
Author

Rather then performing whole-genome alignments, I typically align consecutively 10kb chunked genomes, which is faster for alignment and the alignments can be toggled by mapping quality and alignment score. If you add the genome name to the read group when running minimap2 and merge the resulting bam files, 100 genomes is essentially the same as 100x Illumina coverage and is quite rapid to view in IGV.

image

@kevfengler227
Copy link
Author

kevfengler227 commented Sep 5, 2024

If you zoom out you can see the PAV in the genomes well, just not the SNPs

image

@kevfengler227
Copy link
Author

coloring and grouping by read group is key

@kevfengler227
Copy link
Author

kevfengler227 commented Sep 5, 2024

finally, if you number the chunks consecutively you know exactly where it came from in the query- which is much better than using kmers or other methods where coordinates are lost. Then you know you are looking at syntenic alignments when you mouse over a chunk and see it's chunk# (position) is similar to reference

@baozg
Copy link

baozg commented Sep 5, 2024

Thanks for sharing! Chunking could be a good idea, but this also lose the abiltiy to detect the variation longer than chunk length or introduce ambiguous alignment (TEs). It more like chain by yourself as you know the coordinates. I think it would be better if IGV use chunk in the browser but with more contiguous alignments. Actually, I use AnchorWave and wfmash more often, whihc nearly produce end-to-end alignment in A.thaliana (easier than maize). For the alternative approach other than IGV, I use https://github.com/cmdcolin/jbrowse-plugin-mafviewer for convert my paf to pseduomaf (which only can present SNPs or DEL)

image

@kevfengler227
Copy link
Author

kevfengler227 commented Sep 5, 2024

you can use whatever chunk size you want for a given application depending on the level of similarity in the pangenome, typically 1-100 kb (aligned with map-hifi). With that you can see quite large INDELs. Again, you can control what is displayed by changing the visualization parameters in IGV more so than with whole-genome alignments. Also, the directionality of chunks is indicative of inversions. For major differences the lack of an aligned chunk also informative.

so the real beauty of the chunked alignment approach is that is highly parallelizable and rapid. One can do an all-by-all comparison in minutes, so that all/any reference(s) can be viewed in IGV with all queries on a whim. If you want to get fancy you can group your queries into various sub-groups, rather than 1 big one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants