-
Notifications
You must be signed in to change notification settings - Fork 104
Tutorial: Collapse redundant isoforms without genome
Last Updated: 2017/10/23
If you already have a genome, you can map your Iso-Seq output (HQ isoform fasta/fastq file) to a genome and run the collapse script directly as described in the tutorial here.
If you do not have a genome, you have two ways to collapse redundant isoforms:
- Use CD-HIT to group and collapse similar sequences
- Use Cogent to identify gene families and reconstruct coding contigs
The pros and cons of using CD-HIT:
- (pro) Fast turnaround, only need to install and run CD-HIT
- (pro) Easy to understand results, as it is simply clustering by sequence similarity
- (con) Does not provide exon level information, so still would not know which sequences are isoforms from the same gene
- (con) Does not provide additional filtering for quality of input sequence
The pros and cons of using Cogent:
- (con) Slow turnaround, requires installation of Cogent
- (con) Harder to understand results, since Cogent first partition sequences into gene families, reconstructs contigs, then collapses them
- (pro) Provides isoform vs gene information; each partition is roughly a gene family
- (pro) Provides additional filtering for quality of input sequence
If no reference genome is available, one possibility is to use sequence clustering tools like CD-HIT. However, one must realize the difference between genome-based and sequence-based collapse, in that the latter:
- Does not provide exon-level information since no genomic mapping is available
- Does not provide additional filtering for quality of input sequences (as can be done for genome-based, one can additionally filter for alignment coverage & identity)
Thus in many cases it will be difficult to find the exact parameters to use in CD-HIT that will emulate the exact collapse result if the ref genome was used. We recommend reading the CD-HIT manual carefully to understand the different parameter choices and how they may affect the collapse result.
A few parameters that are most relevant are:
-c <sequence identity>
In general, redundant transcripts will be highly similar, so using a identity threshold of 0.99 should work well.
-G, -aL, -AL, -aS, -AS defines the minimum coverages for the longer and shorter of the two compared sequences
Setting the minimum coverage higher means that there is potential for more redundancy. For example, we might use the following parameter set:
cd-hit-est -i <input> -o <output> -c 0.99 -T 6 -G 0 -aL 0.90 -AL 100 -aS 0.99 -AS 30
which says that we will merge sequences that are 99% similar and that the shorter sequence must pretty much completely align to the longer sequence (99% coverage with less than 30 bp unaligned) and the longer sequence must be at least 90% aligned with less than 100 bp unaligned. Since we do not wish to collapse heterozygous sequences, we would always set the shorter sequence criteria (-aS, -AS) to stringent thresholds, while controlling how much additional nucleotides we would allow in the longer sequence (-aL, -AL).
We encourage users to experiment with different parameters and find the ones that best suit their needs.
Briefly, you would have to install and run Cogent, which is separate GitHub repository.
You would use Cogent in the following manner:
- Install and set up Cogent
- Run Cogent to identify gene families
- For each gene family, reconstruct coding "contigs"
- Use the coding "contigs" from (3) as a "fake genome" to align and collapse redundant isoforms (tutorial)