IsoPhase: Haplotyping using Iso Seq data

Last Updated: 08/06/2021

NOTE: This is IsoPhase for phasing single-species whole transcriptome Iso-Seq samples. For metagenomics phasing (MagPhase), please go to the MagPhase GitHub instead

Prerequisite
Install IsoPhase
What you will need for phasing
Running IsoPhase
Summarizing IsoPhase output
Tagging BAM files with phasing information

Prerequisite

The following tools are required to run IsoPhase.

minimap2
samtools
pyvcf (install via BioConda using conda install -c conda-forge pyvcf)
pysam

Install IsoPhase

Follow the install steps for installing Cupcake. I will be using the Anaconda environment for the remainder of this tutorial.

# activate anaconda environment, here I assume it's called anaCogent
$ source activate anaCogent
(anaCogent)-bash-4.1$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(anaCogent)-bash-4.1$ cd cDNA_Cupcake
(anaCogent)-bash-4.1$ python setup.py build
(anaCogent)-bash-4.1$ python setup.py install

What you will need for phasing

Reference genome, in fasta format
Full-length (CCS/HiFi) reads, in fastq format (best if filtered for accuracy!, see below)
Annotation GFF file using Iso-Seq data, where each isoform has ID format PB.X.Y
Association file linking individual full-length reads to isoforms PB.X.Y

Reference Genome

This is the reference genome that is associated with the gene/isoform annotations. It must be in fasta format.

Full-Length Reads

This is the full-length (CCS/HiFi) reads. It must be in fastq format.

When you run Iso-Seq 3 through either Bioconda or SMRT Link, it will generate the full-length, non-concatemer read file, usually with the filename unpolished.flnc.bam. This BAM file can be converted to FASTQ using:

bamtools convert -format fastq -in unpolished.flnc.bam > flnc.fastq

Annotation GFF file

The GFF files should have a field transcript_id where the ID is in PB.X.Y format where X is the gene locus and Y is the isoform index.

For example:

chr1     PacBio  transcript      226419  242943  .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1     PacBio  exon    226419  229394  .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1     PacBio  exon    232706  232874  .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1     PacBio  exon    235860  236071  .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1     PacBio  exon    242773  242943  .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1     PacBio  transcript      229163  232874  .       -       .       gene_id "PB.1"; transcript_id "PB.1.2";
chr1     PacBio  exon    229163  229394  .       -       .       gene_id "PB.1"; transcript_id "PB.1.2";
chr1     PacBio  exon    232706  232874  .       -       .       gene_id "PB.1"; transcript_id "PB.1.2";

Here, PB.1.1 and PB.1.2 are two isoforms of the same gene (PB.1). It is important to know which isoforms belong to the same genes since IsoPhase combines all FL reads from the same gene for SNP calling and phasing.

Association file linking FL reads to isoforms

The association file must have at least two fields, id and pbid where id is the full-length (CCS/HiFi) read ID and pbid corresponds to the unique isoforms from the annotation GFF file.

For example:

id      length  is_fl   stat    pbid
m54033_171101_101105/52102116/ccs       4369    Y       unique  PB.1.1
m54033_171101_101105/15073981/ccs       4468    Y       unique  PB.1.1
m54033_171101_035054/42533716/ccs       4392    Y       unique  PB.1.1
m54033_171101_101105/21889610/ccs       4374    Y       unique  PB.1.2
m54033_171101_101105/6422907/ccs        4366    Y       unique  PB.1.2
m54033_171101_035054/15860654/ccs       4426    Y       unique  PB.1.2

Running IsoPhase

Selecting gene loci with sufficient coverage for phasing

NOTE: IsoPhase currently only calls substitution SNPs. It also will only call SNPs at positions with > 40-fold coverage.

First we will generate a by_loci directory where each by_loci/PB.X_sizeN contains information about the locus PB.X and N is the number of full-length reads associated with transcript isoforms PB.X.Y.

usage: select_loci_to_phase.py [-h] [-c COVERAGE]
                               genome_fasta flnc_filename gff_filename
                               stat_filename

positional arguments:
  genome_fasta          Reference genome fasta
  flnc_filename         FLNC fastq file
  gff_filename          GFF file of transcripts, IDs must be PB.X.Y
  stat_filename         Tab-delimited read stat file linking FLNC to PB.X.Y
  -c COVERAGE, --coverage COVERAGE
                        Minimum FLNC coverage required (default: 40)

For example, we will run

python <path_to>/cDNA_Cupcake/phasing/utils/select_loci_to_phase.py \
   genome.fasta isoseq_flnc.fastq \
   hq.fastq.no5merge.collapsed.filtered.gff \
   hq.fastq.no5merge.collapsed.read_stat.renamed.txt \
   -c 40

Generate and run IsoPhase commands

We first modify the template bash script run_phasing_in_dir.sh by making a local copy and editing it.

cp <path_to>/cDNA_Cupcake/phasing/utils/run_phasing_in_dir.sh .

Change PLOIDY to suit your data. PLODIY=2 means diploid and PLOIDY=4 means tetraploid, for example. MIN_BQ is the minimum QV used by IsoPhase for SNP calling.

PLOIDY=2
MIN_BQ=13

Optionally, you can uncomment and modify the following section to run GMAP (or minimap2) against the refernce genome. This will align the FL reads for each locus to the reference genome and generate an aligned, sorted BAM file.

#fq2fa.py ccs.fastq  # this can be removed in later
#gmap -D ~/share/gmap_db_new/ -d maize4 -f samse -n 0 -t 4 ccs.fasta > ccs.maize4.sam
#samtools view -bS ccs.maize4.sam > ccs.maize4.bam
#samtools sort ccs.maize4.bam > ccs.maize4.sorted.bam
#samtools index ccs.maize4.sorted.bam

Next, we will run the command to generate a per-locus bash script. After the command below, there should be a run.sh in each by_loci/PB.X_sizeN subdirectory.

ls -1d by_loci/*size*|xargs -n1 -i echo "bash run_phasing_in_dir.sh {}" > cmd
bash cmd

We can now run each of the by_loci/PB.X_sizeN/run.sh. Below is an example command to go through each subdirectory and generate the command used for it.

ls -1d by_loci/*size*|xargs -n1 -i echo "cd {}; bash run.sh; cd ../../" > cmd2
bash cmd2

You can split up cmd2 into chunks to run each of them in parallel.

Summarizing IsoPhase output

The two VCF output files in each subdirectory are:

phased.nopartial.cleaned.vcf   # phasing using reads that cover every SNP base
phased.partial.cleaned.vcf     # phasing using reads that cover at least one SNP base

With phased.nopartial.cleaned.vcf, the number of FL reads that cover every SNP base may be significantly reduced (since alternative splicing might mean some reads skip certain exons that contain SNPs) and the result is there may be fewer SNPs called and phasing may not be successful.

With phased.partial.cleaned.vcf , reads that do not have every SNP base called have to be inferenced. When there are very few SNPs, this may result in some reads being assigned to the wrong haplotypes. Users are advised to look at both VCF output to determine the quality of the results.

You can summarize the phasing results using:

python ~/GitHub/cDNA_Cupcake/phasing/utils/summarize_byloci_results.py

This generates a summary file called summarized.isophase_results.txt which shows the number of SNPs called (num_snp), the number of alleles inferenced with only reads that cover every SNP (num_hap_nopartial) and all reads (num_hap_withpartial).

locus   size    num_snp num_hap_nopartial   num_hap_withpartial
by_loci/PB.12426_size82 82  5   2   2
by_loci/PB.21892_size59 59  0   0   0
by_loci/PB.21890_size531    531 2   2   2

You can also collect all the VCFs from each separate directory into a single file using:

python ~/GitHub/cDNA_Cupcake/phasing/utils/collect_all_vcf.py

which generates a summarized SNP VCF file IsoSeq_IsoPhase.vcf. NOTE however, that this file only contains SNP information - but not isoform information, since that is specific to each gene locus.

You can further decide which directories (loci) to tally and which VCF output file to use:

usage: collect_all_vcf.py [-h] [-d DIR] [-o OUTPUT]
                          [--vcf {phased.partial.vcf,phased.partial.cleaned.vcf,phased.nopartial.vcf,phased.nopartial.cleaned.vcf}]
                          [-s SELECT_DIRS]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --dir DIR     Directory containing the subdirs of IsoPhase (default:
                        by_loci/)
  -o OUTPUT, --output OUTPUT
                        Output VCF filename (default: IsoSeq_IsoPhase.vcf)
  --vcf {phased.partial.vcf,phased.partial.cleaned.vcf,phased.nopartial.vcf,phased.nopartial.cleaned.vcf}
                        VCF to use per directory (default:
                        phased.partial.cleaned.vcf)
  -s SELECT_DIRS, --select_dirs SELECT_DIRS
                        Comma separate list of directories to tally - if this
                        is used, <dir> is ignored

Tagging BAM files with phasing information

You can tag the aligned read BAM file with phasing information in the RG tag so you can visualize it in IGV.

$ python <path_to_Cupcake>/phasing/utils/tag_bam_post_phasing.py --help
usage: Tagging BAM files with phasing info [-h] read_bam hap_info output_bam

positional arguments:
  read_bam    Aligned BAM file that be tagged
  hap_info    Comma-delimited hap info CSV, must have column 'id' and
              'hap_postclean'
  output_bam  Output tagged BAM filename

Below is an example for tagging one particular locus:

cd by_loci/PB.5798_size209
python <path_to_Cupcake>/phasing/utils/tag_bam_post_phasing.py ccs.fasta.sorted.bam phased.nopartial.cleaned.hap_info.txt ccs.nopartial.tagged.bam
python <path_to_Cupcake>/phasing/utils/tag_bam_post_phasing.py ccs.fasta.sorted.bam phased.partial.cleaned.hap_info.txt ccs.partial.tagged.bam

This will add a RG:Z:<hap_info> tag to the BAM file which you can then color in IGV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly