-
Notifications
You must be signed in to change notification settings - Fork 104
Iso Seq Single Cell Analysis: Recommended Analysis Guidelines
Last Updated: 08/23/2022
MAJOR UPDATE!!! As of 08.23.2022, the official Iso-Seq single-cell tools at isoseq.how represents the latest and recommended single-cell Iso-Seq analysis pipeline. Please refer to isoseq.how for the latest guidelines.
- Generate CCS Reads
- Read segmentation with Skera
- Remove cDNA primers, polyA tails, and extract UMI/BCs
- Barcode correction and UMI deduplication
- Align to Genome
- Collapse into Unique Transcripts
- Compare Against Annotation
- Filter Artifacts
- Process into CSV Report and (optional) UMI/BC Error Correction
- Producing a Seurat-compatible input from SQANTI3 isoforms
- Advanced: Linking molecule info to cell type, running IsoPhase, tagging BAMs
- Advanced: Calling variants using bcftools
- BioPython
- bamtools
- PacBio's isoseq3, which must be v3.7.0 or later!
- cDNA_Cupcake (this repository)
If you have SMRT Analysis installed, you already isoseq3 (need v3.7.0 and up). Otherwise, I highly recommend using AnaConda to install isoseq3 through pbbioconda.
The Cupcake scripts for single cell are all in cDNA_cupcake/singlecell subdirectory and do not require installation. You can execute the scripts directly.
Following is an example of installing BioPython and isoseq3 BioConda, then downloading cDNA_Cupcake:
# First, launch Conda environment
$ source activate anaCogent
(anaCogent)-bash-4.1$
# Install BioPython
(anaCogent)-bash-4.1$ conda install biopython
# Install bamtools and isoseq3
(anaCogent)-bash-4.1$ conda install -c bioconda bamtools
(anaCogent)-bash-4.1$ conda install -c bioconda isoseq3
# Download cDNA_Cupcake from GitHub
(anaCogent)-bash-4.1$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
# Installation of cDNA_Cupcake is optional,
# only if you are doing further processing after mapping to genome
(anaCogent)-bash-4.1$ cd cDNA_Cupcake
(anaCogent)-bash-4.1$ python setup.py build
(anaCogent)-bash-4.1$ python setup.py install
Please first read the official isoseq3 UMI/BC deduplication documentation.
For this tutorial, we assume that the UMIs and cell barcodes (BCs) are on the 3' end, between the polyA tail and 3' primer. The UMIs and BCs can be of any length.
Generate CCS reads. You can use either SMRT Link, SMRT Analysis, or CCS through pbbioconda. Refer to the ccs for isoseq3 section for the latest recommended ccs parameters for Iso-Seq.
The output from running CCS is either a ccs.bam
or a ConsensusReadSet ccs.consensusreadset.xml
. Multiple CCS BAMs can be combined into a single ConsensusReadSet dataset using dataset create
.
Refer to skera.how for deconcatenating MAS-Seq reads into segmented reads (S-reads).
skera split <movie>.hifi_reads.bam adapters.fasta <movie>.skera.bam
After skera, we are essentially left with just a regular CCS read as if it did not get concatenated!
We now run lima
, then isoseq3 tag
and isoseq3 refine
.
# Remove 5' and 3' cDNA primer
$ lima --isoseq --dump-clips [-j cpus] <movie>.skera.bam primers.fasta output.bam
# Clip UMI and cell barcode
$ isoseq3 tag [-j cpus] output.5p--3p.bam output.5p--3p.tagged.bam --design T-8U-12B
# Remove poly(A) tails and concatemer
$ isoseq3 refine [-j cpus] \
--require-polya \
output.5p--3p.tagged.bam primers.fasta output.5p--3p.tagged.refined.bam
lima must be run with --isoseq
mode to proper identify and 5' and 3' cDNA primers (if you're using 10x Chromium, you'd supply the 10x cDNA primers).
isoseq3 tag
is then called to extract the UMIs and single cell cell barcodes. Refer to isoseq.how/umi on how to designate the length and relative position of the UMI and single cell BC. In this example, T-8U-12B
designates that there's a 8bp UMI and a 12bp BC after the polyA tail. The UMI is stored in the XM
BAM tag and BC in the XC
BAM tag (see isoseq BAM tag page)
isoseq3 refine
removes the polyA tail (use the --require-polya
option).
Use isoseq3 correct
with a single cell barcode list (ex: see 10x single cell barcode list). I've made a compressed gzip file for two of the most common barcode sets, see here, note that for the 3' kit, the barcodes had to be reverse-complemented.
# barcode correction based on single cell barcode list
$ isoseq3 correct [-j cpus] --barcodes <barcode_list> \
output.5p--3p.tagged.refined.bam \
output.5p--3p.tagged.refined.corrected.bam
The corrected barcodes will be stored in CB
in the BAM tag.
Before running UMI deduplication, we need to sort the BAM file by the CB
tag first.
$ samtools sort -t CB \
output.5p--3p.tagged.refined.corrected.bam \
output.5p--3p.tagged.refined.corrected.sorted.bam
$ isoseq3 groupdedup \
output.5p--3p.tagged.refined.corrected.sorted.bam \
output.5p--3p.tagged.refined.corrected.sorted.dedup.bam
The output is xxxx.dedup.[bam|fasta]
which can be mapped to the genome for further analysis.
Either use pbmm2
(PacBio's minimap2) wrapper:
pbmm2 align [-j CPUS] \
--sort --preset ISOSEQ \
hg38.fasta \
dedup.bam \
mapped.bam
Or run minimap2 and convert to sorted BAM later:
minimap2 -t 30 -ax splice -uf --secondary=no -C5 \
hg38.fasta dedup.fasta \
> mapped.sam
samtools view -bS mapped.sam|samtools sort > mapped.bam
We will now be using isoseq3 collapse
as described in isoseq.how
isoseq3 collapse mapped.bam collapsed.gff
pigeon sort collapsed.gff -o collapsed.sorted.gff
We will use SQANTI3 to annotate each unique transcript against an annotation.
To run SQANTI3, you must provide a GTF annotation (such as GENCODE) and a reference genome fasta.
python sqanti3_qc.py --gtf dedup.5merge.collapsed.gff \
gencode.annotation.gtf hg38.fa \
[--fl_count dedup.5merge.collapsed.abundance.txt]
The output from SQANTI3 are dedup.5merge.collapsed_classification.txt
, dedup.5merge.collapsed_junctions.txt
, and a PDF file dedup.5merge.collapsed_sqanti_report.pdf
.
This step is optional, but I recommend using SQANTI3 filter to remove library artifacts. These are artifacts that can only be detected after mapping back to the genome, including intrapriming (accidental priming off genomic 'A's), RT switching artifacts, and mismapping to non-canonical junctions.
python sqanti3_RulesFilter.py \
dedup.5merge.collapsed_classification.txt \
dedup.5merge.collapsed_corrected.fasta \
dedup.5merge.collapsed.gff
The output filtered fasta/gff file is dedup.5merge.collapsed_classification.filtered_lite.[fasta|gff]
.
We then use the Cupcake/singlecell scripts to generate a collated CSV file that links each mapped FLNC read to its classified genes and transcripts based on SQANTI3's output:
gzip dedup.info.csv.gz
python <path_to>/cDNA_Cupcake/singlecell/collate_FLNC_gene_info.py \
dedup.5merge.collapsed.group.txt \
dedup.info.csv.gz \
dedup.5merge.collapsed_classification.filtered_lite_classification.txt \
dedup.annotated.csv
Following is an example output dedup.annotated.csv
:
id | pbid | length | transcript | gene | category | ontarget | ORFgroup | UMI | UMIrev | BC | BCrev |
---|---|---|---|---|---|---|---|---|---|---|---|
molecule/1359076 | PB.72530.1 | 879 | novel | WASH3P | novel_not_in_catalog | NA | NA | GGTATTCTAAGA | TCTTAGAATACC | TGCCGGATGTATTCCT | AGGAATACATCCGGCA |
molecule/3250406 | PB.10.8 | 436 | ENST00000624431.2 | FO538757.2 | incomplete-splice_match | NA | NA | AGTGAACATTGC | GCAATGTTCACT | TATCGAGTGTTCTATC | GATAGAACACTCGATA |
molecule/1830928 | PB.11.2 | 580 | ENST00000488147.1 | WASH7P | incomplete-splice_match | NA | NA | TTCCATTTAGGT | ACCTAAATGGAA | TATGGAGCTCACAGCC | GGCTGTGAGCTCCATA |
molecule/1869216 | PB.11.2 | 580 | ENST00000488147.1 | WASH7P | incomplete-splice_match | NA | NA | GTCTAGACGGAG | CTCCGTCTAGAC | CCACCTTGATGTACGT | ACGTACATCAAGGTGG |
We no longer recommend correcting error for UMI and BC (script UMI_BC_error_correct.py
) since isoseq3 dedup
already can handle mismatches and indels with --max-tag-mismatches
and --max-tag-shift
.
At this point, the output dedup.annotated.csv
can be imported into tools like R for manipulation. Happy analyzing!
After producing an annotated csv file (from either collate_FLNC_gene_info.py
or UMI_BC_error_correct.py
) and filtered gtf file from SQANTI3, you can use make_seurat_input.py
to generate a Seurat-compatible input. This script will make a sparse matrix based on the isoform counts per cell. By default, it will filter out RPS
, RPL
, MT-
, and novel
genes. These filters can be disabled by using the --keep_RiboMito
and --keep_novel
options.
python <path_to>/cDNA_Cupcake/singlecell/make_seurat_input.py \
-i dedup.annotated.(corrected.)csv(.gz) \
-a dedup.5merge.collapsed_classification.filtered_lite.gtf \
-o <path_to>/output/
make_seurat_input.py
will generate an output like this:
path/to/output
└── isoforms_seurat
├── barcodes.tsv
├── genes.tsv
└── matrix.mtx
This output can then be loaded into Seurat:
pbmc.data = Read10X(data.dir="<path_to>/output/isoforms_seurat")
You may follow the standard scRNA-seq processing tutorial by Seurat from this point on for clustering and marker genes discovery. The major difference comparing to traditional scRNA-seq is that we are working with isoforms instead of genes, so the standard QC metrics may need to be modified according to your goal. For example, isoforms tend to be more sparse in terms of counts compared to genes, and you may not want to apply a UMI count threshold that is too stringent and risk removing important isoforms.
In addition, a new tool called kana (Preprint available here) allows you to load the matrix directly in a browser to analyze the scIso-Seq data. It is extremely fast (~12k cells in a few mins in our experience, using a 13 inch MacBook 2019) and does most of the usual processing step including clustering and annotation.
If you already have dedup.annotated.csv.gz
from the previous step and you also have cell barcode-to-cell type information (say, from matching short read data), which must contain the BCrev
and Celltype
columns below (extra columns are ok and will be ignored):
# bc_info.csv example
BCrev,Celltype
AAACCCAAGACCAGAC,NA
AAACCCAAGGATCACG,CD4 Memory
AAACCCACAAACACCT,CD16 Mono
AAACCCACAACACTAC,CD4 Memory
Then you can add the Celltype column to your annotated CSV file using:
python <path_to>/singlecell/link_molecule_to_celltype.py \
bc_info.csv \
dedup.annotated.csv.gz \
dedup.annotated.celltype.csv
Note that dedup.annotated.celltype.csv
will be comma-delimited instead of tab-delimited (since cell types can have spaces, using comma is less prone to processing error later).
id | pbid | length | transcript | gene | category | ontarget | ORFgroup | UMI | UMIrev | BC | BCrev | Celltype |
---|---|---|---|---|---|---|---|---|---|---|---|---|
molecule/1359076 | PB.72530.1 | 879 | novel | WASH3P | novel_not_in_catalog | NA | NA | GGTATTCTAAGA | TCTTAGAATACC | TGCCGGATGTATTCCT | AGGAATACATCCGGCA | CD4 Naive |
molecule/3250406 | PB.10.8 | 436 | ENST00000624431.2 | FO538757.2 | incomplete-splice_match | NA | NA | AGTGAACATTGC | GCAATGTTCACT | TATCGAGTGTTCTATC | GATAGAACACTCGATA | NA |
molecule/1830928 | PB.11.2 | 580 | ENST00000488147.1 | WASH7P | incomplete-splice_match | NA | NA | TTCCATTTAGGT | ACCTAAATGGAA | TATGGAGCTCACAGCC | GGCTGTGAGCTCCATA | CD14 Mono |
molecule/1869216 | PB.11.2 | 580 | ENST00000488147.1 | WASH7P | incomplete-splice_match | NA | NA | GTCTAGACGGAG | CTCCGTCTAGAC | CCACCTTGATGTACGT | ACGTACATCAAGGTGG | CD4 Memory |
This celltype-annotated CSV file can be used to tag BAM files later once we run IsoPhase.
We will first run IsoPhase using the dedup reads and the collapsed results. We will be essentially following the same IsoPhase steps as used for bulk Iso-Seq data, except that we do not have a .read_stat.txt
file that we must convert from the collapsed .group.txt
from step 7.
python <path_to>/phasing/utils/convert_group_to_read_stat_file.py \
dedup.5merge.collapsed.group.txt \
dedup.5merge.collapsed.read_stat.txt
Now we have a .read_stat.txt
that we can input to IsoPhase. Follow the IsoPhase tutorial all the way through, which will include commands like below:
python <path_to>/phasing/utils/select_loci_to_phase.py \
hg38.fa \
dedup.fasta \
dedup.5merge.collapsed_classification.filtered_lite.gtf \
dedup.5merge.collapsed.read_stat.txt \
-c 40
where dedup.5merge.collapsed_classification.filtered_lite.gtf
is the result of running SQANTI3 filtering from step 9.
Then move on to generating IsoPhase commands for each locus and run them.
Suppose we found a locus that we want to tag both the haplotype information and the cell type information. We would first align the input reads (they're from dedup.fasta
, but in each folder they would be called ccs.fasta
) using GMAP or minimap2, then tag the aligned BAM file with information from IsoPhase and the celltype-annotated CSV file from step (i).
# going to a locus of interest
cd by_loci/PB.211762_size130
# align the input (dedup reads) to genome to get a BAM file; use GMAP or minimap2
minimap2 -ax splice -uf --secondary=no -C5 hg38.fa ccs.fasta|samtools view -bS|samtools sort > ccs.hg38.sorted.bam
# tag the BAM file
python <path_to>/phasing/utils/tag_bam_post_phasing.py \
ccs.hg38.sorted.bam \
phased.partial.cleaned.hap_info.txt \
ccs.hg38.sorted.tagged.bam \
--celltype dedup.annotated.celltype.csv
samtools index ccs.hg38.sorted.tagged.bam
Now the output ccs.hg38.sorted.tagged.bam
will contain cell type information using the XC
tag and the haplotype information using the RG
tag.
You can use IGV to visualize by first choosing "Color Alignments by" --> "tag" --> type in "RG". Then "Sort Alignments by" --> "tag" --> type in "XC".
The IGV screenshot will look something like this:
HiFi reads are very accurate. While IsoPhase will generate a VCF file, it contains only heterozygous variants and does not call variant with respect to a reference genome. For traditional variant calling, we can use bcftools
to directly call variants from Iso-Seq reads. We will start with dedup.mapped.bam
generated from mapping the de-duplicated Iso-Seq reads (After isoseq3 dedup
) to a reference genome. bcftools
contains a preset designed for PacBio HiFi reads. The option --indel-size 500
allows bcftools to be more sensitive to larger indel in our experience. Example of using hg38 to call variants:
bcftools mpileup --threads 32 -f hg38.fasta dedup.mapped.bam \
-X pacbio-ccs --indel-size 500 --annotate FORMAT/AD,FORMAT/DP | \
bcftools call -f GQ --threads 32 -vm -Ov | \
bcftools norm --threads 32 -c s -f hg38.fasta - | \
bcftools filter --threads 32 -i 'QUAL > 20 && INFO/DP > 4' > bcftools_variants.vcf
In the above, we applied a simple filter of QUAL
> 20 and coverage depth of 4 to remove many low quality variants. It is important to note that variant calling in RNA-seq or Iso-Seq is generally challenging due to the fluctuation in expression level across genes/isoforms (i.e. lower sensitivity in lowly-expressed isoforms). To complicate things further, isoforms are sparse in single-cells sequencing resulting in allelic drop-out in cells.
With a set of paediatric AML MAS-Iso-Seq data consisting of cells from a HCT donor and recipient, we showed the accuracy of the variants (called with bcftools) by demultiplexing the donor and recipient using cellsnp-lite
and vireo
.
First we can filter to just the heterozygous variants and set an even more stringent QUAL
score (This is set arbitrarily) to ensure we have the highest quality variants for genotype demultiplexing:
bcftools filter -i 'GT="0/1" && QUAL > 51' bcftools_variants.vcf > bcftools_variants.hq.vcf
In single-cells sequencing, there are usually many "empty" cell barcodes that has low or no transcripts associated. The file "dedup.annotated.csv" allow us to count the number of UMIs associated with cell barcodes directly (You can also use R/Python to do this):
cut -f11 dedup.annotated.csv | \
datamash -g 1 --sort count 1 | \
sort -rnk2 > barcodes_UMI_count.tsv
awk '$2>100{print $1}' barcodes_UMI_count.tsv > barcodes_100UMI.tsv
Next, we "pile-up" the alleles in each single cell with cellsnp-lite
and demultiplex the donor and recipient using vireo
. Note that the isoseq
pipeline currently "tag" the cell barcodes as the XC
tag and the UMI as XM
tag in the BAM file, so we need to set that in the parameters of cellsnp-lite
.
cellsnp-lite --cellTAG XC --UMItag XM -s dedup.mapped.bam \
-b barcodes_100UMI.tsv -R bcftools_variants.hq.vcf \
--minMAF 0.1 --minCount 20 --gzip \
-O cellsnp-lite_bcftools_het
vireo -c cellsnp-lite_bcftools_het -N 2 \
-o vireo_bcftools_het \
--callAmbientRNAs
vireo
will generate a few output files. The file donor_ids.tsv
will assign a genotype to each of the cell barcodes assigned using the alleles called in the cell.