Skip to content

Latest commit

 

History

History
356 lines (280 loc) · 15.7 KB

README.md

File metadata and controls

356 lines (280 loc) · 15.7 KB

PipeRNAseq

A comprehensive pipeline for RNAseq data analysis.

This pipeline has been used in the following publications:

  1. Li, J., Byrne, K.T., Yan, F., ..., Sun, Y.H., ..., 2018. Tumor cell-intrinsic factors underlie heterogeneity of immune cell infiltration and response to immunotherapy. Immunity, 49(1), pp.178-193.

  2. Markosyan, N., Li, J., Sun, Y.H., ..., 2019. Tumor cell–intrinsic EPHA2 suppresses antitumor immunity by regulating PTGS2 (COX-2). The Journal of clinical investigation, 129(9), pp.3594-3609.

  3. Ye, X., Yang, Y., ..., Sun, Y.H., ..., 2022. Genomic signatures associated with maintenance of genome stability and venom turnover in two parasitoid wasps. Nat Commun, 13, 6417

  4. Zhu, J.#, Chen, K.#, Sun, Y.H.#, ..., 2023. LSM1-mediated Major Satellite RNA decay is required for nonequilibrium histone H3. 3 incorporation into parental pronuclei. Nat Commun, 14, 957 (#co-first authors)

  5. Liu, Z., Zhang, X., ..., Sun, Y.H.$ and Huo, J.$ 2023. Long- and short-read RNA sequencing from five reproductive organs of boar. Sci Data, 10, 678 ($co-corresponding authors)

Software prerequisites

This pipeline is designed to run on Linux servers (but can also be used on Mac OS), and requires the following software:

They need to be installed and added to the $PATH before using the pipeline.

bowtie2
STAR
bedtools
samtools
salmon
featureCounts (from Subread)
fastqc (optional)
cufflinks (optional)
sra-tools (optional)

The above software can also be installed using conda, as below:

#Create pipernaseq environment
conda create --name pipernaseq
conda install -n pipernaseq -c bioconda bowtie2
conda install -n pipernaseq -c bioconda star
conda install -n pipernaseq -c bioconda bedtools
conda install -n pipernaseq -c bioconda samtools
conda install -n pipernaseq -c bioconda subread
conda install -n pipernaseq -c bioconda fastqc
conda install -n pipernaseq -c bioconda git
#We ignore cufflinks since we usually don't need it.

#This environment does not seem compatible with salmon. Even the conda install may work, you may get errors when using salmon.
#We can download salmon and install it separately (example for Linux):
wget "https://github.com/COMBINE-lab/salmon/releases/download/v1.9.0/salmon-1.9.0_linux_x86_64.tar.gz"
tar -xvzf salmon-1.9.0_linux_x86_64.tar.gz
#Finally, add the salmon-1.9.0_linux_x86_64/bin directory to PATH

#Optional: add sra-tools to the env if you plan to download public data from GEO using fastq-dump
conda install -n pipernaseq -c bioconda sra-tools

#Create another env for multiqc, due to the conflict with pipernaseq:
conda create --name multiqc_env
conda install -n multiqc_env -c bioconda multiqc

The main pipeline script is PipeRNAseq.sh, and the dependencies are in the ./bin folder.

One UCSC tool (from http://hgdownload.soe.ucsc.edu/admin/exe/) is used: bedGraphToBigWig. If the binary files under the ./bin folder are not working (Execute ./bin/bedGraphToBigWig but get errors), please re-download them by choosing the correct platform (e.g. linux.x86_64, or macosx.x86_64).

For Mac OS users, the bedGraphToBigWig compiled on Mac OS can be downloaded here: http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64. It needs to be saved in ./bin folder to replace the current one.

Also, for Mac OS, set the pipeline home directory at PipeRNAseq.sh line 59 manually to your local directory (and comment out line 57): HomeDir="/Users/yusun/Downloads/PipelineHomeDir"

Pipeline setup

Here is an example of mm10 genome setup. If you have installed PipeRiboseq or PipeSmRNAseq pipeline before, then some folders such as the bin, genome folders (mm10, hg38, etc.) are compatible and many files can be shared.

0, Files

For mm10, hg38, rn6, sacCer3 and susScr11, some annotation files have been prepared in the repository.

For other species, follow the pipeline component of mm10 to generate dependent files.

1, Download scripts from GitHub to the Linux server:

#Activate conda env if you installed software through conda:
conda activate pipernaseq

git clone https://github.com/sunyumail93/PipeRNAseq.git
mv PipeRNAseq PipelineHomeDir

#Also, add PipelineHomeDir to PATH so PipeRNAseq.sh can be recognized:
PATH=$PATH:/path/to/PipelineHomeDir

2, Set up index files for genome mapping

2a, Download whole genome fasta sequences and chromosome sizes from UCSC goldenpath:

cd PipelineHomeDir/mm10/Sequence
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip *.gz
wget "http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes" -O mm10.ChromInfo.txt

2b, Extract RNA sequences

#Go to PipelineHomeDir/mm10/Annotation
cd ../Annotation
gunzip *.gz
cd ../Sequence
bedtools getfasta -s -split -name -fi mm10.fa -bed ../Annotation/mm10.RefSeq.reduced.bed12 -fo mm10.RefSeq.reduced.bed12.fa.t
#Genome FASTA index .fai file will also be generated after running the above command
cat mm10.RefSeq.reduced.bed12.fa.t|sed 's/::.*//' > mm10.RefSeq.reduced.bed12.fa
rm -rf mm10.RefSeq.reduced.bed12.fa.t

2c, Set up Index files:

#Create directory: PipelineHomeDir/mm10/Index
mkdir ../Index
cd ../Index

#STAR index generation. Better to run it in the background as it may take > 1h:
mkdir STARIndex
STAR --runMode genomeGenerate --genomeDir STARIndex --genomeFastaFiles ../Sequence/mm10.fa --sjdbGTFfile ../Annotation/mm10.RefSeq.reduced.bed12.geneid.gtf --sjdbOverhang 100

#salmon index (SalmonIndex directory will be created automatically)
#It will take ~2 min to build mm10 transcriptome index
salmon index -t ../Sequence/mm10.RefSeq.reduced.bed12.fa -i SalmonIndex
#Optional: if you would like to include transposon consensus sequences, you can also add the following index:
cat ../Sequence/mm10.RefSeq.reduced.bed12.fa ../Sequence/mm10.TE.fa > ../Sequence/mm10.RefSeq.reduced.bed12.WithTE.fa
#The index name has to be SalmonIndexWithTE for the pipeline to recognize
salmon index -t ../Sequence/mm10.RefSeq.reduced.bed12.WithTE.fa -i SalmonIndexWithTE --type puff -k 31 --keepDuplicates
#This uniqMatchingWithTE.txt file will be used in DESeq2 analysis when importing salmon results
cat ../Annotation/mm10.uniqMatching.txt ../Annotation/TE.list > ../Annotation/mm10.uniqMatchingWithTE.txt   

#rRNA bowtie2 index:
mkdir rRNAIndex
bowtie2-build ../Sequence/mm10.rRNA.fa ./rRNAIndex/rRNAIndex

4, Add executable permissions

#Go back to PipelineHomeDir
cd ../../

chmod +x PipeRNAseq.sh
chmod +x ./bin/bedGraphToBigWig

Pipeline components

PipelineHomeDir/
    ├── PipeRNAseq.sh
    ├── bin/
    └── mm10/
      └── Annotation/
        ├── mm10.RefSeq.reduced.bed12
        ├── mm10.RefSeq.reduced.mRNA.bed12
        ├── mm10.RefSeq.reduced.bed12.geneid.gtf
        └── mm10.uniqMatching.txt
      └── Index/
        ├── rRNAIndex/
        ├── SalmonIndex/
        └── STARIndex/
      └── Sequence/
        ├── mm10.fa
        ├── mm10.fai
        ├── mm10.ChromInfo.txt
        ├── mm10.rRNA.fa
        └── mm10.RefSeq.reduced.bed12.fa
    └── hg38/
       ...

Notes:

1, For the Annotation folder, download GTF file from UCSC table browser. reduced: Only one location was chosen when one gene duplicates at multiple genomic loci. For more details about preprocessing the genome annotation files, see the Preprocessing tutorial.

2, The uniqMatching.txt file contains one-to-one matching from transcript to gene name.

3, For the Index folder, indexes are not included in this Github repository, but need to be created during set up.

4, For the Sequence folder, RefSeq.reduced.bed12.fa was converted from RefSeq.reduced.bed12 file using bedtools, and removed two rRNA genes (rRNA affects salmon quantification TPM). genome.fa and ChromInfo.txt files need to be downloaded from UCSC goldenpath. The fai index file is not required now sine it will be generated by samtools when needed.

Usage

Type the pipeline name, then you will see the manual page:

PipeRNAseq.sh

Manual page:

Examples

A regular run using mostly default parameters:

#Single-end RNAseq
PipeRNAseq.sh -i Data.fastq.gz -g mm10
#Paired-end RNAseq
PipeRNAseq.sh -l Data.R1.fastq.gz -r Data.R2.fastq.gz -g mm10

More parameters used, not run fastqc, run featureCounts using unique mapping reads (to pair with a Riboseq data), and generate bigWig tracks:

PipeRNAseq.sh -l Data.R1.fastq.gz -r Data.R2.fastq.gz -g mm10 -noqc -p 4 -pairrpf -bigWig

Run real data to test the pipeline

1, Download data

Use a public dataset: GEO SRA: SRR10446759

fastq-dump is part of NCBI SRA Toolkit.

If you included sra-tools in your conda environment pipernaseq, you can use fastq-dump directly:

#For single-end data
fastq-dump --split-3 SRR10446770
gzip SRR10446770.fastq

#For paired-end data
fastq-dump --split-3 SRR12990746
mv SRR12990746_1.fastq SRR12990746.R1.fastq
mv SRR12990746_2.fastq SRR12990746.R2.fastq
gzip SRR12990746*

2, Run PipeRNAseq.sh pipeline:

PipeRNAseq.sh -i SRR10446770.fastq.gz -g mm10 -p 8
PipeRNAseq.sh -l SRR12990746.R1.fastq.gz -r SRR12990746.R2.fastq.gz -g mm10 -p 8

3, Loop over multiple datasets

If you have multiple RNAseq data to process, first write a loop to put each data into a separate folder, then submit PipeRNAseq.sh within each folder.

#Original data hierarchy
RNAseq/
    ├── Data1.RNAseq.R1.fastq.gz
    ├── Data1.RNAseq.R2.fastq.gz
    ├── Data2.RNAseq.R1.fastq.gz
    └── Data2.RNAseq.R2.fastq.gz
for i in `for name in *R?.fastq.gz*;do echo ${name%.R*};done|uniq`;do echo $i;mkdir $i;done
for name in *RNAseq;do echo $name;mv $name.R?.fastq.gz $name/;done

#After running the above two commands:
RNAseq/
    ├── Data1.RNAseq
        ├── Data1.RNAseq.R1.fastq.gz
        └── Data1.RNAseq.R2.fastq.gz
    └── Data2.RNAseq
        ├── Data1.RNAseq.R1.fastq.gz
        └── Data1.RNAseq.R2.fastq.gz
        
#Finally submit the pipeline:
#If you are running this on an HPC cluster, it's better to create a sbatch script using this loop, then submit them to run parallelly.
for name in *Data;do echo $name;cd $name;PipeRNAseq.sh -l $name.R1.fastq.gz -r $name.R2.fastq.gz -g mm10 -p 8 -bigWig;cd ../;done

#After the pipeline finishes, you will get a list of outputs:
RNAseq/
    ├── Data1.RNAseq/
        ├── fastqc/
            ├── Data1.RNAseq.fastqc.log
            ├── Data1.RNAseq.R1_fastqc.html
            ├── Data1.RNAseq.R1_fastqc.zip
            ├── Data1.RNAseq.R2_fastqc.html
            └── Data1.RNAseq.R2_fastqc.zip
        ├── feature_counts/
            ├── Data1.RNAseq.mm10.featureCounts.FullTable.txt
            ├── Data1.RNAseq.mm10.featureCounts.FullTable.txt.summary          #FullTable.txt is the original output generated by featureCounts
            ├── Data1.RNAseq.mm10.featureCounts.gene.txt                       #This output file, containing gene length and raw counts, is sufficient for regular gene expression analysis
            ├── Data1.RNAseq.mm10.featureCounts.log
            ├── Data1.RNAseq.mm10.featureCounts.unique.All.CDS.gene.txt        #Only use unique mapping reads to count, only on mRNA CDS and lncRNA full length regions. 
            ├                                                                  #This is to pair with a Riboseq data
            ├── Data1.RNAseq.mm10.featureCounts.unique.FullTable.txt
            ├── Data1.RNAseq.mm10.featureCounts.unique.FullTable.txt.summary
            ├── Data1.RNAseq.mm10.featureCounts.unique.gene.txt
            ├── Data1.RNAseq.mm10.featureCounts.unique.lncRNA.gene.txt
            ├── Data1.RNAseq.mm10.featureCounts.unique.log
            ├── Data1.RNAseq.mm10.featureCounts.unique.mRNA.CDS.FullTable.txt
            ├── Data1.RNAseq.mm10.featureCounts.unique.mRNA.CDS.FullTable.txt.summary
            ├── Data1.RNAseq.mm10.featureCounts.unique.mRNA.CDS.gene.txt
            └── Data1.RNAseq.mm10.featureCounts.unique.mRNA.CDS.log
        ├── genome_mapping/
            ├── Data1.RNAseq.mm10.Log.final.out
            ├── Data1.RNAseq.mm10.Log.out
            ├── Data1.RNAseq.mm10.Log.progress.out
            ├── Data1.RNAseq.mm10.SJ.out.tab
            ├── Data1.RNAseq.mm10.sorted.bam                     #bam alignment file
            ├── Data1.RNAseq.mm10.sorted.bam.bai                       
            ├── Data1.RNAseq.mm10.sorted.unique.bam              #Unique mapping reads only
            ├── Data1.RNAseq.mm10.sorted.unique.bam.bai
            ├── Data1.RNAseq.rRNA.log
            └── Data1.RNAseq.STAR.log
        ├── Data1.RNAseq.R1.fastq.gz
        ├── Data1.RNAseq.R2.fastq.gz
        ├── Data1.RNAseq.summary
        ├── salmon_results/
            ├── Data1.RNAseq.mm10.quant.sf                        #salmon quantification result. Can be imported to DESeq2 using tximport
            ├── Data1.RNAseq.mm10.salmon.log
            └── Data1.RNAseq.salmon/
        ├── salmonWithTE_results
            ├── Data1.RNAseq.mm10.salmonWithTE.log
            ├── Data1.RNAseq.mm10.WithTE.quant.sf                 #salmon quantification including TE
            └── Data1.RNAseq.salmonWithTE/
        └── tracks
            ├── Data1.RNAseq.mm10.sorted.bedGraph.bw              #bigWig track for un-stranded RNAseq
            ├── Data1.RNAseq.mm10.sorted.minus.bedGraph.bw        #bigWig track files for stranded RNAseq
            └── Data1.RNAseq.mm10.sorted.plus.bedGraph.bw
    └── Data2.RNAseq/
        ...

#You can also run MultiQC under the project directory (outside all Data.RNAseq folders) to summarize the results
#See more details: https://multiqc.info/
#Deactivate the current env:
conda deactivate

#Activate multiqc env and run it:
conda activate multiqc_env
multiqc .

Important results

1, Gene and transcript quantification results

Data.mm10.featureCounts.gene.txt                                   #All genes
Data.mm10.featureCounts.unique.All.CDS.gene.txt                    #All genes, mRNA CDS region counts (full length for lncRNA)
Data.mm10.quant.sf                                                 #Salmon quantification of all transcirpts
Data.mm10.WithTE.quant.sf                                          #Salmon quantification of all transcirpts + transposons

2, bigWig Tracks

Data.mm10.sorted.bedGraph.bw
Data.mm10.sorted.plus.bedGraph.bw
Data.mm10.sorted.minus.bedGraph.bw

Downstream analysis

The output quantification results are fully compatible with easyRNAseqDE, which can perform downstream differential expression (DE) analysis.

After running easyRNAseqDE, we suggest further visualization using Quickomics.

The bigWig track files can be uploaded to the UCSC Genome Browser or opened in R/Python for figure generation.