In this module we will analyze Illumina gene expression data generated from a FASTQ file (wt_mRNA_100K_reads.fq
). This data file contains a sampling of reads from RNA sequencing of wild-type seedlings. For your reference, please access the article that describes the age of these seedlings, their genotype, how they were treated, as well as for that of the pifq mutant. This information should be recorded in the Materials and Methods section of your lab report
We will use BWA to align this data to Arabidopsis genes. BWA is a fast and efficient short read aligner geared towards quickly aligning large sets of short DNA/RNA sequences (reads) to large genomes/transcriptomes. BWA takes an index and a set of reads as input and outputs a list of alignments in a SAM file SAM stands for 'Sequence Alignment/Map'.
We will use Samtools to access and manipulate the alignments and a Perl script to count the number of reads mapping to each gene.
You will need the following files. They are available on GitHub at https://github.com/mfcovington/BIS180L-RNAseq.
-
cDNA and CDS (coding DNA sequence) FASTA reference files from TAIR (The Arabidopsis Information Resource)
TAIR10_cdna_20110103_representative_gene_model_updated.fa
TAIR10_cds_20110103_representative_gene_model_updated.fa
-
Sample FASTQ file to map to the FASTA files:
wt_mRNA_100K_reads.fq
-
A Perl script to count the # of reads mapping to each gene:
bam2counts.pl
-
A markdown version of these instructions is also available in the git repository:
README.md
Documentation for tools used for mapping reads and visualizing alignments can be found online:
- BWA: http://bio-bwa.sourceforge.net/bwa.shtml
- Samtools: http://samtools.sourceforge.net/samtools.shtml
First, we will index the FASTA reference file. This is required before we can map using bwa
. Mapping requires a few steps. The first mapping step uses bwa aln
to create a sequence alignment index. This .sai
file is converted into a SAM file using bwa samse
or bwa sampe
depending on whether your FASTQ file contains single read or paired end data. SAM files can be further converted to BAM files (described below) using samtools
.
You'll notice that we are assigning long file names and frequently used ID values to variables. This makes our code more readable, reusable, and typo-resistant. Be careful though. Variables will not be defined in other terminal windows. To confirm that the variable contains what you expect, you can check its contents with: echo $VARIABLE_NAME
.
Build index of FASTA reference file before mapping:
FA=TAIR10_cdna_20110103_representative_gene_model_updated.fa
bwa index $FA
Map reads in FASTQ file to FASTA reference:
ID=wt_mRNA_100K_reads
bwa aln $FA $ID.fq > $ID.sai
bwa samse $FA $ID.sai $ID.fq > $ID.sam
samtools view -Sb $ID.sam > $ID.bam
SAM files are plain-text and can be visualized with commands such as head
and less
. BAM files contain all the information that the corresponding SAM file contains, but is significantly smaller in size. Since they are binary format, BAM files require a tool like samtools
for visualization.
The following are equivalent (note: -S
turns off word wrap for less
):
less -S $ID.sam
samtools view -h $ID.bam | less -S
These views start by listing every sequence ID from the FASTA reference file. Samtools lets you easily ignore this header (by leaving off the -h
flag) and get right to the actual alignments.
samtools view $ID.bam | less -S
You'll see something like this:
SRR477075.1 4 * 0 0 * * 0 0 NGAAACTTCTGATCGTCATGGAAGCTACTTCACAAC #1:BDDEFDHFDFIGFHIIIIIIIBEHBEHBEHGGI
SRR477075.2 16 AT1G52300.1 114 25 36M * 0 0 TTACCCCGCCGCTCGCAAGAGGACTTACAACTGGAN ####FIHEIIHHBIIIIIIIIIHHHGHHFFFDD=1# XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:3T31G0
SRR477075.3 4 * 0 0 * * 0 0 CCTCGTTCAAGTCAAGTTGTTGGATGGCCTCCTATA @@@DFDDDHHHFHIGGHDHIGGIE<G?EH+?GHEHE
SRR477075.4 16 AT2G38530.1 205 37 36M * 0 0 GACCGTCAGCAAGCTTGCCGTTGCCTTCAATCTGCC D@FF?<CJIGIGHBBIGHGIIJIHHHHHFFFDD@@@ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
...
There is a lot of information here. The SAM format is described in full in the Samtools documentation at http://samtools.sourceforge.net/samtools.shtml#5; however, a few key facts about the SAM format will get you started:
- Each read from the FASTQ file is represented by one line in the SAM/BAM file (The read ID is the 1st column)
- The 3rd column indicates the reference sequence that the read maps (if the read fails to map, you will see a
*
) - The 4th column is the left-most position of the mapped read
- The 5th column is the mapping quality (MAPQ; the higher the better)
In order to do differential gene expression analysis, we first need to know how many reads map to each gene.
Count the number of reads mapping to each gene:
perl bam2counts.pl counts.tsv $ID.bam
Let's see what the output looks like:
head counts.tsv
wt_mRNA_100K_reads
AT1G01010.1 2
AT1G01020.1 4
AT1G01030.1 0
AT1G01040.2 7
AT1G01050.1 5
AT1G01060.1 2
AT1G01070.1 0
AT1G01073.1 0
AT1G01080.2 2
In order to be used to their full potential, BAM files need to be sorted and indexed. This allows the alignment information to be accessed much quicker and many downstream tools require an index (e.g., Integrated Genomics Viewer, IGV).
Sort & Index BAM file:
samtools sort $ID.bam $ID.sorted
samtools index $ID.sorted.bam
Let's see how the sorted BAM file looks (compare this to how the unsorted BAM file looked above)
samtools view $ID.sorted.bam | less -S
SRR477075.7939 16 AT1G50920.1 50 37 36M * 0 0 AGTTCGTTGACATCATCCTTTCACGGACTCAGCGTC FEIGBJJIJJGGCFEGJHGGEGBHFHHGFFFFF@@C XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
SRR477075.42057 16 AT1G50920.1 96 37 36M * 0 0 TGTTGTCCACAAGGGTTACAAGATTAACCGTCTCCG 3;D3C:1:113CA?A:,A??<EC48C0DDDD:B?;? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
SRR477075.51923 16 AT1G50920.1 130 37 36M * 0 0 CGTCAGTTCTCCATGAGAAAGGTTAAGTACACTCAG EE@IEFEFC++B<FIA+E<<@BF>FB?DA22?A@?? XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10A25
SRR477075.53014 0 AT1G50920.1 742 37 36M * 0 0 CGAGCTGCTGTTTTGTTCTTTCTCGACATTTCTGGA CCBFFFFFHHHHHJJIIGJJJHIIHJJIIIJIIIII XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
...
What are three pros/cons of SAM vs BAM format?
How does the TAIR10 CDS reference differ from the TAIR10 cDNA reference? How does the bam2counts.pl
output for AT1G20620.1
differ for CDS vs cDNA and why?
Mapping multiple FASTQ files and subsequent read counting can be automated as in this example:
FA=TAIR10_cdna_20110103_representative_gene_model_updated.fa
for ID in sample1 sample2 sample3 sample4; do
bwa index $FA
bwa aln $FA $ID.fq > $ID.sai
bwa samse $FA $ID.sai $ID.fq > $ID.sam
samtools view -Sb $ID.sam > $ID.bam
done
perl bam2counts.pl counts.tsv *.bam
What are at least two major advantages to such automation?
There are many different parameters/options available tools like bwa
and samtools
. How much does filtering out reads with lower mapping qualities (MAPQ) affect the # of reads mapped?
The command to remove mapped reads with MAPQ less than 20 is:
samtools view -Sb -q20 $ID.sam > $ID.q20.bam