-
Notifications
You must be signed in to change notification settings - Fork 204
CBW 2024 Advanced Module 1: Introduction to metagenomics and read‐based profiling
This tutorial is part of the 2024 Canadian Bioinformatics Workshops Advanced Microbiome Analysis (St John's, NL, May 29-30).
Author: Ben Fisher
The goal of this tutorial is to familiarize students with processes by which we analyze metagenomic data. Shotgun Metagenomic Sequencing, sometimes called MGS or WGS, is capable of capturing any DNA extracted from a given sample (however, this does not necessarily mean it captures ALL of the DNA). With MGS reads, we must consider that there will be significant host contamination, so we must filter against host sequences in our pipeline. There are many tools that can work to produce similar results, and this is not a one-size-fits-all solution. This is instead a foray into popular tools and processes, and how to appropriately use them for our analyses.
Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them.
- Setup, Importing and Formatting Data
- Remove Rare Taxa and Rarefy
- Alpha Diversity
- Beta Diversity
- Visualization with Stacked Bar Charts
- Visualization with Heatmaps
Copy the files to your working directory.
cp -r ~/CourseData/MIC_data/AMB_data/raw_data/ .
Sometimes in bioinformatics, the number of tasks you have to complete can get VERY large. Fortunately, there are several tools that can help us with this. One such tool is GNU Parallel. This tool can simplify the way in which we approach large tasks, and as the name suggests, it can iterate though many tasks in parallel, i.e. concurrently. We can use a simple command to demonstrate this.
parallel 'echo {}' ::: a b c
With the command above, the program contained within the quotation marks ' '
is echo
. This program is run 3 times, as there are 3 inputs listed after the :::
characters. What happens if there are multiple lists of inputs? Try the following:
parallel 'echo {}' ::: a b c ::: 1 2 3
Here, we have demonstrated how parallel
treats multiple inputs. It uses all combinations of one of each from a b c
and 1 2 3
. But, what if we wanted to use 2 inputs that were sorted in a specific order? This is where the --link
flag becomes particularly useful. Try the following:
parallel --link 'echo {}' ::: a b c ::: 1 2 3
In this case, the inputs are "linked", such that only one of each is used. If the lists are different lengths, parallel
will go back to the beginning of the shortest list and continue to use it until the longest list is completed. You do not have to run the following command, as the output is provided to demonstrate this.
$ parallel --link 'echo {}' ::: light dark ::: red blue green
> light red
> dark blue
> light green
Notice how light
appears a second time (on the third line of the output) to satisfy the length of the second list.
Another useful feature is specifying which inputs we give parallel are to go where. This can be done intuitively by using multiple brackets { }
containing numbers corresponding to the list we are interested in. Again, you do not have to run the following command, as the output is provided to demonstrate this.
$ parallel --link 'echo {1} {3}; echo {2} {3}' ::: one red ::: two blue ::: fish
> one fish
> two fish
> red fish
> blue fish
Finally, a handy feature is that parallel
accepts files as inputs. This is done slightly differently than before, as we need to use four colon characters ::::
instead of three. Parallel will then read each line of the file and treat its contents as a list. You can also mix this with the three-colon character lists :::
you are already familiar with. Using the following code, create a test file and use parallel
to run the echo
program:
echo -e "A\nB\nC" > test.txt
parallel --link 'echo {2} {1}' :::: test.txt ::: 1 2 3
And with that, you're ready to use parallel for all of your bioinformatic needs! We will continue to use it throughout this tutorial and show some additional features along the way. There is also a cheat-sheet here for quick reference.
Quality control is an important step in any pipeline. For this tutorial, we will use FastQC to inspect the quality of our samples.
First, make your desired output directory (if it doesn't already exist). Then, run FastQC as follows:
fastqc -t 4 raw_data/*fastq.gz -o fastqc_out
Go to http://##.uhn-hpc.ca/
(substituting ## for your student number) and navigate to your FastQC output directory. Click on the .html files to view the results for each sample. Let's look at the Per base sequence quality tab. This shows a boxplot representing the quality of the base call for each position of the read. In general, due to the inherent degradation of quality with increased sequence length, the quality scores will trend downward as the read gets longer. However, you may notice that for our samples this is not the case! This is because for the purpose of this tutorial, your raw data has already been trimmed.
More often, per base sequence quality will look like the following. The FastQC documentation provides examples of "good" and "bad" data. These examples are also shown below:
**Question 1: **Which of the graphs does your data resemble more closely? **Question 2: **What can we do if data fails the Per Base Sequence Quality module?
Now, you may have also noticed that most of the samples fail the "Per Base Sequence Content" module of FastQC. Let's look at our visualization:
This specific module plots out the proportion of each base position in a file, and raises a warning/error if there are large discrepancies in base proportions. In a given sequence, the lines for each base should run in parallel, indicating that the base calling represents proper nucleotide pairing. Additionally, the A and T lines may appear separate from the G and C lines, which is a consequence of the GC content of the sample. The relative amount of each base reflects the overall amount of the bases in the genome, and should not be imbalanced. When this is not the case, the sequence composition is biased. A common cause of this is the use of primers, which throws off our sequence content at the beginning of the read. Fortunately, although the module error stems from this bias, according to the FastQC documentation for this module it will not largely impact our downstream analysis.
The other modules are explained in depth in the FastQC Module Help Documentation
KneadData is a tool which "wraps" several programs to create a pipeline that can be executed with one command. Remember that these reads have already been trimmed for you - this is in fact one of KneadData's functionalities. For this tutorial though, we will use KneadData to filter our reads for contaminant sequences against a human database.
Kneaddata outputs many files for each sample we provide it. These include:
- paired sequences which match our database;
- singletons which match our database;
- paired sequences that do not match our database;
- singletons that do not match our database;
- and some others.
First, we want to activate our conda
environment where KneadData and our other tools for this tutorial are installed:
conda activate taxonomic
Then, run KneadData, using parallel, with the following command:
parallel -j 1 --eta --link 'kneaddata -i1 {1} -i2 {2} -o kneaddata_out -db ~/CourseData/MIC_data/tools/GRCh38_PhiX --bypass-trim --remove-intermediate-output' ::: raw_data/*R1_subsampled.fastq.gz ::: raw_data/*R2_subsampled.fastq.gz
While kneaddata
is running, consider the following:
The GRCh38_PhiX database is made up of the human genome.
Question 3: Of the four output files specified above, which should we choose for analyzing the microbiome?
You can check out all of the files that kneaddata
has produced by listing the contents of the output directory (there is a lot!). Take note of how the files are differentiated from one another, and try to identify some of the files we are interested in. Once kneaddata
is complete, we want to stitch our reads together into a single file. This is accomplished with a Perl script from our very own Microbiome Helper. For your convenience, it is already on your student instance.
with the Perl script, concatenate the reads into a single file:
perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_contam*.fastq
The script finds paired reads that match a given regex and outputs the combined files.
- We first specify that our program is to be run with Perl, and then provide the path to the program.
- The
-p
flag specifies how many processes to run in parallel. The default is to do one process at a time, so using-p 4
speeds things up. - The
--no_R_match
option tells the script that our read pairs are differentiated by*_1.fastq
instead of*_R1.fastq
. - The
-o
flag specifies the directory where we want the concatenated files to go. - Our regex matches the paired reads that do not align to the human database from the KneadData output. This is because the reads that aren't "contaminants" actually align to the human genome, so what we are left with could contain microbial reads.
- Consider that our files of interest are named something like
MSMB4LXW_R1_subsampled_kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq
. If we want to match all of our paired contaminant files with a regex, we can specify the string unique to those filenames_paired_contam
, and use wildcards*
to fill the parts of the filename that will change between samples.
- Consider that our files of interest are named something like
If the above does not work, you may need to install Perl:
conda install conda-forge::perl
If it still does not work or you already have Perl installed, you may get an error saying you require Parallel::ForkManager. Fix by executing the following inside your conda environment:
conda install bioconda::perl-parallel-forkmanager
First, we should see how many reads are in our concatenated samples. Since .fastq
files have 4 lines per read, we can divide the number of lines in the file by 4 to count the reads. Use the following command to check number of lines in the output files:
wc -l cat_reads/*
Woah! There's almost nothing left in most of these files! One even has zero reads! From this, we can infer that KneadData found the majority of our reads aligned to the human genome, leaving us with very few sequences to look for microbial reads. This is not entirely uncommon, however the fact that our input reads are actually subsets of much larger samples exacerbated this effect.
For the purpose of this tutorial, we will instead continue with the raw data instead of our filtered data. We are only doing this to demonstrate the tools for the purposes of this tutorial. This is NOT standard practice. In practice, you SHOULD NOT use unfiltered metagenomic data for taxonomic annotation.
To continue, we will concatenate the raw data, then unzip it (the ;
lets you enter multiple command lines that will execute in series when you press enter).
perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 -o cat_reads_full raw_data/*.fastq.gz;
gunzip cat_reads_full/*.gz
Now let's check how many reads we are dealing with:
wc -l cat_reads_full/*
Question 4: How many reads are in each sample?
Now that we have our reads of interest, we want to understand what these reads are. To accomplish this, we use tools which annotate the reads based on different methods and databases. There are many tools which are capable of this, with varying degrees of speed and precision. For this tutorial, we will be using Kraken2 for fast exact k-mer matching against a database.
Our lab has also investigated which parameters impact tool performance in this Microbial Genomics paper. One of the most important factors is the contents of the database, which should include as many taxa as possible to avoid the reads being assigned an incorrect taxonomic label. Generally, the bigger and more inclusive database, the better. However, due to the constraints of our cloud instance, we will be using a "Standard 8GB" index provided by the Kraken2 developers. For your convenience, the database is already available on your instance.
First, you must create the appropriate output directories, or Kraken2 will not write any files. Using parallel
, we will then run Kraken2 for our concatenated reads as shown below:
parallel -j 1 --eta 'kraken2 --db ~/CourseData/MIC_data/tools/k2_standard_08gb --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport --confidence 0 {}' ::: cat_reads_full/*.fastq
This process can take some time. While this runs, let's learn about what our command is doing!
- We first specify our options for parallel, where:
- the
-j 1
option specifies that we want to run two jobs concurrently; - the
--eta
option will count down the jobs are they are completed; - after the program contained in quotation marks, we specify our input files with
:::
, and use a regex to match all of the concatenated, unzipped.fastq
files.
- the
- We then describe how we want
kraken
to run:- by first specifying the location of the database with the
--db
option; - then specifying the output directory for the raw kraken annotated reads;
- notice that we use a special form of the brackets here,
{/.}
, this is a special function ofparallel
that will remove both the file path and extension when substituting the input into ourkraken
command. This is useful when files are going into different directories, and when we want to change the extension.
- notice that we use a special form of the brackets here,
- similarly, we also specify the output of our "report" files with the
--report
option; - the
-confidence
option allows us to filter annotations below a certain threshold (a float between 0 and 1) into the unclassified node. We are using0
because our samples are already subset, however this should generally be higher. See our paper for more information. - and finally, we use the empty brackets
{}
forparallel
to tellkraken
what our desired input file is.
- by first specifying the location of the database with the
With Kraken2, we have annotated the reads in our sample with taxonomy information. If we want to use this to investigate diversity metrics, we need to find the abundances of taxa in our samples. This is done with Kraken2's companion tool, Bracken (Bayesian Reestimation of Abundance with KrakEN).
Let's run Bracken on our Kraken2 outputs!
parallel -j 2 --eta 'bracken -d ~/CourseData/MIC_data/tools/kraken2_standard_08gb -i {} -o bracken_out{/.}.species.bracken -r 100 -l S -t 1' ::: kraken2_kreport/*.kreport
Some notes about this command:
-
-d
specifies the database we want to use. It should be the same database we used when we ran Kraken2; -
-i
is our input file(s); -
-o
is where and what we want the output files to be; -
-r
is the read length to get all classifications for, the default is 100; -
-l
is the taxonomic level at which we want to estimate abundances; -
-t
is the number of reads required prior to abundance estimation to perform re-estimation
Another tool that is commonly used for taxonomic annotation of metagenomic sequences is MetaPhlAn. This tool is different from Kraken2 in that it uses a database of marker genes, instead of a collection of genomes. We will use MetaPhlAn 3 for this tutorial, but a newer version (4) utilizing a larger database is available. First, let's make an output folder. Then, we can run the following:
parallel -j 2 --eta 'metaphlan --input_type fastq --bowtie2db ~/CourseData/MIC_data/tools/CHOCOPhlAn_201901 --no_map -o metaphlan_out/{/.}.mpa {}' ::: cat_reads/*.fastq
Similar to Kraken2, MetaPhlAn will output individual files for each sample. We can use a utility script from MetaPhlAn to merge our outputs into one table.
python ~/CourseData/MIC_data/AMB_data/scripts/merge_metaphlan_tables.py metaphlan_out/*.mpa > metaphlan_out/mpa_merged_table.txt
If we view this new combined table, we will see three key things:
- First, the output format is different to that of Kraken2, where the full taxonomic lineages are expanded line by line.
- Second, MetaPhlAn only outputs relative abundances for each taxonomic node, whereas Kraken2 (before re-analysis with Bracken) will output absolute numbers of reads assigned to each taxonomic node.
- Third, the number of taxa that MetaPhlAn finds is much smaller than Kraken2. This is partially due to us using a low confidence threshold with Kraken2, but this discrepancy between the two tools tends to hold true at higher confidence thresholds as well. See our paper for more info about how these tools perform compared to each other.
After all of this, we are almost ready to create some profiles from our samples! The last step is to put everything into a format that R can easily handle. One standard format is the .biom
format, which is a recognized standard for the Earth Microbiome Project and is supported by the Genomics Standards Consortium.
To transform our data, we will use kraken-biom, a tool which takes the .kreport
files output by bracken
and creates a new, combined file in the .biom
format. This tool can also incorporate our sample metadata, and it is easier to merge this information now versus later.
First, we will have to copy the metadata file to our working directory:
cp ~/CourseData/MIC_data/AMB_data/amb_module1/mgs_metadata.tsv .
Let's have a look at this file, try reading it with cat
Now that we have our metadata, let's run kraken-biom
and merge our samples to one organized output file:
kraken-biom kraken2_kreport/*bracken_species.kreport -m mgs_metadata.tsv -o mgs.biom --fmt json
- The inputs are a positional argument, and
kraken-biom
will accept lists. We can match all of our desired bracken-corrected.kreport
files with the regexkraken2_kreport/*bracken_species.kreport
- the
-m
option is for specifying our metadata file. Note thatkraken-biom
is picky about the order of the metadata, so the entries in the list should be in the same order that your files are found. Typically this is in lexicographic order. - the
-o
option specifies our output file. - the
--fmt
option specifies that we want the output to be injson
format, which is not the default behavior of the program. JSON is a text-based version of the BIOM format, as opposed to HDF5 which is a binary file format.
Now, with all of that, we should have our final mgs.biom
file ready to go! You can check out what this file looks like (it's a single line so head
will not work here). It can be cumbersome to look at, but there are patterns. Fortunately, R is much better at reading these files than we are!
It would be best to create a new R markdown document for this section. You can then paste the following lines of code into the first chunk and clink "Run". If you are not using a markdown document, omit the first line.
knitr::opts_chunk$set(echo = TRUE)
library(phyloseq)
library(vegan)
library(ggplot2)
library(taxonomizr)
#setwd("~/workspace/")
#data<-import_biom("~/CourseData/MIC_data/AMB_data/amb_module1/mgs.biom", header = TRUE)
data<-import_biom("ben/mgs_correction.biom", header = TRUE)
We can see what the phyloseq object looks like by using:
View(data)
With this, we can see that it is a special object that should contain otu_table
, tax_table
, and sam_data
objects within it. We can additionally inspect each of these elements with the View()
function, specifying what we are looking for inside the phyloseq object. This command takes the form: View(data@sam_data)
, where the string following the @
sign is what we want to see.
Question 5: What do each of these objects within data
represent?
Next, we will do some housekeeping with the phyloseq object we have created with the import-biom()
function. If we look at the tax_table
with the following:
View(data@tax_table)
We get something like this:
Rank1 | Rank2 | Rank3 | Rank4 | Rank5 | Rank6 | Rank7 | |
---|---|---|---|---|---|---|---|
820 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__uniformis |
2755405 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__sp. CACC 737 |
2528203 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__sp. A1C1 |
... | ... | ... | ... | ... | ... | ... | ... |
We can see that the table contains all of our information, but the labels need some work. So, we can modify the contents of the tax_table
as follows. Let's create a new chunk in our document and start pasting this code to run:
#Rename the taxa names in the taxa table.
data@tax_table@.Data <- substring(data@tax_table@.Data, 4)
#Rename columns of taxa table.
colnames(data@tax_table@.Data)<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
bacteria<-subset_taxa(data, Kingdom == "Bacteria")
So, what's this doing to our data?
- The first command strips all the taxonomic names of their first 3 characters, which removes those pesky leading
k__
strings. We are directly modifying the vectors within the data frame with thesubstr()
command. The parameters of this command tell R we only want the strings starting at the 4th character, which skips the letter and two underscores (i.e. `k__). - The second command creates a character vector of the taxonomic levels to replace the "Ranks" in the table, and replaces those column names.
- The third command subsets all of the taxa in our data to only bacteria, by selecting only taxa with
"Bacteria"
as theirKingdom
column entry in thetax_table
.
Now if you run View(data@tax_table)
, the resulting output should look much cleaner!
And, a quick way to check if this worked is to execute the following command in your console. This looks for unique values in the "Kingdom" row of the taxonomy table, after coercing it to a data frame:
unique(as.data.frame(bacteria@tax_table@.Data)$Kingdom)
Which should only return 1 unique string:
[1] "Bacteria"
An important step in analyzing sequencing data is rarefaction. Rarefaction involves randomly subsetting samples so that all samples have even sequencing depth
A quick note on rarefying and rarefaction: (click to expand)
It is worth knowing that the practice of rarefaction has been called into question in the past. However, the current thinking is that "rarefaction is the most robust approach to control for uneven sequencing effort when considered across a variety of alpha and beta diversity metrics."
Generally, it is a good idea to start by manually removing rare taxa. It is common to remove taxa that have less than 20 reads across all samples in our dataset. To do this, we will use the prune_taxa()
command from phyloseq.
Create a new chunk. If we view the otu_table
of Bacteria
, we will see as we scroll through that there are many taxa that appear sparsely across the different samples.
View(bacteria@otu_table)
We are not particularly interested in these rare taxa, so a quick way to deal with them is to "prune" taxa from our samples that have less than "n" reads across all samples. We can first look at the taxa sums in the form of a histogram.
#View the histogram of taxa sums in our "Bacteria" dataset.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums before Pruning")
With this, we see that most frequently, the sum of all reads for a given taxa is in the 0-20 bin. This means that there are lots of taxa with low abundances (less than 20 reads total) in our dataset. So, we can prune these rare taxa with a built-in phyloseq command called prune_taxa
:
#Prune rare taxa from the dataset. This removes taxa that have less than 20 occurances across all samples.
bacteria<-prune_taxa(taxa_sums(bacteria)>=20,bacteria)
Some notes about this command:
- The first argument we give to
prune_taxa()
is the condition we want met for the taxa to 'pass' this filtering. - We are looking for the sum of the taxa (found by
taxa_sums
) in our phyloseq objectbacteria
to be greater than 20. - The second argument is the phyloseq object we are pruning, which will still be
bacteria
. - The result of this command is overwrites our previous
bacteria
R object.
With that, we can re-visit our histogram and see what this pruning has done:
#View the histogram of taxa sums after we remove the rare taxa.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums after Pruning")
From the scale of the y-axes, we can see that this has pruned many of the rare taxa. This can be verified by viewing the otu_table
with View(bacteria@otu_table)
.
Create a new chunk. To rarefy our dataset, we must first visualize the rarefaction curve of our samples using the vegan package. To do this, we need to create a dataframe that vegan can work with from our Phyloseq object bacteria
.
#Visualize the rarefaction curve of our data.
rarecurve(as.data.frame(t(otu_table(bacteria))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))
Some notes about this command:
- We apply a number of transformations to
bacteria
beforerarecurve
works on it:- the otu_table of
bacteria
is returned by theotu_table(bacteria)
command; - the otu_table is then tranposed by
t()
such that the rows and columns are switched, because this is the formatrarecurve
expects; - this transposed otu_table is then coerced to a dataframe by
as.data.frame()
so thatrarecurve
can read it.
- the otu_table of
- The remainder of the
rarecurve
parameters control how the output is displayed.
Your rarefaction curve should look similar to the following:
Looking at the rarefaction curve, we can see that the number of species for all of the samples eventually begins to plateau, which is a good sign! This tells us that we have reached a sequencing depth where more reads does not improve the number of taxa we find in our sample. However, with the labels, it can be difficult to see exactly what these sample sizes are, so the following code will print it out for us:
print(c("Minimum sample size:",min(sample_sums(bacteria)), "Maximum sample size:", max(sample_sums(bacteria))))
So, we know that at a minimum we must rarefy to the smaller number. However, considering some samples plateau much before this number, we should choose a smaller cutoff. There is not a strict method to choosing a cutoff, but we should remember we want to include abundant taxa and exclude rare taxa. With this in mind, it is acceptable to rarefy our samples to a size of 10000
reads. Use the following lines of code to achieve this:
#Set seed for reproducibility. Rarefy to a sample size equal to or smaller than the minimum sample sum.
set.seed(711)
rarefied<-rarefy_even_depth(bacteria, rngseed = FALSE, sample.size = 10000, trimOTUs = TRUE)
#See that the samples are all the same size now.
rarecurve(as.data.frame(t(otu_table(rarefied))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))
Some notes about these commands:
- We first set the seed to some number, in this case
711
. This is for reproducibility, since otherwise,rarefy_even_depth()
would use a random seed by default. This means that if you ran the same code twice without setting the seed, you would get two different rarefied subsets. - The
rarefy_even_depth()
command needs to know to not use a random seed (rngseed = FALSE
), that our cutoff is 10000 (sample.size = 10000
), and that we are rarefying thebacteria_pruned
phyloseq object. - The
trimOTUs = TRUE
parameter ofrarefy_even_depth()
means that if a taxa is subsampled to an abundance of 0 across all samples, that taxa is removed from the OTU table. Having taxa with 0 reads can mess things up later in the analysis.
Question 6: Why do we prune rare taxa before rarefying?
Excellent! Now that our data is imported, formatted, and rarefied, we can finally look at some diversity metrics!
Alpha diversity is a metric that evaluates the different types of taxa in a given sample. Different alpha diversity methods typically use calculations based on different components of the sample, which are:
- Richness: the number of taxa in a sample.
- Evenness: the distribution of abundances of the taxa in a sample (i.e. similarities/differences in read quantity per taxa).
Some methods use only one of these components, some will use a combination of both, and others use neither. There are many indices to choose from, but some common ones are:
- Observed taxa: The number of different taxa (richness)
- Chao1 index: another measure of richness, more sensitive to rare taxa
- Shannon index: a combination of richness and evenness, with more weight to the richness component.
- Simpson index: a combination of richness and evenness, with more weight to the evenness component.
- Fisher's diversity index / Fisher's alpha was one of the first methods to calculate the relationship between the number of different taxa and the abundance of individual taxa
First, make a new chunk. Then, try running the following lines of code:
#Plot alpha diversity with several metrics.
plot_richness(physeq = rarefied, x="diagnosis", color = "diagnosis", measures = c("Observed", "Shannon", "Fisher")) + geom_boxplot() +theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Your plots will look something like the following:
We can see that the groups are not identical, and that the different indices yield different plots. As well, some indices are similar to each other (like observed taxa and Fisher's alpha). Since we see differences in both the Shannon and Simpson plots, we can say that there are differences in both richness and evenness between our CD and non-IDB sample groups.
Try adding or changing the measures to see how they compare to one another. Also, try changing value of "x" to different (categorical) metadata variables.
Question 7: How can you use the View()
command to see what metadata you can choose from?
Beta diversity is another useful metric we use to compare microbiome compositions. Specifically, beta diversity is used to evaluate the differences between entire samples in a dataset. You can imagine that each sample is compared to one another, and the "distance" between these samples is calculated, such that we get a symmetric matrix of pairwise comparisons.
Sample A | Sample B | Sample C | |
---|---|---|---|
Sample A | 1 | A vs B | A vs C |
Sample B | B vs A | 1 | B vs C |
Sample C | C vs A | C vs B | 1 |
To capture beta diversity, we use some metric to calculate these pairwise distances between samples. There are several to choose from, but the most common include:
- Bray-Curtis dissimilarity: takes into account presence/absence and abundance of taxa
- Jaccard distance: takes into account only presence/absence of taxa
- Weighted UniFrac: accounts for phylogenetic distances between present and absent taxa, weighted by taxa abundance
- Unweighted UniFrac: accounts for phylogenetic distances between present and absent taxa
Once we decide on the metric(s) to use, we then have to consider how to ordinate our data. Ordination is the method by which we take our data points in multidimensional space and project them them to lower dimensional, i.e. 2D space. This allows us to more intuitively visualize our data and the differences between groups.
Common ordination methods include:
- Principal Coordinate Analysis (PCoA) uses eigenvalue decomposition of the distance matrix for normally distributed data
- Non-metric Multidimensional Scaling (NMDS) uses iterative rank-order calculations of the distance matrix for non-normally distributed data
Any combinations of the above metrics and methods can be used, depending on the type of data to be analyzed. For our dataset, we will be using the Bray-Curtis dissimilarity and PCoA method. Fortunately, all of these can easily be implemented in R.
Create a new chunk. First, we will create a relative abundance table from our OTU table. This can be done by transforming our sample counts to a percentage (float) between 0 and 1 using the transform_sample_counts()
function from phyloseq:
percentages <- transform_sample_counts(rarefied, function(x) x*100 / sum(x))
Now, we have a new phyloseq object percentages
in which the abundances are relative. With this object we can create an ordination by selecting our method
and distance
metric. Fortunately, the ordinate
function from phyloseq can do this transformation in one step:
ordination<-ordinate(physeq = percentages, method = "PCoA", distance = "bray")
We are ready to plot our ordination! In this step, we have to specify what data we are using and what our ordination object is. Additionally, we can select our metadata group of interest with the color
parameter.
plot_ordination(physeq = percentages, ordination = ordination, color="diagnosis") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")
Your plot should look something like the following:
Now, try substituting the color
parameter with some different categories from our metadata. Recall how to view the metadata. What does this do to the plot?
As well, you can change the parameters of the ordination like distance
or method
, then run the plot_ordination()
function again. How does this change the plot?
Another useful way to visualize the microbial composition of our samples is through the use of stacked bar charts. These charts break down the relative abundances of different taxa in our samples, informing us more about who is making up the communities. Additionally, we can compare our samples side-by-side to see if there are differences in the taxa abundances between samples. We will start by using the percentages
object that we created in the beta diversity step.
Create a new chunk. We can visualize any taxonomic level with stacked bar charts by collapsing the taxa to their respective nodes at that level. This is done by the tax_glom()
phyloseq function, which "agglomerates" all of the taxa to the level we specify. Let's start by looking at the different families in our samples.
percentages_glom <- tax_glom(percentages, taxrank = 'Family')
We can verify that this worked by viewing the tax_table
.
Question 8: What happened to the data "below" the family level?
Next, we want to "melt" our phyloseq object percentages_glom
into a new dataframe for constructing our plot with ggplot2. We can do this with the psmelt()
function and create a new data frame like so:
percentages_df <- psmelt(percentages_glom)
If we were to keep the data as is, we would be including every Family in our visualization. However, many of the families in our dataframe appear at low abundances across our dataset. This will make our bars look very cluttered, which is contrary to our objective of creating intuitive visualizations. So, it is useful to further collapse all of the families that occur below a certain abundance threshold. Let's try this with an arbitrary value of 1 percent. We can do this by checking all of the Abundance
values in percentages_df
, and if the value is less than 1 (i.e. 1 percent) the Family
value will be replaced with the character string Family < 1% abundance
.
percentages_df$Family[percentages_df$Abundance < 1] <- "Family < 1% abundance"
Our Family
data is categorical, so we have essentially just created a new "Family" category for simplicity. In fact, nearly all of the data in our percentages_df
dataframe is categorical, which presents a problem. Ggplot2 will want to separate our data by factor levels, so we have to coerce our metadata column of interest to the factor class. We can do this easily with the as.factor()
function:
percentages_df$Family <- as.factor(percentages_df$Family)
Additionally, we can choose the colours of our chart. By default, it will be grayscale, and that's no fun! The RColorBrewer package provides many palettes to choose from for different types of data. We can use the "Spectral" palette, but have a problem: "Spectral" only has 11 colours, but we have more than 11 families! We can fix this by using the colorRampPalette()
function from the "grDevices" package, which interpolates new colours with a given palette.
This command works by first using brewer.pal(11,"Spectral")
, which returns a character vector of 11 colours in hexadecimal form. Then, we specify how many colours we want to "ramp" to by finding the number of unique factor levels, or Families, in our dataframe with length(levels(percentages_df$Family))
which returns a numerical value. Finally, colorRampPalette()(n)
creates a function that returns a new character vector of n
hexadecimal colour codes, which we store in a new vector colours
. Funny enough, we have exactly 11 factor levels within Family
. However this method is useful when we want to look at more or less factors, like if we agglomerated the data to the Fhylum or Genus levels instead.
colors<- colorRampPalette(brewer.pal(11,"Spectral"))(length(levels(percentages_df$Family)))
Additionally, it will be helpful to see which samples come from each metadata group. Let's consider the Diagnosis
column of our dataframe. We can differentiate between them by creating another character vector containing colours assigned to each group. To do this quickly, we can use an ifelse()
function that checks whether the value of each cell in the Diagnosis
column is equal to CD
. If this is true, we set the colour to red. If it is false, we know that it must be equal to nonIDB
instead, and set the colour to blue. R natively supports hundreds of named colours too, which you can view with colours()
; feel free to pick your favourites! This works for any category that contains two different values.
c <- ifelse(percentages@sam_data$diagnosis == "CD", "red", "blue")
Finally, we can plot our stacked bar chart! Using the ggplot()
function, we will combine all of the data and objects we have prepared. We must specify our dataframe with the data
parameter. As well, we must tell ggplot how we want the graph to look with aes()
, or the aesthetic map. First, we have to specify what our axes are. Then, we must specify how the bars are to be "filled", or separated. And, we want to add a theme()
which colours our X-axis labels by the Diagnosis
group, using the character vector we made above.
relative_plot <- ggplot(data=percentages_df, aes(x=ID, y=Abundance, fill=Family))+
ggtitle("Relative Abundance", subtitle = "Family Level")+
xlab("Sample ID")+
ylab("Abundance (%)")+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = colors)+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, colour = c))
relative_plot
Your plot will look similar to the following:
Try agglomerating to different taxonomic levels. Doing so, you will have to change a few lines of code and run them before building and viewing the relative_plot
.
Try using the high_dysbiosis
metadata group. Question 9: How do you have to change the ifelse()
function to make this work?
So far, we have looked at how the groups differ from each other at different levels by diversity between Diagnosis
groups, by comparing the distances between samples, and looking at how they differ taxonomically. However, we have yet to comprehensively look at the taxa inside of our samples. Heatmaps are a great way to visualize this information, and you will see that it allows us to easily directly compare the taxa making up the samples.
With taxonomic annotation tools, it can sometimes be difficult to accurately resolve the taxonomy at species and strain levels. For our demonstration, we will again agglomerate the taxa in our dataset, but to the Genus level instead of Family. We will create a new object using use the same tax_glom()
function as before, but this time start with our rarefied
phyloseq object. This is because it has not been transformed to relative abundances.
Create a new chunk, then paste and run the following line.
hm <- tax_glom(rarefied, taxrank = "Genus")
This still leaves us with lots of genera, which we can count with the following command: length(get_taxa_unique(rarefied, taxonomic.rank = "Genus"))
. This returns 64! We should now consider how we want our heatmap to look. With that many taxa, it will become more difficult to interpret. So, let's represent only the top n
taxa in our dataset.
The following command will use the taxa_sums()
function to first sum up how many reads we have for each genus. Then, the sort()
function will sort the taxa - to do this we have to specify TRUE
for the decreasing
parameter to indicate we want to go from largest to smallest. This results in a named numeric vector, from which we want the names of these top 20 taxa. This is extracted with the names()
function, which returns a character vector of taxa names. Finally, the prune_taxa()
function takes this character vector and takes those taxa out of hm
, and overwrites our original hm
object.
hm<-prune_taxa(names(sort(taxa_sums(hm),decreasing=TRUE)[1:20]), hm)
Let's also change the column names of our matrix. Currently, they are the names of our bracken output files, but we can swap them with the values from ID
stored in the phyloseq object.
#change column names to sample IDs
colnames(hm)<-rarefied@sam_data$ID
Now, we should format our reduced dataset into a matrix, so our Heatmap()
function will be able to handle it. We only need the otu_table
from hm
, so we can extract it and then coerce it to the matrix class. Doing this converts all of the numbers in the dataframe to the numeric class.
hm<-as.matrix(otu_table(hm))
If we inspect our matrix with View()
, we will see that the row names are not actually taxonomic names, but the taxID's. This is not particularly useful for us because taking the time to look up what each of these taxIDs are on the NCBI taxonomy browser is very inefficient. So, we should interchange these taxIDs with the corresponding genus name.
A quick note, the tax_glom()
function works somewhat counterintuitively with the taxIDs. If several species get collapsed into one genera, the taxID of the first species in that genera gets assigned. So, the taxIDs we have do not correspond to a genus node in NCBI, but one of the original species. This is ok though, because we can still get the genus name from this taxID using the taxonomizr package from NCBI.
The first time you use this package, you will build and store a SQL database that the functions will rely on. In our case, however, the database is already stored on your virtual instance, so this command will just tell R where to look for it.
prepareDatabase(sqlFile = 'accessionTaxa.sql', tmpDir = "~/CourseData/MIC_data/tools/", getAccessions = FALSE)
Now, using the taxa_names()
function we can get the 20 top taxIDs from our hm
object. This provides a character vector to the getTaxonomy()
function, which will return the genus name for each taxID, and store them in a new matrix genera
like so:
genera<-getTaxonomy(ids = taxa_names(hm), sqlFile = "accessionTaxa.sql", desiredTaxa = "genus")
With this, we can replace the taxIDs with their corresponding genera names. This works because we obtained the genera in the same order that the IDs are in the matrix.
rownames(hm)<-genera
Finally, we are going to transform our data by taking the base 2 logarithm of all of the abundance values in our matrix. This will allow us to discern more easily between the values that fall in the middle of the range of abundances.
hm<-log2(hm)
Now if we inspect hm
with View()
, there are some new values in our matrix. As well, you may notice that -Inf
has appeared in several places. This is because any logarithm of 0 is undefined, so R substituted this value when doing the transformation. We can fix this by simply replacing -Inf
with 0
in our matrix.
hm[hm=="-Inf"]<-0
We are nearly ready to build the heatmap, however we should prepare some additional information to include. If we want to show which samples belong to particular metadata groups, we need to make an annotation. The ComplexHeatmap package has extensive documentation and examples to help us add elements to our visualization, but fortunately, adding a simple annotation to our heatmap is a relatively straightforward process.
The annotation will be a bar that sits on top of the heatmap, and indicates which diagnosis
group the sample belongs to. This should make it apparent if these samples are clustering together in our plot! To do this, we will use the HeatmapAnnotation()
function. The first parameter we provide is what our categorical variable is and where to find it. You may notice that we are going back to the sam_data
table of rarefied
, because rarefied
contains our sample metadata while the hm
matrix does not. This works though, as we generated hm
from rarefied
above. The function will also assign random colours to indicate the groups, and will include a legend in our final heatmap because we specify the show_legend
parameter.
ha <- HeatmapAnnotation(Diagnosis = rarefied@sam_data$diagnosis, show_legend = TRUE)
Another useful step in preparation of the heatmap is sorting the dendrogram that accompanies it. Heatmap()
will do hierarchical clustering by default and sort the nodes of the resulting dendrogram. However, we can elect to use a different sorting algorithm, such as dendsort()
. To do this, we have to generate our own hierarchical cluster of the data with hclust()
, then apply dendsort()
to the product and save it as an object.
col_dend = dendsort(hclust(dist(t(hm))))
One last preparation step! We should pick what colour gradient we want the heatmap to use. Again, we can use the RColorBrewer package, but this time using a palette that is good for continuous data, called a sequential palette. We will then ramp the colours to a larger number so that we create a nice gradient. I like "RdPu" as a palette, but you could choose your own as well!
colours<- colorRampPalette(brewer.pal(9,"RdPu"))(256)
We're finally there! now we can put it all together with the Heatmap()
function from ComplexHeatmap. In this command, we will use the parameters to specify our matrix containing the data, our custom dendrogram, and our annotation bar. You can also elect whether to show the row dendrogram by changing/removing the show_row_dend
parameter. Let's plot it!
Heatmap(hm, cluster_columns = col_dend,
show_row_dend = FALSE, col = colours,
top_annotation = ha,
row_dend_width = unit(30,"mm"),
heatmap_legend_param = list(title = "Log(2)\nAbundance"))
And your resulting heatmap should look something like the following:
Try changing your heatmap annotation to be another metadata variable. Does this cluster better or worse than diagnosis
?
Try changing the command at near the beginning of this module to create a different heatmap. Maybe look at the least abundant taxa? Or the top 30?
Question 1: Which of the graphs does your data resemble more closely? The FastQC plots for our samples should look more like the "good" Illumina data. The quality scores do not decrease rapidly like in the "bad" example.
Question 2: What can we do if data fails the Per Base Sequence Quality module? See this documentation from FastQC for an explanation. One possibility would be trimming the adapter sequence and the low quality part of the read.
Question 3: Of the four output files specified above, which should we choose for analyzing the microbiome?
_We should use the _*paired_contam*.fastq
output files. This is because they contain the reads that 1) do not align to the human genome; and 2) are not singletons. If the reads do not align to the human reference, there is a chance they are microbial in origin, and we can use them for the next step.
Question 4: How many reads are in each sample?
_Using the _wc
_program, we can see that there are 20000 reads in each of the concatenated .fastq files. Since we know that .fastq files have one read for every four lines, we divide 20000 by 4 to get 5000 reads.
Question 5: What do each of these objects within data
represent?
otu_table()
contains the counts for each taxa in our sample.
tax_table()
contains the full taxonomy, as given by Kraken2, for each taxa sample.
otu_table()
contains the metadata for our samples.
Question 6: Why do we prune rare taxa before rarefying? We prune the rare taxa first because rarefying randomly subsamples the data to our specified read depth. If these rare taxa are not removed beforehand, they may be included in the rarefied dataset. Why is this bad? Well, rarefying changes the absolute abundances of our samples. When we rarefy, we also round to an integer. So, if a rare taxa with 10 reads in the original sample is rarefied to 0.01 and gets rounded to 1, its abundance is inflated by 100x. We can prune to an arbitrary cutoff, or pick one that is 1/1000th of the mean read depth.
Question 7: How can you use the View()
command to see what metadata you can choose from?
_Try _View(rarefied@sam_data)
to view the metadata table
Question 8: What happened to the data "below" the family level?
_It gets collapsed, or "agglomerated", up to the family level, so if family A has genera B, C, and D, the sum of the B, C, and D abundances becomes the abundance of A. So it is not removed, instead it is combined. Interestingly, taxa that have _NA
at the family level (or above) are actually removed.
Question 9: How do you have to change the ifelse()
function to make this work?
Try c <- ifelse(percentages@sam_data$high_dysbiosis=="FALSE", "red", "blue")
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to morgan.langille@dal.ca.