CBW 2024 Advanced Module 1: Introduction to metagenomics and read‐based profiling

A Crash Course in GNU Parallel
Quality Control
- Visualization with FastQC
- Filtering with Kneaddata
Generating Taxonomic Profiles
- Annotation with Kraken2/Bracken

Bioinformatic Tool Citations

Parallel
FastQC
Kneaddata
Bowtie2
Kraken2
Bracken
Kraken-biom
MetaPhlAn 3.1

Working on the Command Line

Copy the files to your working directory.

cp -r ~/CourseData/MIC_data/AMB_data/raw_data/ .

A Crash Course in GNU Parallel

Sometimes in bioinformatics, the number of tasks you have to complete can get VERY large. Fortunately, there are several tools that can help us with this. One such tool is GNU Parallel. This tool can simplify the way in which we approach large tasks, and as the name suggests, it can iterate though many tasks in parallel, i.e. concurrently. We can use a simple command to demonstrate this.

parallel 'echo {}' ::: a b c

With the command above, the program contained within the quotation marks ' ' is echo. This program is run 3 times, as there are 3 inputs listed after the ::: characters. What happens if there are multiple lists of inputs? Try the following:

parallel 'echo {}' ::: a b c ::: 1 2 3

Here, we have demonstrated how parallel treats multiple inputs. It uses all combinations of one of each from a b c and 1 2 3. But, what if we wanted to use 2 inputs that were sorted in a specific order? This is where the --link flag becomes particularly useful. Try the following:

parallel --link 'echo {}' ::: a b c ::: 1 2 3

In this case, the inputs are "linked", such that only one of each is used. If the lists are different lengths, parallel will go back to the beginning of the shortest list and continue to use it until the longest list is completed. You do not have to run the following command, as the output is provided to demonstrate this.

$ parallel --link 'echo {}' ::: light dark ::: red blue green
> light red
> dark blue
> light green

Notice how light appears a second time (on the third line of the output) to satisfy the length of the second list.

Another useful feature is specifying which inputs we give parallel are to go where. This can be done intuitively by using multiple brackets { } containing numbers corresponding to the list we are interested in. Again, you do not have to run the following command, as the output is provided to demonstrate this.

$ parallel --link 'echo {1} {3}; echo {2} {3}' ::: one red ::: two blue ::: fish
> one fish
> two fish
> red fish
> blue fish

Finally, a handy feature is that parallel accepts files as inputs. This is done slightly differently than before, as we need to use four colon characters :::: instead of three. Parallel will then read each line of the file and treat its contents as a list. You can also mix this with the three-colon character lists ::: you are already familiar with. Using the following code, create a test file and use parallel to run the echo program:

echo -e "A\nB\nC" > test.txt
parallel --link 'echo {2} {1}' :::: test.txt ::: 1 2 3

And with that, you're ready to use parallel for all of your bioinformatic needs! We will continue to use it throughout this tutorial and show some additional features along the way. There is also a cheat-sheet here for quick reference.

Quality Control

Visualization with FastQC

First, make your desired output directory (if it doesn't already exist). Then, run FastQC as follows:

fastqc -t 4 raw_data/*fastq.gz -o fastqc_out

Go to http://##.uhn-hpc.ca/ (substituting ## for your student number) and navigate to your FastQC output directory. Click on the .html files to view the results for each sample. Let's look at the Per base sequence quality tab. This shows a boxplot representing the quality of the base call for each position of the read. In general, due to the inherent degradation of quality with increased sequence length, the quality scores will trend downward as the read gets longer. However, you may notice that for our samples this is not the case! This is because for the purpose of this tutorial, your raw data has already been trimmed.

More often, per base sequence quality will look like the following. The FastQC documentation provides examples of "good" and "bad" data. These examples are also shown below:

Which of the graphs does your data resemble more closely? What can we do if data fails the Per Base Sequence Quality module?

Now, you may have also noticed that most of the samples fail the "Per Base Sequence Content" module of FastQC. Let's look at our visualization:

This specific module plots out the proportion of each base position in a file, and raises a warning/error if there are large discrepancies in base proportions. In a given sequence, the lines for each base should run in parallel, indicating that the base calling represents proper nucleotide pairing. Additionally, the A and T lines may appear separate from the G and C lines, which is a consequence of the GC content of the sample. The relative amount of each base reflects the overall amount of the bases in the genome, and should not be imbalanced. When this is not the case, the sequence composition is biased. A common cause of this is the use of primers, which throws off our sequence content at the beginning of the read. Fortunately, although the module error stems from this bias, according to the FastQC documentation for this module it will not largely impact our downstream analysis.

The other modules are explained in depth in the FastQC Module Help Documentation

Filtering with KneadData

KneadData is a tool which "wraps" several programs to create a pipeline that can be executed with one command. Remember that these reads have already been trimmed for you - this is in fact one of KneadData's functionalities. For this tutorial though, we will use KneadData to filter our reads for contaminant sequences against a human database.

Kneaddata outputs many files for each sample we provide it. These include:

paired sequences which match our database;
singletons which match our database;
paired sequences that do not match our database;
singletons that do not match our database;
and some others.

First, we want to activate our conda environment where KneadData and our other tools for this tutorial are installed:

conda activate taxonomic

Then, run KneadData, using parallel, with the following command:

parallel -j 1 --eta --link 'kneaddata -i1 {1} -i2 {2} -o kneaddata_out -db ~/CourseData/MIC_data/tools/GRCh38_PhiX --bypass-trim --remove-intermediate-output' ::: raw_data/*R1_subsampled.fastq.gz ::: raw_data/*R2_subsampled.fastq.gz

While kneaddata is running, consider the following: The GRCh38_PhiX database is made up of the human genome. Of the four output files specified above, which should we choose for analyzing the microbiome?

You can check out all of the files that kneaddata has produced by listing the contents of the output directory (there is a lot!). Take note of how the files are differentiated from one another, and try to identify some of the files we are interested in. Once kneaddata is complete, we want to stitch our reads together into a single file. This is accomplished with a Perl script from our very own Microbiome Helper. For your convenience, it is already on your student instance.

with the Perl script, concatenate the reads into a single file:

perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_contam*.fastq

The script finds paired reads that match a given regex and outputs the combined files.

We first specify that our program is to be run with Perl, and then provide the path to the program.
The -p flag specifies how many processes to run in parallel. The default is to do one process at a time, so using -p 4 speeds things up.
The --no_R_match option tells the script that our read pairs are differentiated by *_1.fastq instead of *_R1.fastq.
The -o flag specifies the directory where we want the concatenated files to go.
Our regex matches the paired reads that do not align to the human database from the KneadData output. This is because the reads that aren't "contaminants" actually align to the human genome, so what we are left with could contain microbial reads.
- Consider that our files of interest are named something like MSMB4LXW_R1_subsampled_kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq. If we want to match all of our paired contaminant files with a regex, we can specify the string unique to those filenames _paired_contam, and use wildcards * to fill the parts of the filename that will change between samples.

If the above does not work, you may need to install Perl:

conda install conda-forge::perl

If it still does not work or you already have Perl installed, you may get an error saying you require Parallel::ForkManager. Fix by executing the following inside your conda environment:

conda install bioconda::perl-parallel-forkmanager

Generating Taxonomic Profiles

First, we should see how many reads are in our concatenated samples. Since .fastq files have 4 lines per read, we can divide the number of lines in the file by 4 to count the reads. Use the following command to check number of lines in the output files:

wc -l cat_reads/*

Woah! There's almost nothing left in most of these files! One even has zero reads! From this, we can infer that KneadData found the majority of our reads aligned to the human genome, leaving us with very few sequences to look for microbial reads. This is not entirely uncommon, however the fact that our input reads are actually subsets of much larger samples exacerbated this effect.

For the purpose of this tutorial, we will instead continue with the raw data instead of our filtered data. We are only doing this to demonstrate the tools for the purposes of this tutorial. This is NOT standard practice. In practice, you SHOULD NOT use unfiltered metagenomic data for taxonomic annotation.

To continue, we will concatenate the raw data, then unzip it (the ; lets you enter multiple command lines that will execute in series when you press enter).

perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 -o cat_reads_full raw_data/*.fastq.gz;
gunzip cat_reads_full/*.gz

Now let's check how many reads we are dealing with:

wc -l cat_reads_full/*

How many reads are in each sample?

Annotation with Kraken2/Bracken

Now that we have our reads of interest, we want to understand what these reads are. To accomplish this, we use tools which annotate the reads based on different methods and databases. There are many tools which are capable of this, with varying degrees of speed and precision. For this tutorial, we will be using Kraken2 for fast exact k-mer matching against a database.

Our lab has also investigated which parameters impact tool performance in this Microbial Genomics paper. One of the most important factors is the contents of the database, which should include as many taxa as possible to avoid the reads being assigned an incorrect taxonomic label. Generally, the bigger and more inclusive database, the better. However, due to the constraints of our cloud instance, we will be using a "Standard 8GB" index provided by the Kraken2 developers. For your convenience, the database is already available on your instance.

First, you must create the appropriate output directories, or Kraken2 will not write any files. Using parallel, we will then run Kraken2 for our concatenated reads as shown below:

parallel -j 1 --eta 'kraken2 --db ~/CourseData/MIC_data/tools/k2_standard_08gb --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport --confidence 0 {}' ::: cat_reads_full/*.fastq

This process can take some time. While this runs, let's learn about what our command is doing!

We first specify our options for parallel, where:
- the -j 1 option specifies that we want to run two jobs concurrently;
- the --eta option will count down the jobs are they are completed;
- after the program contained in quotation marks, we specify our input files with :::, and use a regex to match all of the concatenated, unzipped .fastq files.
We then describe how we want kraken to run:
- by first specifying the location of the database with the --db option;
- then specifying the output directory for the raw kraken annotated reads;
  - notice that we use a special form of the brackets here, {/.}, this is a special function of parallel that will remove both the file path and extension when substituting the input into our kraken command. This is useful when files are going into different directories, and when we want to change the extension.
- similarly, we also specify the output of our "report" files with the --report option;
- the -confidence option allows us to filter annotations below a certain threshold (a float between 0 and 1) into the unclassified node. We are using 0 because our samples are already subset, however this should generally be higher. See our paper for more information.
- and finally, we use the empty brackets {} for parallel to tell kraken what our desired input file is.

With Kraken2, we have annotated the reads in our sample with taxonomy information. If we want to use this to investigate diversity metrics, we need to find the abundances of taxa in our samples. This is done with Kraken2's companion tool, Bracken (Bayesian Reestimation of Abundance with KrakEN).

Let's run Bracken on our Kraken2 outputs!

parallel -j 2 --eta 'bracken -d ~/CourseData/MIC_data/tools/kraken2_standard_08gb -i {} -o bracken_out{/.}.species.bracken -r 100 -l S -t 1' ::: kraken2_kreport/*.kreport

After all of this, we are almost ready to create some profiles from our samples! The last step is to put everything into a format that R can easily handle. One standard format is the .biom format, which is a recognized standard for the Earth Microbiome Project and is supported by the Genomics Standards Consortium.

To transform our data, we will use kraken-biom, a tool which takes the .kreport files output by bracken and creates a new, combined file in the .biom format. This tool can also incorporate our sample metadata, and it is easier to merge this information now versus later.

First, we will have to copy the metadata file to our working directory:

cp ~/CourseData/MIC_data/AMB_data/amb_module1/mgs_metadata.tsv .

Let's have a look at this file, try reading it with cat

Now that we have our metadata, let's run kraken-biom and merge our samples to one organized output file:

kraken-biom kraken2_kreport/*bracken_species.kreport -m mgs_metadata.tsv -o mgs.biom --fmt json

The inputs are a positional argument, and kraken-biom will accept lists. We can match all of our desired bracken-corrected .kreport files with the regex kraken2_kreport/*bracken_species.kreport
the -m option is for specifying our metadata file. Note that kraken-biom is picky about the order of the metadata, so the entries in the list should be in the same order that your files are found. Typically this is in alphanumeric/dictionary order.
the -o option specifies our output file.
the --fmt option specifies that we want the output to be in json format, which is not the default behavior of the program. JSON is a text-based version of the BIOM format, as opposed to HDF5 which is a binary file format.

Now, with all of that, we should have our final mgs.biom file ready to go! You can check out what this file looks like (it's a single line so head will not work here). It can be cumbersome to look at, but there are patterns. Fortunately, R is much better at reading these files than we are!

Working in R

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].