Skip to content

Metagenomics (IMPACTT December 2022) Answers

Robyn Wright edited this page Dec 5, 2022 · 7 revisions

These are answers for the 2022 IMPACTT bioinformatics workshop running from May 30th - June 1st.

Authors: Robyn Wright and Morgan Langille

This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.

The answers shown here are for the Metagenomic Taxonomic Composition Tutorial and the Metagenomic Functional Composition Tutorial.

Taxonomic composition

Question 1: Based on the above line counts can you tell how many reads should be in the reverse FASTQs? Remember that a standard convention is to mark the reverse FASTQs with the characters R2

The number of reads should be the same in the reverse FASTQs as it is in the forward FASTQs.

Question 2: Take a look at kneaddata_read_counts.txt: how many reads in sample CSM7KOMH were dropped?

You can look at the number of reads in the "final pair1" or "final pair2" columns for this - they should show 24,007 reads, meaning that 993 reads were dropped.

Question 3: It's reasonable that the processed data was provided by IBDMDB since their goal is to provide standardized data that can be used for many projects. However, why is it important that normally all raw data be uploaded to public repositories?

There are a few different reasons that it's important that all raw data be uploaded to repositories, primarily that someone else should be able to verify your results by repeating exactly what you did, but also the best standards for processing sequencing data are constantly evolving and someone else may want to look at something different in those samples. For example, you may have looked at only the bacterial reads in your samples, but someone else may be interested in the archaeal, fungal or viral reads.

Question 4: What does the text {1/.}.kraken.txt and {1/.}.kreport do in the above command? (Hint look at our above discussion of parallel).

It removes the directory (cat_reads/) and file suffix (.fastq) from the file names, and also adds on .kraken.txt or .kreport to these file names.

Question 5: How many more reads in each sample are classified using the full RefSeq database (RefSeqCompleteV205) than with the minikraken database (with no confidence threshold)? *Hint: you can use the less command to look at both the .kreport and the .kraken.txt files.

For example, if we look at the CSM7KOMH_0.0_minikraken.kreport and CSM7KOMH_0.0_RefSeqCompleteV205.kreport files, we can see that with the minikraken database, 33,117 reads or 68.97% of reads were classified, while with the RefSeqCompleteV205 database, 49,463 reads or 98.98% of reads were classified.

Question 6: How many more reads in each sample are classified using a confidence threshold of 0.0 compared with a confidence threshold of 0.1 for the minikraken database?

For example, if we look at the same samples but with a confidence threshold of 0.1 (CSM7KOMH_0.1_minikraken.kreport and CSM7KOMH_0.1_RefSeqCompleteV205.kreport), then we find that 26,558 reads or 55.31% of reads were classified with the minikraken database, while 49,409 reads or 98.87% of reads were classified with the RefSeqCompleteV205 database. So we can see that there was a fairly large decrease in the number of reads classified with the minikraken database when we went from a confidence threshold of 0.0 to 0.1, while with the RefSeqCompleteV205 database, this difference was very small. As you can see, we need to take into account both the database and the confidence threshold when choosing which to use for your samples. If you want to read more about this, you can do so in our preprint.

Question 7: Take a look at the raw output for the same sample but run with a confidence threshold of 0.1. Was the second read classified? Why or why not?

In this example, you should see that the second read with a confidence threshold of 0.0 reads: C CB1APANXX170426:1:1101:10001:37312/1 Bacteroides (taxid 816) 101 0:57 816:5 0:4 816:1 And with a confidence threshold of 0.1 reads: U CB1APANXX170426:1:1101:10001:37312/1 unclassified (taxid 0) 101 0:57 816:5 0:4 816:1

If we calculate the confidence as we had before for the taxid 816, which the read is initially classified as, then this gives us (5+1)/(57+5+4+1) = 6/67 = 0.089, so this is below the confidence threshold of 0.1. Because the other k-mers within the read were unclassified, this read becomes unclassified when we set the confidence threshold to 0.1.

Question 8: How would you go about figuring out the total number of species we found within our ten different samples?

You can simply count the number of lines within the Bracken summary files, like so:

wc -l bracken_out_merged/merged_output_minikraken.species.bracken
wc -l bracken_out_merged/merged_output_RefSeqCompleteV205.species.bracken

Remember that these files also have a header row, so we'll need to subtract one from each, giving 415 species in the MiniKraken-classified samples and 2,381 species in the RefSeqCompleteV205-classified samples.

Question 9: How does the number of species classified by each of the confidence thresholds differ, and how does it differ between the minikraken and the RefSeqCompleteV205 databases?

You can use some commands like this to see the number of species classified within each file, in different groups at a time:

wc -l bracken_out/*minikraken*
wc -l bracken_out/*RefSeqCompleteV205*
wc -l bracken_out/*0.0_minikraken*
wc -l bracken_out/*0.1_minikraken*
wc -l bracken_out/*0.0_RefSeqCompleteV205*
wc -l bracken_out/*0.1_RefSeqCompleteV205*

Question 10: Have a look at the figures and see what differences you notice in the abundance profiles between: (A) the confidence thresholds of 0.0 and 0.1; (B) the CD and nonIBD samples; and (C) the MiniKraken and RefSeq Complete V205 databases.

Assembly

Clone this wiki locally