-
Notifications
You must be signed in to change notification settings - Fork 204
Metagenomics (IMPACTT December 2022) Answers
These are answers for the 2022 IMPACTT bioinformatics workshop running from May 30th - June 1st.
Authors: Robyn Wright and Morgan Langille
This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.
The answers shown here are for the Metagenomic Taxonomic Composition Tutorial and the Metagenomic Functional Composition Tutorial.
Question 1: Based on the above line counts can you tell how many reads should be in the reverse FASTQs? Remember that a standard convention is to mark the reverse FASTQs with the characters R2
The number of reads should be the same in the reverse FASTQs as it is in the forward FASTQs.
Question 2: Take a look at kneaddata_read_counts.txt
: how many reads in sample CSM7KOMH were dropped?
You can look at the number of reads in the "final pair1" or "final pair2" columns for this - they should show 24,007 reads, meaning that 993 reads were dropped.
Question 3: It's reasonable that the processed data was provided by IBDMDB since their goal is to provide standardized data that can be used for many projects. However, why is it important that normally all raw data be uploaded to public repositories?
There are a few different reasons that it's important that all raw data be uploaded to repositories, primarily that someone else should be able to verify your results by repeating exactly what you did, but also the best standards for processing sequencing data are constantly evolving and someone else may want to look at something different in those samples. For example, you may have looked at only the bacterial reads in your samples, but someone else may be interested in the archaeal, fungal or viral reads.
Question 4: What does the text {1/.}.kraken.txt
and {1/.}.kreport
do in the above command? (Hint look at our above discussion of parallel).
It removes the directory (cat_reads/
) and file suffix (.fastq
) from the file names, and also adds on .kraken.txt
or .kreport
to these file names.
Question 5: How many more reads in each sample are classified using the full RefSeq database (RefSeqCompleteV205) than with the minikraken database (with no confidence threshold)?
*Hint: you can use the less
command to look at both the .kreport
and the .kraken.txt
files.
For example, if we look at the CSM7KOMH_0.0_minikraken.kreport
and CSM7KOMH_0.0_RefSeqCompleteV205.kreport
files, we can see that with the minikraken database, 33,117 reads or 68.97% of reads were classified, while with the RefSeqCompleteV205 database, 49,463 reads or 98.98% of reads were classified.
Question 6: How many more reads in each sample are classified using a confidence threshold of 0.0 compared with a confidence threshold of 0.1 for the minikraken database?
For example, if we look at the same samples but with a confidence threshold of 0.1 (CSM7KOMH_0.1_minikraken.kreport
and CSM7KOMH_0.1_RefSeqCompleteV205.kreport
), then we find that 26,558 reads or 55.31% of reads were classified with the minikraken database, while 49,409 reads or 98.87% of reads were classified with the RefSeqCompleteV205 database. So we can see that there was a fairly large decrease in the number of reads classified with the minikraken database when we went from a confidence threshold of 0.0 to 0.1, while with the RefSeqCompleteV205 database, this difference was very small. As you can see, we need to take into account both the database and the confidence threshold when choosing which to use for your samples. If you want to read more about this, you can do so in our preprint.
Question 7: Take a look at the raw output for the same sample but run with a confidence threshold of 0.1. Was the second read classified? Why or why not?
In this example, you should see that the second read with a confidence threshold of 0.0 reads:
C CB1APANXX170426:1:1101:10001:37312/1 Bacteroides (taxid 816) 101 0:57 816:5 0:4 816:1
And with a confidence threshold of 0.1 reads:
U CB1APANXX170426:1:1101:10001:37312/1 unclassified (taxid 0) 101 0:57 816:5 0:4 816:1
If we calculate the confidence as we had before for the taxid 816, which the read is initially classified as, then this gives us (5+1)/(57+5+4+1) = 6/67 = 0.089, so this is below the confidence threshold of 0.1. Because the other k-mers within the read were unclassified, this read becomes unclassified when we set the confidence threshold to 0.1.
Question 8: How would you go about figuring out the total number of species we found within our ten different samples?
You can simply count the number of lines within the Bracken summary files, like so:
wc -l bracken_out_merged/merged_output_minikraken.species.bracken
wc -l bracken_out_merged/merged_output_RefSeqCompleteV205.species.bracken
Remember that these files also have a header row, so we'll need to subtract one from each, giving 415 species in the MiniKraken-classified samples and 2,381 species in the RefSeqCompleteV205-classified samples.
Question 9: How does the number of species classified by each of the confidence thresholds differ, and how does it differ between the minikraken and the RefSeqCompleteV205 databases?
You can use some commands like this to see the number of species classified within each file, in different groups at a time:
wc -l bracken_out/*minikraken*
wc -l bracken_out/*RefSeqCompleteV205*
wc -l bracken_out/*0.0_minikraken*
wc -l bracken_out/*0.1_minikraken*
wc -l bracken_out/*0.0_RefSeqCompleteV205*
wc -l bracken_out/*0.1_RefSeqCompleteV205*
Question 10: Have a look at the figures and see what differences you notice in the abundance profiles between: (A) the confidence thresholds of 0.0 and 0.1; (B) the CD and nonIBD samples; and (C) the MiniKraken and RefSeq Complete V205 databases.
Question 1: How many contigs are there in the megahit_out/final.contigs.fa
file?
Question 2: How many reads from sample XX were mapped to the contigs?
Question 3: How many fasta files/bins are there in the fasta_bins
folder?
Question 4: How many contigs are there in the largest fasta file?
Question 5: If you were keeping only the MAGs with >90% completeness and <10% contamination/redundancy, how many MAGs would you be left with? -> 20 How about if you kept the MAGs with >50% completeness and <10% contamination/redundancy? -> 41
Question 6: Take a look at the most complete MAGs and the Kraken2/Bracken output. Do the most complete MAGs make sense to you based on the community profiles output by Kraken2/Bracken? Are there any taxa missing? If there are, why do you think this might be?
Question 7: Are there any MAGs with unusually small or large genome sizes? How do you think the completion or contamination/redundancy contributes to this?
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].