-
Notifications
You must be signed in to change notification settings - Fork 204
CBW‐ICG‐AMB‐Module3
This tutorial is part of the 2024 ICG Advanced Microbiome Analysis workshop (July 11-12 2024). It is modified from the CBW Advanced Microbiome Analysis (held in St John's, NL, May 27-28).
Author: Robyn Wright
Introduction
3.1 Copy in the MMseqs data we will be using
3.2 Run MMseqs
3.3 Get MMseqs top hits
3.4 Combine Kraken taxonomy and MMseqs function, normalise, and reformat
3.5 Visualise with JarrVis
3.6 Run HUMAnN
3.7 Combine HUMAnN output
3.8 Visualise HUMAnN output
3.9 AMR annotation of reads using CARD RGI
3.10 Answers\
The main goal of this tutorial is to introduce students to functional profiling of taxonomic reads using MMseqs and HUMAnN and to visualise the results of both the taxonomic and functional annotations. As in the taxonomic composition workshop (AMB module 1), we want to emphasize that there is not a one-size-fits-all pipeline for analyzing MGS data. Each of MMseqs and HUMAnN are typically used for general functional annotation, while CARD RGI is used specifically for AMR annotation of reads.
Each of the MMseqs, HUMAnN, and CARD RGI parts of this tutorial can stand alone, so feel free to start with whichever of these you feel will be most useful to you. Note that you should run step 3.1 first regardless of which tool you want to start with.
Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them.
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. We'll be using MMseqs to assign functions to our samples on a read-by-read basis by mapping them to the UniRef90 protein database, which allows us to link the function with the taxonomy that we've obtained from Kraken2 (although MMseqs can also be used for taxonomy assignment). MMseqs2 works by taking sequenced reads, translating them into protein and then mapping them against this protein database (in this case, UniRef90, a large protein database clustered at 90% identity).
HUMAnN3 (HMP Unified Metabolic Analysis Network) is a tool for profiling the presence/absence and abundance of microbial pathways in a community from metagenomic (or metatranscriptomic) sequencing data. HUMAnN3 works by: (1) identifying the species in the samples using MetaPhlAn, (2) mapping these reads to pangenomes of the species using Bowtie2, and (3) aligning the reads that could not be mapped to the pangenomes to a protein database (usually UniRef50) with DIAMOND.
As we've done previously, we'll start by activating the conda environment and creating a directory to use:
conda activate functional
cd ~/workspace
mkdir amb_module3
cd amb_module3
Then, we'll create symlinks to some of the files that we'll be using:
ln -s ~/CourseData/AMB_data/raw_data/ .
ln -s ~/workspace/amb_module1/cat_reads/ .
ln -s ~/CourseData/AMB_data/scripts/ .
ln -s ~/CourseData/AMB_data/MMSeqs2_db/ .
ln -s ~/CourseData/AMB_data/amb_module1/mgs_metadata.tsv .
And we'll unzip the MMSeqs database that we're using:
tar -xvf MMSeqs2_db/MMSeqs2_db.tar.gz
If this didn't work, you can download each of these files separately:
COGDBGeneLength.pbz2 mmseqsUniref90DB.dbtype mmseqsUniref90DB_h.index RefGeneDescription.pbz2 RsUpmap.pbz2
COGECmapped.pbz2 mmseqsUniref90DB_h mmseqsUniref90DB.index RefGeneLength.pbz2 UnirefGeneLength.pbz2
mmseqsUniref90DB mmseqsUniref90DB_h.dbtype mmseqsUniref90DB.lookup RsECmapped.pbz2 UpECmapped.pbz2
They should all be downloaded into a folder called MMSeqs2_db
, so this would probably look something like this:
mkdir MMSeqs2_db
cd MMSeqs2_db
wget http://kronos.pharmacology.dal.ca/public_files/tutorial_data/MMSeqs2_db/COGDBGeneLength.pbz2 http://kronos.pharmacology.dal.ca/public_files/tutorial_data/MMSeqs2_db/COGECmapped.pbz2
And this should be repeated for each of the files above. You can then change back into the main directory ~/workspace/amb_module3
.
You will notice that we'll be using the reads that we concatenated in the first module as well as the metadata and the MMSeqs database information.
Now, we'll start running MMseqs2. Note that these commands can actually all be combined for each sample, but so that we can see and understand what's going on, we're going to run each of them separately.
First, make a directory to store the output:
mkdir mmseqs_U90_out
Now, we'll use parallel
to create databases for all of our sample files:
parallel -j 8 --progress 'mmseqs createdb {} mmseqs_U90_out/mmseqs-{/.}-queryDB' ::: cat_reads/*
This command creates an MMseqs database from the the input fastq file. The creation of this database is necessary for MMseqs as it vastly increases the speed at which translated DNA sequences can be mapped against a protein database.
Next, we'll actually run the searches with MMseqs:
parallel -j 2 --progress 'mmseqs search mmseqs_U90_out/mmseqs-{/.}-queryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-{/.}-resultDB tmp --db-load-mode 3 --threads 4 --max-seqs 25 -s 1 -a -e 1e-5' ::: cat_reads/*
This command is the real meat of the job file and runs the freshly created sample database against the provided UniRef90 protien database. There are a number of parameters in this command:
-
--db-load-mode 3
- This parameter tells MMseqs how to deal with loading the database into memory. For more information you can check out this page. However, setting this parameter to 3 helps when running MMseqs on a cluster environment. -
--threads
- The number of processors we want MMseqs to use during the search -
--max-seqs 25
- This indicates that we want MMseqs to output at maximum 25 hits for each sequence -
-s 1
- This indicates the sensitivity that we want MMseqs to run at. Increasing this number will lower the speed at which MMseqs runs but will increase its sensitivity. For well-explored environments such as the human gut, a setting of 1 should suffice. -
-a
- This indicates that we want our results to output backtraces for each sequence match. These are needed to convert the resulting MMseqs file into a usable file format. -
-e 1e-5
- This indicates that we only want to keep matches that are below an E-value of 1e-5 (E-values are a measure of how well two sequences match one another, and the closer they are to zero, the better the match is). -
> /dev/null 2>&1
- We could add this part to the end of the command if we wanted to run the command without having too much text printed to our screen.
Got an error message?? Or it's taking a really long time?
We actually unfortunately don't have enough memory on these servers to run this command. If you haven't yet got an error message, you can stop this command with ctrl
+c
.
We'll just delete any files that we could have made if you ran that command, so that we don't confuse any further steps:
rm mmseqs_U90_out/*resultDB*
Note that you may get an error saying that there's no such file or directory. That's fine! You can't remove files that don't exist.
Copy over the output that we would have got from this command if we could run it:
wget http://kronos.pharmacology.dal.ca/public_files/tutorial_data/mmseqs_U90_out.tar.gz
tar -xvf mmseqs_U90_out.tar.gz
And now run the final command that allows us to convert the resulting file from the MMseqs2 format into one that is more usable:
parallel -j 2 --progress 'mmseqs convertalis mmseqs_U90_out/mmseqs-{/.}-queryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-{/.}-resultDB mmseqs_U90_out/mmseqs-{/.}-s1.m8 --db-load-mode 2' ::: cat_reads/*
This command is similar and takes as input the query database we made from our first command, the UniRef90 database we searched against and the resulting file from our search command. It will output the files mmseqs_U90_out/mmseqs-*-s1.m8
.
Again, if we didn't want to print the output of this then we could add > /dev/null 2>&1
to the end of the command.
This command will take a few minutes to run, so it's a good time for a break if you'd like one!!
Now, we'll move these *.m8
files to a new folder:
mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/
Let's take a quick look at one of the files we just moved into the directory mmseqs_m8_files using the less command:
less mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8
We you will see is a file in BLAST tabular format:
Column Number | Data Type |
---|---|
0 | query sequence ID |
1 | Subject (database) sequence ID |
2 | Percent Identity |
3 | Alignment Length |
4 | Number of gaps |
5 | Number of mismatches |
6 | Start on the query sequence |
7 | End on the query sequence |
8 | Start on the database sequence |
9 | End on the database sequence |
10 | E value - the expectation that this alignment is random given the length of the sequence and length of the database |
11 | bit score - the score of the alignment itself |
Use grep
to look at the matches for sequence CB1APANXX170426:1:1101:10000:16732/1
in mmseqs-CSM7KOMH-s1.m8
.
Question 1: How many matches are there?
Question 2: What is the highest percent identity match?
Question 3: What is the lowest E-value match (smallest value = best match)?
The next step we need to take is to get the name of the protein sequence that had the best alignment for each sequence read in our samples. We can achieve this by running the command:
mkdir mmseqs_U90_out_tophit
python3 scripts/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files --output_path mmseqs_U90_out_tophit
Now that we have the best protein sequence that matches best with each of the sequences in our samples we can begin creating our final data table containing the stratified abundance of each function in our samples. Because we (in the Langille lab) are continually updating some of these scripts, and the ones that we are using are ones that we have only recently developed, there are a few different steps that we'll take to get the files in the format that we want them to be in. Often as we develop things in bioinformatics, we'll try out a lot of different things, and as the protocols that we use things mature we can consolidate them into a single script. We have just about reached that point here, but we haven't consolidated things yet so we'll run these few steps.
We're going to match these to the full kraken databases, so we'll create a link to this output:
ln -s ~/CourseData/AMB_data/kraken2_outraw_RefSeqCompleteV205/ .
While we don't have that many files and could just make a file by hand, it's good practice to make these files with scripts for when we're working on our own data and may have hundreds of samples!
We'll be using Python for this, and we can open up Python (version 3) by typing in python3
and pressing enter. You should see some information about the Python version print out, and a new command prompt with >>>
pop up. This shows us that we're now using Python rather than bash.
Next, we'll import the package that we'll use:
import os
Note: You'll need to press enter after each line to make sure that it runs!
Now, we'll make a list of our samples:
samples = os.listdir('cat_reads/') #this command creates a list of the files in cat_reads/
samples = [s.split('.')[0] for s in samples] #and this command uses a for loop to get only the sample names - the part of the file names before the '.'
Print out the list to see what's in it:
print(samples)
Question 4: What's in the list?
Now we'll set up a few variables so that we don't need to keep typing them out:
kraken_path = 'kraken2_outraw_RefSeqCompleteV205/'
kraken_suffix = '_0.1_RefSeqCompleteV205.kraken.txt'
mmseqs_path = 'mmseqs_U90_out_tophit/mmseqs-'
mmseqs_suffix = '-s1.m8-parsed.txt'
mmseqs_m8_path = 'mmseqs_m8_files/mmseqs-'
mmseqs_m8_suffix = '-s1.m8'
And then we'll create a new file (multi-sample-outfiles-w-m8.txt
) and loop through the samples adding them to the new file:
with open('multi-sample-outfiles-w-m8.txt', 'w') as f:
for sample in samples:
string = sample+'\t'
string += kraken_path+sample+kraken_suffix+'\t'+'kraken2'+'\t'
string += mmseqs_path+sample+mmseqs_suffix+'\t'+'uniref'+'\t'
string += mmseqs_m8_path+sample+mmseqs_m8_suffix+'\n'
s = f.write(string)
Make sure that you press enter to execute this command.
You'll see that for each sample, we're adding the sample name, the path to the kraken output for that file, the path to the mmseqs output for that file, and the path to the mmseqs m8 output for that file, each separated by a tab (\t
).
Now we can quit Python again by entering quit()
(and pressing enter).
If you take a look at this file that we just made, you should see all of your samples in there along with the outputs from Kraken and MMseqs.
Now that we have this master file we can pass this information into the helper script to add all of it together for our samples:
ln -s MMSeqs2_db/*.pbz2 .
python3 scripts/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf Workshop_strat_matrix_RPKM.txt --stratified Y --map2EC Y
If you get an error about not having a module (package) called numpy, what do you think you should do? Note that conda can be useful for installing packages, so googling "conda numpy install" might help! You can use the same command for installing pandas.
The first link command will grab all the databases that contain information about the length of each gene in our UniRef protein database. This will be important to normalize the abundance of each functional classification.
The second command will generate a final stratified data matrix that shows the abundance of each EC number stratified by the different taxonomic classifications within the sample. This script also normalizes the abundances of each of these ECS into reads per kilobase per million (RPKM). This abundance metric is the number of reads that mapped to that EC number per million reads mapped within the sample, divided by the gene length of that EC. It's important that the abundances are normalized by gene length or there would be an unfair bias toward longer genes due to the higher chance of them being sequenced. We can also run the same command as above without the --stratified Y
option to generate functional profiles that are not broken down by the contributing taxa.
This might take a while, it might say "killed", or it might just give you an error (although it runs elsewhere). Either way, we can copy the output if we need to, and take a look at it:
wget http://kronos.pharmacology.dal.ca/public_files/tutorial_data/Workshop_strat_matrix_RPKM.txt
less Workshop_strat_matrix_RPKM.txt
Question 5: Describe what's in the Workshop_strat_matrix_RPKM.txt
file.
Now we're going to be running a few different scripts that will get the output in the format that we're going to need.
First, we'll activate the PICRUSt2 conda environment:
conda activate picrust2
Then we'll open up Python again and use this to split the first column that currently contains EC numbers and taxon information into two separate columns:
import pandas as pd #import the pandas package, and instead of needing to type pandas each time, we'll just call it pd
file = pd.read_csv('Workshop_strat_matrix_RPKM.txt', header=0, sep='\t') #open up the file as a dataframe
samples = [s for s in file.columns if s != 'function'] #get all of the sample names
file[['EC','taxon']] = file['function'].str.split('|',expand=True) #split the first column named 'function' based on the '|' symbol
file = file.drop('function', axis=1).loc[:, ['EC', 'taxon']+samples] #remove the original 'function' column and reorder the others to have 'EC' and 'taxon' before the sample names
file.to_csv('Workshop_strat_matrix_RPKM_split.txt', index=False, sep='\t') #save this as a new file
Make sure to quit Python again.
Now, we can use one of the PICRUSt2 scripts to add descriptions for the EC numbers to this file:
add_descriptions.py -i Workshop_strat_matrix_RPKM_split.txt -o Workshop_strat_matrix_RPKM_split_des.txt -m EC
Question 6: Look at this new file. Any ideas how you might check that these descriptions are correct?
Currently, this file just contains the taxon names, but we're going to need the full taxonomy information for all of these taxa. The database that we've used for classification of the reads with Kraken is from NCBI, so this is where we'll get the most up to date information on these taxa. We'll now be making a data dictionary - a mapping from the NCBI taxonomy ID's to the full taxonomy information.
First, we'll make a new folder for the NCBI taxonomy information and change into it:
mkdir taxonomy
cd taxonomy
Now we'll download the taxonomy information and extract the information frm the tar file:
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xvf taxdump.tar.gz
There are a couple of key files that we're going to be using from this: nodes.dmp
and names.dmp
- you can see information on everything that's in these files in readme.txt
, but what we're interested in is the taxid (taxonomy ID) and parent taxid from nodes.dmp
, and the taxid and name in the names.dmp
file. What we'll be doing is going through each of these files and essentially formatting the data to be how we want to have it for each of these taxa, so that it'll be quick and easy for us to access and we won't need to open up these large files every time that we want to access it.
Change out of the taxonomy
directory: cd ..
, start up python
again, and run this code:
import pickle
nodes_dict = {}
names_dict = {}
nodes_level_dict = {}
First we are importing the pickle package - this allows us to easily save python objects like dictionaries - and then we're setting up some empty dictionaries. In some programming languages, different types of brackets will be used to show different types of objects. For example, in python, regular brackets ()
will be used for giving arguments to functions, square brackets []
are used for lists or accessing information contained in lists, and curly brackets {}
are used to create dictionaries.
Now, we'll get the information that we need from the nodes.dmp
file:
for line in open('taxonomy/nodes.dmp', 'r'):
this_line = line.replace('\t', '').split('|')
node, parent, level = this_line[0], this_line[1], this_line[2]
nodes_dict[node] = parent
nodes_level_dict[node] = level
Here, we were saying that for each line of the nodes.dmp
file:
- remove the tabs and split the remaining string by the
|
symbol - define the node, parent and level of this as the first, second and third field (note that python indexing starts at 0, not 1 like R and some other programming languages)
- add the parent taxid to the name of this taxid in the nodes_dict
- add which taxonomy rank/level this is at to the nodes_level_dict
Now we'll do similar for the names.dmp
file:
for line in open('taxonomy/names.dmp', 'r'):
this_line = line.replace('\t', '').split('|')
taxid, name, name_type = this_line[0], this_line[1], this_line[3]
if name_type == 'scientific name':
names_dict[taxid] = name
So again we're saying that for each line of the names.dmp
file:
- remove the tabs and split the remaining string by the
|
symbol - define the taxid, name and type of name as the first, second and fourth fields
- if the name type is the scientific name (the two == signs means that we're asking it if it matches, rather than defining the name_type as 'scientific name'). Other options here would include synonym, common name, equivalent name, type material... but we only want the scientific name for each taxonomy ID
- if it is the scientific name, then we'll save the name to the taxid in the names_dict
Now we'll reformat these dictionaries to a new dictionary full_taxonomy to be usable for us and have everything in the same place:
full_taxonomy = {}
for taxid in nodes_dict:
parent_id = nodes_dict[taxid]
tax_list = [names_dict[taxid]]
nodes_list = [taxid]
level_list = [nodes_level_dict[taxid]]
while parent_id in nodes_dict:
if parent_id in ['1', '0']:
break
tax_list.append(names_dict[parent_id])
nodes_list.append(parent_id)
level_list.append(nodes_level_dict[parent_id])
parent_id = nodes_dict[parent_id]
full_taxonomy[taxid] = [nodes_list, tax_list, level_list]
So here we're looping through all of the taxid's in the nodes_dict and we're:
- getting the parent taxid (and calling it parent_id)
- starting a list called tax_list where the first thing in the list is the scientific name that we saved in names_dict - we'll be adding to this with the names of the parent taxid's
- starting a list called nodes_list where the first thing is this taxid - we'll be adding the parent taxid's
- starting a list called level_list where the first thing is the name is this level - we'll be adding the names of all levels to this. These would be e.g. genus, species, etc.
- starting a while loop - while loops will continue indefinitely until a certain condition is met. You always need to be a little careful with them, because it's easy to code something that would run forever, but here we're just saying that we want to continue running something while the parent taxid (parent_id) is still in the nodes_dict
- next, we're making sure the loop will stop by saying that if the parent taxid is either '1' or '0' (which means either 'root' or 'unclassified', respectively), then the loop has to break
- if the parent taxid is still in the nodes_dict, then we'll add the name of the parent to the names_dict
- we'll add the parent taxid to the nodes list
- we'll add the name of the level of the parent to the level_list
- and finally, we'll change the parent taxid to be the parent of the current parent (i.e., we'll move up the list)
- after we've broken out of the while loop, we'll save the new lists that we've made to the full_taxonomy dictionary
Now we're ready to save the new dictionary that we've made:
with open('taxonomy.dict', 'wb') as f:
pickle.dump(full_taxonomy, f)
And we can quit python:
quit()
Next, we're going to add the full taxonomy information that we just made (or copied across), and we'll resave the file in the full stratified format that JarrVis needs.
First, start up python again: python
Now import the packages that we'll need and open up the file:
import pandas as pd
import pickle
file = pd.read_csv('Workshop_strat_matrix_RPKM_split_des.txt', header=0, sep='\t')
Open up the taxonomy dictionary that we made:
with open('taxonomy.dict', 'rb') as f:
full_taxonomy = pickle.load(f)
Make a new dictionary that contains the prefixes for each taxonomy level:
tax_levels = {'phylum':'p__', 'class':'c__', 'order':'o__', 'family':'f__', 'genus':'g__', 'species':'s__'}
Now we'll go through each row of this file and replace the existing taxon field with the full taxonomy information:
drop = []
for row in file.index.values:
tax = file.loc[row, 'taxon']
taxid = tax.split('taxid ')[1].replace(')', '')
try:
full_tax = full_taxonomy[taxid]
this_tax = []
for level in tax_levels:
if level in full_tax[2]:
index = full_tax[2].index(level)
name = full_tax[1][index]
this_tax.append(tax_levels[level]+name)
else:
this_tax.append(tax_levels[level]+'Unclassified')
this_tax = ';'.join(this_tax)
file.loc[row, 'taxon'] = this_tax
except:
drop.append(row)
file = file.drop(drop, axis=0)
In this for loop, we are:
- getting the name of the taxon from the dataframe (when we're accessing information from a dataframe like this, we always give the row name first and then the column name)
- getting the taxid of this taxon by splitting the string based on the 'taxid ' (and then removing the additional bracket)
- getting all of the taxonomy information for this taxid from our dictionary
- creating an empty list called this_tax
- going through each of the levels in the tax_levels dictionary
- checking whether the level is in our full taxonomy, and if it is:
- getting the index (location) in the list that it is)
- getting the name at this level
- adding the prefix and the name to the this_tax list
- else (if it's not in the full taxonomy)
- just add that this taxon is 'Unclassified' at this level (along with the prefix)
- create a string from the this_tax list that is separated by semi-colons
- rename the dataframe cell with the full taxonomy information
Finally, after this we're dropping the few taxonomy ID's that we didn't find in our database. Unfortunately, this often happens because new versions of the NCBI database make small changes to the taxonomy and drop ID's.
Now we can save this file:
file.to_csv('Workshop_strat_matrix_RPKM_split_des_full_tax.txt', sep='\t', index=False)
If you like, to help understand what the previous lines of code did, take a look at the previous file (Workshop_strat_matrix_RPKM_split_des.txt
) and this new one that we've just made (Workshop_strat_matrix_RPKM_split_des_full_tax.txt
).
Question 7: Can you work out what was added?
Now we're going to change the format of this file so that rather than have all samples on the same line for each function/taxon, each function/taxon/sample will have it's own line:
new_file = []
samples = [s for s in file.columns if s not in ['EC', 'description', 'taxon']]
for row in file.index.values:
for sample in samples:
this_row = [sample, file.loc[row, 'taxon'], file.loc[row, 'EC']+': '+file.loc[row, 'description'], file.loc[row, sample]]
new_file.append(this_row)
Here, we first made an empty list called new_file, and got the names of the samples. Then, we looped through each row of the dataframe and:
- for each sample:
- we made a new row/list (this_row) that contained the sample name, the name of the taxon, the name of the EC number, the description of the EC number, and the abundance value of that function/taxon/sample
- added this new row/list (this_row) to the new_file list
You should be able to tell that here we're going to have many more rows than we did previously as we're going to be multiplying the number of rows that we did have by the number of samples. An easy way to check that everything is as we hope it will be is to look at both the previous dataframe and the length of the new list:
print(file)
print(len(new_file))
Now we can convert this new list (new_file) to a dataframe called jarrvis_file (with the column names that JarrVis is expecting) and save it:
jarrvis_file = pd.DataFrame(new_file, columns=['Sample', 'Genus', 'Gene', 'Contribution'])
jarrvis_file.to_csv('Workshop_stratified_for_jarrvis.txt', sep='\t', index=False)
And then finally, we can quit python:
quit()
Now we're finally ready to run JarrVis to visualise our taxonomy-function information from MMseqs!
- First off, download
Workshop_stratified_for_jarrvis.txt
and getmgs_metadata.txt
:wget http://kronos.pharmacology.dal.ca/public_files/tutorial_data/mgs_metadata.txt
. Download both of these to your laptop. - Open up and modify the
mgs_metadata.txt
so that 'sample' says 'sample_id' (it's important that this matches exactly!) - If you already have RStudio (desktop) installed, go ahead and open it up. If not, download and install it from here, and then open it up.
- Open the console and type:
install.packages('shiny')
- it may ask you more questions, but you probably want to say yes/y to everything. - Type:
library(shiny)
- Type:
runGist("943ff5fdbd94815cc27f302d9f56ff0b")
- you should see some things running and then a box will pop up! - In "Upload stratified output File (TSV)", click browse and choose the
Workshop_stratified_for_jarrvis.txt
file - In "Upload Sample Metadata File (TSV)", click browse and choose the
mgs_metadata.txt
file - Click "Select Metadata Categories" - you should see the
Metadata Categories
box populate itself with "disease_state" (if you had more metadata categories in your metadata file, these would come up here for you to choose from) - Choose the taxonomy level to collapse at - you can choose whatever you'd like, but I went with genera
- Click "Select Taxa Categories" - you won't see anything change, but the next "Filter by Taxa" box now has all of the taxa. Click on "Select All" here.
- Click "Select Function Categories" - this is the same as for taxa, so now click on "Filter by Function" and "Select All" again.
- Now click on "Update the Gene Contribution Threshold from data". You can move up the bottom filter so that we don't plot all of the really low abundance stuff - I've found that to ~500 seems to be sensible for visualising everything.
- Now scroll to the bottom of the window and click "Display plot".
- Now scroll back up and you should see the functions in each group and be able to start exploring them. Try clicking on some of the links and seeing how they go back to the sample groupings.
Note: Sometimes weird things happen! If you get an error when you try to display the plot, try quitting RStudio (if it asks you if you want to save your workspace click never), re-opening it, and starting again from step 5 here.
If you're struggling with getting this to run, you can download the html output from JarrVis here (this is equivalent to saving rather than viewing the files in JarrVis).
Question 8: What can you see in the JarrVis output?
Now we can run HUMAnN. Note that you should still run step 3.1 to set up the directories and copy across the data that we'll be using.
For this, we'll be using the output from MetaPhlAn that we generated yesterday.
First, change directory and make the HUMAnN output folder:
cd ~/workspace/amb_module3
mkdir humann_out
Now we'll make the environment:
conda create --name biobakery3-new python=3.7
conda activate biobakery3-new
conda install humann -c biobakery
conda install conda-forge::parallel
We'll download the databases:
humann_databases --download chocophlan full CHOCOPhlAn --update-config yes
humann_databases --download utility_mapping full humann_databases/ --update-config yes
humann_databases --download uniref uniref90_diamond humann_databases/ --update-config yes
mkdir metaphlan_out
parallel -j 2 --eta --progress 'humann -i {} -o humann_out/ --threads 8 --protein-database humann_databases/uniref --taxonomic-profile ~/workspace/amb_module1/metaphlan_out/{/.}.mpa' ::: cat_reads/*.fastq
Note: This might take quite a while to run so you may want to use tmux
for this. Check back in the previous module if you've forgotten how to open this. Remember to activate the biobakery3-new
environment if you do start a new tmux session!
The options here are hopefully quite straightforward:
-
-i
- the input files -
-o
- the output folder to use -
--threads
- the number of threads to run with -
--taxonomic-profile
- the location of the MetaPhlAn taxonomic profiles. Note that this is optional - if we didn't include this option, MetaPhlAn would be run again (but it would run again using MetaPhlAn4, which apparently works better but requires more memory to run than we have here) -
--protein-database
- the location of the uniref database to use. This is also optional, but if we don't provide it and HUMAnN doesn't find it in the folder that it expects it to be, it will download this database again, which could take quite a long time to do.
Note that this might take a long time to run, so you can continue on to the section after question 9 while this runs if you like!
As this finishes with each of the samples, you can take a look at the output files if you like. The key files will be *_genefamilies.tsv
, *_pathabundance.tsv
and *_pathcoverage.tsv
for each sample, and you can look in the *_tmp
folders, too.
After they've all finished, you can run some HUMAnN scripts to join the tables for all of the samples together into one. We run this separately for each file type:
humann_join_tables -i humann_out -o HMP_humann_genefamilies.tsv --file_name genefamilies
humann_join_tables -i humann_out -o HMP_humann_pathabundance.tsv --file_name pathabundance
humann_join_tables -i humann_out -o HMP_humann_pathcoverage.tsv --file_name pathcoverage
Question 9: Have a look at these files. Do you see any issues with them?
I've also run HUMAnN on the full samples (not sub-sampled), and you can download that like this:
wget -O HMP_humann_pathabundance_full.tsv https://drive.usercontent.google.com/download?id=1KVMz0JWC4Q3F70QFJ_ZoGMyJqlvwMKOP
As well as downloading a version that I modified to have the metadata as an additional line just under the header here:
wget -O HMP_humann_pathabundance_full_metadata.tsv https://drive.usercontent.google.com/download?id=1uxHb3-xLmyOD7Qgdtk-RMWmhW8OcffGG
Question 10: Does this look more reasonable to you?
HUMAnN has some built-in functions for plotting this output, so we'll run some of the pathways like so:
mkdir humann_plots
parallel -j 1 'humann_barplot --input HMP_humann_pathabundance_full_metadata.tsv --focal-metadata disease_state --last-metadata disease_state --output humann_plots/{}.png --focal-feature {} --sort sum metadata --scaling logstack' ::: PWY-7221 PEPTIDOGLYCANSYN-PWY PWY-1269 PWY-6277 FUCCAT-PWY P164-PWY PWY-7013 HSERMETANA-PWY P163-PWY
Note: these pathways were chosen more-or-less at random to show a few different patterns. Feel free to look through the file and run this with some other pathways, e.g.,:
humann_barplot --input HMP_humann_pathabundance_full_metadata.tsv --focal-metadata disease_state --last-metadata disease_state --output humann_plots/CALVIN-PWY.png --focal-feature CALVIN-PWY --sort sum metadata --scaling logstack
humann_barplot --input HMP_humann_pathabundance_full_metadata.tsv --focal-metadata disease_state --last-metadata disease_state --output humann_plots/ANAEROFRUCAT-PWY.png --focal-feature ANAEROFRUCAT-PWY --sort sum metadata --scaling logstack
Now take a look through these plots so that you can see how the functions are distributed among the taxa, and how this varies for different functions.
Question 11: Is this the same as you thought it would be based on the MMseqs output?
We can use the Comprehensive Antibiotic Resistance Database (CARD) Resistance Gene Identifier (RGI) on reads as well as the contigs from MAGs. If you made it to this part in Module 2, then you already have the data that we need. If not, go to Section 2.10 of Module 2 and follow the instructions for downloading the databases.
If you already have the data, just activate the conda environment again:
conda activate rgi
Make the output directory and then change to the card_data
directory to run the RGI from:
mkdir card_out
cd ~/workspace/card_data/
Now run CARD RGI:
parallel -j 1 --link 'rgi bwt --read_one {1} --read_two {2} --aligner bowtie2 --output_file ~/workspace/amb_module3/card_out/{1/.}_CARD --threads 2 --local --clean' \
::: ~/workspace/amb_module3/raw_data/*_R1_subsampled.fastq.gz ::: ~/workspace/amb_module3/raw_data/*_R2_subsampled.fastq.gz
You'll notice that this tool wants the information about the forward and reverse reads separately, we've told it to use the Bowtie2 aligned, where to put the output files, to use the local database, and to remove temporary files.
Now we can change to the directory with the CARD output files and have a look at some of them:
cd ~/workspace/amb_module3/card_out
less HSMA33J3_R1_subsampled.fastq_CARD.allele_mapping_data.txt
less HSM6XRQY_R1_subsampled.fastq_CARD.overall_mapping_stats.txt
less HSM7J4QT_R1_subsampled.fastq_CARD.reference_mapping_stats.txt
less HSM7J4QT_R1_subsampled.fastq_CARD.gene_mapping_data.txt
You can see some details about the output files here.
For now, we'll just separate the *gene_mapping_data*
files from the others as these are what we're most interested in:
mkdir gene_mapping
mv *gene_mapping_data* gene_mapping/
mkdir other_read_files
mv *subsampled* other_read_files
cd ..
If you open up one of these files in Excel (or similar), you can have a look at all of the information that is in here. You'll see that there is a lot of information about the quality of the matches to the CARD database, including the number of completely mapped reads in the sample, the mapped reads with the flanking sequence (sequence found surrounding the gene sequence in the database), the average percent of the gene that is covered (along with the length of the reference sequence - based on these and our 100bp reads, it's not very surprising that these percentages are relatively low!), etc. We're often most interested in the number of reads that are mapped, but we might also want to perform some filtering based on some of these other parameters, and we can find this information on a read-by-read basis in some of the other files. There's also information in this file about the AMR gene families that each gene belongs to, as well as the drug class that they give resistance to and the mechanism of resistance.
How exactly we want to filter these results would depend on the application, but you can see that in a recent study that I contributed to, where we were looking for AMR genes in the biofilms growing on plastics that were in a river next to a wastewater treatment plant, we chose to: remove AMR genes with <10 reads per sample, normalise the AMR genes to the length of the reference AMR gene within the CARD database, normalise the number of reads within each sample to give reads per kilobase per million (RPKM), and then we removed the target point mutations for antibiotic resistance (e.g. gyrase and ribosomal mutations) due to the elevated background noise that can occur when including point mutation ARGs from metagenomic data.
For this tutorial, we won't worry too much about filtering the data because these are subsampled sequencing data files anyway, but we will combine the files together and look at the number of reads giving resistance to different drug classes.
Start up python again: python
Import the pandas and os packages:
import pandas as pd
import os
import matplotlib.pyplot as plt
You might notice that we've used the os
package a few times now - it contains a lot of basic functions for working with directories, listing the contents, making new ones, etc.
Set up an empty list for all of the results to go to, a dictionary for the drug class that each AMR gene gives resistance to, and get the samples within the gene_mapping
folder:
all_df = []
gene_drug_class = {}
samples = os.listdir('card_out/gene_mapping')
Now, we'll loop through the samples, opening the files, getting only the gene abundance data, and adding the resulting dataframes to the all_df list:
for sample in samples:
this_sample = pd.read_csv('card_out/gene_mapping/'+sample, index_col=0, header=0, sep='\t')
sample_name = sample.split('_')[0]
for row in this_sample.index.values:
if row not in gene_drug_class:
gene_drug_class[row] = this_sample.loc[row, 'Drug Class']
this_sample = this_sample.loc[:, ['All Mapped Reads']]
all_df.append(this_sample.rename(columns={'All Mapped Reads':sample_name}))
Now we'll combine all of the sample data frames (in the all_df list) into a single data frame:
amr_df = pd.concat(all_df, axis=1).fillna(value=0)
And then we'll rename these AMR genes based on the drug class that they confer resistance to:
amr_df = amr_df.rename(index=gene_drug_class)
Remember that you can always print things out by either just typing the name or them, or using the print()
function: print(amr_df)
And now, as we have a lot of duplicate rows, we'll group them all together:
amr_df = amr_df.groupby(by=amr_df.index, axis=0).sum()
Note that we're taking the sum of the reads here, but in many cases there might be reasons to use the mean/median/etc.
Usually we might want to normalise these counts by the number of reads in each sample. This isn't really so necessary when we started with the same number of reads in each sample (25,000 in each sample), but we can still do it anyway to convert to relative abundance (%):
amr_df = amr_df.divide(25000).multiply(100)
Now we'll just plot this as a heatmap:
fig = plt.figure(figsize=(20,20))
ax = plt.subplot(111)
ax.pcolor(amr_df, cmap='plasma', edgecolor='black')
I've chosen to color this using the plasma colormap (cmap), but you could choose any of the others here, or there are ways to define your own.
Then we can add some other things to this, like x and y tick labels, and numbers into the cells of the heatmap, and then save the figure:
xt = plt.xticks([a+0.5 for a in range(len(amr_df.columns))], amr_df.columns.values, rotation=90)
yt = plt.yticks([a+0.5 for a in range(len(amr_df.index))], amr_df.index.values)
max_val = max(amr_df.max(axis=1).values)
for r in range(len(amr_df.index.values)):
for c in range(len(amr_df.columns)):
val = amr_df.iloc[r, c]
if val < (max_val/2): fc = 'white'
else: fc = 'black'
tx = ax.text(c+0.5, r+0.5, str(round(val, 4)), ha='center', va='center', fontsize=10, color=fc)
plt.savefig('AMR_heatmap.png', dpi=600, bbox_inches='tight')
In the for loop here, we're getting the number that's in each cell of the data frame (looping through the rows and then the columns), then if the value is less than half of the maximum % relative abundance that we have, we're making the text colour (fc) white, and if it's not, then it'll be black. Then we're adding text to the right cell of the heatmap.
If you take a look at the heatmap, you'll see that just a few drug classes seem to be abundant, with lots of the others having very low percentage abundances.
Question 12: Are there any patterns that you can notice about the samples that seem to have higher relative abundances of AMR genes than others?
Use grep
to look at the matches for sequence CB1APANXX170426:1:1101:10000:16732/1
in mmseqs-CSM7KOMH-s1.m8
.
Question 1: How many matches are there?
Using grep -c "CB1APANXX170426:1:1101:10000:16732/1" mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8
we can see that there are 25 matches for that sequence.
Question 2: What is the highest percent identity match?
Using grep "CB1APANXX170426:1:1101:10000:16732/1" mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8
to print the matches, we can look at the third column to see the % identity matches. The first two matches (UniRef90_A0A016DTE9 and UniRef90_J9H7P3) are both 1.00 (or 100%).
Question 3: What is the lowest E-value match (smallest value = best match)?
Using grep
as above for question 2, and looking at the E-values in the second to last column, we can see that the lowest values are again for the top two sequences (5.517E-15). The closer to 0 the E-values are, the better.
Question 4: What's in the list?
Your output should look like this: ['MSM9VZHR', 'HSMA33KE', 'HSM7J4QT', 'MSMB4LXW', 'MSM79HA3', 'CSM7KOMH', 'PSM7J18I', 'HSM6XRQY', 'HSMA33J3', 'CSM79HR8']
You can see that this is a list of sample names - you can tell that it's a list because it's enclosed in square brackets ([]
).
Question 5: Describe what's in the Workshop_strat_matrix_RPKM.txt
file.
We can see that this file contains a feature table containing samples as columns and stratified taxa/functions as rows. The cells each contain the abundance converted to RPKM (Reads Per Kilobase per Million). You can see that the functions (EC numbers) and taxa are separated by the "|" symbol.
Question 6: Look at this new file. Any ideas how you might check that these descriptions are correct?
It's always a good idea to check that scripts are doing what we expect of them! I'd usually check a few of the EC numbers/descriptions on the KEGG website, e.g. for EC:6.1.1.7 - I just found this page by googling the EC number.
Question 7: Can you work out what was added?
We added in the full ranked taxonomy here. As above, it's always a good idea to check some of these. We can use the NCBI taxonomy website if we want to have a look. Try searching the taxon name that we had before and looking at the full taxonomy information.
Question 8: What can you see in the JarrVis output?
In the left side of JarrVis, we can see our two sample groups, CD (Crohn's Disease) and nonIBD. Then we can see links to taxa - g__Bacteroides (the Bacteroides genus) has the largest number of connections, and it has connections to both sample groups. On the right, we can see the functions with their descriptions, and we can see how these link back to the taxa and samples that contribute them. Hovering over each of the grey lines shows us the RPKM are in this taxon/function grouping (note: it might take a second after hovering sometimes for the information to pop up!).
Question 9: Have a look at these files. Do you see any issues with them?
We don't have very much that's well classified. This is because we're using very sub-sampled samples (vs the millions of reads that is typical for metagenome samples), and for useful functional annotation we really need higher sequencing depth than for taxonomy.
Question 10: Does this look more reasonable to you?
Yes! Now we have lots more functions present in our samples.
Question 11: Is this the same as you thought it would be based on the MMseqs output?
It's a bit difficult to tell now that we're looking at pathways and not EC numbers! If we wanted to find out, we could either look at the same functions for each, or we could look at the pathways that the EC numbers contribute to (or vice versa).
Question 12: Are there any patterns that you can notice about the samples that seem to have higher relative abundances of AMR genes than others?
We can see that HSM7J4QT seems to be higher in aminoglycoside antibiotic resistance than other samples, HSM6XRQY and MSMB4LXW in tetracycline antibiotic resistance, and CSM79HR8 in macrolide/lincosamide (sometimes called MLS) antibiotic resistance. These are in the CD, nonIBD, nonIBD and CD groups, so based on this cursory look alone it doesn't look like there are massive differences in AMR gene abundance based on metadata grouping, but we haven't done any statistical testing or looked at the individual genes yet so can't really draw conclusions based on this!
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].