Skip to content

CBW 2021 PICRUSt2 Tutorial

Morgan Langille edited this page Aug 27, 2021 · 61 revisions

This tutorial is part of the 2021 Canadian Bioinformatic Workshop.

Authors: Jacob Nearing and Morgan Langille

This tutorial is an updated version of the 2018 CBW Picrust2 tutorial original written by Gavin Douglas and Morgan Langille.

Table of Contents

Introduction

PICRUSt2 is a tool that predicts the abundance of gene families and higher-level pathways present in a microbial community based on amplicon sequencing data. This tutorial will demonstrate each major step in the PICRUSt2 pipeline. Throughout the tutorial there will be questions that are meant to aid understanding. The answers are on this page, which you might want to open in a different tab.

For this lab we will be processing 16S sequencing of ileum and rectum biopsy samples from the Inflammatory Bowel Disease Multi'omics Database. Although shotgun metagenomics is becoming cheaper it typically isn't done with biopsy samples since it requires an enormous amount of sequencing (due to the large amount of host DNA!). 16S sequencing doesn't suffer from this issue since only the 16S is amplified - there typically is no host contamination.

This tutorial was meant for PICRUSt v2.4.1.

Teaching objectives

  1. Run and understand major steps of PICRUSt2 workflow.
  2. Understand what files PICRUSt2 outputs and what the abundances in each one refer to.

Bioinformatic tool citations

Activating the conda environment

PICRUSt2 is already installed on each of the students instances in a conda environment, which is a tool for managing environments. This is helpful since sometimes different tools require different versions of packages as dependencies. It also allows users to easily install software with a simple command. The environment we'll be using for this tutorial is called picrust2.

You can activate the environment for this tutorial with this command:

conda activate picrust2

When you're finished this tutorial you can deactivate the environment with this command:

conda deactivate

Exploring input files

Two input files are required to run PICRUSt2: (1) a FASTA file of each study sequence and (2) the abundance of each study sequence in each sample. The workflow will work with any sequences that can be reliably placed in the reference tree (which is based on 16S rRNA genes [16S] by default), but typically users will input either the representative sequences of OTUs or amplicon sequence variants (ASVs). Think back to your earlier sessions on 16S rRNA sequencing to remind yourself what ASVs/OTUs are!

The first step will be to make a new directory for this tutorial so that we can keep all of our files organized. We will also download the required files for this tutorial using the wget command.

mkdir ~/workspace/picrust2_tut
cd ~/workspace/picrust2_tut
mkdir input_files
wget https://www.dropbox.com/s/cjqy61z4g1pylbz/ASV_abun.tsv?dl=1 -O input_files/ASV_abun.tsv
wget https://www.dropbox.com/s/j865g782dc3t96g/ASVs.fna?dl=1 -O input_files/ASVs.fna
wget https://www.dropbox.com/s/mhly9pf53vsfx7l/picrust2_lab_metadata.tsv?dl=1 -O input_files/picrust2_lab_metadata.tsv

The above wget commands will have downloaded three files.

The first file ASV_abun.tsv represents each sample as a column (sample ids are in first row), each ASV as a row (ASV ids are in first column), and the numbers in the table as counts of ASVs per sample.

View the ASV_abun.tsv file with the less command and the option -S which turns off line wrapping (use the left and right arrow keys to scroll to see the entire file):

less -S input_files/ASV_abun.tsv

You can also see how many lines there in the sequence table with this command:

wc -l input_files/ASV_abun.tsv

There are 100 sequences in this table (the first line is the header!)

The second file is named picrust2_lab_metadata.tsv. This file contains all the metadata for the samples of interest. We will use this file along with some pre-made Rscripts to visualize how our predictions differ between Rectum and Ileum samples from individuals with CD and nonIBD.

The final file is named ASVs.fna. This file contains all of the representative sequences for the ASVs in our ASV abundance table. You will notice that this file ends in .fna and not .fasta. You may encounter this on your own but do not worry as .fna format is just another file identifier for FASTA formatted files.

Question 1: How many samples are in this dataset?

Question 2: How could you check how many sequences are in the FASTA file input_files/ASVs.fna?

Read placement

The first step of the workflow involves taking the study sequences and putting them into the pre-existing phylogenetic tree based on the 16S sequences from the reference genomes with the script place_seqs.py. As with all of the scripts in this workflow you can see the arguments and options for this tool by typing place_seqs.py --help. The default reference tree is based on full-length 16S sequences parsed from genomes in the Integrated Microbial Genomes database. The multiple-sequence alignment (MSA) of these reference 16S sequences is also required as input since it is required for the first placement step (done by HMMER). However, luckily for us this script automatically points to this MSA file that was included with our PICRUSt2 installation!

To run the sequence placement algorithm and produce a new phylogenetic tree with our sequences placed within them we can run the following command.

place_seqs.py -s input_files/ASVs.fna -o tutorial.tre -p 6 --intermediate place_seqs_working --verbose

These are the inputs to this tool (besides changing the default reference files we talked about eariler):

  • -s FASTA: your study sequences (i.e. FASTA of amplicon sequence variants or operational taxonomic units)
  • -o TREEFILE: Output tree with placed study sequences.
  • -p INT: Number of threads to use.
  • --intermediate: Option to specify a folder where intermediate files will be written (otherwise they will not be kept).
  • --verbose: Option to specify that wrapped commands will be printed to screen (useful for troubleshooting!).

The output of this tool is the tree tutorial.tre. Take a look at this file with less. The tree is in newick format, which captures how individual tips on the tree are related topologically and also the branch-length distances between them. Importantly, place_seqs.py takes the most likely placement for each input sequence, but that doesn't mean it's correct!

Note that we set the option --intermediate place_seqs_working above. In this folder you can find the intermediate output files from the placement pipeline. One particularly useful file is the JPLACE file (place_seqs_working/epa_out/epa_result.jplace). This file contains information on where each sequence was placed in the reference tree: often there will be multiple placement positions. The JPLACE file is in JSON format and contains the reference tree followed by each placed sequence. The name of each placed sequence is in the field starting with "n:" and the placement locations is in the preceeding "p": field.

Visualizing this output can be very informative. If you have time you can optionally view the placements using the interactive tree of life (iTol) website as described below.

Hidden-state prediction

Now that we have the study sequences placed in the reference tree we can proceed to the heart of the PICRUSt2 pipeline: hidden-state prediction (HSP). For each placed study sequence the predicted abundances of gene families of interest will be predicted, which below will be Enzyme Classification (E.C.) numbers.

We will also need to predict how many 16S copies are expected to be in the genome corresponding to each study sequence. Predicting how many 16S copies there are is important since this will be used in the next step to normalize the abundance of each sequence. We will also calculate the nearest-sequenced taxon index (NSTI), which is the measure of how distant each study sequence from the nearest reference sequence in the tree.

This command will run HSP using maximum parsimony to predict the number of 16S genes per predicted genome and will calculate NSTI values for each study sequence:

hsp.py -i 16S -t tutorial.tre -o 16S_predicted.tsv -m mp -p 6 -n

Similarly, this command will generate predictions on how many copies of each gene family are found in each predicted genome. We wont bother calculating NSTI values again since it would be the same as above.

hsp.py -i EC -t tutorial.tre -o EC_predicted.tsv -p 6 -m mp

The input arguments and options are:

  • -t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
  • -i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM', 'PHENO')
  • -o FILENAME: Named of output file containing predicted counts.
  • -m METHOD: Hidden-state prediction method to use.
  • -p INT: Number of processes to run in parallel.
  • -n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated.

There are a few other options which you can look at using the command hsp.py --help however I would suggest users stick to the default values for the other arguments unless they are using a custom database.

Next, it's a good idea to check whether any of your input sequences have very high NSTI values. It's possible that sequences with high NSTI values (e.g. > 1) could be uncharacterized taxa, but often they are simply garbage sequences.

Lets first inspect the 16S_predicted.tsv file using the less command.

less 16S_predicted.tsv

By doing this you will notice that there are three columns within the file: sequence, 16S_rRNA_Count and metadata_NSTI. You could sort this table by the NSTI values in the 3rd column using the sort command:

sort -k 3 16S_predicted.tsv

Question 3: What sequence has the highest NSTI value?

Question 4: Which sequence is predicted to have the most copies of the 16S gene? How many copies does it have?

You can also look at the predictions for ECs:

less -S EC_predicted.tsv

Each EC number is a different column and thus there are many more columns compared to the 16S copy number predictions!

Metagenome pipeline

Now that we have the predicted genomes for each study sequence we can predict the metagenomes for each sample. The basic approach here is to multiply the predicted number of genes by the abundance of each corresponding "genome" (i.e. the abundance of the study sequence). However, before predicting the metagenomes the sequence abundances are normalized by the number of marker gene copies in each predicted genome (i.e. how many 16S copies they have).

metagenome_pipeline.py -i input_files/ASV_abun.tsv -m 16S_predicted.tsv -f EC_predicted.tsv -o EC_metagenome_out --strat_out

The input arguments/options are:

  • -i STUDY.tsv - Table of ASV abundances across all samples (either TSV or BIOM format).
  • -m MARKER_PREDICTED.txt - Output predicted 16S copy numbers (or other marker) for all study sequences.
  • -f FUNC_PREDICTED.txt - Output predicted functional abundances for all study sequences.
  • --max_nsti INT - Max NSTI values per study sequence (sequences with values above this cut-off will be removed). Default: 2.
  • -o OUTPUT_DIR - Output directory.
  • --strat_out - Indicates we would like to generate an additional file with the functional contributions from each individual ASV.
  • --wide_table - Indicates we would like the stratified abundance predictions to be in a wide format rather than a long format.

The above command output the predicted metagenomes to the folder EC_metagenome_out by default. There are 4 output files in this directory:

  • pred_metagenome_strat.tsv.gz: Predicted functional contribution to the metagenome from each individual ASV
  • pred_metagenome_unstrat.tsv.gz: predicted metagenomes over all sequences.
  • seqtab_norm.tsv.gz: sequence table normalized by predicted number of marker genes.
  • weighted_nsti.tsv.gz: weighted NSTI values per sample (over all sequences weighted by the abundance in that sample).

You will notice that these files are compressed in gzip format. We will go ahead and uncompress them with the following command:

gunzip EC_metagenome_out/*

Take a look at the normalized sequence abundance table with less - S EC_metagenome_out/seqtab_norm.tsv. The abundances in this file are the number of reads divided by the predicted number of marker genes.

Question 6: What is the abundance of sequence 132b62bcbd3f280208e314eeab566a4d in sample HMP.2.206659 in this normalized table? Given that the unnormalized abundance was 11 how many predicted 16S sequences must there have been?

The output weighted NSTI values are also useful to explore. You can check out the option section at the bottom of this page if you have time!

You will also notice that this part of the pipeline output two different predicted metagenome files. The first file pred_metagenome_unstart.tsv represents a table similar to what we saw with the ASV_abun.tsv file. On columns we have each different sample and on the rows we have the predicted normalized abundance for each enzyme classification ID.

Question 7: Why would the column sums (total predicted abundance in each sample) in the unstratified predictions not be equal even if the input sequence abundance table was rarified so that all samples had the same depth?

The second predicted metagenome file pred_metagenome_strat.tsv is similar to the first file however it contains an additional column named sequence which breaks down the predicted EC abundance contributed to each sample per sequence (taxon) in our dataset.

Pathway inference

The final step in the PICRUSt2 pipeline is to infer pathway abundances based on the presence of gene families. This step is done by wrapping MinPath, which determines the minimum number of pathways that can be explained given the presence of gene families.

You can run this step on the Stratified table using this command:

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_strat.tsv -o pathways_out -p 3 --wide_table

The inputs to this tool are:

  • -i EC_metagenome_out/pred_metagenome_contrib.tsv - Stratified output of metagenome_pipeline.py step. (Note you can also input the unstratified table, however if a stratified table is input then it will generate both stratified and unstratifeid pathway tables)
  • -o pathways_out - The name of the folder to output files into.
  • -p INT - Number of processes to run simultaneously.
  • --wide_table - Indicates we are inputting stratified abundances that are in wide format.

Once again both stratified and unstratified output files are created: pathways_out/path_abun_strat.tsv.gz and pathways_out/path_abun_unstrat.tsv.gz respectively.

The abundance of pathways in these files is based on the mean abundance of the required gene families.

Question 8: Given the above, which one of the two following statements is correct?

1. The stratified pathway abundances represent the abundances expressed by predicted genomes independently
2. The stratified pathway abundances represent the abundances of the community-wide pathway levels contributed by an individual predicted genome.

Visualizing pathway abundances

*** Small section on how they will visiual the pathway abundances ***

Explore JPLACE file with iTOL (optional)

If there is time remaining you can try reading in the JPLACE file containing placed reads into the interactive Tree of Life website.

First, go to this website: https://itol.embl.de/

Then click on "Upload" at the top of the page.

Choose the JPLACE file output to the intermediate output folder produced at the sequence placement step: place_seqs_working/epa_out/epa_result.jplace. This file contains the possible placement positions for each ASV onto the reference tree along with each placements likelihood.

After a minute or so you should see a circular tree!

  • Under the "Basic" tab change the "Branch lengths" option to "Ignore".
  • Click on the switch beside "Phylogenetic Placements"

You should now see red dots where the placements are located within the tree. We hope to improve this functionality in the future (e.g. add taxonomic labels on the tips, colour clades, etc.).

  • To search for individual placements, click on the "show query form" button within the "Datasets" tab. A search box opens up on the left which can be used to search by ASV ids.

Question 11: Search for the ASV d3d5bc15a5f947217d626ad4a99c5757 - how many possible placements are there for this sequence?

Weighted NSTI exploration

Clone this wiki locally