Skip to content

CBW 2018 PICRUSt2 Tutorial

Gavin Douglas edited this page Jun 1, 2018 · 70 revisions

This tutorial is for Module 3 of the Canadian Bioinformatics Workshop running from June 5-7 2018 in Toronto.

Authors: Gavin Douglas and Morgan Langille

Table of Contents

Introduction

PICRUSt2 is a tool that predicts the abundance of gene families and higher-level pathways present in a microbial community based on amplicon sequencing data. This tutorial will demonstrate each major step in the PICRUSt2 pipeline. Throughout the tutorial there will be questions that are meant to aid understanding. The answers are on this page.

For this lab we will be processing 16S sequencing of biopsy samples from the Inflammatory Bowel Disease Multi'omics Database. Although shotgun metagenomics is becoming cheaper it typically isn't done with biopsy samples since it requires an enormous amount of sequencing (due to the large amount of host DNA!). 16S sequencing doesn't suffer from this issue since only the 16S is amplified - there typically is no host contamination.

This tutorial was meant for PICRUSt v2.0.0-b.2. The input files which are described below can be downloaded here, but these are already present on the student server!

Teaching objectives

  1. Run and understand major steps of PICRUSt2 workflow.
  2. Understand how to assess the prediction quality based on NSTI values.
  3. Understand what files PICRUSt2 outputs and what the abundances in each one refer to.

Bioinformatic tool citations

Activating the conda environment

PICRUSt2 is installed on the student server in a conda environment, which is a tool for managing environments. This is helpful since sometimes different tools require different versions of packages as dependencies. It also allows users to easily install software with a simple command. The environment we'll be using for this tutorial is called picrust2-dev.

A different version of Python is installed in this environment. Before activating this environment you can check where the global version of Python is installed and its version.

This command will identify where Python is installed.

which python

And this command will let you know its version:

python --version

You can activate this tutorial with this command:

source /usr/local/miniconda3/bin/activate picrust2-dev

Compare the location and version of Python in this environment to what was outputted above.

Note that although conda is often used for managing Python versions it can be used to manage any environment. For example, PICRUSt2 is dependent on certain R packages and a different version of R is installed in this environment.

Question 1: What version of R is installed in the picrust2-dev environment and where is the R script located (note that it's capitalized unlike python)?

When you're finished this tutorial you can deactivate the environment with this command:

source /usr/local/miniconda3/bin/deactivate

Exploring input files

Two input files are required to run PICRUSt2: (1) a FASTA file of each study sequence and (2) the abundance of each study sequence in each sample. The workflow will work with any sequences that can be reliably placed in the reference tree (which is based on 16S rRNA genes [16S] by default), but typically users will input either the representative sequences of OTUs or amplicon sequence variants (ASVs).

These two files are in the input_files folder:

  • ASV_abun.tsv
  • ASVs.fna

The ASV_abun.tsv represents each sample as a column (sample ids are in first row), each ASV as a row (ASV ids are in first column), and the numbers in the table as counts of ASVs per sample.

View the ASV_abun.tsv file with the less command and the option -S which turns off line wrapping (use the left and right arrow keys to scroll to see the entire file):

less -S input_files/ASV_abun.tsv

Question 2: How many samples are in this dataset?

You can also see how many lines there in the sequence table with this command:

wc -l input_files/ASV_abun.tsv

There are 100 sequences in this table (the first line is the header!)

Question 3: As a sanity check, how could you check how many sequences are in the FASTA file input_files/ASVs.fna?

Read placement

The first step of the workflow involves taking the study sequences and putting them into the pre-existing phylogenetic tree based on the 16S sequences from the reference genomes with the script place_seqs.py. As with all of the scripts in this workflow you can see the arguments and options for this tool by typing place_seqs.py --help. The default reference tree is based on full-length 16S sequences parsed from genomes in the Integrated Microbial Genomes database. The multiple-sequence alignment (MSA) of these reference 16S sequences is also required as input since it is required for the first placement step (done by PaPaRa).

Question 4: Based on the output of place_seqs.py --help where are the default reference files and how many sequences are in the reference MSA?

place_seqs.py -s input_files/ASVs.fna -o tutorial.tre --threads 4 --intermediate place_seqs_working

These are the inputs to this tool (besides changing the default reference files):

  • -s FASTA: your study sequences (i.e. FASTA of amplicon sequence variants or operational taxonomic units)
  • -o TREEFILE: Output tree with placed study sequences.
  • --threads INT: Number of threads to use.
  • --intermediate: Option to specify a folder where intermediate files will be written (otherwise they will not be kept).
  • --print_cmds: Option to specify that wrapped commands will be printed to screen (useful for troubleshooting!).

The output of this tool is the tree tutorial.tre. Take a look at this file with less. The tree is in newick format, which captures how individual tips on the tree are related topologically and also the branch-length distances between them. Importantly, place_seqs.py takes the most likely placement for each input sequence, but that doesn't mean it's correct!

Note that we set the option --intermediate place_seqs_working above. In this folder you can find the intermediate output files from the placement pipeline. One particularly useful file is the JPLACE file (place_seqs_working/epa_out/epa_result.jplace). This file contains information on where each sequence was placed in the reference tree: often there will be multiple placement positions. The JPLACE file is in JSON format and contains the reference tree followed by each placed sequence. The name of each placed sequence is in the field starting with "n:" and the placement locations is in the preceeding "p": field.

Visualizing this output can be very informative. If you have time you can optionally view the placements using the interactive tree of life (iTol) website as described below.

Hidden-state prediction

Now that we have the study sequences placed in the reference tree we can proceed to the heart of the PICRUSt2 pipeline: hidden-state prediction (HSP). For each placed study sequence the predicted abundances of gene families of interest will be predicted, which below will be Enzyme Classification (E.C.) numbers.

We will also predict how many 16S copies are expected to be in the genome corresponding to each study sequence. Predicting how many 16S copies there are is important since this will be used in the next step to normalize the abundance of each sequence. We will also calculate the nearest-sequenced taxon index (NSTI), which is the measure of how distant each study sequence from the nearest reference sequence in the tree.

This command will run HSP using maximum parsimony to predict the number 16S genes per predicted genome and will calculate NSTI values for each study sequence:

hsp.py -i 16S -n -t tutorial.tre -o 16S_predicted -m mp -n --seed 163

Similarly, this command will generate predictions how many copies of each gene family are found in each predicted genome. We wont bother calculating NSTI values again since it would be the same as above. Note that this command will take a lot longer (~3 minutes) since predictions are being made for 1000s of gene families instead of just 1 as above.

hsp.py -i EC -t tutorial.tre -o EC_predicted -p 4 -m mp --seed 163

The input arguments and options are:

  • -t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
  • -i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM')
  • -o PREFIX: Prefix for output files: RDS (R object containing state probabilities), predicted counts, and optionally a table of CIs.
  • --observed_trait_table TRAIT_COUNTFILE: Trait file to use if a non-default file is needed (most users should use one of the default trait options above).
  • -m METHOD: Hidden-state prediction method to use.
  • -p INT: Number of processes to run in parallel.
  • --chunk_size INT: Number of gene families to read in for a given processor.
  • -n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated.
  • -c: Specifies that pseudo-confidence intervals will be output.

Question 5: Note that a random seed (--seed 163) was specified for each above command. Why is this important to set?

It's a good idea to check whether any of your input sequences have very high NSTI values. It's possible that sequences with high NSTI values (e.g. > 1) could be uncharacterized taxa, but often they are simply garbage sequences.

To take a look at the NSTI values open RStudio in your browser and run these commands in R:

hsp_16S_nsti <- read.table("/path/to/16S_predicted.tsv", header=T, sep="\t", row.names=1)
hist(hsp_16S_nsti$metadata_NSTI, breaks=100, col="grey", main="", xlab="Nearest-sequenced taxon index", ylim=c(0,40))

You should get a histogram that looks like this:

images/picrust2_tutorial_NSTI_hist.png

The majority of sequences have reasonably low NSTI values (< 0.3). There is one outlier with a value >0.3. We could identify which sequence this is with these commands:

max_nsti_val <- max(hsp_16S_nsti$metadata_NSTI)

hsp_16S_nsti[which(hsp_16S_nsti$metadata_NSTI == max_nsti_val), ]

The above commands should have returned:

# X16S_rRNA_Count metadata_NSTI
#70334369bb15777be95200a4edaa90be               1      0.338791

Which identifies sequence 70334369bb15777be95200a4edaa90be as having the highest NSTI value of 0.338791.

Question 6: Using a similar approach as above how could you figure out what the lowest NSTI value is?

Question 7: Either within RStudio or by simply lessing the 16S_predicted.tsv file, which sequence is predicted to have the most copies of the 16S gene? How many copies does it have?

Metagenome pipeline

Now that we have the predicted genomes for each study sequence we can predict the metagenomes for each sample. The basic approach here is to multiply the predicted number of genes by the abundance of each corresponding "genome" (i.e. the abundance of the study sequence). However, before predicting the metagenomes the sequence abundances are normalized by the number of marker gene copies in each predicted genome (i.e. how many 16S copies they have).

metagenome_pipeline.py -i input_files/ASV_abun.tsv -f EC_predicted.tsv -m 16S_predicted.tsv -p 4

The input arguments/options are:

  • -i STUDY.tsv - Table of ASV abundances across all samples (either TSV or BIOM format).
  • -m MARKER_PREDICTED.txt - Output predicted 16S copy numbers (or other marker) for all study sequences.
  • -f FUNC_PREDICTED.txt - Output predicted functional abundances for all study sequences.
  • -p INT - Number of processes to run in parallel to speed up execution.
  • --max_nsti INT - Max NSTI values per study sequence (sequences with values above this cut-off will be removed). Default: 2.
  • -o OUTPUT_DIR - Output directory.

The above command output the predicted metagenomes to the folder metagenome_out by default. There are 4 output files in this directory:

  • pred_metagenome_strat.tsv: predicted metagenomes stratified by which sequence is contributing each function.
  • pred_metagenome_unstrat.tsv: predicted metagenomes over all sequences.
  • seqtab_norm.tsv: sequence table normalized by predicted number of marker genes.
  • weighted_nsti.tsv: weighted NSTI values per sample (over all sequences weighted by the abundance in that sample).

Take a look at the normalized sequence abundance table with less. The abundances in this file are the number of reads divided by the predicted number of marker genes.

Question 8: What is the abundance of sequence 132b62bcbd3f280208e314eeab566a4d in sample HMP.2.206659 in this normalized table? Given that the unnormalized abundance was 11 how many predicted 16S sequences must there have been?

The output weighted NSTI values are also useful to explore. You can check out the option section at the bottom of this page if you have time!

The key output files are the stratified and unstratified metagenome predictions. The key difference in the stratified table (metagenome_out/pred_metagenome_strat.tsv) is that it contains a column called "sequence", which breaks down the number of genes contributed by individual ASVs. Both of these predicted metagenome files contain the numbers of gene families in each sample.

Note that these values are derived from the input read depth; therefore, the magnitude of the values will depend on the sequencing depth per sample. Also, note that the sum of gene family abundances per sample (i.e. column sums) will not be equal.

Question 9: Why would the column sums not be equal even if the input sequence abundance table was rarified so that all samples had the same depth?

Pathway inference

The final step in the PICRUSt2 pipeline is to infer pathway abundances based on the presence of gene families. This step is done by wrapping MinPath, which determines the minimum number of pathways that can be explained given the presence of gene families.

You can run MinPath over all samples with this command:

run_minpath.py -i metagenome_out/pred_metagenome_strat.tsv -m /usr/local/picrust2/MinPath/ec2metacyc_picrust_prokaryotic.txt -p 4 --intermediate minpath_intermediate -o minpath_out

The inputs to this tool are:

  • -i metagenome_out/pred_metagenome_strat.tsv - Stratified output of metagenome_pipeline.py step.
  • -m MINPATH_MAPFILE - path to mapfile of gene families to pathways of interest.
  • -o OUT_PREFIX - Prefix for the two output files.
  • --intermediate - Optional folder where intermediate files will be written..
  • -p INT - Number of processes to run simultaneously.

Once again both stratified and unstratified output files are created: minpath_out_strat_path.tsv and minpath_out_unstrat_path.tsv respectively.

The abundance of pathways in these files is based on the mean abundance of the required gene families.

The mapping between pathways and the underlying gene families was given as an input file to the above command (/usr/local/picrust2/MinPath/ec2metacyc_picrust_prokaryotic.txt). The intermediate MinPath output files were saved in minpath_intermediate. There is a lot of information in this folder and it wont be used by most users, but it's useful to have these raw files in case you want to more information on how MinPath predicted the presence of a pathway in a sample.

Note that the stratified abundances sum to be equal to the unstratified abundances, which are the overall abundance of that pathway in the sample (NOTE: this is a different approach then what is outputted by HUMAnN2 which will be discussed in a future module!).

Question 10: Given the above, which one of the two following statements is correct?

1. The stratified pathway abundances represent the abundances expressed by predicted genomes independently
2. The stratified pathway abundances represent the abundances of the community-wide pathway levels contributed by an individual predicted genome.

Explore JPLACE file with iTOL (optional)

If there is time remaining you can try reading in the file containing placed reads into iTOL.

Explore weighted NSTI values (optional)

Explore weighted NSTI values in R, similar to above!

Compare NSTI values between biopsy locations

Clone this wiki locally