CBW 2018 PICRUSt2 Tutorial

CURRENTLY UNDER CONSTRUCTION

This tutorial is for Module 3 of the Canadian Bioinformatics Workshop running from June 5-7 2018 in Toronto.

Authors: Gavin Douglas and Morgan Langille

Introduction

PICRUSt2 is a tool that predicts the abundance of gene families and higher-level pathways present in a microbial community based on amplicon sequencing data. This tutorial will demonstrate each major step in the PICRUSt2 pipeline. Throughout the tutorial there will be questions that are meant to aid understanding. The answers are on this page.

For this lab we will be processing 16S sequencing of biopsy samples from the Inflammatory Bowel Disease Multi'omics Database. Although shotgun metagenomics is becoming cheaper it typically isn't done with biopsy samples since it requires an enormous amount of sequencing (due to the large amount of host DNA!). 16S sequencing doesn't suffer from this issue since only the 16S is amplified - there typically is no host contamination.

This tutorial was meant for PICRUSt v2.0.0-b.2. The input files which are described below can be downloaded here, but these are already present on the student server!

Teaching objectives

Run and understand major steps of PICRUSt2 workflow.
Understand how to assess the prediction quality based on NSTI values.
Understand what files PICRUSt2 outputs and what the abundances in each one refer to.

Bioinformatic tool citations

castor
EPA-NG
gappa
iTOL (website, paper)
MinPath
PaPaRa
PICRUSt2
RStudio

Activating conda environment

PICRUSt2 is installed on the student server in a conda environment, which is a tool for managing environments. This is helpful since sometimes different tools require different versions of packages as dependencies. It also allows users to easily install software with a simple command. The environment we'll be using for this tutorial is called picrust2-dev.

A different version of Python is installed in this environment. Before activating this environment you can check where the global version of Python is installed and its version.

This command will identify where Python is installed.

which python

And this command will let you know its version:

python --version

You can activate this tutorial with this command:

source /usr/local/miniconda3/bin/activate picrust2-dev

Compare the location and version of Python in this environment to what was outputted above.

Note that although conda is often used for managing Python versions it can be used to manage any environment. For example, PICRUSt2 is dependent on certain R packages and a different version of R is installed in this environment.

QUESTION: What version of R is installed in the picrust2-dev environment and where is the R script located (note that it's capitalized unlike python)?

When you're finished this tutorial you can deactivate the environment with this command:

source /usr/local/miniconda3/bin/deactivate

Exploring input files

Two input files are required to run PICRUSt2: (1) a FASTA file of each study sequence and (2) the abundance of each study sequence in each sample. The workflow will work with any sequences that can be reliably placed in the reference tree (which is based on 16S rRNA genes [16S] by default), but typically users will input either the representative sequences of OTUs or amplicon sequence variants (ASVs).

These two files are in the input_files folder:

ASV_abun.tsv
ASVs.fna

The ASV_abun.tsv represents each sample as a column (sample ids are in first row), each ASV as a row (ASV ids are in first column), and the numbers in the table as counts of ASVs per sample.

View the ASV_abun.tsv file with the less command and the option -S which turns off line wrapping (use the left and right arrow keys to scroll to see the entire file):

less -S input_files/ASV_abun.tsv

Question: How many samples are in this dataset?

You can also see how many lines there in the sequence table with this command:

wc -l input_files/ASV_abun.tsv

There are 100 sequences in this table (the first line is the header!)

QUESTION: As a sanity check, how could you check how many sequences are in the FASTA file input_files/ASVs.fna?

Read placement

The first step of the workflow involves taking the study sequences and putting them into the pre-existing phylogenetic tree based on the 16S sequences from the reference genomes with the script place_seqs.py. As with all of the scripts in this workflow you can see the arguments and options for this tool by typing place_seqs.py --help. The default reference tree is based on full-length 16S sequences parsed from genomes in the Integrated Microbial Genomes database. The multiple-sequence alignment (MSA) of these reference 16S sequences is also required as input since it is required for the first placement step (done by PaPaRa).

QUESTION: Based on the output of place_seqs.py --help where are the default reference files and how many sequences are in the reference MSA?

place_seqs.py -s input_files/ASVs.fna -o tutorial.tre --threads 4 --intermediate place_seqs_working

These are the inputs to this tool (besides changing the default reference files):

-s FASTA: your study sequences (i.e. FASTA of amplicon sequence variants or operational taxonomic units)
-o TREEFILE: Output tree with placed study sequences.
--threads INT: Number of threads to use.
--intermediate: Option to specify a folder where intermediate files will be written (otherwise they will not be kept).
--print_cmds: Option to specify that wrapped commands will be printed to screen (useful for troubleshooting!).

The output of this tool is the tree tutorial.tre. Take a look at this file with less. The tree is in newick format, which captures how individual tips on the tree are related topologically and also the branch-length distances between them. Importantly, place_seqs.py takes the most likely placement for each input sequence, but that doesn't mean it's correct!

Note that we set the option --intermediate place_seqs_working above. In this folder you can find the intermediate output files from the placement pipeline. One particularly useful file is the JPLACE file (place_seqs_working/epa_out/epa_result.jplace). This file contains information on where each sequence was placed in the reference tree: often there will be multiple placement positions. The JPLACE file is in JSON format and contains the reference tree followed by each placed sequence. The name of each placed sequence is in the field starting with "n:" and the placement locations is in the preceeding "p": field.

Visualizing this output can be very informative. If you have time you can optionally view the placements using the interactive tree of life (iTol) website as described below.

Hidden-state prediction

Now that we have the study sequences placed in the reference tree we can proceed to the heart of the PICRUSt2 pipeline: hidden-state prediction (HSP). For each placed study sequence the predicted abundances of gene families of interest will be predicted, which below will be Enzyme Classification (E.C.) numbers.

We will also predict how many 16S copies are expected to be in the genome corresponding to each study sequence. Predicting how many 16S copies there are is important since this will be used in the next step to normalize the abundance of each sequence. We will also calculate the nearest-sequenced taxon index (NSTI), which is the measure of how distant each study sequence from the nearest reference sequence in the tree.

This command will run HSP using maximum parsimony to predict the number 16S genes per predicted genome and will calculate NSTI values for each study sequence:

hsp.py -i 16S -n -t tutorial.tre -o 16S_predicted -m mp -n --seed 163

Similarly, this command will generate predictions how many copies of each gene family are found in each predicted genome. We wont bother calculating NSTI values again since it would be the same as above. Note that this command will take a lot longer (~3 minutes) since predictions are being made for 1000s of gene families instead of just 1 as above.

hsp.py -i EC -t tutorial.tre -o EC_predicted -p 4 -m mp --seed 163

The input arguments and options are:

-t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
-i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM')
-o PREFIX: Prefix for output files: RDS (R object containing state probabilities), predicted counts, and optionally a table of CIs.
--observed_trait_table TRAIT_COUNTFILE: Trait file to use if a non-default file is needed (most users should use one of the default trait options above).
-m METHOD: Hidden-state prediction method to use.
-p INT: Number of processes to run in parallel.
--chunk_size INT: Number of gene families to read in for a given processor.
-n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated.
-c: Specifies that pseudo-confidence intervals will be output.

QUESTION: Note that a random seed (--seed 163) was specified for each above command. Why is this important to set?

It's a good idea to check whether any of your input sequences have very high NSTI values. It's possible that sequences with high NSTI values (e.g. > 1) could be uncharacterized taxa, but often they are simply garbage sequences.

To take a look at the NSTI values open RStudio in your browser and run these commands in R:

hsp_16S_nsti <- read.table("/path/to/16S_predicted.tsv", header=T, sep="\t", row.names=1)
hist(hsp_16S_nsti$metadata_NSTI, breaks=100, col="grey", main="", xlab="Nearest-sequenced taxon index", ylim=c(0,40))

You should get a histogram that looks like this:

images/picrust2_tutorial_NSTI_hist.png

length(which(hsp_16S_nsti$metadata_NSTI == min(hsp_16S_nsti$metadata_NSTI)))

max(hsp_16S_nsti$metadata_NSTI)

hsp_16S_nsti[which(hsp_16S_nsti$metadata_NSTI == max(hsp_16S_nsti$metadata_NSTI)), ]

X16S_rRNA_Count metadata_NSTI

#70334369bb15777be95200a4edaa90be 1 0.338791


## Metagenome pipeline
Now that we have the predicted _genomes_ for each study sequence we can predict the _metagenomes_ for each sample. The basic approach here is to multiply the predicted number of genes by the abundance of each corresponding "genome" (i.e. the abundance of the study sequence). However, before predicting the metagenomes the sequence abundances are normalized by the number of marker gene copies in each predicted genome (i.e. how many 16S copies they have).

metagenome_pipeline.py -i input_files/ASV_abun.tsv -f EC_predicted.tsv -m 16S_predicted.tsv -p 4


This command outputs the predicted metagenomes in both stratified and unstratified format to the folder ```metagenome_out``` by default.

The input arguments/options are:
* ```-i STUDY.biom``` - BIOM file abundances of study variants across all samples.
* ```-m MARKER_PREDICTED.txt``` - Output predicted 16S copy numbers (or other marker) for all study sequences.
* ```-f FUNC_PREDICTED.txt``` - Output predicted functional abundances for all study sequences.
* ```-p INT``` - Number of processes to run in parallel to speed up execution.
* ```--max_nsti INT``` - Max NSTI values per study sequence (sequences with values above this cut-off will be removed). Default: 2.
* ```-o OUTPUT_DIR``` - Output directory.

**Explore normalized output file**

## Pathway inference

run_minpath.py -i metagenome_out/pred_metagenome_strat.tsv -m /home/gavin/github_repos/picrust_repos/picrust2/MinPath/ec2metacyc_picrust_prokaryotic.txt -p 4 --intermediate minpath_intermediate -o minpath_out


**Point out that intermediate MinPath files can be parsed manually in ```minpath_intermediate```.**

**Also again stratified and unstratified predictions are output**

**Clarify how the unstratified abundances differ from how HUMAnN2 does it**

**List options**

## Explore JPLACE file with iTOL (optional)

If there is time remaining you can try reading in the file containing placed reads into iTOL.



## Explore weighted NSTI values (optional)

**Explore weighted NSTI values in R, similar to above!**

**Compare NSTI values between biopsy locations**

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].