PICRUSt Tutorial with de novo Variants

Note this workflow is a work-in-progress and is still being altered. It is currently being tested and so should be used cautiously.

One drawback of using PICRUSt is that it restricts users to picking against OTUs in the Greengenes database. This is not ideal for several reasons including that a large proportion of OTUs are typically de novo and OTU picking is gradually being replaced by error correction approaches like DADA2 and deblur.

To get around this drawback the user can run the genome prediction steps of PICRUSt themselves (rather than having a pre-calculated supplied for reference OTUs). The below workflow will show users how they can run these steps themselves on DADA2 variants. The dataset we will be using is the output of our DADA2 16S tutorial. A description of this data and the accompanying metadata is available at the beginning of that tutorial. This approach is currently being refined so this will likely change over time. This tutorial is based partially on the steps laid out here: https://github.com/vmaffei/dada2_to_picrust.

Acquire data

If you are running this tutorial as part of EMBL-EBI's Metagenomics workshop then the tutorial files should already be on your computer. You can copy the zipped folder to the Desktop, unzip it, and enter the folder to get started.

cd ~/Desktop
cp ~/Documents/picrust_tutorial_Oct2017.zip ~/Desktop
unzip picrust_tutorial_Oct2017.zip
cd picrust_tutorial_Oct2017

Otherwise you can download the tutorial files by replacing the cp command above with the below wget command.

wget http://kronos.pharmacology.dal.ca/public_files/tutorial_datasets/picrust_tutorial_Oct2017.zip

Overview of data

In the picrust_tutorial_Oct2017 directory you will see output files from the DADA2 pipeline (the files produced by our DADA2 16S tutorial) and required files that will be used for the genome prediction steps.

dada2_out/ contains the counts of the variants identified with DADA2 (seqtab_tax_rarified.biom) and their sequences in FASTA format (seqtab.fasta).
map.txt contains the metadata for each sample and will be used at the data analysis stage.
img_gg_starting_files/ contains IMG KEGG ortholog counts that were used to create PICRUSt's precalculated files for the Greengenes database. There are additional files which in this directory which will be described when needed.

Activate anaconda environment

Before proceeding with this workflow you will need to activate the qiime1 anaconda environment.

source ~/anaconda2/bin/activate qiime1

Create New Reference Tree

Trait predictions with PICRUSt are based on ancestral state reconstruction (ASR). This method requires a phylogenetic tree of all the taxa of interest (the study and reference sequences in this case). To make this tree we need to align all of these sequences of interest together.

The first step is to concatenate the study and reference 16S rRNA gene sequences into a single file. Note that the first argument below is the output of the DADA2 pipeline for our dataset. The second argument is the reference full-length 16S sequences for the subset of Greengenes OTUs that correspond to genomes in IMG with KEGG ortholog count data.

cat dada2_out/seqtab.fasta img_gg_starting_files/gg_13_5_img_subset.fasta > study_seqs_gg_13_5_img_subset.fasta

Next we need to align all of these sequences together. The below command will do this using PyNAST, which uses the full Greengenes reference alignment as a template.

align_seqs.py -e 90 -p 0.1 -i study_seqs_gg_13_5_img_subset.fasta -o study_seqs_gg_13_5_img_subset_pynast

Non-default options:

-e 90: Min sequence length of 90 to include in alignment.
-p 0.1: Min percent identity of 10% to include in the alignment.

Filter low quality columns from alignment. This is important since the alignment is being made between full 16S sequences and shorter variants of variable regions.

filter_alignment.py -i study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned.fasta -o study_seqs_gg_13_5_img_subset_pynast

We can now run fasttree to build a phylogenetic tree based on this alignment. The below command is the multi-threaded version of fasttree (you need to export a shell variable specifying the number of threads).

export OMP_NUM_THREADS=9
FastTreeMP  -nt -gamma -fastest -no2nd -spr 4 -constraints img_gg_starting_files/99_otus_IMG_pruned_no_names_constraint.txt < study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned_pfiltered.fasta > study_seqs_gg_13_5_img_subset.tre

There is a lot going on in this command, so here is the breakdown:

-nt: specifies the alignment is of nucleotides.
-gamma: rescale branch lengths and compute Gamma20-based likelihood.
-fastest: speed up neighbour-joining phase.
-no2nd: when used with -fastest yields an intermediate level of speed and search intensiveness.
-spr 4: specifies 4 subtree-pruning-regrafting steps (see paper).
-constraints img_gg_starting_files/99_otus_IMG_pruned_no_names_constraint.txt: file containing topology constraints for Greengenes sequences included in the tree (calculated in advance and based on the original Greengenes tree).
< FILE: this specifies the input file.
> FILE: this specifies the output file.

PICRUSt genome prediction

Start by formatting input tree and trait table. The below commands perform a few different checks of the input files and prepares them for the ASR step.

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/gg_13_5_img_16S_counts.txt -o format/16S

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/img_400_ko.tab -o format/KO -m img_gg_starting_files/gg_13_5_img_fixed.txt

Run ancestral state reconstruction steps for 16S rRNA gene counts to get estimates for every internal node in the tree. The default method is phylogenetic independent contrasts although there are other methods available as well.

ancestral_state_reconstruction.py -i format/16S/trait_table.tab -t format/16S/pruned_tree.newick -o asr/16S_asr_counts.tab -c asr/asr_ci_16S.tab

You could also run this command to run ASR for the KO table. This command takes > 30 minutes so we have made the output files available in img_gg_starting_files/KO_asr_precalc. Again you do not need to run the below command.

ancestral_state_reconstruction.py -i format/KO/trait_table.tab -t format/KO/pruned_tree.newick -o asr/KO_asr_counts.tab -c asr/asr_ci_KO.tab

Now that we have the trait predictions for all internal nodes we can extend this predictions to the tips of the tree (i.e. our study sequences of interest). Note that the ASR files for the KEGG orthologs have been pre-calculated.

predict_traits.py -i format/16S/trait_table.tab -t format/16S/reference_tree.newick -r asr/16S_asr_counts.tab -o predicted/16S_precalculated.tab -a -c asr/asr_ci_16S.tab

predict_traits.py -i format/KO/trait_table.tab -t format/KO/reference_tree.newick -r img_gg_starting_files/KO_asr_precalc/KO_asr_counts.tab -o predicted/KO_precalculated.tab -a -c img_gg_starting_files/KO_asr_precalc/asr_ci_KO.tab

Add metadata to KO table

echo "" >> predicted/KO_precalculated.tab
cat img_gg_starting_files/metadata/precalc_meta >> predicted/KO_precalculated.tab

Get predictions per sample based on the abundance of each variant.

normalize_by_copy_number.py -i dada2_out/seqtab_tax_rarified.biom -c predicted/16S_precalculated.tab -o seqtab_tax_rarified_norm.biom

predict_metagenomes.py -i seqtab_tax_rarified_norm.biom -c predicted/KO_precalculated.tab -o seqtab_tax_rarified_norm_ko.biom

categorize_by_function.py -i seqtab_tax_rarified_norm_ko.biom -o seqtab_tax_rarified_norm_ko_l3.biom -c KEGG_Pathways -l 3

Run metagenome contributions:

metagenome_contributions.py -i seqtab_tax_rarified_norm.biom -l K08602,K03699 -o metagenome_contributions.txt -c predicted/KO_precalculated.tab

Plot stacked barcharts of metagenome contributions:

plot_metagenome_contributions.R --input metagenome_contributions.txt --output K03699_contrib.pdf --function_id K03699 --rank Genus

You should see this figure:

Testing for Statistical Differences with STAMP

Convert to STAMP format.

biom_to_stamp.py -m "KEGG_Description" seqtab_tax_rarified_norm_ko.biom > seqtab_tax_rarified_norm_ko.spf
biom_to_stamp.py -m "KEGG_Pathways" seqtab_tax_rarified_norm_ko_l3.biom > seqtab_tax_rarified_norm_ko_l3.spf

The top KO after FDR multiple-test correction as barplot:

The top KEGG pathway after FDR multiple-test correction as boxplot:

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].