Workflow PICRUSt2‐MPGA database

This page has been modified from the original Workflow page. You can see full details on the updates made to the PICRUSt2 database here.

Below is an overview of the PICRUSt2 workflow for the updated PICRUSt2-MPGA database using GTDB genomes, which includes example commands for processing 16S sequencing data and getting E.C. number and KEGG ortholog (KO) abundances. The E.C. numbers can then be used to calculate MetaCyc pathway abundances and coverages. Note that there are other gene family databases supported which may be more informative (but which cannot be collapsed to pathways by default). See the side-bar for more details on individual commands.

We have recently updated the PICRUSt2 database to include GTDB r214.1 genomes. You can see full details of the changes made and the improvements to performance here. We now have two separate phylogenetic trees that are used for inserting study sequences and predicting gene content, one for bacteria and one for archaea. This means that some of the functions used previously need to be run twice and some of the output files need to be combined. You can see all of the updated commands necessary on this page. If you wish to run the previous PICRUSt2 database, you can still do so. You can see details for how to do this on the previous pages here.

Note that you can type the option -h to get a description of each below script.

Installation

Currently, the default version of PICRUSt2 that is downloaded will be 2.5.3. As of version 2.6.0 we are introducing the updated database that is constructed using GTDB genomes and functional annotations from Eggnog. See the installation instructions here for information on installing and running the updated database.

The entire pipeline can be run with this command (details):

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1

If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.

Place amplicon sequence variants (or OTUs) into reference phylogeny (details)

Place sequences into bacterial reference phylogeny:

place_seqs.py -s study_seqs.fna -o bac_placed_seqs.tre -p 1 \
                    -r bacteria \
                    --intermediate placement_working_bac

Place sequences into archaeal reference phylogeny:

place_seqs.py -s study_seqs.fna -o arc_placed_seqs.tre -p 1 \
                    -r archaea \
                    --intermediate placement_working_arc

Note that it is not necessary to run both of these if, for example, you know that your dataset should only contain bacteria.

Run hidden-state prediction to get 16S copy numbers per predicted genome (details)

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -r bacteria -t bac_placed_seqs.tre -o bac_marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i 16S -r archaea -t arc_placed_seqs.tre -o arc_marker_nsti_predicted.tsv.gz -p 1 -n

Determining the best domain for each sequence (details)

Now we want to check whether the NSTI was lower for the bacterial or the archaeal references for each of our study sequences:

pick_best_domain.py \
                    -n1 bacteria \
                    -n2 archaea \
                    --tree_dom1 bac_placed_seqs.tre \
                    --tree_dom2 arc_placed_seqs.tre \
                    --tree_out_dom1 bac_reduced_placed_seqs.tre \
                    --tree_out_dom2 arc_reduced_placed_seqs.tre \
                    --nsti_table_dom1 bac_marker_nsti_predicted.tsv.gz \
                    --nsti_table_dom2 arc_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_dom1 bac_reduced_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_dom2 arc_reduced_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_combined combined_marker_nsti_predicted.tsv.gz

Run hidden-state prediction to get E.C. number, and KO abundances per predicted genome (details)

hsp.py -i EC -r bacteria -t bac_reduced_placed_seqs.tre -o bac_EC_predicted.tsv.gz -p 1

hsp.py -i KO -r bacteria -t bac_reduced_placed_seqs.tre -o bac_KO_predicted.tsv.gz -p 1

hsp.py -i EC -r archaea -t arc_reduced_placed_seqs.tre -o arc_EC_predicted.tsv.gz -p 1

hsp.py -i KO -r archaea -t arc_reduced_placed_seqs.tre -o arc_KO_predicted.tsv.gz -p 1

Combine files from HSP (details)

combine_domains.py --table_dom1 bac_EC_predicted.tsv.gz --table_dom2 arc_EC_predicted.tsv.gz -o combined_EC_predicted.tsv.gz

combine_domains.py --table_dom1 bac_KO_predicted.tsv.gz --table_dom2 arc_KO_predicted.tsv.gz -o combined_KO_predicted.tsv.gz

Predict E.C. and KO abundances in sequencing samples (adjusts gene family abundances by 16S sequence abundance) (details)

Note that this is the same page as before. At this point, we have combined the predictions that we obtained separately for the bacterial and archaeal domains and are now running this in the same way as we did previously.

metagenome_pipeline.py -i study_seqs.biom \
                       -m combined_marker_nsti_predicted.tsv.gz \
                       -f combined_EC_predicted.tsv.gz \
                       -o EC_metagenome_out

metagenome_pipeline.py -i study_seqs.biom \
                       -m combined_marker_nsti_predicted.tsv.gz \
                       -f combined_KO_predicted.tsv.gz \
                       -o KO_metagenome_out

Infer MetaCyc pathway abundances and coverages based on predicted E.C. number abundances (details)

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate pathways_working \
                    -p 1

Add descriptions as new column in gene family and pathway abundance tables (details)

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                    -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Shuffling predictions

Note that this step has not changed.

An optional additional step is to shuffle the ASV labels in the genome prediction tables (i.e. the outputs of hsp.py). Any analyses based on these shuffled tables can then be compared with analyses based on the actual data to check if there is more signal in the unshuffled data. See here for more details.

Please first check our FAQ if you have any questions about PICRUSt2.

For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.

To report a bug or to make a feature request please make a new issue at the top of this page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow PICRUSt2‐MPGA database

Installation

The entire pipeline can be run with this command (details):

Place amplicon sequence variants (or OTUs) into reference phylogeny (details)

Run hidden-state prediction to get 16S copy numbers per predicted genome (details)

Determining the best domain for each sequence (details)

Run hidden-state prediction to get E.C. number, and KO abundances per predicted genome (details)

Combine files from HSP (details)

Predict E.C. and KO abundances in sequencing samples (adjusts gene family abundances by 16S sequence abundance) (details)

Infer MetaCyc pathway abundances and coverages based on predicted E.C. number abundances (details)

Add descriptions as new column in gene family and pathway abundance tables (details)

Shuffling predictions

Home

Major bug reports and announcements

Key limitations

Installation

Workflow

Updated PICRUSt2-MPGA database

Tutorial

QIIME 2 plugin

Validation with paired metagenomes

FAQ

Clone this wiki locally