-
Notifications
You must be signed in to change notification settings - Fork 105
Workflow PICRUSt2‐MPGA database
This page has been modified from the original Workflow page. You can see full details on the updates made to the PICRUSt2 database here.
Below is an overview of the PICRUSt2 workflow for the updated PICRUSt2-MPGA database using GTDB genomes, which includes example commands for processing 16S sequencing data and getting E.C. number and KEGG ortholog (KO) abundances. The E.C. numbers can then be used to calculate MetaCyc pathway abundances and coverages. Note that there are other gene family databases supported which may be more informative (but which cannot be collapsed to pathways by default). See the side-bar for more details on individual commands.
We have recently updated the PICRUSt2 database to include GTDB r214.1 genomes. You can see full details of the changes made and the improvements to performance here. We now have two separate phylogenetic trees that are used for inserting study sequences and predicting gene content, one for bacteria and one for archaea. This means that some of the functions used previously need to be run twice and some of the output files need to be combined. You can see all of the updated commands necessary on this page. If you wish to run the previous PICRUSt2 database, you can still do so. You can see details for how to do this on the previous pages here.
Note that you can type the option -h
to get a description of each below script.
Currently, the default version of PICRUSt2 that is downloaded will be 2.5.3. As of version 2.6.0 we are introducing the updated database that is constructed using GTDB genomes and functional annotations from Eggnog. See the installation instructions here for information on installing and running the updated database.
The entire pipeline can be run with this command (details):
picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1
If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py
and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.
Place amplicon sequence variants (or OTUs) into reference phylogeny (details)
Place sequences into bacterial reference phylogeny:
place_seqs.py -s study_seqs.fna -o bac_placed_seqs.tre -p 1 \
-r bacteria \
--intermediate placement_working_bac
Place sequences into archaeal reference phylogeny:
place_seqs.py -s study_seqs.fna -o arc_placed_seqs.tre -p 1 \
-r archaea \
--intermediate placement_working_arc
Note that it is not necessary to run both of these if, for example, you know that your dataset should only contain bacteria.
Run hidden-state prediction to get 16S copy numbers per predicted genome (details)
Note that NSTI values will be added to the 16S prediction table (since the -n
option was given).
hsp.py -i 16S -r bacteria -t bac_placed_seqs.tre -o bac_marker_nsti_predicted.tsv.gz -p 1 -n
hsp.py -i 16S -r archaea -t arc_placed_seqs.tre -o arc_marker_nsti_predicted.tsv.gz -p 1 -n
Determining the best domain for each sequence (details)
Now we want to check whether the NSTI was lower for the bacterial or the archaeal references for each of our study sequences:
pick_best_domain.py \
-n1 bacteria \
-n2 archaea \
--tree_dom1 bac_placed_seqs.tre \
--tree_dom2 arc_placed_seqs.tre \
--tree_out_dom1 bac_reduced_placed_seqs.tre \
--tree_out_dom2 arc_reduced_placed_seqs.tre \
--nsti_table_dom1 bac_marker_nsti_predicted.tsv.gz \
--nsti_table_dom2 arc_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom1 bac_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom2 arc_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_combined combined_marker_nsti_predicted.tsv.gz
Run hidden-state prediction to get E.C. number, and KO abundances per predicted genome (details)
hsp.py -i EC -r bacteria -t bac_reduced_placed_seqs.tre -o bac_EC_predicted.tsv.gz -p 1
hsp.py -i KO -r bacteria -t bac_reduced_placed_seqs.tre -o bac_KO_predicted.tsv.gz -p 1
hsp.py -i EC -r archaea -t arc_reduced_placed_seqs.tre -o arc_EC_predicted.tsv.gz -p 1
hsp.py -i KO -r archaea -t arc_reduced_placed_seqs.tre -o arc_KO_predicted.tsv.gz -p 1
Combine files from HSP (details)
combine_domains.py --table_dom1 bac_EC_predicted.tsv.gz --table_dom2 arc_EC_predicted.tsv.gz -o combined_EC_predicted.tsv.gz
combine_domains.py --table_dom1 bac_KO_predicted.tsv.gz --table_dom2 arc_KO_predicted.tsv.gz -o combined_KO_predicted.tsv.gz
Predict E.C. and KO abundances in sequencing samples (adjusts gene family abundances by 16S sequence abundance) (details)
Note that this is the same page as before. At this point, we have combined the predictions that we obtained separately for the bacterial and archaeal domains and are now running this in the same way as we did previously.
metagenome_pipeline.py -i study_seqs.biom \
-m combined_marker_nsti_predicted.tsv.gz \
-f combined_EC_predicted.tsv.gz \
-o EC_metagenome_out
metagenome_pipeline.py -i study_seqs.biom \
-m combined_marker_nsti_predicted.tsv.gz \
-f combined_KO_predicted.tsv.gz \
-o KO_metagenome_out
Infer MetaCyc pathway abundances and coverages based on predicted E.C. number abundances (details)
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
-o pathways_out \
--intermediate pathways_working \
-p 1
Add descriptions as new column in gene family and pathway abundance tables (details)
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
-o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
-o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
-o pathways_out/path_abun_unstrat_descrip.tsv.gz
Note that this step has not changed.
An optional additional step is to shuffle the ASV labels in the genome prediction tables (i.e. the outputs of hsp.py
). Any analyses based on these shuffled tables can then be compared with analyses based on the actual data to check if there is more signal in the unshuffled data. See here for more details.
Please first check our FAQ if you have any questions about PICRUSt2.
For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.
To report a bug or to make a feature request please make a new issue at the top of this page.