-
Notifications
You must be signed in to change notification settings - Fork 105
Testing the updated PICRUSt2 database
I have been working on an updated PICRUSt2 database for a while now. This uses GTDB r214 genomes that have been annotated using Eggnog v2.1.12, giving BiGG reaction, CAZy, EC, GO, KO, PFAM and gene name annotations for all genomes. I am currently working on benchmarking this database against the previous database.
Note that this new database has not yet been published and we are currently in the testing phase. If you would like to use it for your work and provide us with feedback, that would be very much appreciated. Please reach out to Robyn Wright ([email protected]) with questions.
For genomes to be included in this new database, we required them to be present in the GTDB bacterial or archaeal phylogenetic trees, have at least one 16S gene copy (>80% expected 16S length), be >90% complete, and be <10% redundant. This left 26,868 bacterial and 1,002 archaeal genomes. Details on the included genomes are in the archaea/archaea_metadata.csv.gz and bacteria/bacteria_metadata.csv.gz files within the default_files folder. This includes taxonomy information.
GTDB provides phylogenetic trees that are based on multiple markers (please see their methods for more details). We are now inserting study sequences into these trees using an alignment of the 16S genes, but the trees should theoretically be much more robust than the previous one using only 16S sequences.
In the new database, we have 1.3-fold more KOs (14,106 vs 10,543) and 1.3-fold more EC numbers (3,763 vs 2,913) than the previous database. We've updated the pathway and reaction mapping files to those used in HUMAnN 3.
For users that want to run PICRUSt2 in the same way as they have previously but with new datasets, we have left the current scripts as-is, but we have added a new picrust2_pipeline_split.py
script that will run the following steps:
- Place study sequences in bacterial and archaeal reference trees
- Calculate Nearest Sequenced Taxon Indices for all study sequences in both trees
- Determine which tree is the best fit (lowest NSTI) for each study sequence. Reduce the output trees to only contain the sequences that fit best with that domain
- Carry out hidden state prediction for all traits for both bacteria and archaea
- Combine the prediction tables for bacteria and archaea for each trait
- Metagenome prediction
- Pathway prediction
This new version can currently be installed from the Github repository only. You can do that like so:
git clone --single-branch -b PICRUSt2-v2.6.0 https://github.com/picrust/picrust2
mv picrust2 picrust2-v2.6.0
cd picrust2-v2.6.0
conda env create -n picrust2-v2.6.0 -f picrust2-env.yaml
conda activate picrust2-v2.6.0
pip install --editable .
I haven't yet updated every part of the package, so if some things (like pytest or the version number) are different, please don't be concerned. If you can run the pipeline below then it is working as expected.
picrust2_pipeline_split.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1
You can see a description of most of these files on the previous page.
The key output files remain the same:
-
EC_metagenome_out
- Folder containing unstratified EC number metagenome predictions (pred_metagenome_unstrat.tsv.gz
), sequence table normalized by predicted 16S copy number abundances (seqtab_norm.tsv.gz
), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz
). -
KO_metagenome_out
- AsEC_metagenome_out
above, but for KO metagenomes. -
pathways_out
- Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.
You'll see some additional intermediate files in the folder, including:
-
combined_marker_predicted_and_nsti.tsv.gz
- this file now includes additional columns that tells you whether each sequence fitted better in the bacterial or the archaeal tree as well as which the genome with the lowest NSTI was. You can search for this in the metadata file if you like. It is worth checking whether this fits your expectations based on your taxonomic classifications (if you have them). -
[bac/arc].tre
- these are the output tree files created in step 1 above. -
[bac/arc]_marker_predicted_and_nsti.tsv.gz
- these are the original files calculated in step 2 above. -
[bac/arc]_reduced.tre
and[bac/arc]_reduced_marker_predicted_and_nsti.tsv.gz
- these are the filtered files created in step 3 above. -
[bac/arc]_[EC/KO]_predicted.tsv.gz
- these are the predictions obtained using the filtered files in step 4 above. -
combined_[EC/KO]_predicted.tsv.gz
- these are the files obtained from step 5 above.
This can be run in the same way as the previous add_descriptions.py
script:
add_descriptions_split.py -i picrust2_out_pipeline_split/KO_metagenome_out/pred_metagenome_unstrat.tsv.gz \
-m KO -o picrust2_out_pipeline_split/KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
Currently, the available options for this are KO, EC and METACYC.
We have been using samples from the Blueberry, Cameroon (stool), HMP, Indian (stool), Mammal, Ocean and Primate datasets that were used in the PICRUSt2 manuscriptPICRUSt2 manuscript.
In our testing with these datasets, we found that 21,453 of 21,552 sequences (99.9%) were placed in the tree that we expected them to be placed. The remaining sequences came from the Ocean (n=8) or Primate (n=1) datasets. We'll add some more details on this later, but for the most part, these sequences appeared to not have high similarity with any sequences but were likely archaea.
The median NSTI for the testing datasets ranged between 0.05 (HMP) and 0.32 (Primate) with the default PICRUSt2 database, and it is now between 0.03 (HMP) and 0.15 (Ocean) with the new database. We believe that this should lead to improvements in the reliability of predictions.
Comparing the intersection of the predictions that we obtain using the updated database with the previous PICRUSt2 database, we get Spearman correlation coefficients of 0.89 (Primate) - 0.93 (Ocean) for EC numbers, 0.85 (Indian) - 0.90 (Blueberry) for KOs, and 0.85 (Mammal) - 0.90 (Cameroon) for pathways. We therefore believe that we are obtaining reasonable predictions using this new database.
Previously, PICRUSt2 functional predictions were compared with functional annotations obtained with HUMAnN2 on paired shotgun metagenome data as a "gold standard". We did the same thing again and found that Spearman correlation coefficients were slightly lower for the updated database than they were for the default database for all datasets aside from Primate, which had slightly higher correlations for the updated database. Bray-Curtis distances were slightly lower for the updated database for EC numbers and slightly higher for the updated database for KOs.
As this comparison is inherently biased due to the use of specific versions of reference databases within HUMAnN2 (or any other functional annotation tool), we have also made some mock samples using genomes that are not present within the PICRUSt2 database. The abundances and taxonomy of these genomes is loosely based on the datasets used in the original PICRUSt2 manuscript. Briefly, feature tables for the datasets were collapsed at the genus level and a genome that matched each genus as best as possible was chosen that was >= 90% complete and <=10% redundant. 16S copies were identified in these genomes and the feature tables were filtered to only these "genera". 16S genes were samples randomly from the genomes matching each genus to make up the counts within the feature tables and the 16S sequences were clustered at 100% identity to make new mock feature tables/representative sequences. Both the default and updated PICRUSt2 databases were used for functional predictions, and the "real" metagenomes were constructed by multiplying the abundances of all genomes by the counts of EC/KOs in their genomes (as annotated by Eggnog). Spearman correlation coefficients were higher and Bray-Curtis distances were lower for the updated than default database for all datasets, with the differences typically being greater for the non-human associated (Blueberry Soil, Mammal, Ocean and Primate) than the human associated (Cameroon, HMP and Indian) datasets.
Please first check our FAQ if you have any questions about PICRUSt2.
For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.
To report a bug or to make a feature request please make a new issue at the top of this page.