Skip to content

Sequence placement

Robyn Wright edited this page Jan 14, 2025 · 21 revisions

PICRUSt2 wraps HMMER to place study sequences into a reference multiple-sequence alignment and then places these sequences into the reference phylogeny with EPA-NG or SEPP. The "study sequences" referred to will be the representative OTUs and/or ASVs under the typical workflow. The tool GAPPA is used to convert the resulting .jplace object into newick format.

Please note that before PICRUSt2-v2.6.0 the default running of this command was with the PICRUSt2-oldIMG database. As of PICRUSt2-v2.6.0 the default database will be the PICRUSt2-MPGA database. See here for further details on this new database. See the details for the --ref_dir option for using the PICRUSt2-oldIMG database with PICRUSt2-v2.6.0.

Note that your input study sequences need to be on the positive strand!

place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 --intermediate placement_working

The script takes these arguments/options:

  • -s FASTA: your study sequences (i.e. FASTA of amplicon sequence variants or operational taxonomic units)
  • --ref_dir DIRECTORY: Argument specifying non-default reference files to use for sequence placement. There are four expected files in this directory (see below). As of PICRUSt2-v2.6.0, the default for this will be to place in the bacterial tree, and it can now take 'bac'/'bacteria', 'arc'/'archaea' or 'oldIMG' as options. If you want to place sequences in the bacterial tree, run this with --ref_dir bac, if you want to place sequences in the archaeal tree, run this with --ref_dir arc, and if you want to run this with the oldIMG database, run this with --ref_dir oldIMG.
  • -o TREEFILE: Output tree with placed study sequences.
  • -t epa-ng|sepp: Placement tool to use when placing sequences into reference tree. One of "epa-ng" or "sepp" must be input (default: epa-ng)
  • -p INT: Number of processes to run in parallel.
  • --intermediate: Option to specify a folder where intermediate files will be written (otherwise they will not be kept).
  • --chunk_size: Number of query seqs to read in at once for EPA-NG (default: 5000).
  • --verbose: Option to specify that wrapped commands will be printed to screen (useful for troubleshooting!).

Using Custom Reference Files

To use custom reference files you need to specify a directory with --ref_dir that contains:

  1. A multiple-sequence alignment (with the extension .fna or .fasta and can optionally be gzipped)
  2. A tree in newick format (extension .tre)
  3. A hidden-markov model of the multiple-sequence alignment (extension .hmm)
  4. A modelfile output by RaXmL specifying the best parameters for the tree (extension .model)

Note that the prefix of these files needs to be the same as the specified folder name. For instance, the default reference files (prokaryotic 16S rRNA gene alignment) are in picrust2/default_files/prokaryotic/pro_ref and they all have the prefix "pro_ref":

pro_ref.fna.gz
pro_ref.hmm
pro_ref.model
pro_ref.tre

If you do not have a model file you can create one by following these instructions. You can create an HMM of your alignment with hmmbuild.

Further details on creating these files can be found in the wiki describing how the updated database was built here.

Clone this wiki locally