Skip to content

Metagenome prediction

Gavin Douglas edited this page Aug 1, 2019 · 21 revisions

The script metagenome_pipeline.py reads in a sequence abundance table (the abundances of OTUs or ASVs in BIOM, TSV, or mothur shared file format), the predicted marker gene abundances, and the predicted gene family abundances (these last two files are output by hsp.py). The sequence abundances should be in read counts and not relative abundances. It will normalize the input sequence abundance table by the predicted number of marker genes. It will then determine the predicted functional profiles per sample. Output stratified by sequence ids (i.e. taxonomic contributors) will also be output if the --strat_out option is used. Also, rare ASVs can be collapsed into the same category in the stratified output table based on the --min_reads and --min_samples options. Note the output files are tab-delimited even if the input files was in BIOM format. The normalized sequence abundance table and the weighted nearest-sequenced taxon index values per-sample will also be output to the output directory as separate files.

Below are examples of how this command is run to get unstratified metagenome predictions for EC numbers and KEGG orthologs (KOs):

metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f EC_predicted.tsv.gz \
                       -o EC_metagenome_out


metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f KO_predicted.tsv.gz \
                       -o KO_metagenome_out

The input arguments/options are:

  • -i STUDY.biom - Tab-delimited table, BIOM, or mothur shared file containing counts of study variants across all samples.

  • -m MARKER_PREDICTED.tsv.gz - Output predicted 16S copy numbers (or other marker) for all study sequences.

  • -f FUNC_PREDICTED.tsv.gz - Output predicted functional abundances for all study sequences.

  • --max_nsti INT - Max NSTI values per study sequence (sequences with values above this cut-off will be removed). Default: 2.

  • min_reads INT - Min number of reads an ASV must have to NOT be collapsed into "RARE" category. Default: 1.

  • min_samples INT - Min number of samples an ASV must be in to NOT be collapsed into "RARE" category. Default: 1.

  • --metagenome_contrib - Output long-form gzipped table called "metagenome_contrib.tsv.gz" that breaks down how each input ASV is contributing to each predicted gene family. Note that the column names of this file refers to OTUs for backwards compatability (Note this option was only present in v2.1.4-b and as of v2.2.0-b this is the default stratified output format).

  • --strat_out - Flag to indicate that stratified output should also be generated.

  • --wide_table -Output wide-format stratified table of metagenome predictions when --strat_out is set. This is the deprecated method of generating stratified tables since it is extremely memory intensive. The stratified outfile is named pred_metagenome_strat.tsv.gz when this option is set (added in v2.2.0-b).

  • --skip_norm - Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S rRNA genes). This step will be performed automatically unless this option is specified (added in v2.2.0-b).

  • -o metagenome_out - Output directory.

Clone this wiki locally