Skip to content

PICRUSt Tutorial with de novo Variants

Gavin Douglas edited this page Sep 27, 2017 · 31 revisions

UNDER CONSTRUCTION

Note this workflow is a work-in-progress and is still being altered. It is currently being tested and so should be used cautiously.

One drawback of using PICRUSt is that it restricts users to picking against OTUs in the Greengenes database. This is not ideal for several reasons including that a large proportion of OTUs are typically de novo and OTU picking is gradually being replaced by error correction approaches like DADA2 and deblur.

To get around this drawback the user can run the genome prediction steps of PICRUSt themselves (rather than having a pre-calculated supplied for reference OTUs).

Concatenate study and reference 16S rRNA gene sequences into a single file.

cat dada2_out/seqtab.fasta img_gg_starting_files/gg_13_5_img_subset.fasta > study_seqs_gg_13_5_img_subset.fasta

Discuss optional vsearch step.

Align all sequences to full Greengenes reference using PyNAST ~ 1m30s:

align_seqs.py -e 90 -p 0.1 -i study_seqs_gg_13_5_img_subset.fasta -o study_seqs_gg_13_5_img_subset_pynast

Filter alignment:

filter_alignment.py -i study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned.fasta -o study_seqs_gg_13_5_img_subset_pynast

Run fasttree with topology constraints for reference sequences, ~1min

export OMP_NUM_THREADS=9
FastTreeMP  -nt -gamma -fastest -no2nd -spr 4 -constraints img_gg_starting_files/99_otus_IMG_pruned_no_names_constraint.txt < study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned_pfiltered.fasta > study_seqs_gg_13_5_img_subset.tre

PICRUSt genome prediction

Start by formatting input tree and trait table.

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/gg_13_5_img_16S_counts.txt -o format/16S

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/img_400_ko.tab -o format/KO -m img_gg_starting_files/gg_13_5_img_fixed.txt

Run ancestral state reconstruction steps for both traits (~30s and ~24min respectively)

ancestral_state_reconstruction.py -i format/16S/trait_table.tab -t format/16S/pruned_tree.newick -o asr/16S_asr_counts.tab -c asr/asr_ci_16S.tab

You could also run this command to run ASR for the KO table. This command takes ~24 minutes so we have made the output files available in img_gg_starting_files/KO_asr_precalc.

ancestral_state_reconstruction.py -i format/KO/trait_table.tab -t format/KO/pruned_tree.newick -o asr/KO_asr_counts.tab -c asr/asr_ci_KO.tab

Get names of dada2 variants

grep ">seq" dada2_out/seqtab.fasta | tr -d ">" | tr "\n" "," > study_ids.txt

Run predict traits:

predict_traits.py -i format/16S/trait_table.tab -t format/16S/reference_tree.newick -r asr/16S_asr_counts.tab -o predict_traits/16S_precalculated.tab -a -c asr/asr_ci_16S.tab -g "$(< study_ids.txt)"

predict_traits.py -i format/KO/trait_table.tab -t format/KO/reference_tree.newick -r img_gg_starting_files/KO_asr_precalc/KO_asr_counts.tab -o predicted/KO_precalculated.tab -a -c img_gg_starting_files/KO_asr_precalc/asr_ci_KO.tab -g "$(< study_ids.txt)" 

Add metadata to KO table

echo "" >> predicted/KO_precalculated.tab
cat img_gg_starting_files/metadata/precalc_meta >> predicted/KO_precalculated.tab

Get predictions per sample based on the abundance of each variant.

normalize_by_copy_number.py -i dada2_out/seqtab_tax_rarified.biom -c predicted/16S_precalculated.tab -o seqtab_tax_rarified_norm.biom

predict_metagenomes.py -i dada2_out/seqtab_tax_rarified.biom -c predicted/KO_precalculated.tab -o seqtab_tax_rarified_norm_ko.biom

categorize_by_function.py -i seqtab_tax_rarified_norm_ko.biom -o seqtab_tax_rarified_norm_ko_l3.biom -c KEGG_Pathways -l 3

Convert to STAMP format.

biom_to_stamp.py -m "KEGG_Description" seqtab_tax_rarified_norm_ko.biom > seqtab_tax_rarified_norm_ko.spf
biom_to_stamp.py -m "KEGG_Pathways" seqtab_tax_rarified_norm_ko_l3.biom > seqtab_tax_rarified_norm_ko_l3.spf
Clone this wiki locally