Skip to content

PICRUSt Tutorial with de novo Variants

Gavin Douglas edited this page Sep 27, 2017 · 31 revisions

Note this workflow is a work-in-progress and is still being altered. It is currently being tested and so should be used cautiously.

One drawback of using PICRUSt is that it restricts users to picking against OTUs in the Greengenes database. This is not ideal for several reasons including that a large proportion of OTUs are typically de novo and OTU picking is gradually being replaced by error correction approaches like DADA2 and deblur.

To get around this drawback the user can run the genome prediction steps of PICRUSt themselves (rather than having a pre-calculated supplied for reference OTUs).

Concatenate study and reference 16S rRNA gene sequences into a single file.

cat dada2_out/seqtab.fasta img_gg_starting_files/gg_13_5_img_subset.fasta > study_seqs_gg_13_5_img_subset.fasta

Discuss optional vsearch step.

Align all sequences to full Greengenes reference using PyNAST ~ 1m30s:

align_seqs.py -e 90 -p 0.1 -i study_seqs_gg_13_5_img_subset.fasta -o study_seqs_gg_13_5_img_subset_pynast

Filter alignment:

filter_alignment.py -i study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned.fasta -o study_seqs_gg_13_5_img_subset_pynast

Run fasttree with topology constraints for reference sequences, ~1min

export OMP_NUM_THREADS=9
FastTreeMP  -nt -gamma -fastest -no2nd -spr 4 -constraints img_gg_starting_files/99_otus_IMG_pruned_no_names_constraint.txt < study_seqs_gg_13_5_img_subset_pynast/study_seqs_gg_13_5_img_subset_aligned_pfiltered.fasta > study_seqs_gg_13_5_img_subset.tre

PICRUSt genome prediction

Start by formatting input tree and trait table.

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/gg_13_5_img_16S_counts.txt -o format/16S

format_tree_and_trait_table.py -t study_seqs_gg_13_5_img_subset.tre -i img_gg_starting_files/img_400_ko.tab -o format/KO -m img_gg_starting_files/gg_13_5_img_fixed.txt

Run ancestral state reconstruction steps for both traits (~30s and ~24min respectively)

ancestral_state_reconstruction.py -i format/16S/trait_table.tab -t format/16S/pruned_tree.newick -o asr/16S_asr_counts.tab -c asr/asr_ci_16S.tab -p -j multithreaded -n 9

ancestral_state_reconstruction.py -i format/KO/trait_table.tab -t format/KO/pruned_tree.newick -o asr/KO_asr_counts.tab -c asr/asr_ci_KO.tab -p -j multithreaded -n 9

Get names of dada2 variants

grep ">seq" dada2_out/seqtab.fasta | tr -d ">" | tr "\n" "," > study_ids.txt

Run predict traits:

predict_traits.py -i format/16S/trait_table.tab -t format/16S/reference_tree.newick -r asr/16S_asr_counts.tab -o predict_traits/16S_precalculated.tab -a -c asr/asr_ci_16S.tab -g "$(< study_ids.txt)"

predict_traits.py -i format/KO/trait_table.tab -t format/KO/reference_tree.newick -r asr/KO_asr_counts.tab -o predict_traits/KO_precalculated.tab -a -c asr/asr_ci_KO.tab -g "$(< study_ids.txt)" 

Add metadata to KO table

python /home/gavin/github_repos/microbiome_helper/add_picrust_metadata.py -i img_gg_starting_files/img_400_ko.tab -m img_gg_starting_files/KEGG_Pathways.tab -o img_kegg_pathway_metadata.txt

python /home/gavin/github_repos/microbiome_helper/add_picrust_metadata.py -i img_gg_starting_files/img_400_ko.tab -m img_gg_starting_files/KEGG_Description.tab -o img_kegg_description_metadata.txt

cat img_kegg_description_metadata.txt >> predict_traits/KO_precalculated.tab
cat img_kegg_pathway_metadata.txt >> predict_traits/KO_precalculated.tab
Clone this wiki locally