-
Notifications
You must be signed in to change notification settings - Fork 204
Microbiome Helper 2 Functional annotation of metagenomic reads with HMMs
Authors: Robyn Wright
Modifications by: NA
Note: this page is still a work in progress
These are the steps that we have used for identifying hit sequences for a single gene of interest and normalising the abundance of that gene using the abundance of genes that are known to appear as single copies in most genomes (YchF [COG0012], PheS [COG0016], ArgS [COG0018], SerS [COG0172], CysS [COG0215], LeuS [COG0495], ValS [COG0525], TsaD [COG0533], Ffh [COG0541], and FtsY [COG0552]; as used by Delmont et al. 2018). We have used this for e.g. identifying genes in reads that are likely involved in plastic degradation by searching for similarity to genes that have been experimentally verified to be capable of these functions. For most use cases, we recommend annotating all reads using the same functional database (see our other workflows for this here), but this may be useful for users that are interested in a particular gene and want to be very confident in the matches for that gene that are obtained.
Currently, the commands used here are what have been used and they have not yet been modified to be generic.
- Fasta file containing multiple representatives of a gene of interest
- If you do not already have your own list of representatives, you may wish to download them from somewhere like UniProt
- We have an example file here (example to be added!!) that you can try
- Reads or contigs that have already been quality controlled or genomes of interest
Instructions for installing these packages will later be added to this page.
- filter_hits.py
- check_alignment.py
- get_known_tips.py
- distance_to_reference.R
These can all be found in microbiome_helper2/HMM_functional_annotation
clustalo --in=alkB_nt.fasta --out=alkB_nt.sto --force --outfmt=st --wrap=80 --threads 24
Convert alignment to fasta:
esl-reformat -o alkB_nt.afa afa alkB_nt.sto
hmmbuild --cpu 24 alkB_nt.hmm alkB_nt.sto
Note that the tree that I have used here was built using amino acid sequences and not the nucleotide sequences that I am using for the rest.
mkdir raxml_evaluate
raxml-ng --evaluate --msa alkB_nt.afa --tree alkB.nwk --prefix raxml_evaluate/alkB_model.info --model GTR+G+F --msa-format FASTA
I am using some of these that I have already built previously.
This section is not essential but probably will be important for determining appropriate thresholds for inclusion.
For me, these are in a folder called genomes_17_strains
This section is mainly necessary if you have made the SCG HMMs for yourself. Otherwise, I have already verified that we get the expected ~1 copy per genome.
mkdir hmmsearch_SCG_17_strains
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_SCG_17_strains/{1/.}-{2/.}.out {1} {2}' ::: single_copy_genes/*.hmm :::
genomes_17_strains/*.ffn
mkdir SCG_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_SCG_17_strains/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o SCG_hits/{1/.}-{2/.}.hits' ::: single_copy_genes/*.hmm ::: genomes_17_strains/*.ffn
Make count table:
cat SCG_hits/* > SCG_hits_17_genomes.hits
And check whether this is the expected number of hits per genome (i.e. we're expecting a median of 1 hit per genome)
mkdir hmmsearch_alkB_17_strains
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_alkB_17_strains/{1/.}-{2/.}.out {1} {2}' ::: alkB_nt.hmm ::: genomes_17_strains/*.ffn
mkdir alkB_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_alkB_17_strains/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o alkB_hits/{1/.}-{2/.}.hits -f --in_fasta {2} --out_fasta alkB_hits/{1/.}-{2/.}.fasta' ::: alkB_nt.hmm ::: genomes_17_strains/*.ffn
Combine counts:
cat alkB_hits/*hits > alkB_hits_17_genomes.hits
Combine sequences:
cat alkB_hits/*fasta > alkB_hits_17_genomes.fasta
clustalo --in=alkB_hits_17_genomes.fasta --out=alkB_hits_17_genomes.afa --force --outfmt=a2m --threads 24 --profile1 alkB_nt.afa
And filter aligned output file to include only hits (not reference sequences):
python get_hit_aligned_only.py -i alkB_hits_17_genomes.afa -r alkB_nt.afa -o alkB_hits_17_genomes_only.afa
NA (see below)
mkdir alkB_epa_out
epa-ng --tree alkB.nwk \
--ref-msa alkB_nt.afa \
--query alkB_hits_17_genomes_only.afa \
-T 24 \
-m raxml_evaluate/alkB_model.info.raxml.bestModel \
-w alkB_epa_out \
--filter-acc-lwr 0.99 \
--filter-max 100
Convert resulting files from jplace into newick using gappa:
gappa examine graft --jplace-path alkB_epa_out/epa_result.jplace \
--fully-resolve \
--out-dir alkB_epa_out/
python get_known_tips.py -i alkB_nt.fasta -o alkB_tips.csv
Rscript distance_to_reference.R alkB_epa_out/epa_result.newick alkB_tips.csv alkB_known_tip_distance.tsv
I am using contigs that are in a folder called prodigal_contigs
.
mkdir hmmsearch_contigs
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_contigs/{1/.}-{2/.}.out {1} {2}' ::: single_copy_genes/*.hmm ::: prodigal_contigs/*.fna
mkdir SCG_contigs_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_contigs/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o SCG_contigs_hits/{1/.}-{2/.}.hits' ::: single_copy_genes/*.hmm ::: prodigal_contigs/*.fna
cat SCG_contigs_hits/* > SCG_contigs.hits
mkdir hmmsearch_alkB_contigs
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_alkB_contigs/{1/.}-{2/.}.out {1} {2}' ::: alkB_nt.hmm ::: prodigal_contigs/*.fna
mkdir alkB_hits_contigs
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_alkB_contigs/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o alkB_hits_contigs/{1/.}-{2/.}.hits -f --in_fasta {2} --out_fasta alkB_hits_contigs/{1/.}-{2/.}.fasta' ::: alkB_nt.hmm ::: prodigal_contigs/*.fna
Combine counts:
cat alkB_hits_contigs/*hits > alkB_hits_contigs.hits
Combine sequences:
cat alkB_hits_contigs/*fasta > alkB_hits_contigs.fasta
Had some issues with clustal omega so using hmmalign here:
hmmalign --trim --dna --mapali alkB_nt.afa --informat FASTA -o alkB_hits_contigs_hmmalign.sto alkB_nt.hmm alkB_hits_contigs.fasta
Convert hmmalign out to fasta:
esl-reformat -o alkB_hits_contigs_hmmalign.afa afa alkB_hits_contigs_hmmalign.sto
python check_alignment.py --ref_msa alkB_nt.afa --in_msa alkB_hits_contigs_hmmalign.afa --in_fasta alkB_hits_contigs.fasta --out_ref_msa alkB_ref_checked_contigs.afa --out_msa alkB_hits_checked_contigs.afa
mkdir alkB_contigs_epa_out
epa-ng --tree alkB.nwk \
--ref-msa alkB_ref_checked_contigs.afa \
--query alkB_hits_checked_contigs.afa \
-T 24 \
-m raxml_evaluate/alkB_model.info.raxml.bestModel \
-w alkB_contigs_epa_out \
--filter-acc-lwr 0.99 \
--filter-max 100
Convert resulting files from jplace into newick using gappa:
gappa examine graft --jplace-path alkB_contigs_epa_out/epa_result.jplace \
--fully-resolve \
--out-dir alkB_contigs_epa_out/
Should already have the alkB_tips.csv
file from above, so not re-running this step.
#python get_known_tips.py -i alkB_ref_checked_contigs.afa -o alkB_tips.csv
Rscript distance_to_reference.R alkB_contigs_epa_out/epa_result.newick alkB_tips.csv alkB_known_tip_distance_contigs.tsv
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].