Skip to content

Microbiome Helper 2 Functional annotation of metagenomic reads with HMMs

Robyn Wright edited this page Dec 20, 2024 · 7 revisions

Authors: Robyn Wright
Modifications by: NA

Note: this page is still a work in progress

Introduction

These are the steps that we have used for identifying hit sequences for a single gene of interest and normalising the abundance of that gene using the abundance of genes that are known to appear as single copies in most genomes (YchF [COG0012], PheS [COG0016], ArgS [COG0018], SerS [COG0172], CysS [COG0215], LeuS [COG0495], ValS [COG0525], TsaD [COG0533], Ffh [COG0541], and FtsY [COG0552]; as used by Delmont et al. 2018). We have used this for e.g. identifying genes in reads that are likely involved in plastic degradation by searching for similarity to genes that have been experimentally verified to be capable of these functions. For most use cases, we recommend annotating all reads using the same functional database (see our other workflows for this here), but this may be useful for users that are interested in a particular gene and want to be very confident in the matches for that gene that are obtained.

Currently, the commands used here are what have been used and they have not yet been modified to be generic.

Requirements

Data needed

  • Fasta file containing multiple representatives of a gene of interest
    • If you do not already have your own list of representatives, you may wish to download them from somewhere like UniProt
    • We have an example file here (example to be added!!) that you can try
  • Reads or contigs that have already been quality controlled or genomes of interest

Programs needed

Instructions for installing these packages will later be added to this page.

Scripts needed

  • filter_hits.py
  • check_alignment.py
  • get_known_tips.py
  • distance_to_reference.R

These can all be found in microbiome_helper2/HMM_functional_annotation

1. Creating the Hidden Markov Model with a gene of interest

1.1 Make Multiple Sequence Alignments using Clustal Omega

clustalo --in=alkB_nt.fasta --out=alkB_nt.sto --force --outfmt=st --wrap=80 --threads 24

Convert alignment to fasta:

esl-reformat -o alkB_nt.afa afa alkB_nt.sto

1.2 Build HMMs with HMMER (hmmbuild)

hmmbuild --cpu 24 alkB_nt.hmm alkB_nt.sto

1.3 Optional: Make trees with gene of interest

1.3.1 Build tree with RAxML

Note that the tree that I have used here was built using amino acid sequences and not the nucleotide sequences that I am using for the rest.

1.3.2 Optimise tree branch length and model parameters with RAxML

mkdir raxml_evaluate
raxml-ng --evaluate --msa alkB_nt.afa --tree alkB.nwk --prefix raxml_evaluate/alkB_model.info --model GTR+G+F --msa-format FASTA

2. Get Single Copy Core Gene HMMs

I am using some of these that I have already built previously.

2.1 Option for downloading these

2.2 Build SCG HMMs

2.2.1 Get representative genomes

2.2.2 Get sequences matching SCG names from representative genomes

2.2.3 Make Multiple Sequence Alignments using Clustal Omega

2.2.4 Build HMMs with HMMER (hmmbuild)

3. Use HMM to search representative genomes

This section is not essential but probably will be important for determining appropriate thresholds for inclusion.

For me, these are in a folder called genomes_17_strains

3.1 Optional: Search in genomes using HMMER (hmmsearch) with SCGs

This section is mainly necessary if you have made the SCG HMMs for yourself. Otherwise, I have already verified that we get the expected ~1 copy per genome.

3.1.1 Run hmmsearch in genomes

mkdir hmmsearch_SCG_17_strains
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_SCG_17_strains/{1/.}-{2/.}.out {1} {2}' ::: single_copy_genes/*.hmm ::: 
genomes_17_strains/*.ffn

3.1.2 Filter hits e-value < 10-25 from output

mkdir SCG_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_SCG_17_strains/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o SCG_hits/{1/.}-{2/.}.hits' ::: single_copy_genes/*.hmm ::: genomes_17_strains/*.ffn

Make count table:

cat SCG_hits/* > SCG_hits_17_genomes.hits

And check whether this is the expected number of hits per genome (i.e. we're expecting a median of 1 hit per genome)

3.2 Search in genomes using HMMER (hmmsearch) with gene of interest

mkdir hmmsearch_alkB_17_strains
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_alkB_17_strains/{1/.}-{2/.}.out {1} {2}' ::: alkB_nt.hmm ::: genomes_17_strains/*.ffn

3.3 Filter hits e-value < 10-25 from output

mkdir alkB_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_alkB_17_strains/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o alkB_hits/{1/.}-{2/.}.hits -f --in_fasta {2} --out_fasta alkB_hits/{1/.}-{2/.}.fasta' ::: alkB_nt.hmm ::: genomes_17_strains/*.ffn

Combine counts:

cat alkB_hits/*hits > alkB_hits_17_genomes.hits

Combine sequences:

cat alkB_hits/*fasta > alkB_hits_17_genomes.fasta

3.4 Align hit sequences against reference sequences using Clustal Omega

clustalo --in=alkB_hits_17_genomes.fasta --out=alkB_hits_17_genomes.afa --force --outfmt=a2m --threads 24 --profile1 alkB_nt.afa 

And filter aligned output file to include only hits (not reference sequences):

python get_hit_aligned_only.py -i alkB_hits_17_genomes.afa -r alkB_nt.afa -o alkB_hits_17_genomes_only.afa

3.5 Discard hit sequences that don't align well with reference

NA (see below)

3.6 Insert hit sequences into reference tree with EPA-NG

mkdir alkB_epa_out
epa-ng --tree alkB.nwk \
        --ref-msa alkB_nt.afa \
        --query alkB_hits_17_genomes_only.afa \
        -T 24 \
        -m raxml_evaluate/alkB_model.info.raxml.bestModel \
        -w alkB_epa_out \
        --filter-acc-lwr 0.99 \
        --filter-max 100

Convert resulting files from jplace into newick using gappa:

gappa examine graft --jplace-path alkB_epa_out/epa_result.jplace \
                    --fully-resolve \
                    --out-dir alkB_epa_out/

3.7 Get distances to the nearest reference tips in tree using castor

python get_known_tips.py -i alkB_nt.fasta -o alkB_tips.csv
Rscript distance_to_reference.R alkB_epa_out/epa_result.newick alkB_tips.csv alkB_known_tip_distance.tsv

3.8 Optional: Visualise tree

Something something

4. Use HMM to search reads/contigs/genomes of interest

I am using contigs that are in a folder called prodigal_contigs.

4.1 Run hmmsearch with SCGs

mkdir hmmsearch_contigs
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_contigs/{1/.}-{2/.}.out {1} {2}' ::: single_copy_genes/*.hmm ::: prodigal_contigs/*.fna

4.2 Filter SCG hits e-value < 10-25 from output

mkdir SCG_contigs_hits
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_contigs/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o SCG_contigs_hits/{1/.}-{2/.}.hits' ::: single_copy_genes/*.hmm ::: prodigal_contigs/*.fna
cat SCG_contigs_hits/* > SCG_contigs.hits

4.3 Run hmmsearch with gene of interest

mkdir hmmsearch_alkB_contigs
parallel -j 2 --eta 'hmmsearch --cpu 12 --tblout hmmsearch_alkB_contigs/{1/.}-{2/.}.out {1} {2}' ::: alkB_nt.hmm ::: prodigal_contigs/*.fna

4.4 Filter gene of interest hits e-value < 10-25 from output

mkdir alkB_hits_contigs
parallel -j 12 --eta 'python filter_hits.py -hf hmmsearch_alkB_contigs/{1/.}-{2/.}.out -g {1/.} -sn {2/.} -o alkB_hits_contigs/{1/.}-{2/.}.hits -f --in_fasta {2} --out_fasta alkB_hits_contigs/{1/.}-{2/.}.fasta' ::: alkB_nt.hmm ::: prodigal_contigs/*.fna

Combine counts:

cat alkB_hits_contigs/*hits > alkB_hits_contigs.hits

Combine sequences:

cat alkB_hits_contigs/*fasta > alkB_hits_contigs.fasta

4.5 Align hit sequences against reference sequences

Had some issues with clustal omega so using hmmalign here:

hmmalign --trim --dna --mapali alkB_nt.afa --informat FASTA -o alkB_hits_contigs_hmmalign.sto alkB_nt.hmm alkB_hits_contigs.fasta

Convert hmmalign out to fasta:

esl-reformat -o alkB_hits_contigs_hmmalign.afa afa alkB_hits_contigs_hmmalign.sto

4.6 Discard hit sequences that don't align well with reference

python check_alignment.py --ref_msa alkB_nt.afa --in_msa alkB_hits_contigs_hmmalign.afa --in_fasta alkB_hits_contigs.fasta --out_ref_msa alkB_ref_checked_contigs.afa --out_msa alkB_hits_checked_contigs.afa

4.7 Insert hit sequences into reference tree with EPA-NG

mkdir alkB_contigs_epa_out
epa-ng --tree alkB.nwk \
        --ref-msa alkB_ref_checked_contigs.afa \
        --query alkB_hits_checked_contigs.afa \
        -T 24 \
        -m raxml_evaluate/alkB_model.info.raxml.bestModel \
        -w alkB_contigs_epa_out \
        --filter-acc-lwr 0.99 \
        --filter-max 100

Convert resulting files from jplace into newick using gappa:

gappa examine graft --jplace-path alkB_contigs_epa_out/epa_result.jplace \
                    --fully-resolve \
                    --out-dir alkB_contigs_epa_out/

4.8 Get distances to the nearest reference tips in tree using castor

Should already have the alkB_tips.csv file from above, so not re-running this step.

#python get_known_tips.py -i alkB_ref_checked_contigs.afa -o alkB_tips.csv
Rscript distance_to_reference.R alkB_contigs_epa_out/epa_result.newick alkB_tips.csv alkB_known_tip_distance_contigs.tsv

5. Normalise hits to gene of interest by SCG hits

Clone this wiki locally