Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between MMseqs search and MMseqs taxonomy #859

Open
pbelmann opened this issue Jul 1, 2024 · 0 comments
Open

Discrepancy between MMseqs search and MMseqs taxonomy #859

pbelmann opened this issue Jul 1, 2024 · 0 comments

Comments

@pbelmann
Copy link

pbelmann commented Jul 1, 2024

Hi all,

if I run the following mmseqs taxonomy command I get for my protein sequence the Clostridium AM magnum hit.

mmseqs taxonomy queryDB /vol/scratch/databases/mmseqs2/gtdb/out/gtdb_database test.faa.gz tmp  --lca-ranks superkingdom,phylum,class,order,family,genus,species,subspecies -c 0.8 --max-seqs 300 --max-accept 50 --cov-mode 0 -e 0.001 --e-profile 0.01  -s 6 --threads 28  --blacklist ""

mmseqs  createtsv queryDB 14_First_11_21_S2_binned.gtdb.taxresults.database  test.taxonomy.tsv --threads 28

test.taxonomy.tsv output:

14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326     30313   species Clostridium AM magnum   Bacteria;Bacillota A;Clostridia;Clostridiales;Clostridiaceae;Clostridium AM;Clostridium AM magnum;uc_Clostridium AM magnum

If I repeat the same with mmseqs search I get following hits:

mmseqs search queryDB /vol/scratch/databases/mmseqs2/gtdb/out/gtdb_database test.faa.gz tmp  --max-seqs 300 --max-accept 50 --cov-mode 0 -e 0.001 --e-profile 0.01  -s 6 --threads 28
mmseqs convertalis queryDB /vol/scratch/databases/mmseqs2/gtdb/out/gtdb_database test.faa.gz mmseqs.out.tsv

mmseqs.out.tsv output:

uery                                                target                  pident  alnlen  mismatch  gapopen  qstart  qend  tstart  tend  evalue     bits  qlen  tlen  qcov   tcov
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_JH601103.1_106       88.200  163     19        0        1       163   1       163   1.803E-90  301   164   164   0.994  0.994
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_FMWM01000002.1_685   49.700  154     76        0        7       159   5       158   6.046E-36  144   164   170   0.933  0.906
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_FLQT01000003.1_544   49.200  154     77        0        7       159   5       158   2.131E-35  143   164   167   0.933  0.922
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_LT707417.1_163       49.100  154     77        0        7       159   5       158   2.921E-35  142   164   167   0.933  0.922
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  JAEXBU010000001.1_1     49.000  154     78        0        7       159   5       158   4.002E-35  142   164   167   0.933  0.922
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_OPYI01000008.1_105   48.900  154     78        0        7       159   5       158   5.484E-35  141   164   167   0.933  0.922
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  NZ_CAIJCS010000014.1_2  48.600  155     79        0        5       158   6       160   7.514E-35  141   164   171   0.939  0.906
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  CABJAE010000022.1_14    48.700  154     78        0        7       159   5       158   1.030E-34  141   164   167   0.933  0.922
14_First_11_21_S2_bin.16_6644_ef2cc1_CJIKHDKB_00326  CYUK01000003.1_1117     48.100  156     80        0        5       159   3       158   1.411E-34  140   164   174   0.945  0.897

From the mmseqs search output one can see that the top hit (NZ_JH601103.1), has the best target, query coverage and evalue.
The top hit belongs to the organism Dolosigranulum pigrum https://www.ncbi.nlm.nih.gov/nuccore/NZ_JH601103 which has the following taxonomy:

d__Bacteria; p__Bacillota; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae; g__Dolosigranulum; s__Dolosigranulum pigrum

I wonder how this discrepancy between the taxonomy of the mmseqs search top hit and the taxonomy provided by mmseqs taxonomy can happen?

Steps to Reproduce (for bugs)

The gtdb database I used, can be downloaded here

The test sequence can be found here

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 0b27c9d
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake: 3.16.3
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 236 GB RAM
  • Operating system and version: Ubuntu 20.04.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant