LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences (>150 bp) against up to millions of prokaryotic genomes.
Documents: https://bioinf.shenwei.me/LexicMap
For the latest features and improvements, please download the pre-release binaries.
Please cite:
Wei Shen, John A. Lees, Zamin Iqbal. (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02812-8
- Features
- Introduction
- Quick start
- Performance
- Installation
- Algorithm overview
- Citation
- Limitations
- Terminology differences
- Support
- License
- Related projects
- The accuracy of LexicMap is comparable with Blastn, MMseqs2, and Minimap2. It
- performs base-level alignment, with
qcovGnm
,qcovHSP
,pident
,evalue
andbitscore
returned, both in TSV and pairwise alignment format (output format).- provides a genome-wide query coverage metric (
qcovGnm
), which enables accurate interpretation of search results - particularly for circular queries (such as plasmid, virus, and mtDNA) against both complete and fragmented assemblies.
- provides a genome-wide query coverage metric (
- returns all possible matches, including multiple copies of a gene in a genome.
- performs base-level alignment, with
- The alignment is fast and memory-efficient, scalable to up to millions of prokaryotic genomes.
- LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
- LexicMap is easy to use (see tutorials, usages, and FAQs).
- Database building requires only a simple command, accepting input from files, a file list, or even a directory.
- Sequence searching supports limiting search by TaxId(s), provides a progress bar.
- Several utility commands are available to resume unfinished indexing, explore the index data, merge search results, extract matched subsequences and more.
Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.
- Existing full alignment tools face challenges of high memory consumption and slow speeds.
- Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis.
- Mapping tools, or those utilizing compressed full-text indexes, return only the most similar matches.
- Prefilter+Align strategies have the sensitivity issue in the prefiltering step.
Methods: (algorithm overview)
- A rewritten and improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently.
- We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee.
- We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched.
- A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching).
- A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment.
- A reimplemented Wavefront alignment algorithm is used for base-level alignment.
Results:
-
LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively).
-
When searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM. (see performance).
With LexicMap v0.7.0 (48 CPUs, indexes and queries queries in HDDs),
Query Genome hits Genome hits
(high-similarity)Genome hits
(medium-similarity)Genome hits
(low-similarity)Time RAM A 1.3-kb marker gene 41,718 11,746 115 29,857 3m:06s 3.97 GB A 1.5-kb 16S rRNA 1,955,167 245,884 501,691 1,207,592 32m:59s 11.09 GB A 52.8-kb plasmid 560,330 96 15,370 544,864 52m:22s 14.48 GB 1003 AMR genes 30,967,882 7,636,386 4,858,063 18,473,433 15h:52m:08s 24.86 GB Notes:
- Default paramters are used, for returning all possible matches.
- Only the best alignment of a genome is used to evaluate alignment similarity:
- high-similarity: (a) qcov >= 90% (genes) or 70% (plasmids), (b) pident>=90%.
- medium-similarity: (a) not belong to high-similarity, (b) qcov >= 50% (genes) or 30% (plasmids), (c) pident>=80%.
- low-similarity: the remaining.
- The search time varies in different computing environments and mainly depends on the I/O speed and the number of threads.
More documents: https://bioinf.shenwei.me/LexicMap.
Building an index (see the tutorial of building an index).
# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi
# From a file list with one file per line
lexicmap index -S -X files.txt -O db.lmi
Querying (see the tutorial of searching).
# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--align-min-match-pident 80 --min-qcov-per-hsp 70 --min-qcov-per-genome 70 \
--top-n-genomes 10000
# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--align-min-match-pident 70 --min-qcov-per-hsp 0 --min-qcov-per-genome 50 \
--align-min-match-len 1000 \
--top-n-genomes 0
Sample output (queries are a few Nanopore Q20 reads). See output format details.
query qlen hits sgenome sseqid qcovGnm cls hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen evalue bitscore
------------------ ---- ---- --------------- ----------------- ------- --- --- ------- ------- ------ ---- ------ ---- ------- ------- ---- ------- --------- --------
ERR5396170.1000004 190 1 GCF_000227465.1 NC_016047.1 84.211 1 1 84.211 165 89.091 5 14 173 4189372 4189536 - 4207222 1.93e-63 253
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 1 1 99.623 801 97.628 9 4 796 1138907 1139706 + 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 2 2 99.623 801 97.628 9 4 796 32607 33406 + 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 3 3 99.623 801 97.628 9 4 796 134468 135267 - 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 4 4 99.623 801 97.503 9 4 796 1768896 1769695 + 1887974 0.00e+00 1427
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 5 5 99.623 801 97.378 9 4 796 242012 242811 - 1887974 0.00e+00 1422
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 6 6 99.623 801 96.879 12 4 796 154380 155176 - 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 7 7 57.915 469 95.736 9 4 464 1280313 1280780 + 1887974 3.71e-236 829
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 8 8 42.839 341 99.120 0 456 796 1282477 1282817 + 1887974 6.91e-168 601
ERR5396170.1000006 796 3 GCF_009663775.1 NZ_RDBR01000008.1 99.623 1 1 99.623 801 93.383 9 4 796 21391 22190 - 52610 0.00e+00 1278
ERR5396170.1000006 796 3 GCF_003344625.1 NZ_QPKJ02000188.1 97.362 1 1 87.437 700 98.143 5 22 717 1 699 - 826 0.00e+00 1249
ERR5396170.1000006 796 3 GCF_003344625.1 NZ_QPKJ02000423.1 97.362 2 2 27.889 222 99.550 0 575 796 1 222 + 510 3.47e-106 396
ERR5396170.1000000 698 2 GCF_001457615.1 NZ_LN831024.1 92.264 1 1 92.264 656 96.341 13 53 696 4452083 4452737 + 6316979 0.00e+00 1169
ERR5396170.1000000 698 2 GCF_000949385.2 NZ_JYKO02000001.1 91.977 1 1 91.977 654 78.135 13 55 696 5638788 5639440 - 5912440 2.68e-176 630
ERR5396170.1000001 2505 3 GCF_000307025.1 NC_018584.1 67.066 1 1 67.066 1690 97.633 16 47 1726 1905511 1907194 - 2951805 0.00e+00 2985
ERR5396170.1000001 2505 3 GCF_900187225.1 NZ_LT906436.1 65.070 1 1 65.070 1641 93.723 20 95 1724 1869503 1871134 - 2864663 0.00e+00 2626
ERR5396170.1000001 2505 3 GCF_013394085.1 NZ_CP040910.1 30.858 1 1 30.858 780 97.692 9 1726 2498 183873 184650 + 1887974 0.00e+00 1384
ERR5396170.1000001 2505 3 GCF_013394085.1 NZ_CP040910.1 30.858 2 2 5.030 127 87.402 1 2233 2358 1236170 1236296 + 1887974 1.73e-37 167
ERR5396170.1000001 2505 3 GCF_013394085.1 NZ_CP040910.1 30.858 3 3 5.150 130 80.769 12 2233 2361 930381 930499 - 1887974 6.61e-43 185
ERR5396170.1000001 2505 3 GCF_013394085.1 NZ_CP040910.1 30.858 4 4 3.713 93 93.548 0 2257 2349 1104581 1104673 - 1887974 5.09e-30 141
CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all
.
Extracting matched sequences:
# Extracting similar sequences for a query gene.
# search matches with query coverage >= 90%
lexicmap search -d demo.lmi/ bench/b.gene_E_faecalis_SecY.fasta -o results.tsv \
--min-qcov-per-hsp 90
# extract matched sequences as FASTA format
lexicmap utils subseq -d demo.lmi -f results.tsv -o results.tsv.aligned.fasta
seqkit head -n 1 results.tsv.aligned.fasta | head -n 3
>NZ_KB944588.1:228637-229935:+ query=lcl|NZ_CP064374.1_cds_WP_002359350.1_906 sgenome=GCF_000392875.1 sseqid=NZ_KB944588.1 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1299 pident=100.000 gaps=0 qstart=1 qend=1299 sstart=228637 send=229935 sstr=+ slen=274762 evalue=0.00e+00 bitscore=2343
TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT
ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
Export Blast-style pairwise alignment format:
# here, we only align <=200 bp queries and show one medium-similarity result.
$ seqkit seq -g -M 200 q.long-reads.fasta.gz \
| lexicmap search -d demo.lmi/ -a \
| csvtk filter2 -t -f '$pident >80 && $pident < 90' \
| csvtk head -t -n 1 \
| lexicmap utils 2blast --kv-file-genome ass2species.map
Query = GCF_003697165.2_r40
Length = 186
[Subject genome #1/2] = GCF_002950215.1 Shigella flexneri
Query coverage per genome = 93.548%
>NZ_CP026788.1
Length = 4659463
HSP cluster #1, HSP #1
Score = 279 bits, Expect = 9.66e-75
Query coverage per seq = 93.548%, Aligned length = 177, Identities = 88.701%, Gaps = 6
Query range = 13-186, Subject range = 1124816-1124989, Strand = Plus/Plus
Query 13 CGGAAACTGAAACA-CCAGATTCTACGATGATTATGATGATTTA-TGCTTTCTTTACTAA 70
|||||||||||||| |||||||||| | |||||||||||||||| |||||||||| ||||
Sbjct 1124816 CGGAAACTGAAACAACCAGATTCTATGTTGATTATGATGATTTAATGCTTTCTTTGCTAA 1124875
Query 71 AAAGTAAGCGGCCAAAAAAATGAT-AACACCTGTAATGAGTATCAGAAAAGACACGGTAA 129
|| |||||||||||||||||| |||||||||||||||||||||||||||||||||||
Sbjct 1124876 AA--GCAGCGGCCAAAAAAATGATTAACACCTGTAATGAGTATCAGAAAAGACACGGTAA 1124933
Query 130 GAAAACACTCTTTTGGATACCTAGAGTCTGATAAGCGATTATTCTCTCTATGTTACT 186
|| ||||||||| ||||| |||||||||||||||||||||||| |||| |||
Sbjct 1124934 AAAGACACTCTTTGAAGTACCTGAAGTCTGATAAGCGATTATTCTCTCCATGT-ACT 1124989
Learn more: demo, tutorials, or usages.
See the paper.
LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Or install with conda or pixi:
conda install -c bioconda lexicmap
We also provide pre-release binaries, with new features and improvements.
See the paper for details.
Wei Shen, John A. Lees, Zamin Iqbal. (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02812-8
- The queries need to be longer than 100 bp, though some shorter ones can also be aligned.
- LexicMap is slow for ultra-long (>1Mb) queries, and the alignment might be fragmented.
- LexicMap is slow for batch searching with more than hundreds of queries. However, there are some ways to improve the search speed of lexicmap search, such as keeping the top N genome matches via
-n/--top-n-genomes
or storing the index on solid state drives (SSDs).
- In the LexicMap source code and command line options, the term "mask" is used, following the terminology in the LexicHash paper.
- In the LexicMap manuscript, however, we use "probe" as it is easier to understand. Because these masks, which consist of thousands of designed k-mers and they capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology.
Please open an issue to report bugs, propose new functions or ask for help.
- High-performance LexicHash computation in Go.
- Wavefront alignment algorithm (WFA) in Golang.