This repo has scripts and steps for evaluation of Manticore Search (MS) over example datasets for Information Retrieval (IR).
We try to evaluate how MS compares with Elasticsearch (ES) and how both compare for retrieval using BM25.
We try to mimic ES settings for BM25 search as described here.
The evaluation is done comparing various IR benchmarking metrics, implemented in BEIR. BEIR is a python package for benchmarking models/algorithms for IR tasks.
Look at all the results here.
Look at all the updated results here. Thanks to the manticore team for addressing the concerns we raised!
We evaluate on the datasets below.
- TREC-COVID https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
- NF-CORPUS https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
Run the below commands in the directory you clone the repo:
cd data
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
tar -xvzf nfcorpus.zip
tar -xvzf trec-covid.zip
cd ..
NOTES:
Meta information about the datasets:
- Each dataset has two fields with contents that needs be indexed, a
title
andtxt
field. - The trec-covid dataset has
171332
documents while nfcorpus has3633
documents. - For IR evaluation, each dataset has a fixed set of queries and corresponding relevant documents.
- The trec-covid dataset has
50
queries while nfcorpus has323
queries.
- Create and activate conda env:
conda create --name ir-bm25-benchmark python=3.10
conda activate ir-bm25-benchmark
- Install dependencies:
pip install -r requirements.txt
pip install --no-deps -r requirements_no_deps.txt
The updated results required manticoresearch-python
client version that was not available via pip as of this writing. To install the latest use:
pip install git+https://github.com/manticoresoftware/manticoresearch-python.git@master
Confirm using pip list that version installed is >=2.0.0
pip list | grep manticoresearch
manticoresearch 2.0.0
- For Pull docker image and start container
docker pull manticoresearch/manticore
docker run -p 9306:9306 -p 9308:9308 manticoresearch/manticore
- For Manticore dev version (MS dev)
docker pull manticoresearch/manticore:dev
docker run -p 9306:9306 -p 9308:9308 manticoresearch/manticore:dev
- Create and populate indices:
a. Create indices with default settings:
python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus
b. Create indices with settings to mimic ES-like BM25 behavior for search:
python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid_es_like --index-es-like
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus_es_like --index-es-like
NOTE 1:
The following options are set on the indices for the ES-like BM25 behaviour:
stopwords='en'
stopwords_unstemmed='1'
morphology='stem_en'
html_strip = '1'
index_exact_words = '1'
index_field_lengths = '1'
These options apply to the two text fields of the document collections. More details about indexing can be found in this function.
NOTE 2: The following MS ranking options are set for the evaluation of the ES-like BM25 behaviour:
ranker=expr('sum(10000 * bm25f(1.2,0.75,{{title=1,content=1}}))'), idf='plain,tfidf_unnormalized'
NOTE 3:
Manticore's default english stops words is much longer than that for ElasticSearch.
For the *es_like
indices you can set use the same stops words as ElasticSearch.
But we've noticed that our evaluation performance is poor when we limit to ES only stop words.
For this, copy the file in data/elasticsearch_en_stop_words
to your manticore docker container, say at location
/var/lib/manticore/data/
.
You can then change your index preparation script to this:
python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid_es_like --index-es-like --stop-words /var/lib/manticore/data/elasticsearch_en_stop_words
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus_es_like --index-es-like --stop-words /var/lib/manticore/data/elasticsearch_en_stop_words
NOTE 4:
Elasticsearch indices are built according to this function in BEIR where the two text fields are indexed with the ES English analyzer. The resulting indices are then queried with a multi-match query over these two fields, as detailed in this function.
Evaluate:
a. Evaluate retrieval for MS default settings:
python -m benchmark.manticore.evaluate data/nfcorpus test nfcorpus
python -m benchmark.manticore.evaluate data/trec-covid test trec_covid
b. Evaluate retrieval for MS with ES-like settings:
python -m benchmark.manticore.evaluate data/nfcorpus test nfcorpus_es_like
python -m benchmark.manticore.evaluate data/trec-covid test trec_covid_es_like
- Run ElasticSearch in a docker container:
docker pull elasticsearch:7.17.0
docker-compose up
Wait for a couple of minutes for the docker container to be ready.
- Evaluate: (This re-creates an index each time you evaluate)
python -m benchmark.es.evaluate_bm25 data/trec-covid test trec_covid
python -m benchmark.es.evaluate_bm25 data/nfcorpus test nfcorpus
Note: There is a sleep of 10 seconds between the creation of the index and the evaluation in the above script. This allows ES to finish the indexing before we run the evaluations.
We are looking to compare all the different strategies we used for indexing and search using the metric NDCG@10
.
This is metric reported by the BEIR paper and can be accessed here for these two datasets and others.
Other metrics printed below are simply for sanity checks.
Comments: (In context of Manticore 4.2.0. Concerns raised were fixed in 4.2.1. See updated results here)
- Comparing to the results for
NDCG@10
achieved with MS using ES-like settings: 3. For the trec-covid dataset:NDCG@10
jumps to0.59764
, but we still fall short of the best of0.68803
reported with ES. 4. For the nfcorpus dataset:NDCG@10
jumps to0.31715
, but we still fall short of the best of0.34281
reported with ES. - Comparing to the results for
NDCG@10
achieved with ES:- MS performs very poorly for the trec-covid dataset -
0.29494
compared to the0.68803
for ES. - MS performs slightly poor for the nfcorpus dataset -
0.28791
compared to the0.34281
for ES.
- MS performs very poorly for the trec-covid dataset -
- Comparing to the results for
NDCG@10
reported by BEIR against ES:- These numbers should match exactly, but they are actually better in reality.
- The reported benchmark had a bug concerning reproducibility. More details here.
Results for trec-covid:
dataset | settings | NDCG@10 |
---|---|---|
trec-covid | MS (default) | 0.29494 |
trec-covid | MS (es-like) | 0.59764 |
trec-covid | MS dev (es-like) | 0.71211 |
trec-covid | ES | 0.68803 |
trec-covid | ES (reported in BEIR) | 0.616 |
Results for nfcorpus:
dataset | settings | NDCG@10 |
---|---|---|
nfcorpus | MS (default) | 0.28791 |
nfcorpus | MS (es-like) | 0.31715 |
nfcorpus | MS dev (es-like) | 0.34537 |
nfcorpus | ES | 0.34281 |
nfcorpus | ES (reported in BEIR) | 0.297 |
Elasticsearch version: 7.17.0
Run using this docker image.
Manticore Search version: Manticore 4.2.0 15e927b28@211223 release
Run using this docker image.
Manticore Search version (with fixes): Manticore 4.2.1 d039fba84@220407 release
Run using this docker image.