[FEATURE] Add z-score for the normalization processor #376 #470

samuel-oci · 2023-10-21T18:44:24Z

Description

This change implements #376

Add z-score for hybrid query normalization processor
Add IT that test normalization end to end

Issues Resolved

Resolving #376

Check List

[x ] New functionality includes testing.
- [ x] All tests pass
[x ] New functionality has been documented.
- [ x] New functionality has javadoc added
[ x] Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Samuel Herman <[email protected]>

samuel-oci · 2023-10-21T18:45:11Z

re opening the PR previously at #468, but this time against the feature branch instead of main.

samuel-oci · 2023-10-21T18:46:47Z

Hi @navneet1v @martin-gaievski @heemin32 this is the new PR that opened this time against the feature branch, feel free to continue providing your feedback here as I closed the original PR.
This PR should hopefully address all your comments from the previous PR.

martin-gaievski

Couple of generic things:

did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.
please fix all CI checks, you can simulate them by running gradle check locally.

martin-gaievski · 2023-10-24T21:31:41Z

...a/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechniqueTests.java

+                new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new ScoreDoc[0]),
+                new TopDocs(
+                    new TotalHits(3, TotalHits.Relation.EQUAL_TO),
+                    new ScoreDoc[] { new ScoreDoc(3, 0.98058068f), new ScoreDoc(4, 0.39223227f), new ScoreDoc(2, -1.37281295f) }


Can you please add a simple formula or a method as part of the code comments, so we can understand how that score calculated out of provided individual scores. Having a reference to a method description is good, but not the same. Something like you added for integ test assertions will be good.

samuel-oci · 2023-11-15T00:24:59Z

Couple of generic things:

did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.

please fix all CI checks, you can simulate them by running gradle check locally.

@martin-gaievski my results look as follows, overall z-score is showing best even among hybrid query results.
It seems to support the results from the blog https://towardsdatascience.com/hybrid-search-2-0-the-pursuit-of-better-search-ce44d6f20c08
I can do more experiments, but do you think that would be sufficient for now for us to include it?

For BM25 as baseline

2023-11-14 15:52:31 - NDCG@5: 0.6347
2023-11-14 15:52:31 - NDCG@10: 0.6563
2023-11-14 15:52:31 - NDCG@100: 0.6810
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - MAP@5: 0.6018
2023-11-14 15:52:31 - MAP@10: 0.6119
2023-11-14 15:52:31 - MAP@100: 0.6179
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - Recall@5: 0.7154
2023-11-14 15:52:31 - Recall@10: 0.7790
2023-11-14 15:52:31 - Recall@100: 0.8842
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - P@5: 0.1560
2023-11-14 15:52:31 - P@10: 0.0853
2023-11-14 15:52:31 - P@100: 0.0100

For Neural search

2023-11-14 15:54:19 - NDCG@5: 0.5747
2023-11-14 15:54:19 - NDCG@10: 0.6073
2023-11-14 15:54:19 - NDCG@100: 0.6381
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - MAP@5: 0.5368
2023-11-14 15:54:19 - MAP@10: 0.5512
2023-11-14 15:54:19 - MAP@100: 0.5585
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - Recall@5: 0.6711
2023-11-14 15:54:19 - Recall@10: 0.7693
2023-11-14 15:54:19 - Recall@100: 0.9067
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - P@5: 0.1480
2023-11-14 15:54:19 - P@10: 0.0867
2023-11-14 15:54:19 - P@100: 0.0103
p50: 33.5
p90: 43.0
p99: 49.52999999999997

For min-max hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:57:04 - NDCG@5: 0.6449
2023-11-14 15:57:04 - NDCG@10: 0.6757
2023-11-14 15:57:04 - NDCG@100: 0.7042
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - MAP@5: 0.6113
2023-11-14 15:57:04 - MAP@10: 0.6249
2023-11-14 15:57:04 - MAP@100: 0.6311
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - Recall@5: 0.7257
2023-11-14 15:57:04 - Recall@10: 0.8194
2023-11-14 15:57:04 - Recall@100: 0.9477
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - P@5: 0.1600
2023-11-14 15:57:04 - P@10: 0.0917
2023-11-14 15:57:04 - P@100: 0.0107
p50: 51.0
p90: 69.10000000000002
p99: 101.1099999999999

For Zscore Hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:59:06 - NDCG@5: 0.6518
2023-11-14 15:59:06 - NDCG@10: 0.6710
2023-11-14 15:59:06 - NDCG@100: 0.7052
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - MAP@5: 0.6105
2023-11-14 15:59:06 - MAP@10: 0.6204
2023-11-14 15:59:06 - MAP@100: 0.6291
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - Recall@5: 0.7561
2023-11-14 15:59:06 - Recall@10: 0.8100
2023-11-14 15:59:06 - Recall@100: 0.9543
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - P@5: 0.1653
2023-11-14 15:59:06 - P@10: 0.0903
2023-11-14 15:59:06 - P@100: 0.0108
p50: 47.5
p90: 60.60000000000002
p99: 69.00999999999999

Note: Edited to reformat the results into a table

Method	NDCG@5	NDCG@10	NDCG@100	MAP@5	MAP@10	MAP@100	Recall@5	Recall@10	Recall@100	P@5	P@10	P@100
BM25	0.6347	0.6563	0.7042	0.6018	0.6119	0.6179	0.7154	0.7790	0.8842	0.1560	0.0853	0.0100
Neural	0.5747	0.6073	0.6381	0.5368	0.5512	0.5585	0.6711	0.7693	0.9067	0.1480	0.0867	0.0103
Hybrid (min-max norm)	0.6449	0.6757	0.7042	0.6113	0.6249	0.6311	0.7257	0.8194	0.9477	0.1600	0.0917	0.0107
Hybrid (z-score norm)	0.6518	0.6710	0.7052	0.6105	0.6204	0.6291	0.7561	0.8100	0.9543	0.1653	0.0903	0.0108

martin-gaievski · 2023-11-16T01:34:40Z

Couple of generic things:

did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.

please fix all CI checks, you can simulate them by running gradle check locally.

@martin-gaievski my results look as follows, overall z-score is showing best even among hybrid query results. It seems to support the results from the blog https://towardsdatascience.com/hybrid-search-2-0-the-pursuit-of-better-search-ce44d6f20c08 I can do more experiments, but do you think that would be sufficient for now for us to include it?

For BM25 as baseline
2023-11-14 15:52:31 - NDCG@5: 0.6347
2023-11-14 15:52:31 - NDCG@10: 0.6563
2023-11-14 15:52:31 - NDCG@100: 0.6810
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - MAP@5: 0.6018
2023-11-14 15:52:31 - MAP@10: 0.6119
2023-11-14 15:52:31 - MAP@100: 0.6179
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - Recall@5: 0.7154
2023-11-14 15:52:31 - Recall@10: 0.7790
2023-11-14 15:52:31 - Recall@100: 0.8842
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - P@5: 0.1560
2023-11-14 15:52:31 - P@10: 0.0853
2023-11-14 15:52:31 - P@100: 0.0100
For Neural search
2023-11-14 15:54:19 - NDCG@5: 0.5747
2023-11-14 15:54:19 - NDCG@10: 0.6073
2023-11-14 15:54:19 - NDCG@100: 0.6381
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - MAP@5: 0.5368
2023-11-14 15:54:19 - MAP@10: 0.5512
2023-11-14 15:54:19 - MAP@100: 0.5585
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - Recall@5: 0.6711
2023-11-14 15:54:19 - Recall@10: 0.7693
2023-11-14 15:54:19 - Recall@100: 0.9067
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - P@5: 0.1480
2023-11-14 15:54:19 - P@10: 0.0867
2023-11-14 15:54:19 - P@100: 0.0103
p50: 33.5
p90: 43.0
p99: 49.52999999999997
For min-max hybrid (weights 0.4, 0.3, 0.3):
2023-11-14 15:57:04 - NDCG@5: 0.6449
2023-11-14 15:57:04 - NDCG@10: 0.6757
2023-11-14 15:57:04 - NDCG@100: 0.7042
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - MAP@5: 0.6113
2023-11-14 15:57:04 - MAP@10: 0.6249
2023-11-14 15:57:04 - MAP@100: 0.6311
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - Recall@5: 0.7257
2023-11-14 15:57:04 - Recall@10: 0.8194
2023-11-14 15:57:04 - Recall@100: 0.9477
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - P@5: 0.1600
2023-11-14 15:57:04 - P@10: 0.0917
2023-11-14 15:57:04 - P@100: 0.0107
p50: 51.0
p90: 69.10000000000002
p99: 101.1099999999999
For Zscore Hybrid (weights 0.4, 0.3, 0.3):
2023-11-14 15:59:06 - NDCG@5: 0.6518
2023-11-14 15:59:06 - NDCG@10: 0.6710
2023-11-14 15:59:06 - NDCG@100: 0.7052
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - MAP@5: 0.6105
2023-11-14 15:59:06 - MAP@10: 0.6204
2023-11-14 15:59:06 - MAP@100: 0.6291
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - Recall@5: 0.7561
2023-11-14 15:59:06 - Recall@10: 0.8100
2023-11-14 15:59:06 - Recall@100: 0.9543
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - P@5: 0.1653
2023-11-14 15:59:06 - P@10: 0.0903
2023-11-14 15:59:06 - P@100: 0.0108
p50: 47.5
p90: 60.60000000000002
p99: 69.00999999999999
Note: Edited to reformat the results into a table

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

@samuel-oci That looks reasonable. Can you please add more info:

exact queries you've used for BM25 and hybrid
dataset(s)
model, was is generic or fine-tuned
maybe scripts that you've used to run the benchmark

samuel-oci · 2023-11-16T18:32:13Z

Note: Edited to reformat the results into a table
Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

@samuel-oci That looks reasonable. Can you please add more info:

exact queries you've used for BM25 and hybrid

dataset(s)

model, was is generic or fine-tuned

maybe scripts that you've used to run the benchmark

Sure @martin-gaievski
repasting results:

Method	NDCG@5	NDCG@10	NDCG@100	MAP@5	MAP@10	MAP@100	Recall@5	Recall@10	Recall@100	P@5	P@10	P@100
BM25	0.6347	0.6563	0.7042	0.6018	0.6119	0.6179	0.7154	0.7790	0.8842	0.1560	0.0853	0.0100
Neural	0.5747	0.6073	0.6381	0.5368	0.5512	0.5585	0.6711	0.7693	0.9067	0.1480	0.0867	0.0103
Hybrid (min-max norm)	0.6449	0.6757	0.7042	0.6113	0.6249	0.6311	0.7257	0.8194	0.9477	0.1600	0.0917	0.0107
Hybrid (z-score norm)	0.6518	0.6710	0.7052	0.6105	0.6204	0.6291	0.7561	0.8100	0.9543	0.1653	0.0903	0.0108

Dataset: Scifact
Queries: same as here I made minor modifications just to get the project to run properly but didn't change the queries (will share that as well after some cleanup)
Model: generic pre-trained all-MiniLM-L12-v2

scripts:

PORT=50365
HOST=localhost
URL="$HOST:$PORT"

curl -XPUT -H "Content-Type: application/json" $URL/_ingest/pipeline/nlp-pipeline -d '
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "AXA30IsByAqY8FkWHdIF",
        "field_map": {
           "passage_text": "passage_embedding"
        }
      }
    }
  ]
}'

curl -XDELETE $URL/scifact

curl -XPUT -H "Content-Type: application/json" $URL/scifact -d '
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name":"hnsw",
                    "engine":"lucene",
                    "space_type": "l2",
                    "parameters":{
                        "m":16,
                        "ef_construction": 512
                    }
                }
            },
            "passage_text": { 
                "type": "text"            
            },
            "passage_key": { 
                "type": "text"            
            },
            "passage_title": { 
                "type": "text"            
            }
        }
    }
}'

curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-minmax-pipeline-hybrid -d '
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.4,
              0.3,
              0.3
            ]
          }
        }
      }
    }
  ]
}'

curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-zscore-pipeline-hybrid -d '
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "z_score"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.4,
              0.3,
              0.3
            ]
          }
        }
      }
    }
  ]
}'

To use later with

PORT=50365
MODEL_ID="AXA30IsByAqY8FkWHdIF"
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=ingest
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=bm25
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=neural --pipelines=norm-minmax-pipeline --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-minmax-pipeline-hybrid --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-zscore-pipeline-hybrid --os_model_id=$MODEL_ID

Signed-off-by: Samuel Herman <[email protected]>

Signed-off-by: samuel-oci <[email protected]>

samuel-oci · 2023-11-17T22:46:38Z

FYI: I noticed a few of the IT tests (which were not changed in this PR) are broken after merge with the upstream feature branch.

martin-gaievski · 2023-11-18T02:08:58Z

@samuel-oci thank you for sharing details of the benchmark. It's not exactly what we have used to run benchmarks from our side. Is it possible for you to adjust some things and run one more round? This way we can compare apples to apples your numbers with those we've got before. Here is the list of what needs to be adjusted:

datasets should be one from https://opensearch.org/blog/hybrid-search/, for instance, scidocs or trec-covid.
model sentence tas-b
use following for index settings, there are few things that are different from the setting you've used:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 4
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}

search pipeline with default weights, like below:

{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}'

samuel-oci · 2023-11-19T04:21:05Z

@samuel-oci thank you for sharing details of the benchmark. It's not exactly what we have used to run benchmarks from our side. Is it possible for you to adjust some things and run one more round? This way we can compare apples to apples your numbers with those we've got before. Here is the list of what needs to be adjusted:

datasets should be one from https://opensearch.org/blog/hybrid-search/, for instance, scidocs or trec-covid.

model sentence tas-b

use following for index settings, there are few things that are different from the setting you've used:
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 4
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}
search pipeline with default weights, like below:
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}'

Sure thing @martin-gaievski , I reproduced the results with the settings you suggested above and used trec-covid for dataset. Moreover, I included L2 normalization this time as well for additional reference with min-max and z-score. Got the following results:

Method	NDCG@5	NDCG@10	NDCG@100	MAP@5	MAP@10	MAP@100	Recall@5	Recall@10	Recall@100	P@5	P@10	P@100
BM25	0.7336	0.6850	0.4734	0.0093	0.0163	0.0798	0.0100	0.0185	0.1120	0.7800	0.7260	0.4916
Neural	0.5179	0.4799	0.3509	0.0060	0.0104	0.0477	0.0072	0.0131	0.0781	0.5760	0.5200	0.3604
Hybrid min-max	0.7497	0.7263	0.4968	0.0099	0.0176	0.0851	0.0104	0.0197	0.1136	0.8000	0.7720	0.5046
Hybrid l2	0.7398	0.7150	0.4919	0.0096	0.0171	0.0830	0.0101	0.0193	0.1134	0.7840	0.7640	0.4992
hybrid z-score	0.6867	0.6467	0.4382	0.0086	0.0150	0.0710	0.0095	0.0173	0.1027	0.7360	0.6800	0.4456

Signed-off-by: Samuel Herman <[email protected]>

codecov · 2023-11-20T18:46:26Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (46499fa) 84.37% compared to head (9a19fe7) 84.34%.

❗ Current head 9a19fe7 differs from pull request most recent head 4843b7b. Consider uploading reports for the commit 4843b7b to get more accurate results

Files	Patch %	Lines
...or/normalization/ZScoreNormalizationTechnique.java	84.05%	5 Missing and 6 partials ⚠️

Additional details and impacted files

@@                         Coverage Diff                         @@
##             feature/z-score-normalization     #470      +/-   ##
===================================================================
- Coverage                            84.37%   84.34%   -0.03%     
- Complexity                             498      523      +25     
===================================================================
  Files                                   40       41       +1     
  Lines                                 1491     1559      +68     
  Branches                               228      247      +19     
===================================================================
+ Hits                                  1258     1315      +57     
- Misses                                 133      138       +5     
- Partials                               100      106       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martin-gaievski · 2023-11-20T19:57:33Z

@samuel-oci from the data you've provided for trec-covid dataset it seems that z-score performing not that great comparing to other techniques. We're mainly looking to the NDCG metric to compare score accuracy performance.

We need more information/datapoints to understand z-score performance better. Can you please run same test for other datasets mentioned in the blog https://opensearch.org/blog/hybrid-search/? Idea is to find if z-score performing better than min-max and l2 for any of the datasets. If that's the case we need to find what is specific about that dataset(s) so z-score performing better. If we cannot find such dataset then we'll need to rethink if we want to add this technique or not.

This is the list of datasets we're used:

NFCorpus
Trec-Covid
Scidocs
Quora
Amazon ESCI
DBPedia
FiQA

I think DBPedia can be a problem due to large size => longest time to ingest data, so you can skip it, the rest should be doable.

Another point - I reviews the configuration I've shared with you previously, there is one adjustment you'll need to make.

This is the mapping we used in our benchmarking:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 12
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}

there are 12 shards in the configuration, but we also used 3 data nodes, so I've gave number of shards to 4. If you want to recreate our setup exactly, then you need 12 shards on 3 nodes. But in our case we were doing that to measure latencies, seems that you have good numbers there so it's not a concern.
To summarize - please change number of shards to 12. You can keep 1 data node, or make it 3 - up to you.

We're working on adding all these details to a separate issue to formalize the intake process for new techniques, that's work in progress now: #444

samuel-oci · 2023-11-21T03:03:05Z

https://opensearch.org/blog/hybrid-search/

@martin-gaievski , while the PR is mostly trivial I think the main issue that I see here is actually the time and effort it takes to setup and reproduce results for overall quite small datasets and workloads that shouldn't require an external environment.
I have some ideas on how to address those efficiently.
Will continue this discussion on the benchmark framework for neural search in this thread:
#430

Regarding the benchmark itself, we would also require to change combiner logic to something more z-score friendly. current combination techniques have some limitations because they only support greater than 0 score.

here is an example for the same benchmark on scifact dataset, this time I also added the combiner that can take into account negative values for z-score in arithmetic mean (proper z-score combiner should not be arithmetic mean whether with negatives or not but we can use it as an approximation for now).
This one is showing advantage to the z-score normalization approach. I can get the rest of the datasets as well, but for now can we have this dataset benchmark (it's part of the BEIR datasets) as a sufficient justification?

Method	NDCG@5	NDCG@10	NDCG@100	MAP@5	MAP@10	MAP@100	Recall@5	Recall@10	Recall@100	P@5	P@10	P@100
BM25	0.6577	0.6809	0.7036	0.6211	0.6327	0.6382	0.7479	0.8131	0.9109	0.1620	0.0900	0.0103
Neural	0.5446	0.5615	0.5946	0.5134	0.5219	0.5292	0.6177	0.6654	0.8192	0.1367	0.0753	0.0093
Hybrid min-max	0.6248	0.6483	0.6790	0.5867	0.5979	0.6058	0.7178	0.7861	0.9177	0.1573	0.0880	0.0104
Hybrid l2	0.6220	0.6379	0.6723	0.5854	0.5931	0.6003	0.7136	0.7598	0.9192	0.1553	0.0843	0.0104
hybrid z-score (default arithmetic mean combiner)	0.6475	0.6705	0.6991	0.6100	0.6217	0.6282	0.7421	0.8044	0.9343	0.1607	0.0900	0.0106
hybrid z-score (arithmetic mean combiner with negatives)	0.6595	0.6770	0.7045	0.6182	0.6275	0.6344	0.7644	0.8111	0.9327	0.9327	0.1660	0.0907

martin-gaievski · 2023-11-21T16:58:33Z

https://opensearch.org/blog/hybrid-search/

@martin-gaievski , while the PR is mostly trivial I think the main issue that I see here is actually the time and effort it takes to setup and reproduce results for overall quite small datasets and workloads that shouldn't require an external environment. I have some ideas on how to address those efficiently. Will continue this discussion on the benchmark framework for neural search in this thread: #430

Regarding the benchmark itself, we would also require to change combiner logic to something more z-score friendly. current combination techniques have some limitations because they only support greater than 0 score.

here is an example for the same benchmark on scifact dataset, this time I also added the combiner that can take into account negative values for z-score in arithmetic mean (proper z-score combiner should not be arithmetic mean whether with negatives or not but we can use it as an approximation for now). This one is showing advantage to the z-score normalization approach. I can get the rest of the datasets as well, but for now can we have this dataset benchmark (it's part of the BEIR datasets) as a sufficient justification?

@samuel-oci I agree that testing for such change is the main effort timewise, but that is absolute necessity. Main point for it is - we need to have a strong point of why we're adding it. My view is that what we do have now is a baseline, and anything we're adding after that should be compared and added if it works better for some/all cases. As soon as it became part of the codebase it can be used by any customer and it's a maintainer's repressibility to respond on requests such: "is/when this technique is better than technique X?"

Main datapoint we're looking for looking now is: how z-score results are better/worse than min/max and l2 on different datasets. For scifact it was shown before that z-score gives better NDCG, there isn't a new thing. I would love to see results for trec-covid, in the benchmark you've shared before the z-score performed worse than both min-max and l2. Would that adjusted combiner be a tipping point that is a question.

Did you check how combiner for negative scores affect the score from min-max and l2? If those scores remain same we can make such combiner a default one, or you can make it flexible if z-score only shows good results with such combiner.

samuel-oci · 2023-11-21T18:21:07Z

@samuel-oci I agree that testing for such change is the main effort timewise, but that is absolute necessity. Main point for it is - we need to have a strong point of why we're adding it. My view is that what we do have now is a baseline, and anything we're adding after that should be compared and added if it works better for some/all cases. As soon as it became part of the codebase it can be used by any customer and it's a maintainer's repressibility to respond on requests such: "is/when this technique is better than technique X?"

Hi @martin-gaievski that makes sense to me, if I understand your point here there are two things you are looking for:

which use cases are going to be supported better by z-score normalization
leave sufficient documentation to make it easy for end user to understand

Main datapoint we're looking for looking now is: how z-score results are better/worse than min/max and l2 on different datasets. For scifact it was shown before that z-score gives better NDCG, there isn't a new thing. I would love to see results for trec-covid, in the benchmark you've shared before the z-score performed worse than both min-max and l2. Would that adjusted combiner be a tipping point that is a question.

For how many data sets should we benchmark this? Or is it more fluid limit of as long as it takes to find the answer to the previous two questions?

Did you check how combiner for negative scores affect the score from min-max and l2? If those scores remain same we can make such combiner a default one, or you can make it flexible if z-score only shows good results with such combiner.

I didn't check new combiner code yet on l2, min-max I was hoping to contribute in parts and leave combiner for later.
Currently I don't expect the geometric mean with negative values combiner to have an effect on existing techniques (min-max and l2) but it broke some tests which is why I decided to not do it for now.
If we do want to test with appropriate z-score combiner I will have to create a specific combiner for z-score that can combine two z-scores to a new z-score in the following way:
https://stats.stackexchange.com/questions/348192/combining-z-scores-by-weighted-average-sanity-check-please
I can try it out and add that in the benchmark as well.

Signed-off-by: Samuel Herman <[email protected]>

martin-gaievski · 2023-12-12T16:33:58Z

@samuel-oci are you planning to continue work on this feature? Checking as there were no activity for last couple of weeks.

samuel-oci · 2023-12-12T17:58:24Z

@samuel-oci are you planning to continue work on this feature? Checking as there were no activity for last couple of weeks.

@martin-gaievski yes, just added the commits that include the scripts used for testing as well, hopefully those will benefit others too when using BEIR for neural-search testing.

It's been a bit of a challenge to get my hands on hardware that will allow me to run the tests on the larger datasets in BEIR, but I think I should be able to get those numbers soon.

jmazanec15 · 2024-10-02T00:51:09Z

@samuel-oci is this still being worked on?

minalsha · 2024-10-15T19:32:24Z

Hi @samuel-oci is this still being worked on?

samuel-oci · 2024-10-25T13:52:35Z

No not working at this anymore, feel free to close this.

vibrantvarun · 2024-10-30T00:15:22Z

Next action items: Deep dive needed on search relevance and perform benchmarking results on some more datasets.

samuel-oci added 5 commits October 17, 2023 13:43

add z-score and logging for tests

e6d2130

Signed-off-by: Samuel Herman <[email protected]>

wire z-score normalization

9cce739

Signed-off-by: Samuel Herman <[email protected]>

add IT test

10b5646

Signed-off-by: Samuel Herman <[email protected]>

fix IT test

266a34b

Signed-off-by: Samuel Herman <[email protected]>

review feedback

0effd07

Signed-off-by: Samuel Herman <[email protected]>

samuel-oci requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, wujunshen, zane-neo, ylwu-amzn and jngz-es as code owners October 21, 2023 18:44

martin-gaievski reviewed Oct 24, 2023

View reviewed changes

samuel-oci added 2 commits November 17, 2023 12:41

minor typo fixes and checkstyle

21954c9

Signed-off-by: Samuel Herman <[email protected]>

add explanation on test score calculation

8d7c3d9

Signed-off-by: Samuel Herman <[email protected]>

samuel-oci force-pushed the add-z-score branch from c5ad3c8 to 8d7c3d9 Compare November 17, 2023 21:19

samuel-oci and others added 2 commits November 17, 2023 13:33

Merge branch 'feature/z-score-normalization' into add-z-score

c4281d6

Signed-off-by: samuel-oci <[email protected]>

fix issues due to merge

b254336

add changelog

9a19fe7

Signed-off-by: Samuel Herman <[email protected]>

samuel-oci added 2 commits December 8, 2023 19:41

add combiner with negative score support

2b84c48

Signed-off-by: Samuel Herman <[email protected]>

adding scripts for BEIR testing

4843b7b

Signed-off-by: Samuel Herman <[email protected]>

martin-gaievski added the Enhancements Increases software capabilities beyond original client specifications label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add z-score for the normalization processor #376 #470

[FEATURE] Add z-score for the normalization processor #376 #470

samuel-oci commented Oct 21, 2023

samuel-oci commented Oct 21, 2023

samuel-oci commented Oct 21, 2023

martin-gaievski left a comment

martin-gaievski Oct 24, 2023

samuel-oci Nov 18, 2023

samuel-oci commented Nov 15, 2023 •

edited

Loading

martin-gaievski commented Nov 16, 2023

samuel-oci commented Nov 16, 2023 •

edited

Loading

samuel-oci commented Nov 17, 2023 •

edited

Loading

martin-gaievski commented Nov 18, 2023

samuel-oci commented Nov 19, 2023

codecov bot commented Nov 20, 2023 •

edited

Loading

martin-gaievski commented Nov 20, 2023

samuel-oci commented Nov 21, 2023 •

edited

Loading

martin-gaievski commented Nov 21, 2023

samuel-oci commented Nov 21, 2023

martin-gaievski commented Dec 12, 2023

samuel-oci commented Dec 12, 2023

jmazanec15 commented Oct 2, 2024

minalsha commented Oct 15, 2024

samuel-oci commented Oct 25, 2024

vibrantvarun commented Oct 30, 2024

[FEATURE] Add z-score for the normalization processor #376 #470

Are you sure you want to change the base?

[FEATURE] Add z-score for the normalization processor #376 #470

Conversation

samuel-oci commented Oct 21, 2023

Description

Issues Resolved

Check List

samuel-oci commented Oct 21, 2023

samuel-oci commented Oct 21, 2023

martin-gaievski left a comment

Choose a reason for hiding this comment

martin-gaievski Oct 24, 2023

Choose a reason for hiding this comment

samuel-oci Nov 18, 2023

Choose a reason for hiding this comment

samuel-oci commented Nov 15, 2023 • edited Loading

martin-gaievski commented Nov 16, 2023

samuel-oci commented Nov 16, 2023 • edited Loading

samuel-oci commented Nov 17, 2023 • edited Loading

martin-gaievski commented Nov 18, 2023

samuel-oci commented Nov 19, 2023

codecov bot commented Nov 20, 2023 • edited Loading

Codecov Report

martin-gaievski commented Nov 20, 2023

samuel-oci commented Nov 21, 2023 • edited Loading

martin-gaievski commented Nov 21, 2023

samuel-oci commented Nov 21, 2023

martin-gaievski commented Dec 12, 2023

samuel-oci commented Dec 12, 2023

jmazanec15 commented Oct 2, 2024

minalsha commented Oct 15, 2024

samuel-oci commented Oct 25, 2024

vibrantvarun commented Oct 30, 2024

samuel-oci commented Nov 15, 2023 •

edited

Loading

samuel-oci commented Nov 16, 2023 •

edited

Loading

samuel-oci commented Nov 17, 2023 •

edited

Loading

codecov bot commented Nov 20, 2023 •

edited

Loading

samuel-oci commented Nov 21, 2023 •

edited

Loading