Question: How to pick K and L for Angular LSH Mapping #373

bennimmo · 2021-04-15T14:49:53Z

bennimmo
Apr 15, 2021

Really appreciate your plugin and your documentation. I, however, am struggling to understand how to pick K and L when using Angular LSH Mapping.

Are you able to give some ideas of what would be practical/sensible for different vector sizes or how one might go about trying to understand how to calculate this. As an example I have vectors of 768 which comes from a sentence encoder and circa 50M documents in one scenario and 1M documents in another.

Answered by bennimmo

May 20, 2021

To close the loop on this... Many factors played into getting odd results some of which were mentioned above and if I'm honest a bug in the test code I wrote myself.

The optimum was a K of circa 50 on with 6 indexes and 20 candidates. This was though on a small dataset of 10K vectors and would unlikely scale... I just wanted to at least report back from anyone in the future stumbling across this.

View full answer

alexklibisz · 2021-04-17T16:46:55Z

alexklibisz
Apr 17, 2021
Maintainer

Hi Ben,
Picking these hyperparameters is a tough problem and usually requires some trial and error.

I would recommend you look at the Glove100 benchmarks as a starting point: https://elastiknn.com/performance/#annb-glove100 . Glove 100 is a dataset of about 1 million 100-dimensional vectors.

You can download the Pareto configs for the LSH rows:

("LSH" in this case is specifically Angular LSH.)

That will give you a CSV like this:

The intuition behind L and k is this:
L is the number of hash tables. If you add more hash tables, it will take more time to search through them, but you're more likely to find a match for your query. So increasing L generally increases recall.
k is the number of bits in a hash. More bits makes the hash more "specific". This means you're less likely to match low-quality result, but you'll also match fewer results overall. So increasing k generally increases precision.

For your scenario, I would take a guess of starting around L=200 and k=7. With 50M docs you'll definitely need to parallelize across multiple ES nodes to get queries in the 100s of millis.

0 replies

alexklibisz · 2021-04-17T20:00:26Z

alexklibisz
Apr 17, 2021
Maintainer

I'll go ahead and close to keep things tidy but please feel free to re-open if you have more questions.

0 replies

bennimmo · 2021-04-19T10:19:16Z

bennimmo
Apr 19, 2021
Author

Thank you so much for your response and so quickly! Massively back your plugin and efforts.

Out of curiosity, would this issue remove the need to "pick" these values: #108

0 replies

alexklibisz · 2021-04-19T13:38:34Z

alexklibisz
Apr 19, 2021
Maintainer

@bennimmo Thanks for the kind words. The model in that issue also has some hyper parameters that need to be set. You might also look at Permutation LSH: https://elastiknn.com/api/#permutation-lsh-mapping. This works for angular similarity and is a bit simpler than LSH IMO.

0 replies

bennimmo · 2021-05-18T18:46:55Z

bennimmo
May 18, 2021
Author

@alexklibisz thanks for the advice on this. I am, however, getting odd results from the permutation LSH. Currently doing a small test of 10,000 vectors and looking for results by ranging the indexes, K and candidates in the search. At best I am getting a recall rate (by carrying out a exact search first and comparing the results) of 0.5. I am only getting this result by setting K to 250 and number of candidates to something like 40 which is essentially searching the whole space (I think), but even then it's coming out with a recall of 0.5. Any advice?

0 replies

alexklibisz · 2021-05-18T21:21:01Z

alexklibisz
May 18, 2021
Maintainer

@bennimmo Couple thoughts:

What is the vector dimension?
How are you measuring recall? If you are comparing the intersection of document IDs, then you could run into a case where many documents have a similar score, so the cutoff of getting included in the result set is effectively arbitrary. I suggest this approach for measuring recall: run an exact search, take the score of the last result, run an approximate search, count the number of approximate greater or equal to the lowest exact score.

0 replies

bennimmo · 2021-05-18T21:24:23Z

bennimmo
May 18, 2021
Author

Noted on point 2, that slipped past me... Apologies.

Currently working with 768 dimensions

0 replies

alexklibisz · 2021-05-19T00:21:33Z

alexklibisz
May 19, 2021
Maintainer

Noted on point 2, that slipped past me... Apologies.

No worries, it's subtle.

Currently working with 768 dimensions

Got it. So K=250 is not totally crazy. :) You're basically saying, "I should be able to identify a vector by ranking the indexes of the 250 values with largest absolute magnitudes." There is a more precise example of how this algo works in the docs: https://elastiknn.com/api/#permutation-lsh-mapping

Next I would say: look at the distribution of your vectors' values. If the values are extremely tight around some point, or they are floats with very high precision, permutation LSH is probably not a good algorithm.

This is the beauty (or the pain :) ) of approximate nearest neighbor search... the best method is highly dependent on the characteristics of your data. After digging deep on this problem for a while, I tend to suggest to pre-filter by some other property, so that you have a search space of say 10k vectors, and then you can efficiently run exact search.

0 replies

bennimmo · 2021-05-20T15:09:45Z

bennimmo
May 20, 2021
Author

To close the loop on this... Many factors played into getting odd results some of which were mentioned above and if I'm honest a bug in the test code I wrote myself.

The optimum was a K of circa 50 on with 6 indexes and 20 candidates. This was though on a small dataset of 10K vectors and would unlikely scale... I just wanted to at least report back from anyone in the future stumbling across this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to pick K and L for Angular LSH Mapping #373

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question: How to pick K and L for Angular LSH Mapping #373

bennimmo Apr 15, 2021

Replies: 9 comments

alexklibisz Apr 17, 2021 Maintainer

alexklibisz Apr 17, 2021 Maintainer

bennimmo Apr 19, 2021 Author

alexklibisz Apr 19, 2021 Maintainer

bennimmo May 18, 2021 Author

alexklibisz May 18, 2021 Maintainer

bennimmo May 18, 2021 Author

alexklibisz May 19, 2021 Maintainer

bennimmo May 20, 2021 Author

bennimmo
Apr 15, 2021

alexklibisz
Apr 17, 2021
Maintainer

alexklibisz
Apr 17, 2021
Maintainer

bennimmo
Apr 19, 2021
Author

alexklibisz
Apr 19, 2021
Maintainer

bennimmo
May 18, 2021
Author

alexklibisz
May 18, 2021
Maintainer

bennimmo
May 18, 2021
Author

alexklibisz
May 19, 2021
Maintainer

bennimmo
May 20, 2021
Author