-
Really appreciate your plugin and your documentation. I, however, am struggling to understand how to pick Are you able to give some ideas of what would be practical/sensible for different vector sizes or how one might go about trying to understand how to calculate this. As an example I have vectors of 768 which comes from a sentence encoder and circa 50M documents in one scenario and 1M documents in another. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
Hi Ben, I would recommend you look at the Glove100 benchmarks as a starting point: https://elastiknn.com/performance/#annb-glove100 . Glove 100 is a dataset of about 1 million 100-dimensional vectors. You can download the Pareto configs for the LSH rows: That will give you a CSV like this: The intuition behind For your scenario, I would take a guess of starting around L=200 and k=7. With 50M docs you'll definitely need to parallelize across multiple ES nodes to get queries in the 100s of millis. |
Beta Was this translation helpful? Give feedback.
-
I'll go ahead and close to keep things tidy but please feel free to re-open if you have more questions. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your response and so quickly! Massively back your plugin and efforts. Out of curiosity, would this issue remove the need to "pick" these values: #108 |
Beta Was this translation helpful? Give feedback.
-
@bennimmo Thanks for the kind words. The model in that issue also has some hyper parameters that need to be set. You might also look at Permutation LSH: https://elastiknn.com/api/#permutation-lsh-mapping. This works for angular similarity and is a bit simpler than LSH IMO. |
Beta Was this translation helpful? Give feedback.
-
@alexklibisz thanks for the advice on this. I am, however, getting odd results from the permutation LSH. Currently doing a small test of 10,000 vectors and looking for results by ranging the indexes, K and candidates in the search. At best I am getting a recall rate (by carrying out a exact search first and comparing the results) of 0.5. I am only getting this result by setting K to 250 and number of candidates to something like 40 which is essentially searching the whole space (I think), but even then it's coming out with a recall of 0.5. Any advice? |
Beta Was this translation helpful? Give feedback.
-
@bennimmo Couple thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Noted on point 2, that slipped past me... Apologies. Currently working with 768 dimensions |
Beta Was this translation helpful? Give feedback.
-
No worries, it's subtle.
Got it. So K=250 is not totally crazy. :) You're basically saying, "I should be able to identify a vector by ranking the indexes of the 250 values with largest absolute magnitudes." There is a more precise example of how this algo works in the docs: https://elastiknn.com/api/#permutation-lsh-mapping Next I would say: look at the distribution of your vectors' values. If the values are extremely tight around some point, or they are floats with very high precision, permutation LSH is probably not a good algorithm. This is the beauty (or the pain :) ) of approximate nearest neighbor search... the best method is highly dependent on the characteristics of your data. After digging deep on this problem for a while, I tend to suggest to pre-filter by some other property, so that you have a search space of say 10k vectors, and then you can efficiently run exact search. |
Beta Was this translation helpful? Give feedback.
-
To close the loop on this... Many factors played into getting odd results some of which were mentioned above and if I'm honest a bug in the test code I wrote myself. The optimum was a K of circa 50 on with 6 indexes and 20 candidates. This was though on a small dataset of 10K vectors and would unlikely scale... I just wanted to at least report back from anyone in the future stumbling across this. |
Beta Was this translation helpful? Give feedback.
To close the loop on this... Many factors played into getting odd results some of which were mentioned above and if I'm honest a bug in the test code I wrote myself.
The optimum was a K of circa 50 on with 6 indexes and 20 candidates. This was though on a small dataset of 10K vectors and would unlikely scale... I just wanted to at least report back from anyone in the future stumbling across this.