Q&A: Scale effects #661
-
Hi @alexklibisz first of all thanks for your time and dedication to build elastiknn. I'd like to share our use case and the scaling behavior we are observing. We have indexed about 150M documents in an elasticsearch cluster, including both document text and a 768-dimensional vector for each document. We are considering using I suspect this might be a memory issue, so I have a few questions related to this:
Many thanks again! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @ezorita, these are some good questions. I'll try to answer below.
150M is more than I've ever tested with. It's not surprising that it takes longer, but 60s sounds like it might just lack resources for that amount of data. I'm assuming that these are LSH (approximate) queries. As a sanity check, how long does it take to run a standard term query on 150M documents with your current infrastructure? Any Elastiknn vector query is basically matching a bunch of terms, so the vector query will necessarily be slower than a term query. I would set a baseline with term queries. More tips below.
The filtering is strictly pre-filtering. So filtering should actually improve performance. More here: https://alexklibisz.github.io/elastiknn/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents
It depends on the number of vectors and the LSH parameters. For cosine LSH, index size will scale with the
Ideally yes. But this is more of a general concern with Elasticsearch and Lucene. AFAIK, for low-latency search, the index files should ideally be cached in the file system cache. You can monitor IOPs or similar metrics to verify that you're reading from memory and not from disk/ssd.
If it has the space, the operating system should eventually cache the index files in file system cache automatically. Elasticsearch has some advanced settings to control this more precisely, e.g., https://www.elastic.co/guide/en/elasticsearch/reference/current/preload-data-to-file-system-cache.html. I haven't tried this. I usually just trust that if I've provided enough system (non-JVM) memory, then the file system will cache the index files.
I haven't really pushed elastiknn anytime recently. I've been benchmarking mostly with the Fashion Mnist dataset, which is ~60k vectors. My general advice is the following:
Yeah, I don't have any plans to add functionality. I've been tinkering with performance when I have the time and when I have ideas. Mostly just because I'm interested in performance optimization. If someone is interested in adding functionality to Elastiknn, I would review the PRs. I would also have a high standard for including a new feature. I don't want it to be flakey or a burden to test/maintain.
I haven't looked at ES' native vector search in a long time, so I'm not familiar with the features. If they don't offer pre-filtering, then it's probably not a fundamental limitation. Elastiknn has had pre-filtering since 2020, implemented with existing Elasticsearch and Lucene APIs. At a strategic level, the difference is that Elasticsearch is using the HNSW model for ANN, which is built into Lucene as a dedicated feature. Whereas Elastiknn is using the LSH model fro ANN, based on standard Lucene term queries: convert the vector to a set of hashes; store each hash as a term; use existing APIs to query for terms. On benchmarks, HNSW seems to be much better than LSH. I haven't seen a direct comparison of Elastiknn LSH vs. Elasticsearch's HNSW. I'd be very interested, and I hope they would be much faster given the amount of effort devoted to this in Lucene the past ~5 years. I hope that helps! |
Beta Was this translation helpful? Give feedback.
-
Converting this to a discussion. Still trying to decide exactly how to distinguish Issues vs. Discussions, but this feels more like a discussion than a specific issue to resolve or implement. |
Beta Was this translation helpful? Give feedback.
Hi @ezorita, these are some good questions. I'll try to answer below.
150M is more than I've ever tested with. It's not surprising that it takes longer, but 60s sounds like it might just lack resources for that amount of data. I'm assuming that these are LSH (approximate) queries. As a sanity check, how long does it take to run a standard term query on 150M documents with your current infrastructure? Any Elastiknn vector query is basically matching a bunch of terms, so the vector quer…