Q&A: Scale effects #661

ezorita · 2024-03-26T17:22:53Z

ezorita
Mar 26, 2024

first of all thanks for your time and dedication to build elastiknn.

I'd like to share our use case and the scaling behavior we are observing. We have indexed about 150M documents in an elasticsearch cluster, including both document text and a 768-dimensional vector for each document. We are considering using elastiknn because it seems to be the only option in elasticsearch that can be combined with any arbitrary boolean queries. However, we are experiencing scaling issues. In a small corpus of 500k documents it's blazing fast, when we go up to 150M it takes about a minute to run an elastiknn query (even without further filtering).

I suspect this might be a memory issue, so I have a few questions related to this:

What is the vector index size of elastiknn given a number of vectors and a vector size, for LSH and cosine similarity?
Does elastiknn expect the whole vector index to be in memory at all times? If so, is there a way to preload the index in memory and keep it there for as long as elasticsearch runs?
Have you experienced such scaling issues before? Do you know the most common causes?
If I understood correctly, you mentioned you'd not continue improving elastiknn further since Elasticsearch is implementing a pretty sophisticated vector search engine. However, I wonder what is different between elastiknn and ES' native vector engine, so that the later does not support arbitrary filter queries.

Many thanks again!

Answered by alexklibisz

Mar 27, 2024

Hi @ezorita, these are some good questions. I'll try to answer below.

However, we are experiencing scaling issues. In a small corpus of 500k documents it's blazing fast, when we go up to 150M it takes about a minute to run an elastiknn query (even without further filtering).

150M is more than I've ever tested with. It's not surprising that it takes longer, but 60s sounds like it might just lack resources for that amount of data. I'm assuming that these are LSH (approximate) queries. As a sanity check, how long does it take to run a standard term query on 150M documents with your current infrastructure? Any Elastiknn vector query is basically matching a bunch of terms, so the vector quer…

View full answer

alexklibisz · 2024-03-27T18:53:19Z

alexklibisz
Mar 27, 2024
Maintainer

Hi @ezorita, these are some good questions. I'll try to answer below.

However, we are experiencing scaling issues. In a small corpus of 500k documents it's blazing fast, when we go up to 150M it takes about a minute to run an elastiknn query (even without further filtering).

150M is more than I've ever tested with. It's not surprising that it takes longer, but 60s sounds like it might just lack resources for that amount of data. I'm assuming that these are LSH (approximate) queries. As a sanity check, how long does it take to run a standard term query on 150M documents with your current infrastructure? Any Elastiknn vector query is basically matching a bunch of terms, so the vector query will necessarily be slower than a term query. I would set a baseline with term queries. More tips below.

(even without further filtering)

The filtering is strictly pre-filtering. So filtering should actually improve performance. More here: https://alexklibisz.github.io/elastiknn/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents

What is the vector index size of elastiknn given a number of vectors and a vector size, for LSH and cosine similarity?

It depends on the number of vectors and the LSH parameters. For cosine LSH, index size will scale with the L and k parameters. L is the number of hashes that get stored to represent a vector. k is the length of each hash. For the cosine LSH model, the math is pretty simple: we store L hashes and k bits for each hash. Vectors are stored as 32-bit floats. So your index should be roughly number of vectors * (L * k bits + dimensions * 32 bits per float). So an index of 150M 1024-dimensional vectors, with L = 100, k = 5 would have size: 150 million * (100 * 5 bits + 1024 * 32 bits), roughly 624GB.

Does elastiknn expect the whole vector index to be in memory at all times?

Ideally yes. But this is more of a general concern with Elasticsearch and Lucene. AFAIK, for low-latency search, the index files should ideally be cached in the file system cache. You can monitor IOPs or similar metrics to verify that you're reading from memory and not from disk/ssd.

If so, is there a way to preload the index in memory and keep it there for as long as elasticsearch runs?

If it has the space, the operating system should eventually cache the index files in file system cache automatically.

Elasticsearch has some advanced settings to control this more precisely, e.g., https://www.elastic.co/guide/en/elasticsearch/reference/current/preload-data-to-file-system-cache.html. I haven't tried this. I usually just trust that if I've provided enough system (non-JVM) memory, then the file system will cache the index files.

Have you experienced such scaling issues before? Do you know the most common causes?

I haven't really pushed elastiknn anytime recently. I've been benchmarking mostly with the Fashion Mnist dataset, which is ~60k vectors.

My general advice is the following:

Model parameters:
- increasing L should increase recall, increase index size, and increase latency.
- increasing k should increase precision, increase index size, and increase latency.
Query parameters
- increasing candidates should increase recall, precision, and latency.
- pre-filtering on some simpler document properties should decrease latency, because it narrows the search space of vectors that need to be considered for the actual vector search. More here: https://alexklibisz.github.io/elastiknn/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents
Parallelism: Increasing the number of shards should decrease latency, provided you have CPUs available for searching. Elastiknn is just using Lucene queries, so it follows the Elasticsearch and Lucene model for parallelism. This means that increasing the number of Elasticsearch shards should decrease latency, provided there is adequate CPU available for processing those shards in parallel. This is because each shard has one or more Lucene segments, and each Elastiknn query runs once per segment. So, if you have adequate CPU allocated to search (see thread pools docs), then these queries should run in parallel. For example, if you have 10 shards, each with 2 segments, 20 threads available for searching, and 20 physical CPU cores available, then all 20 segments will be searched in parallel. In theory that's up to 20x faster than 1 shard with 1 segment. In practice there's some overhead for merging the results. If your index is relatively static, consider using the force-merge API to merge each shard into a single segment. That just makes this all simpler to think about.
Stored fields: using stored fields should also decrease latency. More here: https://alexklibisz.github.io/elastiknn/api/#using-stored-fields-for-faster-queries

If I understood correctly, you mentioned you'd not continue improving elastiknn further since Elasticsearch is implementing a pretty sophisticated vector search engine.

Yeah, I don't have any plans to add functionality. I've been tinkering with performance when I have the time and when I have ideas. Mostly just because I'm interested in performance optimization. If someone is interested in adding functionality to Elastiknn, I would review the PRs. I would also have a high standard for including a new feature. I don't want it to be flakey or a burden to test/maintain.

However, I wonder what is different between elastiknn and ES' native vector engine, so that the later does not support arbitrary filter queries.

I haven't looked at ES' native vector search in a long time, so I'm not familiar with the features. If they don't offer pre-filtering, then it's probably not a fundamental limitation. Elastiknn has had pre-filtering since 2020, implemented with existing Elasticsearch and Lucene APIs.

At a strategic level, the difference is that Elasticsearch is using the HNSW model for ANN, which is built into Lucene as a dedicated feature. Whereas Elastiknn is using the LSH model fro ANN, based on standard Lucene term queries: convert the vector to a set of hashes; store each hash as a term; use existing APIs to query for terms. On benchmarks, HNSW seems to be much better than LSH. I haven't seen a direct comparison of Elastiknn LSH vs. Elasticsearch's HNSW. I'd be very interested, and I hope they would be much faster given the amount of effort devoted to this in Lucene the past ~5 years.

I hope that helps!

1 reply

ezorita Mar 28, 2024
Author

Thanks Alex, this is extremely valuable!

alexklibisz · 2024-03-27T19:03:07Z

alexklibisz
Mar 27, 2024
Maintainer

Converting this to a discussion. Still trying to decide exactly how to distinguish Issues vs. Discussions, but this feels more like a discussion than a specific issue to resolve or implement.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q&A: Scale effects #661

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Q&A: Scale effects #661

ezorita Mar 26, 2024

Replies: 2 comments · 1 reply

alexklibisz Mar 27, 2024 Maintainer

ezorita Mar 28, 2024 Author

alexklibisz Mar 27, 2024 Maintainer

ezorita
Mar 26, 2024

Replies: 2 comments 1 reply

alexklibisz
Mar 27, 2024
Maintainer

ezorita Mar 28, 2024
Author

alexklibisz
Mar 27, 2024
Maintainer