How to find the optimal cache configuration #751

fracek · 2024-10-05T18:31:47Z

fracek
Oct 5, 2024

I recently started using Foyer to cache objects stored on S3. I'm using version 0.12 from the main branch. Let me start by saying that I like it a lot, it was very easy to get started and I love the fetch API with request deduplication built in!

My workload uses files of different sizes, ranging from 500KiB to 150MiB or more¹. It also downloads a lot of files in short burst. We're talking about a few Gigabytes in <30s if it can. I noticed that by changing the configuration I can get very different results in speed in my workload (it benefits a lot from cached data).

I tried to look at the configuration options but there are a lot of them and not everything is clear.

shards: how many shards should I choose? Once I pick a value, is there any way to know if I should increase or decrease the shards?
admission picker: my understanding is that I should use a rate limited admission picker to avoid warning messages about the flushers not keeping up. I assume this value should be roughly my disk write throughput?
what flushers are is clear enough. But what are reclaimers? Should I scale them with flushers?
buffer pool size: my understanding is that this buffer size should be greater than the largest file I handle, multiplied by the number of flusher. So if I have 2 flushers and my largest file is 150MiB, this buffer should be at least 300MiB (+ extra space for the keys and metadata). I noticed that if I set it too small I get a [lodt batch] serialize entry error caused by not enough space in the buffer.
submit queue size threshold: something about inflight requests?
clean region threshold: no idea.

I think these options would be a lot easier to understand if I had a good mental model of Foyer, but at the moment how it works it's a bit fuzzy ². I think a guide explaining the "life of a request" to fetch_with_context would help a lot since I'm sure it touches all (or most) of the options above!

I can tweak the file size on my side. ↩
I see it as: lookup value in memory, if it's not there lookup on disk. If it's not there call the fetcher, push to the in memory cache and return the value. Once the in-memory cache is full, push older entries to disk. ↩

Answered by MrCroxx

Oct 8, 2024

Hi, @fracek . Thanks for asking. Sorry for the late reply. I just came back from vacation.

For your questions, let me answer them one by one.

First, based on your workload, if you are using the main branch, make sure you use the large object engine only. The mixed engine won't help in your case.

shards: how many shards should I choose? Once I pick a value, is there any way to know if I should increase or decrease the shards?

You can start with the default configuration or 8/16/32/64. You don't need to take care of it if you have not met a problem with it. You can monitor the overhead of locking related to sharding via perf and flamegraph.

admission picker: my understanding is that I sh…

View full answer

MrCroxx · 2024-10-08T03:03:11Z

MrCroxx
Oct 8, 2024
Maintainer

Hi, @fracek . Thanks for asking. Sorry for the late reply. I just came back from vacation.

For your questions, let me answer them one by one.

First, based on your workload, if you are using the main branch, make sure you use the large object engine only. The mixed engine won't help in your case.

shards: how many shards should I choose? Once I pick a value, is there any way to know if I should increase or decrease the shards?

You can start with the default configuration or 8/16/32/64. You don't need to take care of it if you have not met a problem with it. You can monitor the overhead of locking related to sharding via perf and flamegraph.

admission picker: my understanding is that I should use a rate limited admission picker to avoid warning messages about the flushers not keeping up. I assume this value should be roughly my disk write throughput?

Yes, you are right! A rate limiter is all you need for most cases. It is recommended to be 80% of the disk write throughout. Besides, if you have a better knowledge of your workload, you can use a customized adissmion picker.

what flushers are is clear enough. But what are reclaimers? Should I scale them with flushers?

Reclaimer is used to evict a region when the disk cache is full and need more space for the flushers. It deletes the indexer for the evicted entries and do reinsertion (if a reinsertion picker is set). However, setting a reinsertion is risky if the policy is not good enough. In most cases, you will not need it. You can scale it with flushers, but you don't need too many reclaimers. 1/2/4/8 is enough for most cases.

buffer pool size: my understanding is that this buffer size should be greater than the largest file I handle, multiplied by the number of flusher. So if I have 2 flushers and my largest file is 150MiB, this buffer should be at least 300MiB (+ extra space for the keys and metadata). I noticed that if I set it too small I get a [lodt batch] serialize entry error caused by not enough space in the buffer.

Yes. BTW, could you please provide the error you got? It is expected to raise a warning and reject the entry. Raising errors or panics should be bugs.

submit queue size threshold: something about inflight requests?

Yes. It is for preventing from too many entries piled up in the channel.

clean region threshold: no idea.

It is used to control when the reclaimers should wake up and do reclamation. You can use the default configuration (without providing a value).

1 reply

$@fracek$

fracek Oct 8, 2024
Author

Thank you, it's now much clearer how the parameters work and I think I can configure my workload properly now!

Yes. BTW, could you please provide the error you got? It is expected to raise a warning and reject the entry. Raising errors or panics should be bugs.

Sorry for the confusion, it is a foyer warning message caused by an inner error.

MrCroxx · 2024-10-08T03:04:08Z

MrCroxx
Oct 8, 2024
Maintainer

And thank you for your advice @fracek . I'm working on the comments and documents. Please ask for anything you need. Let's make them better. 🙌

0 replies

MrCroxx · 2024-10-08T03:06:36Z

MrCroxx
Oct 8, 2024
Maintainer

buffer pool size: my understanding is that this buffer size should be greater than the largest file I handle, multiplied by the number of flusher. So if I have 2 flushers and my largest file is 150MiB, this buffer should be at least 300MiB (+ extra space for the keys and metadata). I noticed that if I set it too small I get a [lodt batch] serialize entry error caused by not enough space in the buffer.

This may be related to the in-memory cache. Please let me know if the disk cache also has the problem. #467

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to find the optimal cache configuration #751

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to find the optimal cache configuration #751

fracek Oct 5, 2024

Footnotes

Replies: 3 comments · 1 reply

MrCroxx Oct 8, 2024 Maintainer

fracek Oct 8, 2024 Author

MrCroxx Oct 8, 2024 Maintainer

MrCroxx Oct 8, 2024 Maintainer

fracek
Oct 5, 2024

Replies: 3 comments 1 reply

MrCroxx
Oct 8, 2024
Maintainer

fracek Oct 8, 2024
Author

MrCroxx
Oct 8, 2024
Maintainer

MrCroxx
Oct 8, 2024
Maintainer