Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically adjust fetch debounce with PID controller #23313

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

ballard26
Copy link
Contributor

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.
This increased batching frees up CPU resources for other operations to
use and tends to decrease end-to-end latency.

From testing its been found empirically that when Redpanda is at
saturation restricting the fetch scheduling group to only consume 20% of
overall reactor utilization will improve end-to-end latency for a
variety of workloads.

This commit implements a PID controller that will dynamically adjust
fetch debounce to ensure that the fetch scheduling group is only
consuming 20% of overall reactor utilization when Redpanda is at
saturation.

The test results for the controller can be found here

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few mechanical comments.

Will review the actual function later.

src/v/config/configuration.cc Outdated Show resolved Hide resolved
src/v/kafka/server/handlers/fetch.cc Outdated Show resolved Hide resolved
src/v/kafka/server/handlers/fetch.cc Show resolved Hide resolved
Feediver1
Feediver1 previously approved these changes Sep 16, 2024
Copy link

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add units in src/v/config/configuration.cc L687-8

@StephanDollberg
Copy link
Member

Just restating what we discussed in person here:

I think we should just go ahead and merge the simplest form (PID controller on the coordinator shard) behind a feature flag. At the same time add some metrics that help us with judging how good it works. Then we can selectively enable it in cloud.

@travisdowns
Copy link
Member

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.

Is this right? I wouldn't say it's a case of fetch starving other groups specifically, but being at CPU saturation, so in a sense all groups are starving all other groups: i.e., fetch isn't specifically a bad guy. What's different about fetch is that we have pretty control over fetch granularity, and can decide to return 10 * x bytes in 1 response instead of x bytes in 10 responses which could be close to 10x more efficient in many cases.

I'm not even sure I'd call this "backpressure" either: the client may consume at the same rate (or even faster!) it's just that we are applying server-side logic to send larger batches to increase batching. The doesn't matter too much for the PR message, but we should have our wiki doc page and I think it matters more there.

StephanDollberg
StephanDollberg previously approved these changes Oct 2, 2024
@ballard26
Copy link
Contributor Author

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.

Is this right? I wouldn't say it's a case of fetch starving other groups specifically, but being at CPU saturation, so in a sense all groups are starving all other groups: i.e., fetch isn't specifically a bad guy. What's different about fetch is that we have pretty control over fetch granularity, and can decide to return 10 * x bytes in 1 response instead of x bytes in 10 responses which could be close to 10x more efficient in many cases.

I'm not even sure I'd call this "backpressure" either: the client may consume at the same rate (or even faster!) it's just that we are applying server-side logic to send larger batches to increase batching. The doesn't matter too much for the PR message, but we should have our wiki doc page and I think it matters more there.

I largely agree with this. I think its just a matter of different(likely better) wording. At the end of the day when RP is at saturation all groups are starved for resources. However, with the fetch group we can more efficiently schedule requests with the pid controller. Allowing for the same overall amount of data to be consumed and with the same or better overall latency, but requiring less reactor resources to fulfill.

, fetch_pid_target_utilization(
*this,
"fetch_pid_target_utilization",
"The overall precentage of reactor utilization the fetch scheduling "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "fraction between 0 and 1" not "precentage" since percentage is always between 0 and 100.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description to make it clear that its a fraction, not a percentage.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just change the percentage thing in the config

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.
This increased batching frees up CPU resources for other operations to
use and tends to decrease end-to-end latency.

From testing its been found empirically that when Redpanda is at
saturation restricting the fetch scheduling group to only consume 20% of
overall reactor utilization will improve end-to-end latency for a
variety of workloads.

This commit implements a PID controller that will dynamically adjust
fetch debounce to ensure that the fetch scheduling group is only
consuming 20% of overall reactor utilization when Redpanda is at
saturation.
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looking forward to this!

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 9, 2024

@piyushredpanda piyushredpanda merged commit 497c5de into redpanda-data:dev Oct 9, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants