-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize Dispatcher and Worker looping #444
Conversation
Thanks @hms!
This overhead is minimal, though. Active Record does effectively a no-op when the limit you pass is 0, which is why I did this, in favour of simpler code 🤔 |
Agreed,claimed_executions is clever and does limit by 0. But not before going through ReadyExecution.claim which has to do a little work. Admittedly, this is a minor performance improvement at best. But it does free SQ down the road for optimizations that might not as easily take advantage of the limit 0 trick. |
Unfortunately, this was one in a series of planned PRs and its parent #425, needs a bit of cleanup. Since I have to do a small amount of git surgery deal with the unnecessary dependency on #425, if you don't believe this change brings enough value, I'm happy to withdraw the PR. Otherwise, I'll rebase it on main (which I should have done in the first place). |
That's true! Perhaps the easiest would be to handle the limit 0 case in About
I think this would be a bit different from before in that it'll wait no matter what if there are fewer scheduled jobs than the |
I think there are two issues to help frame the way I'm thinking about these proposed changes:
With this PR:
Addressing your feedback on Worker changes:
I believe this reflects our divergence on the third bullet above. When Pool.idle? == false then #claim_executions does not change any SQ state making it a no-op. While all of the database engines have gotten a lot smarter about cutting off the query post parsing and prior to execution when limit(0) is detected, it's still has to parse (not sure if Rails is smart enough to use prepared query here) and the query request represents a network round trip for non-sqlite implementations -- every Worker.polling_interval. On "Big Iron", say that spiffy new and over provisioned Dell that David is always blogging about 😏, the DB overhead is almost nothing. But on my tiny little slice of Heroku (or any other small VSP), it represents something that's measurable that returns zero value. We have the heartbeat for proof-of-life 😇. So, entering an extended interruptible_sleep instead of polling is effectively the same as the current implementation (just with less polling). Addressing your feedback on Dispatcher changes:
I think the use-case I missed 😞 is where even just one of the dispatched jobs was on a priority queue that had the resources to be processed within the Dispatcher.polling_interval window. Then, the current PR implementation could represent a degradation of service. Changing from: returns the original priority queue performance profile. I'll make that correction and resubmit. |
Ok, sounds good! Thanks for taking the time to write down these arguments 🙇♀️ Agree on the changes! |
PR feedback addressed. As usual, thank you for all of your time helping me get these changes right. |
@@ -24,7 +24,8 @@ def metadata | |||
private | |||
def poll | |||
batch = dispatch_next_batch | |||
batch.size | |||
|
|||
batch.size.zero? ? polling_interval : 0.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Concurrent::Promises.future(time)
work correctly for time = 0.seconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did test it with the queue based version of interruptible_sleep. I could add a test that would help every sleep better "knowing" it works.
I know I wouldn't take this @hms guys word for it...😉
Thanks a lot to you for your thoughtful explanations and changes! 🙏 I think this is ready, just would need some rebasing/cherry-picking to separate the changes about polling from the rest. |
The Worker and Dispatcher share the same poll loop logic (Poller#start_loop) while having different functional requirements. The Worker is poll looping despite not being able to execute new jobs if at capacity. The Dispatcher does require polling, but is reliant on shared logic in Poller#start_loop for a Dispatcher specific optimization. Changes: Move the logic controlling the sleep interval per poll from Poller#start_loop into Worker#poll and Dispatcher#poll by requiring #poll to return the `delay` value passed into interruptible_sleep. Poller#start_loop: * Removes the test based on the number of rows processed by #poll. This was Dispatcher specific logic. Worker#poll: * When Worker at full capacity: return a large value (10.minutes) effectively transforming Poller#start_loop from polling to wake-on-event. * When Worker < capacity: return `polling_interval` and maintain the poll timing until ReadyExecutions become available. Dispatcher#poll: * When `due` ScheduledExecutions.zero? return `polling_interval` and maintain the existing poll timing when no ScheduledExecutions are available to process. * When `due` ScheduledExecutions.postive? return 0. This results in interruptible_sleep(0) which returns immediately and without introducing any delays/sleeps between polls. This also allows for the existing behavior of looping through ScheduledExecutions via poll in order to check for shutdown requests between dispatch_next_batch interations.
Rebased without the extra code that's not part of this PR. Also, turns out I did write a test to prove that Dispatcher#poll returning 0 did, in fact, sleep 0 -- although the confirmation is more via side-effect that direct test (see dispatcher_test.rb "sleeps |
Thanks a lot @hms! I'm going to run this for a bit in production before merging 🙏 |
We've been running this in production for 2 days and it's working well 👍 Thanks @hms! |
The Worker and Dispatcher share the same poll loop logic (Poller#start_loop) while having different functional requirements. The Worker is polling despite not being able to execute new jobs if at capacity. The Dispatcher does require polling, but is reliant on shared logic in Poller#start_loop for a Dispatcher specific optimization.
This PR allows the Worker to switch from polling to wake-on-event when its at capacity and eliminates the overhead of Worker#poll (really Worker#claim_executions) when it's known ahead of time #poll will be a no-op.
Changes:
Move the logic controlling the sleep interval from Poller#start_loop into Worker#poll and Dispatcher#poll by requiring #poll to return the
delay
value passed into interruptible_sleep.Poller#start_loop:
Worker#poll:
polling_interval
and maintain the poll timing until ReadyExecutions become available.Dispatcher#poll:
due
ScheduledExecutions < batch_size: returnpolling_interval
and maintain the existing poll timing.due
ScheduledExecutions >= batch_size: return 0 and do not sleep between loops until alldue
ScheduledExecutions are processed. Loop via poll requests with sleep 0 (instead of simple loop in #poll) to check for shutdown requests between loops.