Specialize Dispatcher and Worker looping #444

hms · 2024-12-06T22:10:39Z

The Worker and Dispatcher share the same poll loop logic (Poller#start_loop) while having different functional requirements. The Worker is polling despite not being able to execute new jobs if at capacity. The Dispatcher does require polling, but is reliant on shared logic in Poller#start_loop for a Dispatcher specific optimization.

This PR allows the Worker to switch from polling to wake-on-event when its at capacity and eliminates the overhead of Worker#poll (really Worker#claim_executions) when it's known ahead of time #poll will be a no-op.

Changes:

Move the logic controlling the sleep interval from Poller#start_loop into Worker#poll and Dispatcher#poll by requiring #poll to return the delay value passed into interruptible_sleep.

Poller#start_loop:

Removes the test based on the number of rows processed by #poll. This was Dispatcher specific logic.

Worker#poll:

When Worker at full capacity: return a large value (10.minutes) effectively transforming Poller#start_loop from polling to wake-on-event.
When Worker < capacity: return polling_interval and maintain the poll timing until ReadyExecutions become available.

Dispatcher#poll:

When due ScheduledExecutions < batch_size: return polling_interval and maintain the existing poll timing.
When due ScheduledExecutions >= batch_size: return 0 and do not sleep between loops until all due ScheduledExecutions are processed. Loop via poll requests with sleep 0 (instead of simple loop in #poll) to check for shutdown requests between loops.

rosa · 2024-12-07T08:41:04Z

Thanks @hms!

eliminates the overhead of Worker#poll (really Worker#claim_executions) when it's known ahead of time #poll will be a no-op.

This overhead is minimal, though. Active Record does effectively a no-op when the limit you pass is 0, which is why I did this, in favour of simpler code 🤔

hms · 2024-12-07T15:20:37Z

Agreed,claimed_executions is clever and does limit by 0. But not before going through ReadyExecution.claim which has to do a little work.

Admittedly, this is a minor performance improvement at best. But it does free SQ down the road for optimizations that might not as easily take advantage of the limit 0 trick.

hms · 2024-12-09T00:06:21Z

@rosa

Unfortunately, this was one in a series of planned PRs and its parent #425, needs a bit of cleanup. Since I have to do a small amount of git surgery deal with the unnecessary dependency on #425, if you don't believe this change brings enough value, I'm happy to withdraw the PR. Otherwise, I'll rebase it on main (which I should have done in the first place).

rosa · 2024-12-10T09:37:34Z

Agreed,claimed_executions is clever and does limit by 0. But not before going through ReadyExecution.claim which has to do a little work.

That's true! Perhaps the easiest would be to handle the limit 0 case in ReadyExecution.claim 🤔

About Dispatcher#poll

When due ScheduledExecutions < batch_size: return polling_interval and maintain the existing poll timing.

I think this would be a bit different from before in that it'll wait no matter what if there are fewer scheduled jobs than the batch_size, even if new jobs have become due while dispatching the previous, smaller than batch_size, batch. Not a big deal in any case, depending on batch_size and polling_interval, but the dispatching will be slower 🤔

hms · 2024-12-10T19:52:19Z

@rosa

I think there are two issues to help frame the way I'm thinking about these proposed changes:

Should the Poller#start_loop be "smart" -- have some logic controlling the loop (like it does now)?
if Poller#start_loop is just responsible for looping and checking for shutdown, where should the "smarts", if any, belong?

With this PR:

Poller#start_loop doesn't have any logic specific to the client: Dispatcher, Worker. It is a "dumb" loop.
Dispatcher/Worker#poll drives the required looping behaviors via logic expressed in 1 simple line of code.
It's more clear and efficient to make looping decisions in #poll than optimizations (specific to looping) in queries (see comments below)

Addressing your feedback on Worker changes:

That's true! Perhaps the easiest would be to handle the limit 0 case in ReadyExecution.claim 🤔

I believe this reflects our divergence on the third bullet above. When Pool.idle? == false then #claim_executions does not change any SQ state making it a no-op. While all of the database engines have gotten a lot smarter about cutting off the query post parsing and prior to execution when limit(0) is detected, it's still has to parse (not sure if Rails is smart enough to use prepared query here) and the query request represents a network round trip for non-sqlite implementations -- every Worker.polling_interval.

On "Big Iron", say that spiffy new and over provisioned Dell that David is always blogging about 😏, the DB overhead is almost nothing. But on my tiny little slice of Heroku (or any other small VSP), it represents something that's measurable that returns zero value. We have the heartbeat for proof-of-life 😇.

So, entering an extended interruptible_sleep instead of polling is effectively the same as the current implementation (just with less polling).

Addressing your feedback on Dispatcher changes:

I think this would be a bit different from before in that it'll wait no matter what if there are fewer scheduled jobs than the batch_size, even if new jobs have become due while dispatching the previous, smaller than batch_size, batch. Not a big deal in any case, depending on batch_size and polling_interval, but the dispatching will be slower 🤔

I think the use-case I missed 😞 is where even just one of the dispatched jobs was on a priority queue that had the resources to be processed within the Dispatcher.polling_interval window. Then, the current PR implementation could represent a degradation of service.

Changing from:
batch.size < batch_size ? polling_interval : 0.seconds
to:
batch.size.zero? ? polling_interval : 0.seconds

returns the original priority queue performance profile. I'll make that correction and resubmit.

rosa · 2024-12-10T20:00:28Z

Ok, sounds good! Thanks for taking the time to write down these arguments 🙇‍♀️ Agree on the changes!

hms · 2024-12-10T22:03:20Z

@rosa

PR feedback addressed. As usual, thank you for all of your time helping me get these changes right.

rosa · 2024-12-12T16:12:58Z

lib/solid_queue/dispatcher.rb

@@ -24,7 +24,8 @@ def metadata
    private
      def poll
        batch = dispatch_next_batch
-        batch.size
+
+        batch.size.zero? ? polling_interval : 0.seconds


Does Concurrent::Promises.future(time) work correctly for time = 0.seconds?

I did test it with the queue based version of interruptible_sleep. I could add a test that would help every sleep better "knowing" it works.

I know I wouldn't take this @hms guys word for it...😉

rosa · 2024-12-12T16:13:58Z

Thanks a lot to you for your thoughtful explanations and changes! 🙏

I think this is ready, just would need some rebasing/cherry-picking to separate the changes about polling from the rest.

The Worker and Dispatcher share the same poll loop logic (Poller#start_loop) while having different functional requirements. The Worker is poll looping despite not being able to execute new jobs if at capacity. The Dispatcher does require polling, but is reliant on shared logic in Poller#start_loop for a Dispatcher specific optimization. Changes: Move the logic controlling the sleep interval per poll from Poller#start_loop into Worker#poll and Dispatcher#poll by requiring #poll to return the `delay` value passed into interruptible_sleep. Poller#start_loop: * Removes the test based on the number of rows processed by #poll. This was Dispatcher specific logic. Worker#poll: * When Worker at full capacity: return a large value (10.minutes) effectively transforming Poller#start_loop from polling to wake-on-event. * When Worker < capacity: return `polling_interval` and maintain the poll timing until ReadyExecutions become available. Dispatcher#poll: * When `due` ScheduledExecutions.zero? return `polling_interval` and maintain the existing poll timing when no ScheduledExecutions are available to process. * When `due` ScheduledExecutions.postive? return 0. This results in interruptible_sleep(0) which returns immediately and without introducing any delays/sleeps between polls. This also allows for the existing behavior of looping through ScheduledExecutions via poll in order to check for shutdown requests between dispatch_next_batch interations.

hms · 2024-12-13T00:59:08Z

@rosa

Rebased without the extra code that's not part of this PR.

Also, turns out I did write a test to prove that Dispatcher#poll returning 0 did, in fact, sleep 0 -- although the confirmation is more via side-effect that direct test (see dispatcher_test.rb "sleeps 0.seconds between polls if there are any read to dispatch jobs").

rosa · 2024-12-17T12:41:43Z

Thanks a lot @hms! I'm going to run this for a bit in production before merging 🙏

rosa · 2024-12-19T14:07:01Z

We've been running this in production for 2 days and it's working well 👍 Thanks @hms!

hms force-pushed the worker_polling branch from 8ad6800 to 60b1274 Compare December 10, 2024 22:00

rosa reviewed Dec 12, 2024

View reviewed changes

hms force-pushed the worker_polling branch from 60b1274 to 32b2ed8 Compare December 13, 2024 00:55

rosa merged commit 6f10ab3 into rails:main Dec 19, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize Dispatcher and Worker looping #444

Specialize Dispatcher and Worker looping #444

hms commented Dec 6, 2024

rosa commented Dec 7, 2024

hms commented Dec 7, 2024

hms commented Dec 9, 2024

rosa commented Dec 10, 2024 •

edited

Loading

hms commented Dec 10, 2024

rosa commented Dec 10, 2024

hms commented Dec 10, 2024

rosa Dec 12, 2024

hms Dec 12, 2024

rosa commented Dec 12, 2024

hms commented Dec 13, 2024

rosa commented Dec 17, 2024

rosa commented Dec 19, 2024

Specialize Dispatcher and Worker looping #444

Specialize Dispatcher and Worker looping #444

Conversation

hms commented Dec 6, 2024

rosa commented Dec 7, 2024

hms commented Dec 7, 2024

hms commented Dec 9, 2024

rosa commented Dec 10, 2024 • edited Loading

hms commented Dec 10, 2024

Addressing your feedback on Worker changes:

Addressing your feedback on Dispatcher changes:

rosa commented Dec 10, 2024

hms commented Dec 10, 2024

rosa Dec 12, 2024

Choose a reason for hiding this comment

hms Dec 12, 2024

Choose a reason for hiding this comment

rosa commented Dec 12, 2024

hms commented Dec 13, 2024

rosa commented Dec 17, 2024

rosa commented Dec 19, 2024

rosa commented Dec 10, 2024 •

edited

Loading