Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize Dispatcher and Worker looping #444

Merged
merged 1 commit into from
Dec 19, 2024
Merged

Conversation

hms
Copy link
Contributor

@hms hms commented Dec 6, 2024

The Worker and Dispatcher share the same poll loop logic (Poller#start_loop) while having different functional requirements. The Worker is polling despite not being able to execute new jobs if at capacity. The Dispatcher does require polling, but is reliant on shared logic in Poller#start_loop for a Dispatcher specific optimization.

This PR allows the Worker to switch from polling to wake-on-event when its at capacity and eliminates the overhead of Worker#poll (really Worker#claim_executions) when it's known ahead of time #poll will be a no-op.

Changes:

Move the logic controlling the sleep interval from Poller#start_loop into Worker#poll and Dispatcher#poll by requiring #poll to return the delay value passed into interruptible_sleep.

Poller#start_loop:

  • Removes the test based on the number of rows processed by #poll. This was Dispatcher specific logic.

Worker#poll:

  • When Worker at full capacity: return a large value (10.minutes) effectively transforming Poller#start_loop from polling to wake-on-event.
  • When Worker < capacity: return polling_interval and maintain the poll timing until ReadyExecutions become available.

Dispatcher#poll:

  • When due ScheduledExecutions < batch_size: return polling_interval and maintain the existing poll timing.
  • When due ScheduledExecutions >= batch_size: return 0 and do not sleep between loops until all due ScheduledExecutions are processed. Loop via poll requests with sleep 0 (instead of simple loop in #poll) to check for shutdown requests between loops.

@rosa
Copy link
Member

rosa commented Dec 7, 2024

Thanks @hms!

eliminates the overhead of Worker#poll (really Worker#claim_executions) when it's known ahead of time #poll will be a no-op.

This overhead is minimal, though. Active Record does effectively a no-op when the limit you pass is 0, which is why I did this, in favour of simpler code 🤔

@hms
Copy link
Contributor Author

hms commented Dec 7, 2024

Agreed,claimed_executions is clever and does limit by 0. But not before going through ReadyExecution.claim which has to do a little work.

Admittedly, this is a minor performance improvement at best. But it does free SQ down the road for optimizations that might not as easily take advantage of the limit 0 trick.

@hms
Copy link
Contributor Author

hms commented Dec 9, 2024

@rosa

Unfortunately, this was one in a series of planned PRs and its parent #425, needs a bit of cleanup. Since I have to do a small amount of git surgery deal with the unnecessary dependency on #425, if you don't believe this change brings enough value, I'm happy to withdraw the PR. Otherwise, I'll rebase it on main (which I should have done in the first place).

@rosa
Copy link
Member

rosa commented Dec 10, 2024

Agreed,claimed_executions is clever and does limit by 0. But not before going through ReadyExecution.claim which has to do a little work.

That's true! Perhaps the easiest would be to handle the limit 0 case in ReadyExecution.claim 🤔

About Dispatcher#poll

When due ScheduledExecutions < batch_size: return polling_interval and maintain the existing poll timing.

I think this would be a bit different from before in that it'll wait no matter what if there are fewer scheduled jobs than the batch_size, even if new jobs have become due while dispatching the previous, smaller than batch_size, batch. Not a big deal in any case, depending on batch_size and polling_interval, but the dispatching will be slower 🤔

@hms
Copy link
Contributor Author

hms commented Dec 10, 2024

@rosa

I think there are two issues to help frame the way I'm thinking about these proposed changes:

  • Should the Poller#start_loop be "smart" -- have some logic controlling the loop (like it does now)?
  • if Poller#start_loop is just responsible for looping and checking for shutdown, where should the "smarts", if any, belong?

With this PR:

  • Poller#start_loop doesn't have any logic specific to the client: Dispatcher, Worker. It is a "dumb" loop.
  • Dispatcher/Worker#poll drives the required looping behaviors via logic expressed in 1 simple line of code.
  • It's more clear and efficient to make looping decisions in #poll than optimizations (specific to looping) in queries (see comments below)

Addressing your feedback on Worker changes:

That's true! Perhaps the easiest would be to handle the limit 0 case in ReadyExecution.claim 🤔

I believe this reflects our divergence on the third bullet above. When Pool.idle? == false then #claim_executions does not change any SQ state making it a no-op. While all of the database engines have gotten a lot smarter about cutting off the query post parsing and prior to execution when limit(0) is detected, it's still has to parse (not sure if Rails is smart enough to use prepared query here) and the query request represents a network round trip for non-sqlite implementations -- every Worker.polling_interval.

On "Big Iron", say that spiffy new and over provisioned Dell that David is always blogging about 😏, the DB overhead is almost nothing. But on my tiny little slice of Heroku (or any other small VSP), it represents something that's measurable that returns zero value. We have the heartbeat for proof-of-life 😇.

So, entering an extended interruptible_sleep instead of polling is effectively the same as the current implementation (just with less polling).

Addressing your feedback on Dispatcher changes:

I think this would be a bit different from before in that it'll wait no matter what if there are fewer scheduled jobs than the batch_size, even if new jobs have become due while dispatching the previous, smaller than batch_size, batch. Not a big deal in any case, depending on batch_size and polling_interval, but the dispatching will be slower 🤔

I think the use-case I missed 😞 is where even just one of the dispatched jobs was on a priority queue that had the resources to be processed within the Dispatcher.polling_interval window. Then, the current PR implementation could represent a degradation of service.

Changing from:
batch.size < batch_size ? polling_interval : 0.seconds
to:
batch.size.zero? ? polling_interval : 0.seconds

returns the original priority queue performance profile. I'll make that correction and resubmit.

@rosa
Copy link
Member

rosa commented Dec 10, 2024

Ok, sounds good! Thanks for taking the time to write down these arguments 🙇‍♀️ Agree on the changes!

@hms
Copy link
Contributor Author

hms commented Dec 10, 2024

@rosa

PR feedback addressed. As usual, thank you for all of your time helping me get these changes right.

@@ -24,7 +24,8 @@ def metadata
private
def poll
batch = dispatch_next_batch
batch.size

batch.size.zero? ? polling_interval : 0.seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Concurrent::Promises.future(time) work correctly for time = 0.seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test it with the queue based version of interruptible_sleep. I could add a test that would help every sleep better "knowing" it works.

I know I wouldn't take this @hms guys word for it...😉

@rosa
Copy link
Member

rosa commented Dec 12, 2024

Thanks a lot to you for your thoughtful explanations and changes! 🙏

I think this is ready, just would need some rebasing/cherry-picking to separate the changes about polling from the rest.

The Worker and Dispatcher share the same poll loop logic
(Poller#start_loop) while having different functional requirements.

The Worker is poll looping despite not being able to execute new
jobs if at capacity.

The Dispatcher does require polling, but is reliant on shared
logic in Poller#start_loop for a Dispatcher specific optimization.

Changes:

Move the logic controlling the sleep interval per poll from
Poller#start_loop into Worker#poll and Dispatcher#poll by
requiring #poll to return the `delay` value passed into
interruptible_sleep.

Poller#start_loop:
* Removes the test based on the number of rows processed by #poll. This
  was Dispatcher specific logic.

Worker#poll:
* When Worker at full capacity: return a large value (10.minutes)
  effectively transforming Poller#start_loop from polling to wake-on-event.
* When Worker < capacity: return `polling_interval` and maintain the poll
  timing until ReadyExecutions become available.

Dispatcher#poll:
* When `due` ScheduledExecutions.zero? return `polling_interval` and
  maintain the existing poll timing when no ScheduledExecutions
  are available to process.
* When `due` ScheduledExecutions.postive? return 0. This results in
  interruptible_sleep(0) which returns immediately and without introducing
  any delays/sleeps between polls. This also allows for the existing behavior
  of looping through ScheduledExecutions via poll in order to check for
  shutdown requests between dispatch_next_batch interations.
@hms
Copy link
Contributor Author

hms commented Dec 13, 2024

@rosa

Rebased without the extra code that's not part of this PR.

Also, turns out I did write a test to prove that Dispatcher#poll returning 0 did, in fact, sleep 0 -- although the confirmation is more via side-effect that direct test (see dispatcher_test.rb "sleeps 0.seconds between polls if there are any read to dispatch jobs").

@rosa
Copy link
Member

rosa commented Dec 17, 2024

Thanks a lot @hms! I'm going to run this for a bit in production before merging 🙏

@rosa
Copy link
Member

rosa commented Dec 19, 2024

We've been running this in production for 2 days and it's working well 👍 Thanks @hms!

@rosa rosa merged commit 6f10ab3 into rails:main Dec 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants