-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: optimize and reduce traffic from delivery workers #970
Open
epoberezkin
wants to merge
1
commit into
master
Choose a base branch
from
ep/rfc-delivery-workers
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Optimizing traffic from delivery workers | ||
|
||
## Problem | ||
|
||
Currently we create a separate delivery worker for each queue. The historic reason for that is that in the past message queue was in client memory, initialized from the database on client start, and each failed message is retried on temporary errors. Temporary errors can be of two types: server availability (network error, server is down, etc.) and queue quota exceeded (that will become available only once the recipient receives the messages or they expire on the server). While the latter error is per queue, which is the reason to have a separate worker for each queue - not to block delivery to other queues on the same transport session - the former would be consistent for the session. | ||
|
||
Having multiple workers per session (in some rare cases, there can be hundreds of workers per session) creates a lot of contention for the database and also too much traffic if the network or server is slow - concurrent workers sending to the same session can create a lot of traffic, particularly when the server is overloaded, as the clients will be failing on timeouts, yet sending the requests and ignoring the responses, potentially resulting in duplicate deliveries (TODO: measure it), thus creating even more traffic for recipients as well. | ||
|
||
## Solution | ||
|
||
As the recent changes in message delivery moved the message delivery queues to the database, when the worker reads the next message ID to deliver from the database (and not only its data, as before), we could optimize it further by having a single worker per transport session that would treat quota exceeded errors differently from network and timeout errors. | ||
|
||
When queue quota is exceeded, the sending queue will be marked with `deliver_after` timestamp, and messages that need to be delivered to this queue would not be read by the worker until that timestamp is reached. When network or timeout error happens, the message will be retried up to N times (probably 3 to 5 to avoid the risk of message-specific errors being misinterpreted as network errors), after that the whole queue will be marked with `deliver_after` timestamp with smaller interval than in case of quota exceeded error but larger than between retries of the single message. Exponential back off would still be used. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how is deliver_after to be treated after client restart? ignore and attempt once, preserving retry interval? it may be better to separate fields for quota exceeded and network error queues in case these situations require different approach |
||
|
||
The benefits of this change could include reduced traffic, battery usage, server load and spikes of server load. It can also reduce the complexity and remove the need for dual retry intervals that are currently set per message - longer delays will be handled on the queue level. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly alternatively could keep workers but block them at session client, e.g. it may process 1 message at a time, managing session network retry interval; quota exceeded workers may cooperate and keep their own retry interval.
May be a smaller change in terms of affecting good queues delivery (it being stuck due to some newly introduced bug).
Also if all session workers "work" to be read by a single worker, query work may have more special filters (e.g. account for user in session). Maybe it's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-transport workers would make using (per-transport) proxies simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less concerned about complicating database logic than this change would substantially reduce sending concurrency, slowing down actual sending to groups, making problem of out of order delivery worse.
On another hand, possibly the current concurrency level creates too many timed-out deliveries, that result in duplicate deliveries, increasing the traffic both for senders and for recipients.
So the first steps would be to reduce traffic and battery usage: