Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reply offload #1457

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

alexander-shabanov
Copy link

@alexander-shabanov alexander-shabanov commented Dec 18, 2024

Overview

This PR introduces the ability to offload replies to I/O threads as described at #1353.

Key Changes

  • Added capability to reply construction allowing to interleave regular replies with offloaded replies in client reply buffers
  • Extended write-to-client handlers to support offloaded replies
  • Added offloading of bulk replies when reply offload enabled
  • Minor changes in cluster slots stats in order to support network-bytes-out for offloaded replies

Note: When reply offload disabled content and handling of client reply buffers remains as before this PR

Performance Impact

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads unless decrease in cob size is highly important for some customers.

Implementation Details

Reply construction:

  1. Original _addReplyToBuffer and _addReplyProtoToList have been renamed to _addReplyPayloadToBuffer and _addReplyPayloadToList and extended to support different types of payloads - regular replies and offloaded replies.
  2. New _addReplyToBuffer and _addReplyProtoToList calls now _addReplyPayloadToBuffer and _addReplyPayloadToList and used for adding regular replies to client reply buffers.
  3. Newly introduced _addBulkOffloadToBuffer and _addBulkOffloadToList are used for adding offloaded replies to client reply buffers.

Write-to-client infrastructure:

The writevToClient and _postWriteToClient has been significantly changed to support reply offload capability.

Testing

  1. Existing unit and integration tests passed
  2. Added unit tests for reply offload functionality

Comment on lines +1442 to +1446
# For use cases where command replies include Bulk strings (e.g. GET, MGET)
# reply offload can be enabled to eliminate espensive memory access
# and redundant data copy performed by main thread
#
# reply-offload yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect their to be cases where tuning this variable makes sense? Generally we want to avoid configuration in Valkey to make it simple to operate. Can we make real-time decisions about offloading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid the config too. It's better to start with no config and, if it turns out we need it later, then we can add. The reverse is not possible because removing a config is a breaking change.

@@ -3206,6 +3206,7 @@ standardConfig static_configs[] = {
createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL),
createBoolConfig("hide-user-data-from-log", NULL, MODIFIABLE_CONFIG, server.hide_user_data_from_log, 1, NULL, NULL),
createBoolConfig("import-mode", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, server.import_mode, 0, NULL, NULL),
createBoolConfig("reply-offload", NULL, MODIFIABLE_CONFIG, server.reply_offload_enabled, 0, NULL, NULL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess also why is the default off? IO threading is off by default, so it seems to allow this to be on by default.

src/networking.c Outdated
} clientReplyPayloadType;

/* Reply payload header */
typedef struct __attribute__((__packed__)) payloadHeader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this packed? I would generally prefer we let the compiler decide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted headers to consume less bytes from client reply buffers. Removed it

@madolson
Copy link
Member

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

src/networking.c Outdated
Comment on lines 2368 to 2392
* INTERNALS
* The writevToClient strives to write all client reply buffers to the client connection.
* However, it may encounter NET_MAX_WRITES_PER_EVENT or IOV_MAX or socket limit. In such case,
* some client reply buffers will be written completely and some partially.
* In next invocation writevToClient should resume from the exact position where it stopped.
* Also writevToClient should communicate to _postWriteToClient which buffers written completely
* and can be released. It is intricate in case of reply offloading as length of reply buffer does not match
* to network bytes out.
*
* For this purpose, writevToClient uses 3 data members on the client struct as input/output paramaters:
* io_last_written_buf - Last buffer that has been written to the client connection
* io_last_written_bufpos - The buffer has been written until this position
* io_last_written_data_len - The actual length of the data written from this buffer
* This length differs from written bufpos in case of reply offload
*
* The writevToClient uses addBufferToReplyIOV, addCompoundBufferToReplyIOV, addOffloadedBulkToReplyIOV, addPlainBufferToReplyIOV
* to build reply iovec array. These functions know to skip io_last_written_data_len, specifically addPlainBufferToReplyIOV
*
* In the end of execution writevToClient calls saveLastWrittenBuf for calculating "last written" buf/pos/data_len
* and storing on the client. While building reply iov, writevToClient gathers auxiliary bufWriteMetadata that
* helps in this calculation. In some cases, It may take several (> 2) invocations for writevToClient to write reply
* from a single buffer but saveLastWrittenBuf knows to calculate "last written" buf/pos/data_len properly
*
* The _postWriteToClient uses io_last_written_buf and io_last_written_bufpos in order to detect completely written buffers
* and release them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally internal comments should be near the code they are describing. Can we move this into the function near the relevant sections?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved (spread) comments to relevant functions

src/networking.c Outdated
typedef enum {
CLIENT_REPLY_PAYLOAD_DATA = 0,
CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD,
} clientReplyPayloadType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} clientReplyPayloadType;
} clientReplyType;

You use the term payload which seems unnecessary. The noun is the Reply.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied suggestion

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated
Comment on lines 2539 to 2545
payloadHeader *header = (payloadHeader *)ptr;
ptr += sizeof(payloadHeader);

if (header->type == CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD) {
clusterSlotStatsAddNetworkBytesOutForSlot(header->slot, header->actual_len);

robj** obj_ptr = (robj**)ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code like this would benefit a lot from some helper methods. Instead of just constantly moving and recasting values. Something like,

robj *getValkeyObjectFromHeader(payloadHeader *header) {
     char *ptr = (char *ptr) header;
     ptr += sizeof(payloadHeader);
     return (robj**)ptr;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested helper function does not address all the needs. As buffer can contain content like
header1ptr1ptr2ptr3header2plain_replyheader3ptr4ptr5 and it is more convenient to move ptr and objv accordingly.

@zuiderkwast
Copy link
Contributor

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Signed-off-by: Alexander Shabanov <[email protected]>
@alexander-shabanov
Copy link
Author

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

Fixed

Signed-off-by: Alexander Shabanov <[email protected]>
@alexander-shabanov
Copy link
Author

alexander-shabanov commented Dec 20, 2024

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to obj->ptr by main thread and much much less from eliminating copy to reply buffers. Without I/O threads, main thread still need to access obj->ptr and writev flow is a bit slower (requires additional preparation work) than plain write flow. I will publish results of various tests with and without I/O threads and with different data sizes on next week.

@alexander-shabanov
Copy link
Author

alexander-shabanov commented Dec 20, 2024

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread (eliminating copy much less important).

@zuiderkwast
Copy link
Contributor

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants