-
Notifications
You must be signed in to change notification settings - Fork 7k
[Core] Fix crash when receiving empty message batch in RaySyncer #59298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The OnReadDone callback in RaySyncerBidiReactorBase was asserting that message batches are never empty when ok=true. However, gRPC's callback streaming API may call OnReadDone with ok=true even when the message batch is empty in certain edge cases, such as: - When a connection is established but no data has been sent yet - During race conditions in concurrent read operations - When the remote side sends an empty batch This fix replaces the RAY_CHECK assertion with a graceful check that logs a debug message and continues reading when an empty batch is received, preventing the GCS server from crashing. The bug was introduced in commit 4a6ed09 when the code was refactored from single message handling to batch message handling, but the edge case of empty batches was not properly handled. Signed-off-by: mingfei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request modifies the RaySyncerBidiReactorBase class to handle cases where gRPC's OnReadDone callback might return an empty message batch even when ok=true. Previously, this would trigger a RAY_CHECK assertion. The change replaces the assertion with a conditional check that logs a debug message, explains the potential gRPC behavior in a comment, and then calls StartPull() to initiate the next read, thus preventing a crash in these edge cases.
|
@ZacAttack @Sparks0219 PTAL |
ZacAttack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Looks good!
|
Would you mind clarifying if this is part of the grpc streaming API behavior or is a raylet ray syncer sending a message with no resource view updates to the gcs? If this is expected behavior, might just be better to delete the RAY_CHECK from ReceiveUpdate and not conditionally check for it like was done prior the batching pr. |
Actually this is a good point. Are you sure that you're not failing a serialization silently and that's why the batch is empty? Since there was a proto change as a result of the highlighted version. Ray does not support any compatability for wire protocol across versions..... |
|
I think what should be investigated is why the version mismatch error didn't trigger earlier as it should since the RAY versions are different. All ray nodes in general should run on the same RAY version as we have no guarantees our internal proto files are backwards compatible. However the version check should've fired and caught this. Would you mind checking why the version check didn't fire @g199209 it's defined here: https://github.com/ray-project/ray/blob/00b1f9d5d3aa37a01c74ea29ef0f8c7d7a31e368/python/ray/_private/utils.py#L1210C5-L1210C18 |
The OnReadDone callback in RaySyncerBidiReactorBase was asserting that message batches are never empty when
ok=true. However, gRPC's callback streaming API may callOnReadDonewithok=trueeven when the message batch is empty in certain edge cases, such as:This fix replaces the
RAY_CHECKassertion with a graceful check that logs a debug message and continues reading when an empty batch is received, preventing the GCS server from crashing.The bug was introduced in commit 4a6ed09 (#57641) when the code was refactored from single message handling to batch message handling, but the edge case of empty batches was not properly handled.
How to reproduce:
ray start --headray start --address=xxx:6379The ray head GCS service will crash immediately:
After this patch, the head node will not crash, and worker node will get the error message as expect:
cc @yancanmao