Skip to content

Conversation

@g199209
Copy link

@g199209 g199209 commented Dec 9, 2025

The OnReadDone callback in RaySyncerBidiReactorBase was asserting that message batches are never empty when ok=true. However, gRPC's callback streaming API may call OnReadDone with ok=true even when the message batch is empty in certain edge cases, such as:

  • When a connection is established but no data has been sent yet
  • During race conditions in concurrent read operations
  • When the remote side sends an empty batch

This fix replaces the RAY_CHECK assertion with a graceful check that logs a debug message and continues reading when an empty batch is received, preventing the GCS server from crashing.

The bug was introduced in commit 4a6ed09 (#57641) when the code was refactored from single message handling to batch message handling, but the edge case of empty batches was not properly handled.


How to reproduce:

  1. Start a new version (master version) ray head: ray start --head
  2. Start a old version (like 2.51.1) ray worker: ray start --address=xxx:6379

The ray head GCS service will crash immediately:

[2025-12-09 15:28:58,653 C 1715644 1715671] (gcs_server) ray_syncer_bidi_reactor_base.h:250:  An unexpected system state has occurred. You have likely discovered a bug in Ray. Please report this issue at https://github.com/ray-project/ray/issues and we'll work with you to fix it. Check failed: !msg_batch->messages().empty()
*** StackTrace Information ***
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0xe5c48a) [0x56033011e48a] ray::operator<<()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0xe5ec75) [0x560330120c75] ray::RayLog::~RayLog()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0x50fc90) [0x56032f7d1c90] ray::syncer::RaySyncerBidiReactorBase<>::OnReadDone()::{lambda()#1}::operator()()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0x731b8c) [0x56032f9f3b8c] EventTracker::RecordExecution()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0x726ec4) [0x56032f9e8ec4] boost::asio::detail::executor_op<>::do_complete()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0xd1b4df) [0x56032ffdd4df] boost::asio::detail::scheduler::do_run_one()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0xd1cf11) [0x56032ffdef11] boost::asio::detail::scheduler::run()
.venv/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server(+0xd1d381) [0x56032ffdf381] boost::asio::io_context::run()
/lib64/libstdc++.so.6(+0xc2b13) [0x7fe9920d4b13]
/lib64/libpthread.so.0(+0x81ca) [0x7fe9929391ca] start_thread
/lib64/libc.so.6(clone+0x43) [0x7fe991a6ee73] clone

After this patch, the head node will not crash, and worker node will get the error message as expect:

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 3.0.0
    Python: 3.12.10
This process on node 10.2.1.98 was started with:
    Ray: 2.51.1
    Python: 3.12.10

cc @yancanmao

The OnReadDone callback in RaySyncerBidiReactorBase was asserting that
message batches are never empty when ok=true. However, gRPC's callback
streaming API may call OnReadDone with ok=true even when the message batch
is empty in certain edge cases, such as:
- When a connection is established but no data has been sent yet
- During race conditions in concurrent read operations
- When the remote side sends an empty batch

This fix replaces the RAY_CHECK assertion with a graceful check that
logs a debug message and continues reading when an empty batch is
received, preventing the GCS server from crashing.

The bug was introduced in commit 4a6ed09 when the code was refactored
from single message handling to batch message handling, but the edge case
of empty batches was not properly handled.

Signed-off-by: mingfei <[email protected]>
@g199209 g199209 requested a review from a team as a code owner December 9, 2025 08:58
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies the RaySyncerBidiReactorBase class to handle cases where gRPC's OnReadDone callback might return an empty message batch even when ok=true. Previously, this would trigger a RAY_CHECK assertion. The change replaces the assertion with a conditional check that logs a debug message, explains the potential gRPC behavior in a comment, and then calls StartPull() to initiate the next read, thus preventing a crash in these edge cases.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 9, 2025
@edoakes
Copy link
Collaborator

edoakes commented Dec 9, 2025

@ZacAttack @Sparks0219 PTAL

@ZacAttack ZacAttack added the go add ONLY when ready to merge, run all tests label Dec 9, 2025
Copy link
Contributor

@ZacAttack ZacAttack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Looks good!

@Sparks0219
Copy link
Contributor

Would you mind clarifying if this is part of the grpc streaming API behavior or is a raylet ray syncer sending a message with no resource view updates to the gcs?

If this is expected behavior, might just be better to delete the RAY_CHECK from ReceiveUpdate and not conditionally check for it like was done prior the batching pr.

@ZacAttack
Copy link
Contributor

Would you mind clarifying if this is part of the grpc streaming API behavior or is a raylet ray syncer sending a message with no resource view updates to the gcs?

If this is expected behavior, might just be better to delete the RAY_CHECK from ReceiveUpdate and not conditionally check for it like was done prior the batching pr.

Actually this is a good point. Are you sure that you're not failing a serialization silently and that's why the batch is empty? Since there was a proto change as a result of the highlighted version. Ray does not support any compatability for wire protocol across versions.....

@Sparks0219
Copy link
Contributor

I think what should be investigated is why the version mismatch error didn't trigger earlier as it should since the RAY versions are different. All ray nodes in general should run on the same RAY version as we have no guarantees our internal proto files are backwards compatible. However the version check should've fired and caught this. Would you mind checking why the version check didn't fire @g199209 it's defined here: https://github.com/ray-project/ray/blob/00b1f9d5d3aa37a01c74ea29ef0f8c7d7a31e368/python/ray/_private/utils.py#L1210C5-L1210C18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants