fix node connected status flappging #4587
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Statement
Node connection status has been observed to flap between connected and disconnected states due to race conditions in the
HeartbeatServer
and its interaction with the priority queue for heartbeat management.Root Cause Analysis
The issue stems from non-atomic operations in the existing heartbeat message handler implementation:
This sequence of operations is vulnerable to race conditions when concurrent heartbeats arrive from the same node, potentially resulting in multiple heartbeats for a single node in the queue. This result in unexpected behaviour as the HashedPriorityQueue is expected to have a single item in the queue for the same key/node
Why Now?
While this bug has existed since version 1.4, it only became apparent with the introduction of concurrent heartbeats in version 1.5. The new version requires nodes to heartbeat to two topics:
As a result, 1.5 orchestrators now receive two concurrent heartbeats from 1.5 compute nodes, exposing the race condition.
Reproduction Steps
Solution
Instead of simply locking the
HeartbeatServer.Handle()
method, I've implemented a more comprehensive fix to address the underlying issues:HashedPriorityQueue
to enforce a single item per key atomically within the queuePeek
method to allowHeartbeatServer
to examine the oldest item without removal and without having to loop over all item usingDequeueWhere
These changes eliminate the need for manual checks, dequeues, and re-enqueues, while also improving the overall efficiency of the queue operations.
Implementation Details
HashedPriorityQueue
Modifications:New
Peek
Method:HeartbeatServer
operations without having to loop over all item usingDequeueWhere
Heartbeat Event Prioritization:
Testing Conducted
HashedPriorityQueue
to ensure unique items per keyHeartbeatServer