fix node connected status flappging #4587

wdbaruni · 2024-10-06T16:35:31Z

Problem Statement

Node connection status has been observed to flap between connected and disconnected states due to race conditions in the HeartbeatServer and its interaction with the priority queue for heartbeat management.

Root Cause Analysis

The issue stems from non-atomic operations in the existing heartbeat message handler implementation:

Checking for an older heartbeat
Removing the old heartbeat
Enqueuing a new heartbeat

This sequence of operations is vulnerable to race conditions when concurrent heartbeats arrive from the same node, potentially resulting in multiple heartbeats for a single node in the queue. This result in unexpected behaviour as the HashedPriorityQueue is expected to have a single item in the queue for the same key/node

Why Now?

While this bug has existed since version 1.4, it only became apparent with the introduction of concurrent heartbeats in version 1.5. The new version requires nodes to heartbeat to two topics:

The old topic (supported by 1.4 orchestrators)
A new topic (supported by 1.5 orchestrators)

As a result, 1.5 orchestrators now receive two concurrent heartbeats from 1.5 compute nodes, exposing the race condition.

Reproduction Steps

Set up a devstack environment with approximately 10 nodes
Observe the node connection status
Note the flapping between connected and disconnected states

Solution

Instead of simply locking the HeartbeatServer.Handle() method, I've implemented a more comprehensive fix to address the underlying issues:

Modified HashedPriorityQueue to enforce a single item per key atomically within the queue
Introduced a Peek method to allow HeartbeatServer to examine the oldest item without removal and without having to loop over all item using DequeueWhere
Corrected the priority and ordering of heartbeat events in the queue

These changes eliminate the need for manual checks, dequeues, and re-enqueues, while also improving the overall efficiency of the queue operations.

Implementation Details

HashedPriorityQueue Modifications:
- Ensure atomic operations for maintaining a single item per key
- Implement version tracking for items so that enqueues remain fast, while dequeues will lazily filter out and remove items that don't match the latest version for the same key
New Peek Method:
- Allow examination of the oldest item without altering the queue state
- Improve efficiency of HeartbeatServer operations without having to loop over all item using DequeueWhere
Heartbeat Event Prioritization:
- Adjust priority calculation to ensure oldest events are dequeued first

Testing Conducted

Enhanced test coverage for HashedPriorityQueue to ensure unique items per key
Improved concurrent heartbeat testing in HeartbeatServer
Manual testing using devstack environments

udsamani

Thank you for documenting it so well.

fix node connected status flappging

c08da5a

wdbaruni requested a review from udsamani October 7, 2024 09:54

udsamani approved these changes Oct 7, 2024

View reviewed changes

wdbaruni merged commit 7285a99 into main Oct 7, 2024
3 of 4 checks passed

wdbaruni deleted the fix-heartbeat branch October 7, 2024 12:33

frrist mentioned this pull request Oct 7, 2024

add logging for debuggin hb issue #4584

Closed

wdbaruni mentioned this pull request Oct 8, 2024

Heartbeat between Orchestrator and Compute node flapping connection state #4585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix node connected status flappging #4587

fix node connected status flappging #4587

wdbaruni commented Oct 6, 2024 •

edited

Loading

udsamani left a comment

fix node connected status flappging #4587

fix node connected status flappging #4587

Conversation

wdbaruni commented Oct 6, 2024 • edited Loading

Problem Statement

Root Cause Analysis

Why Now?

Reproduction Steps

Solution

Implementation Details

Testing Conducted

udsamani left a comment

Choose a reason for hiding this comment

wdbaruni commented Oct 6, 2024 •

edited

Loading