Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix node connected status flappging #4587

Merged
merged 1 commit into from
Oct 7, 2024
Merged

fix node connected status flappging #4587

merged 1 commit into from
Oct 7, 2024

Conversation

wdbaruni
Copy link
Member

@wdbaruni wdbaruni commented Oct 6, 2024

Problem Statement

Node connection status has been observed to flap between connected and disconnected states due to race conditions in the HeartbeatServer and its interaction with the priority queue for heartbeat management.

Root Cause Analysis

The issue stems from non-atomic operations in the existing heartbeat message handler implementation:

  1. Checking for an older heartbeat
  2. Removing the old heartbeat
  3. Enqueuing a new heartbeat

This sequence of operations is vulnerable to race conditions when concurrent heartbeats arrive from the same node, potentially resulting in multiple heartbeats for a single node in the queue. This result in unexpected behaviour as the HashedPriorityQueue is expected to have a single item in the queue for the same key/node

Why Now?

While this bug has existed since version 1.4, it only became apparent with the introduction of concurrent heartbeats in version 1.5. The new version requires nodes to heartbeat to two topics:

  • The old topic (supported by 1.4 orchestrators)
  • A new topic (supported by 1.5 orchestrators)

As a result, 1.5 orchestrators now receive two concurrent heartbeats from 1.5 compute nodes, exposing the race condition.

Reproduction Steps

  1. Set up a devstack environment with approximately 10 nodes
  2. Observe the node connection status
  3. Note the flapping between connected and disconnected states

Solution

Instead of simply locking the HeartbeatServer.Handle() method, I've implemented a more comprehensive fix to address the underlying issues:

  1. Modified HashedPriorityQueue to enforce a single item per key atomically within the queue
  2. Introduced a Peek method to allow HeartbeatServer to examine the oldest item without removal and without having to loop over all item using DequeueWhere
  3. Corrected the priority and ordering of heartbeat events in the queue

These changes eliminate the need for manual checks, dequeues, and re-enqueues, while also improving the overall efficiency of the queue operations.

Implementation Details

  1. HashedPriorityQueue Modifications:

    • Ensure atomic operations for maintaining a single item per key
    • Implement version tracking for items so that enqueues remain fast, while dequeues will lazily filter out and remove items that don't match the latest version for the same key
  2. New Peek Method:

    • Allow examination of the oldest item without altering the queue state
    • Improve efficiency of HeartbeatServer operations without having to loop over all item using DequeueWhere
  3. Heartbeat Event Prioritization:

    • Adjust priority calculation to ensure oldest events are dequeued first

Testing Conducted

  • Enhanced test coverage for HashedPriorityQueue to ensure unique items per key
  • Improved concurrent heartbeat testing in HeartbeatServer
  • Manual testing using devstack environments

@wdbaruni wdbaruni requested a review from udsamani October 7, 2024 09:54
Copy link
Contributor

@udsamani udsamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for documenting it so well.

@wdbaruni wdbaruni merged commit 7285a99 into main Oct 7, 2024
3 of 4 checks passed
@wdbaruni wdbaruni deleted the fix-heartbeat branch October 7, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants