-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster reconnect on handshake required response #4772
Faster reconnect on handshake required response #4772
Conversation
WalkthroughThis pull request introduces a comprehensive enhancement to the connection management and error handling mechanisms in the network communication layer. The changes focus on improving the robustness of heartbeat and handshake processes across multiple components. A new Changes
Sequence DiagramsequenceDiagram
participant CP as Control Plane
participant HT as Health Tracker
participant CM as Connection Manager
CP->>HT: Heartbeat
alt Handshake Required Error
HT->>HT: Set HandshakeRequired flag
CM->>CM: Transition to Disconnected
CM->>CP: Attempt Reconnection
else Normal Error
HT->>HT: Continue Tracking
CM->>CM: Continue Heartbeats
end
Possibly related PRs
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/transport/nclprotocol/types.go (1)
53-53
: LGTM! Consider adding field documentation.The addition of
HandshakeRequired
field is well-placed and appropriate for tracking handshake state. Consider adding a comment to document its purpose and when it gets set/reset.type ConnectionHealth struct { StartTime time.Time LastSuccessfulHeartbeat time.Time LastSuccessfulUpdate time.Time CurrentState ConnectionState ConsecutiveFailures int LastError error ConnectedSince time.Time - HandshakeRequired bool + // HandshakeRequired indicates whether a new handshake is needed with the orchestrator + HandshakeRequired bool }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/transport/nclprotocol/compute/controlplane.go
(2 hunks)pkg/transport/nclprotocol/compute/controlplane_test.go
(2 hunks)pkg/transport/nclprotocol/compute/health_tracker.go
(5 hunks)pkg/transport/nclprotocol/compute/health_tracker_test.go
(3 hunks)pkg/transport/nclprotocol/compute/manager.go
(2 hunks)pkg/transport/nclprotocol/compute/manager_test.go
(3 hunks)pkg/transport/nclprotocol/types.go
(1 hunks)
🔇 Additional comments (11)
pkg/transport/nclprotocol/compute/health_tracker.go (2)
39-39
: LGTM! Consistent state management.
The HandshakeRequired flag is properly reset across all state transitions (Connected, Disconnected, Connecting), ensuring consistent state management.
Also applies to: 50-50, 59-59
76-81
: LGTM! Thread-safe implementation of handshake methods.
The new HandshakeRequired methods are well-implemented with:
- Proper mutex usage for thread safety
- Clear separation of setter and getter
- Focused single responsibility
Also applies to: 97-102
pkg/transport/nclprotocol/compute/health_tracker_test.go (2)
42-42
: LGTM! Comprehensive test coverage in existing tests.
The HandshakeRequired state is properly verified in existing tests:
- Initial state verification
- State transitions during connect/disconnect
- Proper setup and assertions
Also applies to: 46-48, 61-61, 76-76
84-107
: LGTM! Well-structured dedicated test method.
TestHandshakeRequired provides thorough coverage:
- Verifies initial state
- Tests setting and clearing of the flag
- Validates behavior across all connection state transitions
pkg/transport/nclprotocol/compute/controlplane.go (1)
6-6
: LGTM: Import added for error message handling.
The strings package is required for the new error message checking functionality.
pkg/transport/nclprotocol/compute/manager_test.go (2)
244-286
: LGTM: Comprehensive test coverage for handshake required scenario.
The test thoroughly validates the core functionality:
- Immediate disconnect on handshake required error
- Automatic reconnection attempt
- Heartbeat resumption after reconnection
- HandshakeRequired flag state transitions
288-313
: LGTM: Good negative test case for error handling.
The test ensures that similar but different errors don't trigger the fast-path handshake required behavior, maintaining the normal heartbeat failure mechanism.
pkg/transport/nclprotocol/compute/controlplane_test.go (2)
183-220
: LGTM: Well-structured test for fast-path error handling.
The test effectively validates:
- Immediate handshake required flag setting
- No successful heartbeat recording
- Control loop termination
222-247
: LGTM: Good resilience test for other error types.
The test ensures the control plane continues operation on non-handshake errors, maintaining normal error handling behavior.
pkg/transport/nclprotocol/compute/manager.go (2)
Line range hint 434-449
: Verify test coverage for the new handshake requirement check.
Please ensure that the test suite includes scenarios that verify:
- The connection transitions to disconnected state when handshake is required
- The error message correctly indicates "handshake required" as the reason
- The reconnection is triggered immediately upon handshake requirement
Let's check the test coverage:
✅ Verification successful
Test coverage for handshake requirement check is adequate
The test suite already includes comprehensive scenarios that verify:
- Connection transitions to disconnected state when handshake is required
- Error message correctly indicates "handshake required" as the reason
- Automatic reconnection is triggered after handshake requirement
- Different error conditions are properly distinguished from handshake requirements
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify test coverage for handshake required scenarios
# Expected: Test cases should exist for the new handshake requirement check
# Check for test cases involving handshake required scenarios
rg -l 'TestHeartbeat.*Handshake' --type go
rg -A 5 'func Test.*Handshake.*\(' --type go
Length of output: 235
Script:
#!/bin/bash
# Let's examine the actual test implementations in manager_test.go
rg -B 2 -A 10 "handshake.*required" pkg/transport/nclprotocol/compute/manager_test.go
# Check for test setup and assertions related to health tracker
rg -B 2 -A 10 "IsHandshakeRequired" pkg/transport/nclprotocol/compute/manager_test.go
# Look for test cases that verify disconnection states
rg -B 2 -A 10 "TestManager.*Disconnect" pkg/transport/nclprotocol/compute/manager_test.go
Length of output: 1705
Line range hint 434-449
: LGTM! The implementation aligns well with the PR objective.
The addition of the handshake requirement check in checkConnectionHealth
effectively addresses the PR's goal of faster reconnection when the orchestrator restarts. The implementation:
- Follows the existing pattern of health checks
- Maintains consistency with other health check conditions
- Provides clear error messaging for debugging
Let's verify the integration with the health tracker:
✅ Verification successful
LGTM! The handshake requirement check is properly integrated and tested.
The verification confirms that:
- The
IsHandshakeRequired
flag is properly implemented in the health tracker with thread-safe access - The flag is correctly set when handshake errors are detected and cleared on connection state changes
- The implementation is well-tested, including:
- Fast reconnect behavior on handshake required errors
- Proper state transitions
- Error handling differentiation between handshake-required and other errors
The changes in checkConnectionHealth
are consistent with the existing codebase's handling of handshake requirements and will effectively trigger faster reconnection when the orchestrator restarts.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the HandshakeRequired flag usage across the codebase
# Expected: The flag should be set in response to handshake required errors
# Check for HandshakeRequired usage in health tracker
ast-grep --pattern 'func $_(h *HealthTracker) IsHandshakeRequired() bool {
$$$
}'
# Check for places where HandshakeRequired is set
rg -A 3 'HandshakeRequired.*=.*true'
Length of output: 416
Script:
#!/bin/bash
# Let's verify the complete implementation of health tracker and its usage
# Check the implementation of IsHandshakeRequired
ast-grep --pattern 'func (h *HealthTracker) IsHandshakeRequired() bool {
$$$
}'
# Check where handshake required is set to true and the surrounding context
rg -B 5 -A 5 'HandshakeRequired.*=.*true'
# Check for any error handling related to handshake
rg -A 3 -i 'handshake.*error|handshake.*required'
Length of output: 24700
if strings.Contains(err.Error(), "handshake required") { | ||
cp.healthTracker.HandshakeRequired() | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using error type assertion instead of string matching.
The current implementation uses string matching to detect handshake required errors, which can be fragile if error messages change. Consider using type assertion or error wrapping to make the error handling more robust.
-if strings.Contains(err.Error(), "handshake required") {
+var handshakeErr *nodes.ErrHandshakeRequired
+if errors.As(err, &handshakeErr) {
cp.healthTracker.HandshakeRequired()
return
}
Committable suggestion skipped: line range outside the PR's diff.
When orchestrator restarts, compute nodes wait for 5 failed heartbeats (~75s) before attempting to reconnect, even though orchestrator immediately returns "Handshake required" errors.
Modify compute nodes to detect this specific error and trigger immediate reconnection, rather than waiting for the heartbeat failure threshold.
Summary by CodeRabbit
New Features
HandshakeRequired
added to track handshake necessity.Bug Fixes
Tests
ControlPlaneTestSuite
andConnectionManagerTestSuite
.HealthTracker
functionality regarding handshake states.Documentation