Skip to content

Commit

Permalink
Update nclprotocol readme with seqNum exchange changes (#4779)
Browse files Browse the repository at this point in the history
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Expanded documentation for the NCL Protocol, including detailed
descriptions of sequence number management, connection lifecycle, and
failure recovery processes.
- Added new sections and clarified existing ones to enhance
understanding of message sequencing and connection management.
- Updated structures in the documentation to include comments and new
fields for better clarity on their usage.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
  • Loading branch information
wdbaruni authored Dec 18, 2024
1 parent 7498c46 commit caaf92c
Showing 1 changed file with 37 additions and 8 deletions.
45 changes: 37 additions & 8 deletions pkg/transport/nclprotocol/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,26 @@ Local Event Store NCL Protocol Remote Node
└──────────────┘ └─────────────────────┘ └──────────────┘
```

### Sequence Number Management

Each side maintains its own view of message processing:

1. **Orchestrator's View**
- Tracks which messages each compute node has processed
- Updates this view based on heartbeat reports
- Uses its stored view for reconnection decisions
- For new nodes, starts from latest sequence to avoid overwhelming them

2. **Compute Node's View**
- Tracks which messages it has processed from orchestrator
- Reports progress through periodic heartbeats
- Maintains local checkpoints for recovery

This independent tracking helps handle edge cases like:
- Node restarts with fresh state
- Orchestrator recovery from data loss
- Network partitions

### Key Components

1. **Event Store**
Expand Down Expand Up @@ -112,6 +132,9 @@ Local Event Store NCL Protocol Remote Node
- Includes node info, start time, and last processed sequence number
- Orchestrator validates request and accepts/rejects connection
- On acceptance, orchestrator creates dedicated data plane for node
- Orchestrator determines starting sequence based on stored state:
- For reconnecting nodes: Uses last known processed sequence
- For new nodes: Starts from latest sequence number

2. **Data Plane Setup**
- Both sides establish message subscriptions
Expand All @@ -123,8 +146,8 @@ Local Event Store NCL Protocol Remote Node

1. **Health Monitoring**
- Compute nodes send periodic heartbeats
- Include current capacity and last processed sequence
- Orchestrator tracks node health and connection state
- Include current capacity and last processed sequence number
- Orchestrator updates its view of node progress
- Missing heartbeats trigger disconnection

2. **Node Info Updates**
Expand All @@ -135,7 +158,7 @@ Local Event Store NCL Protocol Remote Node
3. **Message Flow**
- Data flows through separate control/data subjects
- Messages include sequence numbers for ordering
- Both sides track processed sequences
- Both sides track processed sequences independently
- Failed deliveries trigger automatic recovery

## Message Contracts
Expand All @@ -147,14 +170,15 @@ Local Event Store NCL Protocol Remote Node
HandshakeRequest {
NodeInfo: models.NodeInfo
StartTime: Time
LastOrchestratorSeqNum: uint64
LastOrchestratorSeqNum: uint64 // For reference only
}

// Response from orchestrator
HandshakeResponse {
Accepted: boolean
Reason: string // Only set if not accepted
LastComputeSeqNum: uint64
StartingOrchestratorSeqNum: uint64 // Determined by orchestrator
}
```

Expand All @@ -166,7 +190,7 @@ HeartbeatRequest {
NodeID: string
AvailableCapacity: Resources
QueueUsedCapacity: Resources
LastOrchestratorSeqNum: uint64
LastOrchestratorSeqNum: uint64 // Used to update orchestrator's view
}

// Acknowledgment from orchestrator
Expand Down Expand Up @@ -202,10 +226,11 @@ sequenceDiagram
C->>O: HandshakeRequest(NodeInfo, StartTime, LastSeqNum)
Note over O: Validate Node
Note over O: Determine starting sequence based on stored state
alt Valid Node
O->>O: Create Data Plane
O->>O: Setup Message Handlers
O-->>C: HandshakeResponse(Accepted=true, LastSeqNum)
O-->>C: HandshakeResponse(Accepted=true, StartingSeqNum)
Note over C: Setup Data Plane
C->>C: Start Control Plane
Expand All @@ -232,6 +257,7 @@ sequenceDiagram
Note over C,O: Periodic Health Monitoring
loop Every HeartbeatInterval
C->>O: HeartbeatRequest(NodeID, Capacity, LastSeqNum)
Note over O: Update sequence tracking
O-->>C: HeartbeatResponse()
end
end
Expand All @@ -253,6 +279,7 @@ sequenceDiagram

During regular operation:
- Heartbeats occur every HeartbeatInterval (default 15s)
- Heartbeats include last processed sequence number, allowing orchestrator to track progress
- Configuration changes trigger immediate updates
- Data plane messages flow continuously in both directions
- Both sides maintain sequence tracking and acknowledgments
Expand Down Expand Up @@ -283,7 +310,8 @@ sequenceDiagram
loop Until Connected
Note over C: Exponential Backoff
C->>O: HandshakeRequest(LastSeqNum)
O-->>C: HandshakeResponse(Accepted)
Note over O: Use stored sequence state
O-->>C: HandshakeResponse(StartingSeqNum)
end
Note over C,O: Resume from Last Checkpoint
Expand All @@ -309,10 +337,11 @@ sequenceDiagram
5. Continue normal operation

This process ensures:
- No events are lost
- No messages are lost
- Messages remain ordered
- Efficient recovery
- At-least-once delivery
- Proper handling of node restarts and state resets

## Component Dependencies

Expand Down

0 comments on commit caaf92c

Please sign in to comment.