Skip to content

Commit 6add731

Browse files
authored
Merge branch 'main' into eng-471-fix-concurrent-modification-detected-test
2 parents fbd8819 + caaf92c commit 6add731

File tree

1 file changed

+37
-8
lines changed

1 file changed

+37
-8
lines changed

pkg/transport/nclprotocol/README.md

+37-8
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,26 @@ Local Event Store NCL Protocol Remote Node
8585
└──────────────┘ └─────────────────────┘ └──────────────┘
8686
```
8787

88+
### Sequence Number Management
89+
90+
Each side maintains its own view of message processing:
91+
92+
1. **Orchestrator's View**
93+
- Tracks which messages each compute node has processed
94+
- Updates this view based on heartbeat reports
95+
- Uses its stored view for reconnection decisions
96+
- For new nodes, starts from latest sequence to avoid overwhelming them
97+
98+
2. **Compute Node's View**
99+
- Tracks which messages it has processed from orchestrator
100+
- Reports progress through periodic heartbeats
101+
- Maintains local checkpoints for recovery
102+
103+
This independent tracking helps handle edge cases like:
104+
- Node restarts with fresh state
105+
- Orchestrator recovery from data loss
106+
- Network partitions
107+
88108
### Key Components
89109

90110
1. **Event Store**
@@ -112,6 +132,9 @@ Local Event Store NCL Protocol Remote Node
112132
- Includes node info, start time, and last processed sequence number
113133
- Orchestrator validates request and accepts/rejects connection
114134
- On acceptance, orchestrator creates dedicated data plane for node
135+
- Orchestrator determines starting sequence based on stored state:
136+
- For reconnecting nodes: Uses last known processed sequence
137+
- For new nodes: Starts from latest sequence number
115138

116139
2. **Data Plane Setup**
117140
- Both sides establish message subscriptions
@@ -123,8 +146,8 @@ Local Event Store NCL Protocol Remote Node
123146

124147
1. **Health Monitoring**
125148
- Compute nodes send periodic heartbeats
126-
- Include current capacity and last processed sequence
127-
- Orchestrator tracks node health and connection state
149+
- Include current capacity and last processed sequence number
150+
- Orchestrator updates its view of node progress
128151
- Missing heartbeats trigger disconnection
129152

130153
2. **Node Info Updates**
@@ -135,7 +158,7 @@ Local Event Store NCL Protocol Remote Node
135158
3. **Message Flow**
136159
- Data flows through separate control/data subjects
137160
- Messages include sequence numbers for ordering
138-
- Both sides track processed sequences
161+
- Both sides track processed sequences independently
139162
- Failed deliveries trigger automatic recovery
140163

141164
## Message Contracts
@@ -147,14 +170,15 @@ Local Event Store NCL Protocol Remote Node
147170
HandshakeRequest {
148171
NodeInfo: models.NodeInfo
149172
StartTime: Time
150-
LastOrchestratorSeqNum: uint64
173+
LastOrchestratorSeqNum: uint64 // For reference only
151174
}
152175

153176
// Response from orchestrator
154177
HandshakeResponse {
155178
Accepted: boolean
156179
Reason: string // Only set if not accepted
157180
LastComputeSeqNum: uint64
181+
StartingOrchestratorSeqNum: uint64 // Determined by orchestrator
158182
}
159183
```
160184

@@ -166,7 +190,7 @@ HeartbeatRequest {
166190
NodeID: string
167191
AvailableCapacity: Resources
168192
QueueUsedCapacity: Resources
169-
LastOrchestratorSeqNum: uint64
193+
LastOrchestratorSeqNum: uint64 // Used to update orchestrator's view
170194
}
171195

172196
// Acknowledgment from orchestrator
@@ -202,10 +226,11 @@ sequenceDiagram
202226
C->>O: HandshakeRequest(NodeInfo, StartTime, LastSeqNum)
203227
204228
Note over O: Validate Node
229+
Note over O: Determine starting sequence based on stored state
205230
alt Valid Node
206231
O->>O: Create Data Plane
207232
O->>O: Setup Message Handlers
208-
O-->>C: HandshakeResponse(Accepted=true, LastSeqNum)
233+
O-->>C: HandshakeResponse(Accepted=true, StartingSeqNum)
209234
210235
Note over C: Setup Data Plane
211236
C->>C: Start Control Plane
@@ -232,6 +257,7 @@ sequenceDiagram
232257
Note over C,O: Periodic Health Monitoring
233258
loop Every HeartbeatInterval
234259
C->>O: HeartbeatRequest(NodeID, Capacity, LastSeqNum)
260+
Note over O: Update sequence tracking
235261
O-->>C: HeartbeatResponse()
236262
end
237263
end
@@ -253,6 +279,7 @@ sequenceDiagram
253279

254280
During regular operation:
255281
- Heartbeats occur every HeartbeatInterval (default 15s)
282+
- Heartbeats include last processed sequence number, allowing orchestrator to track progress
256283
- Configuration changes trigger immediate updates
257284
- Data plane messages flow continuously in both directions
258285
- Both sides maintain sequence tracking and acknowledgments
@@ -283,7 +310,8 @@ sequenceDiagram
283310
loop Until Connected
284311
Note over C: Exponential Backoff
285312
C->>O: HandshakeRequest(LastSeqNum)
286-
O-->>C: HandshakeResponse(Accepted)
313+
Note over O: Use stored sequence state
314+
O-->>C: HandshakeResponse(StartingSeqNum)
287315
end
288316
289317
Note over C,O: Resume from Last Checkpoint
@@ -309,10 +337,11 @@ sequenceDiagram
309337
5. Continue normal operation
310338

311339
This process ensures:
312-
- No events are lost
340+
- No messages are lost
313341
- Messages remain ordered
314342
- Efficient recovery
315343
- At-least-once delivery
344+
- Proper handling of node restarts and state resets
316345

317346
## Component Dependencies
318347

0 commit comments

Comments
 (0)