@@ -85,6 +85,26 @@ Local Event Store NCL Protocol Remote Node
85
85
└──────────────┘ └─────────────────────┘ └──────────────┘
86
86
```
87
87
88
+ ### Sequence Number Management
89
+
90
+ Each side maintains its own view of message processing:
91
+
92
+ 1 . ** Orchestrator's View**
93
+ - Tracks which messages each compute node has processed
94
+ - Updates this view based on heartbeat reports
95
+ - Uses its stored view for reconnection decisions
96
+ - For new nodes, starts from latest sequence to avoid overwhelming them
97
+
98
+ 2 . ** Compute Node's View**
99
+ - Tracks which messages it has processed from orchestrator
100
+ - Reports progress through periodic heartbeats
101
+ - Maintains local checkpoints for recovery
102
+
103
+ This independent tracking helps handle edge cases like:
104
+ - Node restarts with fresh state
105
+ - Orchestrator recovery from data loss
106
+ - Network partitions
107
+
88
108
### Key Components
89
109
90
110
1 . ** Event Store**
@@ -112,6 +132,9 @@ Local Event Store NCL Protocol Remote Node
112
132
- Includes node info, start time, and last processed sequence number
113
133
- Orchestrator validates request and accepts/rejects connection
114
134
- On acceptance, orchestrator creates dedicated data plane for node
135
+ - Orchestrator determines starting sequence based on stored state:
136
+ - For reconnecting nodes: Uses last known processed sequence
137
+ - For new nodes: Starts from latest sequence number
115
138
116
139
2 . ** Data Plane Setup**
117
140
- Both sides establish message subscriptions
@@ -123,8 +146,8 @@ Local Event Store NCL Protocol Remote Node
123
146
124
147
1 . ** Health Monitoring**
125
148
- Compute nodes send periodic heartbeats
126
- - Include current capacity and last processed sequence
127
- - Orchestrator tracks node health and connection state
149
+ - Include current capacity and last processed sequence number
150
+ - Orchestrator updates its view of node progress
128
151
- Missing heartbeats trigger disconnection
129
152
130
153
2 . ** Node Info Updates**
@@ -135,7 +158,7 @@ Local Event Store NCL Protocol Remote Node
135
158
3 . ** Message Flow**
136
159
- Data flows through separate control/data subjects
137
160
- Messages include sequence numbers for ordering
138
- - Both sides track processed sequences
161
+ - Both sides track processed sequences independently
139
162
- Failed deliveries trigger automatic recovery
140
163
141
164
## Message Contracts
@@ -147,14 +170,15 @@ Local Event Store NCL Protocol Remote Node
147
170
HandshakeRequest {
148
171
NodeInfo : models .NodeInfo
149
172
StartTime : Time
150
- LastOrchestratorSeqNum : uint64
173
+ LastOrchestratorSeqNum : uint64 // For reference only
151
174
}
152
175
153
176
// Response from orchestrator
154
177
HandshakeResponse {
155
178
Accepted : boolean
156
179
Reason : string // Only set if not accepted
157
180
LastComputeSeqNum : uint64
181
+ StartingOrchestratorSeqNum : uint64 // Determined by orchestrator
158
182
}
159
183
```
160
184
@@ -166,7 +190,7 @@ HeartbeatRequest {
166
190
NodeID : string
167
191
AvailableCapacity : Resources
168
192
QueueUsedCapacity : Resources
169
- LastOrchestratorSeqNum : uint64
193
+ LastOrchestratorSeqNum : uint64 // Used to update orchestrator's view
170
194
}
171
195
172
196
// Acknowledgment from orchestrator
@@ -202,10 +226,11 @@ sequenceDiagram
202
226
C->>O: HandshakeRequest(NodeInfo, StartTime, LastSeqNum)
203
227
204
228
Note over O: Validate Node
229
+ Note over O: Determine starting sequence based on stored state
205
230
alt Valid Node
206
231
O->>O: Create Data Plane
207
232
O->>O: Setup Message Handlers
208
- O-->>C: HandshakeResponse(Accepted=true, LastSeqNum )
233
+ O-->>C: HandshakeResponse(Accepted=true, StartingSeqNum )
209
234
210
235
Note over C: Setup Data Plane
211
236
C->>C: Start Control Plane
@@ -232,6 +257,7 @@ sequenceDiagram
232
257
Note over C,O: Periodic Health Monitoring
233
258
loop Every HeartbeatInterval
234
259
C->>O: HeartbeatRequest(NodeID, Capacity, LastSeqNum)
260
+ Note over O: Update sequence tracking
235
261
O-->>C: HeartbeatResponse()
236
262
end
237
263
end
@@ -253,6 +279,7 @@ sequenceDiagram
253
279
254
280
During regular operation:
255
281
- Heartbeats occur every HeartbeatInterval (default 15s)
282
+ - Heartbeats include last processed sequence number, allowing orchestrator to track progress
256
283
- Configuration changes trigger immediate updates
257
284
- Data plane messages flow continuously in both directions
258
285
- Both sides maintain sequence tracking and acknowledgments
@@ -283,7 +310,8 @@ sequenceDiagram
283
310
loop Until Connected
284
311
Note over C: Exponential Backoff
285
312
C->>O: HandshakeRequest(LastSeqNum)
286
- O-->>C: HandshakeResponse(Accepted)
313
+ Note over O: Use stored sequence state
314
+ O-->>C: HandshakeResponse(StartingSeqNum)
287
315
end
288
316
289
317
Note over C,O: Resume from Last Checkpoint
@@ -309,10 +337,11 @@ sequenceDiagram
309
337
5 . Continue normal operation
310
338
311
339
This process ensures:
312
- - No events are lost
340
+ - No messages are lost
313
341
- Messages remain ordered
314
342
- Efficient recovery
315
343
- At-least-once delivery
344
+ - Proper handling of node restarts and state resets
316
345
317
346
## Component Dependencies
318
347
0 commit comments