-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make cluster controller highly-available #1815
Comments
One idea could be the following: All nodes that run the The way a candidate becomes leader is to write their generational node id, an incrementing epoch and a timestamp to the While being a leading cluster controller, the leader monitors for Control messages that are being sent to a With this approach, there is the problem of a possible leadership ping-pong if two admin nodes are partitioned from each other. Ways to mitigate this problem is to introduce a grace period before a candidate tries to run again for leadership or sharing liveness information with other nodes to have a more robust liveness mechanism. The leadership selection would benefit from a more refined heartbeat mechanism that generates fewer false positives. The leadership information does not have to be written to the |
Discussion SummaryPhase 1
Phase 2
|
For the preview version we have done everything we need. |
In order to tolerate the loss of a cluster controller, we need another cluster controller taking over. Otherwise, we risk that the Restate cluster becomes unavailable because the cluster controller is responsible for electing new Restate leaders. One high-level idea could be that the nodes running a cluster controller gossip among each other to notice if a cluster controller goes down. Additionally, the cluster controllers could obtain a leader epoch from the metadata store to decide who is the current leader. Only after the current leader is thought to be dead, another cluster controller would start campaigning for the leadership by obtaining a higher leader epoch and telling the others about it.
Tasks
The text was updated successfully, but these errors were encountered: