Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cluster controller highly-available #1815

Closed
2 tasks done
Tracked by #1675
tillrohrmann opened this issue Aug 9, 2024 · 3 comments
Closed
2 tasks done
Tracked by #1675

Make cluster controller highly-available #1815

tillrohrmann opened this issue Aug 9, 2024 · 3 comments
Assignees

Comments

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Aug 9, 2024

In order to tolerate the loss of a cluster controller, we need another cluster controller taking over. Otherwise, we risk that the Restate cluster becomes unavailable because the cluster controller is responsible for electing new Restate leaders. One high-level idea could be that the nodes running a cluster controller gossip among each other to notice if a cluster controller goes down. Additionally, the cluster controllers could obtain a leader epoch from the metadata store to decide who is the current leader. Only after the current leader is thought to be dead, another cluster controller would start campaigning for the leadership by obtaining a higher leader epoch and telling the others about it.

Tasks

Preview Give feedback
  1. muhamadazmy
  2. muhamadazmy
@tillrohrmann
Copy link
Contributor Author

tillrohrmann commented Nov 6, 2024

One idea could be the following:

All nodes that run the Admin role will be cluster controller candidates. The cluster controller candidates heartbeat each other. They find themselves through the NodesConfiguration. When the current cluster controller leader does not respond to heartbeats, then the other candidates can try to become the leader.

The way a candidate becomes leader is to write their generational node id, an incrementing epoch and a timestamp to the NodesConfiguration. The candidate that wins in updating the NodesConfiguration with its leader information becomes the new leader. The other candidates will start heartbeating the new leader waiting for the chance to step up.

While being a leading cluster controller, the leader monitors for NodesConfiguration changes in order to preempt itself if it is no longer the leading cluster controller.

Control messages that are being sent to a Node can contain the leader epoch. This can be used on the recipient side for filtering out messages from outdated leaders.

With this approach, there is the problem of a possible leadership ping-pong if two admin nodes are partitioned from each other. Ways to mitigate this problem is to introduce a grace period before a candidate tries to run again for leadership or sharing liveness information with other nodes to have a more robust liveness mechanism. The leadership selection would benefit from a more refined heartbeat mechanism that generates fewer false positives.

The leadership information does not have to be written to the NodesConfiguration. Writing it to the NodesConfiguration has the advantage that this information is automatically spread throughout the cluster.

@muhamadazmy
Copy link
Contributor

Discussion Summary

Phase 1

  • Cluster Controller(s) continues to collect all node state (observed state) by the means of heartbeat
  • If CC (cluster controller) holds the lowest node id of all nodes with admin role. The CC assumes itself a leader
    • Otherwise no action is taking over the current cluster state.
  • Collected state is verified that it's a valid state. This to avoid doing unnecessary movements or changes even if it doesn't match the CC target state, this to avoid conflicting with other possible running CC that thinks they are leaders.

Phase 2

  • Implement Gossip protocol. All nodes will have an observed view of the cluster, this view can then be used by other components in the system including the CC on this node (if admin)
  • Drop the Scheduler Plan from metadata store? (as far as I understand) since this won't be needed since all nodes will already have the observed plan.
  • Drop AttachRequest (not sure what it does yet)

@tillrohrmann
Copy link
Contributor Author

For the preview version we have done everything we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants