Skip to content

[EPIC] Resiliency in shared mode #170

@jp-gouin

Description

@jp-gouin

The kubelet and the server should be resilient to pod restart.
The kubelet has issue restarting due to the webhook cert
The server has issue restarting due to the boostrap secret

Issue encountered while using persistence.type: dynamic to persist ETCD data:

If the server pod IP changes (pod killed, reschedules, node restarted,...) the server fail to come up again:

time="2025-01-03T15:29:35Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3k-mycluster-server-0-dc22a3d8=https://10.244.3.241:2380], expect: k3k-mycluster-server-0-dc22a3d8=https://10.244.3.243:2380"

The shared mode use the embedded ETCD which use the local IP (Pod IP) to register the member. When the Pod IP changes , ETCD fails to start.

Potential solutions

Set the proper ETCD startup config using the headless service or services

etcd-arg:
  - --initial-cluster=... 
  - --advertise-client-urls=...
  - --initial-advertise-peer-urls=...

This remains challenging due to the embedded nature of ETCD. The server pod is not considered running until ETCD is running and ETCD won't start because the dns resolution will fail.

Use Sqlite in shared mode instead of ETCD

This will require some rework of the boostrap part which kubelet currently rely on to connect to the cluster.
We could use --write-kubeconfig to directly store the kubeconfig for the kubelet

Metadata

Metadata

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions