Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency in shared mode #170

Open
jp-gouin opened this issue Jan 3, 2025 · 2 comments
Open

Resiliency in shared mode #170

jp-gouin opened this issue Jan 3, 2025 · 2 comments
Assignees
Labels
bug Something isn't working priority/0 high priority virtual-kubelet all virtual kubelet related issues

Comments

@jp-gouin
Copy link
Collaborator

jp-gouin commented Jan 3, 2025

If the server pod IP changes (pod killed, reschedules, node restarted,...) the server fail to come up again:

time="2025-01-03T15:29:35Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3k-mycluster-server-0-dc22a3d8=https://10.244.3.241:2380], expect: k3k-mycluster-server-0-dc22a3d8=https://10.244.3.243:2380"

The shared mode use the embedded ETCD which use the local IP (Pod IP) to register the member. When the Pod IP changes , ETCD fails to start.

Potential solutions

Set the proper ETCD startup config using the headless service or services

etcd-arg:
  - --initial-cluster=... 
  - --advertise-client-urls=...
  - --initial-advertise-peer-urls=...

This remains challenging due to the embedded nature of ETCD. The server pod is not considered running until ETCD is running and ETCD won't start because the dns resolution will fail.

Use Sqlite in shared mode instead of ETCD

This will require some rework of the boostrap part which kubelet currently rely on to connect to the cluster.
We could use --write-kubeconfig to directly store the kubeconfig for the kubelet

@jp-gouin jp-gouin added bug Something isn't working virtual-kubelet all virtual kubelet related issues priority/0 high priority labels Jan 3, 2025
@enrichman enrichman self-assigned this Jan 13, 2025
@enrichman
Copy link
Collaborator

While investigating this issue I've seen one of the problem is related to the bootstrap secret. When the server pod is restarted the old secret are lost, and recreated.

Another issue is that deleting the bootstrap secret will not recreate it.

I'm trying to see if mounting an external secret where to store the certs is possible, and if it works. If yes this should solve most of the issues. We will need to think about a way to provide already existing, or custom certs though.

@enrichman
Copy link
Collaborator

Trying to watch an owned resource (Secret) I'm facing an infinite reconciliation loop.

After a bit of investigation I've seen that this is because of the Webhook certs being updated at every loop. At the moment the reconciliation is pretty convoluted, and adding the fix is not trivial, or possible, without a small refactor.

I'll open a small PR with some of them. Nice related article: https://ahmet.im/blog/controller-pitfalls/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority/0 high priority virtual-kubelet all virtual kubelet related issues
Projects
None yet
Development

No branches or pull requests

2 participants