Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance in machine pool failed to join cluster with error bootstrap token not found #11034

Open
archerwu9425 opened this issue Aug 12, 2024 · 3 comments · May be fixed by #11037
Open

Instance in machine pool failed to join cluster with error bootstrap token not found #11034

archerwu9425 opened this issue Aug 12, 2024 · 3 comments · May be fixed by #11037
Labels
area/bootstrap Issues or PRs related to bootstrap providers area/machinepool Issues or PRs related to machinepools kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@archerwu9425
Copy link
Contributor

What steps did you take and what happened?

After I used clusterctl move to migrated an existed workload cluster to a new management cluster, instance in aws machine pool failed to join the cluster and dead loop create/terminate ec2 instances.

Error log found in kubeadmin bootstrap controller:

E0802 08:52:55.552210       1 controller.go:329] "Reconciler error" err="failed to get bootstrap token secret in order to refresh it: secrets \"bootstrap-token-a0o08u\" not found" controller="kubeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="KubeadmConfig" KubeadmConfig="kubed-08/kubed-08-worker-29521" namespace="kubed-08" name="kubed-08-worker-29521" reconcileID="e7d2eb93-428f-4d9c-b64c-95f639a586ff"

Root cause should be:

  1. During clusterctl move, cluster will put on paused filed and stop reconciling
  2. Due to some provider version issue, the move process failed the first time and took more time than usual
  3. Default bootstrap TTL is being used for bootstrap controller, which is 15 mins, the token expired and get deleted in the workload cluster during the paused period
  4. The machine pool size rang is 0-10, and we use cluster auto scaler in the workload cluster, which scaled up the machine pool from 0 to 1 during the paused period, brings the replicas for machinePool is 1 but no nodeRef in the machinePool status, refer to this code block: https://github.com/kubernetes-sigs/cluster-api/blob/v1.7.4/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L274-L280

What did you expect to happen?

For the refreshBootstrapTokenIfNeeded function, if token not found, should create a new one instead of just raise error:
https://github.com/kubernetes-sigs/cluster-api/blob/v1.7.4/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L326-L329

Cluster API version

v1.7.4

Kubernetes version

v1.27.12

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@archerwu9425
Copy link
Contributor Author

/area bootstrap

@k8s-ci-robot k8s-ci-robot added the area/bootstrap Issues or PRs related to bootstrap providers label Aug 12, 2024
@sbueringer sbueringer added the area/machinepool Issues or PRs related to machinepools label Aug 12, 2024
@sbueringer sbueringer added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 21, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Aug 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2024
@chrischdi chrischdi changed the title Instance in machine pool failed to join cluster withe error bootstrap token not found Instance in machine pool failed to join cluster with error bootstrap token not found Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap providers area/machinepool Issues or PRs related to machinepools kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
4 participants