Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin Karpenter controller pod #207

Open
asmacdo opened this issue Nov 21, 2024 · 0 comments
Open

Unpin Karpenter controller pod #207

asmacdo opened this issue Nov 21, 2024 · 0 comments

Comments

@asmacdo
Copy link
Member

asmacdo commented Nov 21, 2024

Investigate Karpenter 0.37

When redeploying LINC, Karpenter failed to bring up a new node, which presents as a user-pod launch to hang.

Useful debug information was scarce, but describing the nodeclaim and getting the logs of the Karpenter pod did indicate that something was going wrong. (The exact error messages aren't recorded, but weren't very helpful.)

Currently, on Dandihub we are using Karpenter controller image 0.35.0, but LINC was using 0.37.

We patched the karpenter config in addons.tf in #205 which has temporarily mitigated the problem-- but we are now pinned and should investigate what is necessary to bring this up to 0.37 (and possibly 1.0+?)

  #---------------------------------------
  # Karpenter Autoscaler for EKS Cluster
  #---------------------------------------
  enable_karpenter                  = true
  karpenter_enable_spot_termination = true
  karpenter = {
    timeout             = "300"
    repository_username = data.aws_ecrpublic_authorization_token.token.user_name
    repository_password = data.aws_ecrpublic_authorization_token.token.password
    values = [<<EOT
        controller:
          image: public.ecr.aws/karpenter/controller:0.35.0@sha256:48d1246f6b2066404e300cbf3e26d0bcdc57a76531dcb634d571f4f0e050cb57
    EOT
    ]
  }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant