Skip to content

Conversation

@knottnt
Copy link
Contributor

@knottnt knottnt commented Dec 11, 2025

Description

Updated the checks against AMI family for NodeGroups and Managed NodeGroups when collecting taints to apply as tolerations to the Nvidia device plugin daemonset to include the AL2023 AMI family

Fixes: #8550

Checklist

  • Added tests that cover your change (if possible) yes, added unit tests
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested, yes
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@github-actions
Copy link
Contributor

Hello knottnt 👋 Thank you for opening a Pull Request in eksctl project. The team will review the Pull Request and aim to respond within 1-10 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

@knottnt knottnt changed the title Apply NodeGroup taints for Nvidia device plugin daemonset to include AL2023 AMIs Add taints for AL2023 NodeGroups as tolerations for Nvidia device plugin daemonset Dec 11, 2025
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, left a question

}
}
for _, ng := range n.spec.ManagedNodeGroups {
if api.HasInstanceTypeManaged(ng, instance.IsNvidiaInstanceType) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to me that we even check the AMIFamily

Are there AMI families we wouldn't want to apply to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Github Issue it looks like the original intent was to target NodeGroups with an EKS optimized AMI.

The device plugin is installed as part of create cluster and create nodegroup when an EKS-Optimized Accelerated AMI with a GPU-enabled instance type is used. If your ClusterConfig contains a single nodegroup that matches this criterion, then eksctl can apply the taints from that nodegroup as tolerations to the device plugin, and if the ClusterConfig contains multiple such nodegroups that do not all have the same taints config, eksctl can combine the set of taints config from all such nodegroups and apply them as tolerations.

Copy link
Contributor

@bryantbiggs bryantbiggs Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA device plugin should only go on the AL2023 NVIDIA accelerated AMIs at this point - the AL2 GPU AMI is reaching EOL, and Bottlerocket NVIDIA variants already provide the device plugin in the AMI

The NVIDIA device plugin isn't relevant to any other AMIs than those listed above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra info! EKS still does allow GPU enabled nodegroups to be created with AL2. Given, this we probably shouldn't remove AL2 to avoid breaking anyone still on that AMI even if it is out of support.

@NicholasBlaskey NicholasBlaskey merged commit efcb779 into eksctl-io:main Dec 11, 2025
8 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] AL2023 AMI breaks toleration support

3 participants