-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add taints for AL2023 NodeGroups as tolerations for Nvidia device plugin daemonset #8627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taints for AL2023 NodeGroups as tolerations for Nvidia device plugin daemonset #8627
Conversation
…monset to include AL2023 AMIs
|
Hello knottnt 👋 Thank you for opening a Pull Request in |
NicholasBlaskey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, left a question
| } | ||
| } | ||
| for _, ng := range n.spec.ManagedNodeGroups { | ||
| if api.HasInstanceTypeManaged(ng, instance.IsNvidiaInstanceType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is weird to me that we even check the AMIFamily
Are there AMI families we wouldn't want to apply to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Github Issue it looks like the original intent was to target NodeGroups with an EKS optimized AMI.
The device plugin is installed as part of create cluster and create nodegroup when an EKS-Optimized Accelerated AMI with a GPU-enabled instance type is used. If your ClusterConfig contains a single nodegroup that matches this criterion, then eksctl can apply the taints from that nodegroup as tolerations to the device plugin, and if the ClusterConfig contains multiple such nodegroups that do not all have the same taints config, eksctl can combine the set of taints config from all such nodegroups and apply them as tolerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA device plugin should only go on the AL2023 NVIDIA accelerated AMIs at this point - the AL2 GPU AMI is reaching EOL, and Bottlerocket NVIDIA variants already provide the device plugin in the AMI
The NVIDIA device plugin isn't relevant to any other AMIs than those listed above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra info! EKS still does allow GPU enabled nodegroups to be created with AL2. Given, this we probably shouldn't remove AL2 to avoid breaking anyone still on that AMI even if it is out of support.
Description
Updated the checks against AMI family for NodeGroups and Managed NodeGroups when collecting taints to apply as tolerations to the Nvidia device plugin daemonset to include the AL2023 AMI family
Fixes: #8550
Checklist
README.md, or theuserdocsdirectory)area/nodegroup) and kind (e.g.kind/improvement)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯