-
Notifications
You must be signed in to change notification settings - Fork 24
Add gpu-taint-management bundle and resources #748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
echo "No GPU product label found on node $NODE_NAME, skipping tainting" | ||
fi | ||
sleep 3600 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an opinion, not a request for a change:
I like to put scripts in separate files and the build the configmap using a configMapGenerator
. By having the script in a file by itself, rather than embedded in a YAML document, your editor/IDE can apply appropriate syntax checking and formatting.
configMapGenerator:
- name: gpu-product-taint-script
options:
disableNameSuffixHash: true
files:
- files/taint.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead of using a Daemonset with a script running for each node if it would make more sense to have a single instance using a kubernetes watch, something like:
#!/bin/bash
while true; do
# watch for new/modified nodes
kubectl get node -l node-role.kubernetes.io/worker -o name -w | while read NODE_NAME; do
echo "Checking node: $NODE_NAME"
PRODUCT=$(kubectl get "$NODE_NAME" -o jsonpath="{.metadata.labels['nvidia\.com/gpu\.product']}")
if [ -n "$PRODUCT" ]; then
echo "Found GPU product label: $PRODUCT on $NODE_NAME"
kubectl taint "$NODE_NAME" "nvidia.com/gpu.product=$PRODUCT:NoSchedule"
else
echo "No GPU product label found on node $NODE_NAME, skipping tainting"
fi
done
# watch expired, pause before restarting
sleep 10
done
This will trigger whenever a node changes, so it will respond immediately either to new or modified nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsks I think this is a better approach, it ensures we aren't having any awkward time between when the new nodes are tainted and when the daemonset has the opportunity to taint the node. Plus it's a single instance
d353735
to
2648266
Compare
@larsks I used your logic with the watch and included an initialization loop, which runs once at runtime to ensure all gpu nodes are labeled at runtime. I also tweaked the logic inside the watch to be much more efficient (the taint was firing way to frequently on ALL node updates). The loop now has memory of the prior gpu product label and fires the taint op only if the state has changed |
Addresses this issue: nerc-project/operations#1190 This pr ONLY creates a bundle and cluster scope resources for a gpu-taint-management deployment in the nerc-ocp-config repo, it does not deploy anyhting to any cluster
Addresses issue: nerc-project/operations#1190 This commit will require a subsequent PR in nerc-ocp-apps to create an argoCD app to deploy things to the cluster
Ignore me - missed the -w
option being used in the script's loop.
Addresses issue: nerc-project/operations#1190
Added bundle to nerc-ocp-test cluster deployment for testing.