Add gpu-taint-management bundle and resources #748

dystewart · 2025-08-13T16:14:51Z

Addresses issue: nerc-project/operations#1190

Added bundle to nerc-ocp-test cluster deployment for testing.

gpu-taint-management/base/daemonset.yaml

larsks · 2025-08-14T14:10:30Z

gpu-taint-management/base/configmap.yaml

+        echo "No GPU product label found on node $NODE_NAME, skipping tainting"
+      fi
+      sleep 3600
+    done


Just an opinion, not a request for a change:

I like to put scripts in separate files and the build the configmap using a configMapGenerator. By having the script in a file by itself, rather than embedded in a YAML document, your editor/IDE can apply appropriate syntax checking and formatting.

configMapGenerator: - name: gpu-product-taint-script options: disableNameSuffixHash: true files: - files/taint.sh

I wonder if instead of using a Daemonset with a script running for each node if it would make more sense to have a single instance using a kubernetes watch, something like:

#!/bin/bash while true; do # watch for new/modified nodes kubectl get node -l node-role.kubernetes.io/worker -o name -w | while read NODE_NAME; do echo "Checking node: $NODE_NAME" PRODUCT=$(kubectl get "$NODE_NAME" -o jsonpath="{.metadata.labels['nvidia\.com/gpu\.product']}") if [ -n "$PRODUCT" ]; then echo "Found GPU product label: $PRODUCT on $NODE_NAME" kubectl taint "$NODE_NAME" "nvidia.com/gpu.product=$PRODUCT:NoSchedule" else echo "No GPU product label found on node $NODE_NAME, skipping tainting" fi done # watch expired, pause before restarting sleep 10 done

This will trigger whenever a node changes, so it will respond immediately either to new or modified nodes.

@larsks I think this is a better approach, it ensures we aren't having any awkward time between when the new nodes are tainted and when the daemonset has the opportunity to taint the node. Plus it's a single instance

dystewart · 2025-08-19T19:58:53Z

@larsks I used your logic with the watch and included an initialization loop, which runs once at runtime to ensure all gpu nodes are labeled at runtime. I also tweaked the logic inside the watch to be much more efficient (the taint was firing way to frequently on ALL node updates). The loop now has memory of the prior gpu product label and fires the taint op only if the state has changed

gpu-taint-management/base/taint.sh

Addresses this issue: nerc-project/operations#1190 This pr ONLY creates a bundle and cluster scope resources for a gpu-taint-management deployment in the nerc-ocp-config repo, it does not deploy anyhting to any cluster

Addresses issue: nerc-project/operations#1190 This commit will require a subsequent PR in nerc-ocp-apps to create an argoCD app to deploy things to the cluster

Ignore me - missed the -w option being used in the script's loop.

dystewart requested review from jtriley, Milstein, tssala23, larsks and joachimweyl August 13, 2025 17:23

larsks reviewed Aug 14, 2025

View reviewed changes

dystewart force-pushed the taint branch 5 times, most recently from d353735 to 2648266 Compare August 19, 2025 19:56

dystewart requested a review from larsks August 19, 2025 19:58

dystewart force-pushed the taint branch from 2648266 to f34382c Compare August 19, 2025 20:11

jtriley previously requested changes Aug 27, 2025

View reviewed changes

gpu-taint-management/base/taint.sh Show resolved Hide resolved

jtriley self-requested a review August 27, 2025 13:38

dystewart added 2 commits August 27, 2025 09:39

Add gpu-taint-management bundle and resources

c4bd6b3

Addresses this issue: nerc-project/operations#1190 This pr ONLY creates a bundle and cluster scope resources for a gpu-taint-management deployment in the nerc-ocp-config repo, it does not deploy anyhting to any cluster

ocp-test: add gpu-taint-management overlay for test cluster

ecb11e9

Addresses issue: nerc-project/operations#1190 This commit will require a subsequent PR in nerc-ocp-apps to create an argoCD app to deploy things to the cluster

jtriley force-pushed the taint branch from f34382c to ecb11e9 Compare August 27, 2025 13:39

jtriley changed the title ~~Add gpu-taint-management bundle and resources and create overlay for nerc-ocp-test cluster deployment~~ Add gpu-taint-management bundle and resources Aug 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add gpu-taint-management bundle and resources #748

Add gpu-taint-management bundle and resources #748

Uh oh!

dystewart commented Aug 13, 2025 •

edited by jtriley

Loading

Uh oh!

Uh oh!

larsks Aug 14, 2025

Uh oh!

larsks Aug 14, 2025 •

edited

Loading

Uh oh!

dystewart Aug 14, 2025

Uh oh!

dystewart commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Add gpu-taint-management bundle and resources #748

Are you sure you want to change the base?

Add gpu-taint-management bundle and resources #748

Uh oh!

Conversation

dystewart commented Aug 13, 2025 • edited by jtriley Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

larsks Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

larsks Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dystewart Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

dystewart commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

dystewart commented Aug 13, 2025 •

edited by jtriley

Loading

larsks Aug 14, 2025 •

edited

Loading