Skip to content

Conversation

dystewart
Copy link
Contributor

@dystewart dystewart commented Aug 13, 2025

Addresses issue: nerc-project/operations#1190

Added bundle to nerc-ocp-test cluster deployment for testing.

echo "No GPU product label found on node $NODE_NAME, skipping tainting"
fi
sleep 3600
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an opinion, not a request for a change:

I like to put scripts in separate files and the build the configmap using a configMapGenerator. By having the script in a file by itself, rather than embedded in a YAML document, your editor/IDE can apply appropriate syntax checking and formatting.

configMapGenerator:
- name: gpu-product-taint-script
  options:
    disableNameSuffixHash: true
  files:
  - files/taint.sh

Copy link
Member

@larsks larsks Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of using a Daemonset with a script running for each node if it would make more sense to have a single instance using a kubernetes watch, something like:

#!/bin/bash

while true; do

  # watch for new/modified nodes
  kubectl get node -l node-role.kubernetes.io/worker -o name -w | while read NODE_NAME; do
    echo "Checking node: $NODE_NAME"
    PRODUCT=$(kubectl get "$NODE_NAME" -o jsonpath="{.metadata.labels['nvidia\.com/gpu\.product']}")
    if [ -n "$PRODUCT" ]; then
      echo "Found GPU product label: $PRODUCT on $NODE_NAME"
      kubectl taint "$NODE_NAME" "nvidia.com/gpu.product=$PRODUCT:NoSchedule"
    else
      echo "No GPU product label found on node $NODE_NAME, skipping tainting"
    fi
  done

  # watch expired, pause before restarting
  sleep 10
done

This will trigger whenever a node changes, so it will respond immediately either to new or modified nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsks I think this is a better approach, it ensures we aren't having any awkward time between when the new nodes are tainted and when the daemonset has the opportunity to taint the node. Plus it's a single instance

@dystewart dystewart force-pushed the taint branch 5 times, most recently from d353735 to 2648266 Compare August 19, 2025 19:56
@dystewart
Copy link
Contributor Author

@larsks I used your logic with the watch and included an initialization loop, which runs once at runtime to ensure all gpu nodes are labeled at runtime. I also tweaked the logic inside the watch to be much more efficient (the taint was firing way to frequently on ALL node updates). The loop now has memory of the prior gpu product label and fires the taint op only if the state has changed

@jtriley jtriley self-requested a review August 27, 2025 13:38
Addresses this issue: nerc-project/operations#1190

This pr ONLY creates a bundle and cluster scope resources for a gpu-taint-management deployment in the nerc-ocp-config repo, it does not deploy anyhting to any cluster
Addresses issue: nerc-project/operations#1190
This commit will require a subsequent PR in nerc-ocp-apps to create an argoCD app to deploy things to the cluster
@jtriley jtriley changed the title Add gpu-taint-management bundle and resources and create overlay for nerc-ocp-test cluster deployment Add gpu-taint-management bundle and resources Aug 27, 2025
@jtriley jtriley dismissed their stale review August 27, 2025 13:45

Ignore me - missed the -w option being used in the script's loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants