CA: refactor utils related to NodeInfos #7479

towca · 2024-11-07T20:51:59Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. There were multiple very similar utils related to copying and sanitizing NodeInfos scattered around the CA codebase. Instead of adding similar DRA handling to all of them separately, they're consolidated into a single location that will be later adapted to handle DRA.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.

Special notes for your reviewer:

The first commit in the PR is just a squash of #7466, and it shouldn't be a part of this review. The PR will be rebased on top of master after #7466 is merged.

This is intended to be a no-op refactor. It was extracted from #7350 after #7447, and #7466.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

k8s-ci-robot · 2024-11-07T20:52:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

towca · 2024-11-07T20:52:36Z

/assign @MaciekPytel
/assign @jackfrancis
/hold

DONOTSUBMIT

towca · 2024-11-19T14:14:22Z

/assign @BigDarkClown

DONOTSUBMIT

jackfrancis · 2024-11-19T20:28:00Z

cluster-autoscaler/core/static_autoscaler.go

@@ -34,7 +34,7 @@ import (
 	"k8s.io/autoscaler/cluster-autoscaler/core/scaledown/planner"
 	scaledownstatus "k8s.io/autoscaler/cluster-autoscaler/core/scaledown/status"
 	"k8s.io/autoscaler/cluster-autoscaler/core/scaleup"
-	orchestrator "k8s.io/autoscaler/cluster-autoscaler/core/scaleup/orchestrator"


DONOTSUBMIT

simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. DeepCopyNodeInfo is changed to be a framework.NodeInfo method. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.

DONOTSUBMIT

jackfrancis · 2024-11-21T19:55:08Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

 			}
-			nodeInfo, err := simulator.BuildNodeInfoForNode(sanitizedNode, podsForNodes[node.Name], daemonsets, p.forceDaemonSets)
+			templateNodeInfo, caErr := simulator.TemplateNodeInfoFromExampleNodeInfo(nodeInfo, id, daemonsets, p.forceDaemonSets, taintConfig)
 			if err != nil {


should be if caErr != nil { here?

(also do we need to define a new caErr variable here instead of just re-using err?)

jackfrancis · 2024-11-21T20:36:16Z

cluster-autoscaler/simulator/framework/infos.go

+	for _, slice := range n.LocalResourceSlices {
+		newSlices = append(newSlices, slice.DeepCopy())
+	}
+	return NewNodeInfo(n.Node().DeepCopy(), newSlices, newPods...)


Because the NewNodeInfo constructor only sets a node object if the passed in node is not nil:

if node != nil { result.schedNodeInfo.SetNode(node) }

... invoking n.Node().DeepCopy(), inline like this might be (theoretically) subject to a nil pointer exception

jackfrancis · 2024-11-21T20:38:10Z

cluster-autoscaler/simulator/node_info_utils.go

+	id := nodeGroup.Id()
+	baseNodeInfo, err := nodeGroup.TemplateNodeInfo()
+	if err != nil {
+		return nil, errors.ToAutoscalerError(errors.CloudProviderError, err)


is this error response too generic?

jackfrancis · 2024-11-21T20:38:55Z

cluster-autoscaler/simulator/node_info_utils.go

+// TemplateNodeInfoFromNodeGroupTemplate returns a template NodeInfo object based on NodeGroup.TemplateNodeInfo(). The template is sanitized, and only
+// contains the pods that should appear on a new Node from the same node group (e.g. DaemonSet pods).
+func TemplateNodeInfoFromNodeGroupTemplate(nodeGroup nodeGroupTemplateNodeInfoGetter, daemonsets []*appsv1.DaemonSet, taintConfig taints.TaintConfig) (*framework.NodeInfo, errors.AutoscalerError) {
+	id := nodeGroup.Id()


I don't think we need to assign this to a var

jackfrancis · 2024-11-21T20:47:10Z

cluster-autoscaler/simulator/node_info_utils.go

+	return TemplateNodeInfoFromExampleNodeInfo(baseNodeInfo, id, daemonsets, true, taintConfig)
+}
+
+// TemplateNodeInfoFromExampleNodeInfo returns a template NodeInfo object based on a real example NodeInfo from the cluster. The template is sanitized, and only


Not sure if I love the term "example" here. Would TemplateNodeInfoFromNode or TemplateNodeInfoFromRealNode or TemplateNodeInfoFromRealNodeInfo work? Then we'd document like this

// TemplateNodeInfoFromNode returns a template NodeInfo object based on a NodeInfo from a real node on the cluster. The template is sanitized, and only

// We need to sanitize the node before determining the DS pods, since taints are checked there, and // we might need to filter some out during sanitization. sanitizedNode := sanitizeNodeInfo(realNode, newNodeNameBase, randSuffix, &taintConfig)

// No need to sanitize the expected pods again - they either come from sanitizedNode and were sanitized above,

etc.

I think my observation is that the word "example" suggests something non-real, mock object, something like that.

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Nov 7, 2024

k8s-ci-robot added area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 7, 2024

k8s-ci-robot requested review from BigDarkClown and x13n November 7, 2024 20:52

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 7, 2024

k8s-ci-robot assigned jackfrancis and MaciekPytel Nov 7, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024

towca mentioned this pull request Nov 13, 2024

CA: introduce internal NodeInfo #7447

Merged

towca force-pushed the jtuznik/dra-info-utils branch from f013ec0 to 5a9ff97 Compare November 13, 2024 12:56

towca added a commit to towca/autoscaler that referenced this pull request Nov 14, 2024

TMP squash: kubernetes#7466, kubernetes#7479

36be2ce

DONOTSUBMIT

towca mentioned this pull request Nov 14, 2024

CA: refactor PredicateChecker into ClusterSnapshot #7497

Open

towca force-pushed the jtuznik/dra-info-utils branch from 5a9ff97 to 3a7c44f Compare November 18, 2024 21:23

towca added a commit to towca/autoscaler that referenced this pull request Nov 19, 2024

TMP squash: kubernetes#7466, kubernetes#7479

2bfe8b5

DONOTSUBMIT

k8s-ci-robot assigned BigDarkClown Nov 19, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024

towca force-pushed the jtuznik/dra-info-utils branch from 3a7c44f to e0024b0 Compare November 19, 2024 14:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024

towca force-pushed the jtuznik/dra-info-utils branch from e0024b0 to a7eea5e Compare November 19, 2024 14:29

towca added a commit to towca/autoscaler that referenced this pull request Nov 19, 2024

TMP squash: kubernetes#7466, kubernetes#7479

acc4c4c

DONOTSUBMIT

jackfrancis reviewed Nov 19, 2024

View reviewed changes

towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024

TMP squash: kubernetes#7466, kubernetes#7479, kubernetes#7497

3036cfd

DONOTSUBMIT

towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024

TMP squash: kubernetes#7466, kubernetes#7479, kubernetes#7497

0a00f6b

DONOTSUBMIT

towca force-pushed the jtuznik/dra-info-utils branch from a7eea5e to 89a5259 Compare November 21, 2024 18:44

towca added a commit to towca/autoscaler that referenced this pull request Nov 21, 2024

TMP squash: kubernetes#7479

629c087

DONOTSUBMIT

jackfrancis reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA: refactor utils related to NodeInfos #7479

CA: refactor utils related to NodeInfos #7479

towca commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

towca commented Nov 7, 2024

towca commented Nov 19, 2024

jackfrancis Nov 19, 2024

jackfrancis Nov 21, 2024

jackfrancis Nov 21, 2024

jackfrancis Nov 21, 2024

jackfrancis Nov 21, 2024

jackfrancis Nov 21, 2024

CA: refactor utils related to NodeInfos #7479

Are you sure you want to change the base?

CA: refactor utils related to NodeInfos #7479

Conversation

towca commented Nov 7, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Nov 7, 2024

towca commented Nov 7, 2024

towca commented Nov 19, 2024

jackfrancis Nov 19, 2024

Choose a reason for hiding this comment

jackfrancis Nov 21, 2024

Choose a reason for hiding this comment

jackfrancis Nov 21, 2024

Choose a reason for hiding this comment

jackfrancis Nov 21, 2024

Choose a reason for hiding this comment

jackfrancis Nov 21, 2024

Choose a reason for hiding this comment

jackfrancis Nov 21, 2024

Choose a reason for hiding this comment