Add support for Preservation of Machines and Backing nodes by thiyyakat · Pull Request #1059 · gardener/machine-controller-manager

thiyyakat · 2025-12-10T09:32:40Z

What this PR does / why we need it:

This PR introduces a feature that allows operators and endusers to preserve a machine/node and the backing VM for diagnostic purposes.

The expected behaviour, use cases and usage are detailed in the proposal that can be found here

Which issue(s) this PR fixes:
Fixes #1008

Special notes for your reviewer:

The following tests were carried out serially with the machine-controller-manager-provider-virtual: #1059 (comment)

Please also take a look at the questions asked here.

Release note:

Introduce support for preservation of machines (both Running and Failed), and the backing node (if it exists).

gardener-robot · 2025-12-10T09:32:56Z

@thiyyakat You need rebase this pull request with latest master branch. Please check.

thiyyakat · 2025-12-11T06:59:24Z

Questions that remain unanswered:

On recovery of a preserved machine, it transitions from Failed to Running. However, if the preserve annotation was when-failed, then the node continues to be preserved in Running even though the annotation says when-failed - is that okay? The node needs to be preserved so that pods can get scheduled onto it without CA scaling it down.
Update: We allow the annotation to stay, but we clear PreserveExpirTime and set the node condition to false. The CA annotation remains until manually removed from node.
drain timeout is checked currently by calculating time from LastUpdateTime (from when machine moved to Failed) to now. Is there a better way to do it?
timeOutOccurred = utiltime.HasTimeOutOccurred(machine.Status.CurrentStatus.LastUpdateTime, timeOutDuration)
In the normal drain, it is checked wrt DeletionTimestamp
In some parts of the code, checks are performed to see if the returned error is due to a Conflict, and ConflictRetry rather than ShortRetry is returned. When should these checks be performed? The preservation flow has a lot of update calls. : Addressed. Use ConflictRetry when appropriate.

thiyyakat

Note: A review meeting was held today for this PR. The comments were given during the meeting.

During the meeting, we revisited the decision to move drain to Failed state for preserved machine. The reason discussed previously was that it didn't make sense semantically to move the machine to Terminating and then do the drain, because there is a possibility that the machine may recover. Since Terminating is a final state, the drain (separate from the drain in triggerDeletionFlow) will be performed in Failed phase. There was no change proposed during the meeting. This design decision was only reconfirmed.

pkg/util/provider/machinecontroller/machine.go

pkg/util/provider/machinecontroller/machine_util.go

pkg/controller/machineset.go

machine-controller-manager

takoverflow

Have only gone through half of the PR, have some suggestions PTAL.

pkg/apis/machine/v1alpha1/machine_types.go

pkg/controller/deployment_machineset_util.go

pkg/controller/machineset.go

pkg/util/provider/machinecontroller/machine_util.go

takoverflow · 2025-12-18T09:40:52Z

pkg/util/provider/machinecontroller/machine_util.go

+		err := nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, preservedCondition)
+		if err != nil {
+			return err
+		}
+		// Step 2: remove CA's scale-down disabled annotations to allow CA to scale down node if needed
+		CAAnnotations := make(map[string]string)
+		CAAnnotations[autoscaler.ClusterAutoscalerScaleDownDisabledAnnotationKey] = ""
+		latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+		if err != nil {
+			klog.Errorf("error trying to get backing node %q for machine %s. Retrying, error: %v", nodeName, machine.Name, err)
+			return err
+		}
+		latestNodeCopy := latestNode.DeepCopy()
+		latestNodeCopy, _, _ = annotations.RemoveAnnotation(latestNodeCopy, CAAnnotations) // error can be ignored, always returns nil
+		_, err = c.targetCoreClient.CoreV1().Nodes().Update(ctx, latestNodeCopy, metav1.UpdateOptions{})
+		if err != nil {
+			klog.Errorf("Node UPDATE failed for node %q of machine %q. Retrying, error: %s", nodeName, machine.Name, err)
+			return err
+		}


Is there a reason why there are two get and update calls made for a node, can these not be combined into a single atomic node object update?

And I know this is not part of your PR but can we update this RemoveAnnotation function, it's needlessly complicated.
All you have to do after fetching the object and checking that annotations are non-nil is

delete(obj.Annotations, annotationKey)

Creating a dummy annotation map, then passing it and then creating a new map which doesn't have the key. All of this complication can be avoided.

By 2 Get() calls are you referring to the call within AddOrUpdateConditionsOnNode and the following Get() here:
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})?

The first one can be avoided if we didn't use the function. The second one is required because step 1 adds conditions to the node object, and the function does not return the updated node object. Fetching from the cache doesn't guarantee an up-to-date node object (tested this out empirically). I could potentially avoid fetching the objects if I didn't use the function. Will test it out.

The two update calls cannot be combined since step 1 requires an UpdateStatus() call, and step 2 updates the Spec, and requires an Update() call.

I will update the RemoveAnnotation function as recommended by you.

Edit: The RemoveAnnotation function returns a boolean indicating whether or not an update is needed. This value is being used in other usages of the function. The function cannot be updated. I will use your suggestion instead of using the function since the boolean value is not required in this case.

pkg/util/provider/machinecontroller/machine_util.go

pkg/util/provider/machinecontroller/machine.go

pkg/apis/machine/v1alpha1/machine_types.go

pkg/apis/machine/v1alpha1/machinedeployment_types.go

pkg/util/provider/machineutils/utils.go

aaronfern

Thanks for the PR @thiyyakat!
A few questions/nits from me, please address them

pkg/apis/machine/types.go

pkg/controller/deployment_machineset_util.go

pkg/controller/machineset.go

pkg/util/provider/machinecontroller/machine.go

pkg/util/provider/machineutils/utils.go

pkg/util/provider/machinecontroller/machine_util.go

docs/documents/apis.md

elankath · 2026-01-13T02:58:34Z

pkg/apis/machine/v1alpha1/machine_types.go

 	UpdateFailed string = "UpdateFailed"
 )

+const (


These condition constants feel like they are in the wrong place as we already have conditions at pkg/apis/machine/types.go. Also, I don't think the Node prefix should be used for the condition constant names as they are used in Machine objects too. @unmarshall should these even be exposed in API ?

I've added them here after seeing the constants for InPlaceUpdates added just above:

machine-controller-manager/pkg/apis/machine/v1alpha1/machine_types.go

Line 226 in 219d435

NodeInPlaceUpdate corev1.NodeConditionType = "InPlaceUpdate"

The NodeCondition for InPlace is named NodeInPlaceUpdate, and I've followed the same.

@elankath , @unmarshall , please let me know what change you would like me to make.

@thiyyakat Ok, but the the reason constants like NodePreservedByMCM, etc should just be PreservedByMCM - that is also the convention followed by in-place update constants.

PreservedNodeDrainSuccessful -> DrainSuccessful

Will make the change to the other constant names and shorten them.

This one: PreservedNodeDrainSuccessful -> DrainSuccessful I am unsure of what to do. DrainSuccessful is used as a Reason for InPlaceUpdate, and the comment indicates the same. Is it okay to re-use it for a Message?
Ref:

machine-controller-manager/pkg/apis/machine/v1alpha1/machine_types.go

Line 237 in 219d435

// DrainSuccessful is a constant for reason in condition that indicates node drain is successful

pkg/apis/machine/v1alpha1/machineset_types.go

pkg/controller/machineset.go

pkg/util/provider/machinecontroller/machine.go

* additionally, add tests for isFailedMachineCandidateForPreservation()

* remove auto preservation logic from manageReplicas() * rename constants * simplify preserved Running machine switch from preserve=now to preserve=when-failed * update tests

…e accessing maps.

- Modify sort function to de-prioritize preserve machines - Add test for the same - Improve logging - Fix bug in stopMachinePreservationIfPreserved when node is not found - Update default MachinePreserveTimeout to 3 days as per doc

- Reuse function to write annotation on machine - Minor refactoring

- Make changes to add auto-preserve-stopped on recovered, auto-preserved previously failed machines. - Change stopMachinePreservationIfPreserved to removeCA annotation when preserve=false on a recovered failed, preserved machine

…erveExpiryTime

…value

* remove stop annotation value * remove CA scale-down annotation when preservation stops * change preservation annotation handling semantics for machine and node * remove auto-preserve-stopped annotation value * Add preserveExpiryTime to NodeCondition.Message * modify test cases

…eserved machines if autoPreservedFailedMachineMax is decreased in the shoot spec.

…liedNodePreserveValue for persisting node annotation values that have been applied.

…th bugfix on master

gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else needs/rebase Needs git rebase labels Dec 10, 2025

gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 10, 2025

thiyyakat force-pushed the feat/preserve-machine branch 2 times, most recently from 06ecf58 to 89f2900 Compare December 10, 2025 12:06

thiyyakat commented Dec 11, 2025

View reviewed changes

takoverflow reviewed Dec 12, 2025

View reviewed changes

machine-controller-manager Outdated Show resolved Hide resolved

takoverflow requested changes Dec 18, 2025

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Dec 18, 2025

thiyyakat force-pushed the feat/preserve-machine branch from 22c646e to 7c062b5 Compare December 19, 2025 08:30

thiyyakat force-pushed the feat/preserve-machine branch from e2a7ea7 to 74603a4 Compare December 31, 2025 09:56

thiyyakat marked this pull request as ready for review January 6, 2026 05:56

thiyyakat requested a review from a team as a code owner January 6, 2026 05:56

r4mek reviewed Jan 7, 2026

View reviewed changes

pkg/apis/machine/v1alpha1/machine_types.go Outdated Show resolved Hide resolved

r4mek reviewed Jan 7, 2026

View reviewed changes

pkg/apis/machine/v1alpha1/machinedeployment_types.go Show resolved Hide resolved

r4mek reviewed Jan 8, 2026

View reviewed changes

pkg/util/provider/machineutils/utils.go Outdated Show resolved Hide resolved

thiyyakat force-pushed the feat/preserve-machine branch 2 times, most recently from a487a18 to 508b1ba Compare January 12, 2026 04:24

aaronfern reviewed Jan 12, 2026

View reviewed changes