Skip to content

Conversation

Pacho20
Copy link

@Pacho20 Pacho20 commented Jul 24, 2025

Introduce support for DaemonSet-based deployment mode for clusters running without the Machine Config Operator (MCO).

This feature enables an alternative deployment path by detecting the absence of the MachineConfig CRD and activating DaemonSet mode via a dedicated feature gate. It includes logic for deploying and managing DaemonSets on targeted nodes, using labels to control execution and track installation status.

There are still many TODOs in the implementation, and the list will be updated as development progresses.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 24, 2025
Copy link

openshift-ci bot commented Jul 24, 2025

Hi @Pacho20. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@spotlesstofu
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 28, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2025
@spotlesstofu
Copy link
Contributor

/ok-to-test

Copy link
Contributor

@spotlesstofu spotlesstofu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Patrik for working on this new feature.

I added a couple of inline comments about the logic.

Please move the shell scripts to a separate file and include them, or mount them as an immutable ConfigMap.

Nice to have: deduplicate code.

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Pacho20, thanks for this. Appreciated.

I left a few comments for you, but my first suggestion here, is to squash the commits that are refactoring and changing the same code base, so the commit-by-commit review would be easier.

@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch from c5bacd6 to a1108ca Compare August 7, 2025 12:18
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2025
@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch 3 times, most recently from 451ec60 to f0973fa Compare August 14, 2025 16:26
@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch from f0973fa to 83c834a Compare August 18, 2025 12:57
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2025
@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch from 83c834a to 121a2db Compare August 18, 2025 13:08
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2025
@Pacho20 Pacho20 marked this pull request as ready for review August 18, 2025 14:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2025
@openshift-ci openshift-ci bot requested review from gkurz and wainersm August 18, 2025 14:17
@littlejawa
Copy link
Contributor

/ok-to-test

@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch 2 times, most recently from 528a8e9 to e48f315 Compare August 22, 2025 15:38
Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Pacho20, thanks for keeping pushing this PR. Appreciated.

I left a few more comments for you, but also lets see if others join the party.

Privileged: &runPrivileged,
RunAsUser: &runAsUser,
},
Command: []string{"/bin/bash", "/scripts/osc-configs-script.sh"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//
// It returns false if no ConfigMap was created due to an error or
// if the existing ConfigMap is identical to the provided YAML data.
func (r *KataConfigOpenShiftReconciler) addImmutableConfigMap(path, yaml string) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit of using the script from the configmap vs adding the script in the container image and run it through the usual static checks etc ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had suggested this from experience that yaml inside yaml can often have issues in ocp. The line wraps start to be interpreted by ocp when they should be running it verbatim inside the container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this as it was removed from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong, thread, reopening.

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Pacho20 !

Thanks for submitting this. I could only partially review the first commit so far, I'll try to devote some time everyday to make progress. Anyway, I've already some comments to share. PTAL.

return false, err
}

return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineConfig CRD is also present in this case but it won't print the message... :-\

What about simplifying the return branches ? Something like :

	err := r.Client.Get(context.Background(), client.ObjectKey{Name: "foo"}, mc)
	if err == nil || k8serrors.IsNotFound(err) {
		r.Log.Info("MachineConfig CRD is present")
		return true, nil
	}

	if meta.IsNoMatchError(err) {
		r.Log.Info("MachineConfig CRD not found")
		return false, nil
	}

	return false, err

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkurz we think we resolved this. Can you confirm?

}

if mode == DaemonSetFallbackOption && !mcAvailable {
r.Log.Info("MC is not available, deployment mode will be set to DaemonSet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Info("MC is not available, deployment mode will be set to DaemonSet")
r.Log.Info("MachineConfig is not available, deployment mode will be set to DaemonSet")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkurz this should now be resolved. Can you confirm?

return ctrl.Result{}, updateErr
}
return res, err
res, err := r.processKataConfigDeleteRequestDaemonSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space damage. Be sure to run make fmt for each individual commit. It is also a good practice to run make vet as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think was fixed when it moved to here #950 the commit doesn't exist that has this error now.

}

cfgMap := &corev1.ConfigMap{}
err := r.Client.Get(context.TODO(), types.NamespacedName{Name: FgConfigMapName,
Namespace: OperatorNamespace}, cfgMap)
if err == nil {
for feature, value := range cfgMap.Data {
fgStatus.FeatureGates[feature] = value == "true"
if value, ok := cfgMap.Data[ConfidentialFeatureGate]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the FeatureGateStatus from containing a map to containing explicit values is what forces you to copy-paste the same code again and again here. This scales OK with three fields, but if we add more, that function will need to be revisited every time. Do we want that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c3d, right, we asked him to add DeploymentModeOption over Slack to cover what the code was already doing and to support the "force mode". Since we don’t have just bools anymore, the extra code here is expected.

We’re also moving from feature gates to config options. I don’t love the duplication either, but for now I’d stick with this. If we add more gates later, we can always revisit with a table-driven/helper pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this thread, feel free to reopen if we came up with a better idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a change should really happen in a preceding commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a new commit. Could you check if it's acceptable for you?

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Pacho20, thanks again for this work.

The comments shows the importance of your work, This is a big feature and you’re making solid progress. Keep going.

// The DaemonSet's Pod contains three containers: one installs the binaries, one modifies the node labels, and one sets the runtime log level.
func (r *KataConfigOpenShiftReconciler) daemonSetForKataInstall(action KataDsAction) (*appsv1.DaemonSet, error) {
// This image contains the binaries for kata-containers
extensionImageString, err := r.GetExtensionImage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stay aligned with how we deploy across the project to avoid mixing methods. Since there’s already an effort to move the crio/kata config into the kata RPM, this DS could follow the same path, otherwise it risks becoming a separate “thing” that’s harder to maintain and ship in future versions. I’d also avoid introducing extra config maps, even if used only internally, since they tend to add confusion.

Now, about the scripts, we already have scripts/install-helpers/, so the needed files could go there, and the image could consume them at build time as other images do.

If we have time to deploy this image in our pipeline, that’s the option I’d vote for (unless we have a better alternative).

@littlejawa, how hard do you think it would be to add this new component/image in Konflux? From what I see, hermetic builds shouldn’t be a problem, but do you see any other enterprise contract breaking here?

return nil
}

func (r *KataConfigOpenShiftReconciler) GetExtensionImage() (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just missed the reason we have GetExentionImage() and GetCliImage() methods. Can't we avoid both methods and calling directly the GetImageForComponent ?

Copy link
Author

@Pacho20 Pacho20 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the GetCliImage() function. I initially created both functions to allow for additional logic if needed. As you can see, GetExtensionImage() has a TODO comment because I wasn't sure whether this is the only extension image we need. Other extension images exist for different distros. If RHEL and CoreOS are sufficient, I can remove both functions and call GetImageForComponent() directly.

case MachineConfigMode:
res, err = r.processKataConfigDeleteRequest()
case DaemonSetMode:
res, err = r.processKataConfigDeleteRequestDaemonSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the DaemonSetFallback ?

Copy link

@gcoon151 gcoon151 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the FG is set to DaemonSetFallback, this field is automatically set (L87 deployment_mode_handler.go) to either MachineConfigMode or DaemonSetMode depending on what is there. What's the suggestion?

case MachineConfigMode:
res, err = r.processKataConfigInstallRequest()
case DaemonSetMode:
res, err = r.processKataConfigInstallRequestDaemonSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, what about the DaemonSetFallback?

Copy link

@gcoon151 gcoon151 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the FG is set to DaemonSetFallback, this field is automatically set (L87 deployment_mode_handler.go) to either MachineConfigMode or DaemonSetMode depending on what is there. What's the suggestion?

r.Log.Info("Kata PODs are present. Requeue for reconciliation ")
return ctrl.Result{Requeue: true, RequeueAfter: 15 * time.Second}, err
}
res, err := r.checkDeletionEligibility()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice move!

err = r.deletePeedPodsConfigDaemonSet()
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the fallback mode here ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. This is handled in deployment_mode_handler.go

@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch from 85c2055 to cb2cb22 Compare September 9, 2025 14:55

// Attempt to GET any MachineConfig to verify if the GVK is known.
// If the CRD isn’t installed, it will return a NoMatchError.
err := r.Client.Get(context.Background(), client.ObjectKey{Name: machineConfigName}, mc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only sensible user for machineConfigName. No deep to define a constant for that. Let's use the literal string "dummy-machine-config" directly here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Prepare feature gate for multi-value support by replacing map[string]bool with struct

Signed-off-by: Patrik Fodor <[email protected]>
Introduce a new feature gate to support running without the Machine Config Operator (MCO),
using a DaemonSet-based deployment instead. This commit adds only the feature gate
definition and detection logic.

The feature gate supports three options, which determine whether DaemonSet mode is activated.

Signed-off-by: Patrik Fodor <[email protected]>
Introduced a helper function to fetch component images based on the cluster version.
Also added a function to label nodes.

Signed-off-by: Patrik Fodor <[email protected]>
This commit builds on the previously introduced
feature gate for DaemonSet-based deployment.
Adds support for deploying and managing the Kata and CAA installation using DaemonSets
without relying on the Machine Config Operator (MCO).

Signed-off-by: Patrik Fodor <[email protected]>
@Pacho20 Pacho20 force-pushed the daemonset-deployment-mode branch from cb2cb22 to c8dac15 Compare September 10, 2025 10:55
Copy link

openshift-ci bot commented Sep 10, 2025

@Pacho20: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants