Skip to content

Conversation

@starbops
Copy link
Member

@starbops starbops commented Dec 31, 2025

Problem:

A race condition were found between the webhook readiness and the Configuration creation upon chart installation.

Solution:

Moving the default Configuration CR to a ConfigMap and using a post-hook Job to apply the default Configuration from the ConfigMap volume.

Since the default Configuration CR does not need to be processed by the mutating webhook (we've already crafted it with sane defaults suitable for Harvester-use), we can make the MutatingWebhookConfiguration resource a post-install-hook. That way, the default Configuration CR can bypass the webhook upon creation.

Related Issue(s):

harvester/harvester#9786

Test plan:

  1. Prepare a v1.7.0 Harvester cluster
  2. Clone the harvester/charts repository, checkout to the branch containing the fix, and install the kubeovn-operator chart directly from the local directory:
    helm upgrade --install kubeovn-operator ./charts/kubeovn-operator --namespace=kube-system --create-namespace --set=enableConfiguration=true
  3. The Helm chart installation should success

Additional documentation or context

Previously, installing the chart directly from the master branch ends up with the following due to the said reason:

$ helm upgrade --install kubeovn-operator ./charts/kubeovn-operator --namespace=kube-system --create-namespace --set=enableConfiguration=true
Release "kubeovn-operator" does not exist. Installing it now.
Error: Internal error occurred: failed calling webhook "mconfiguration-v1.kb.io": failed to call webhook: Post "https://kubeovn-operator-webhook-service.kube-system.svc:443/mutate-kubeovn-io-v1-configuration?timeout=10s": no endpoints available for service "kubeovn-operator-webhook-service"

@starbops starbops marked this pull request as ready for review December 31, 2025 09:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition between webhook readiness and Configuration resource creation during Helm chart installation by converting the Configuration custom resource to a ConfigMap and using a Helm post-hook Job to apply it after installation completes.

Key changes:

  • Introduces a post-install/post-upgrade Job that applies the Configuration from a ConfigMap after the webhook is ready
  • Converts the Configuration resource template to a ConfigMap containing the Configuration YAML
  • Adds kubectl hook image configuration to values.yaml for running the post-hook Job

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
charts/kubeovn-operator/values.yaml Adds hookImage configuration for the kubectl container used in the post-hook Job
charts/kubeovn-operator/templates/post-job.yaml New Helm post-hook Job that applies the Configuration resource from the ConfigMap after installation/upgrade
charts/kubeovn-operator/templates/configuration.yaml Transforms the Configuration custom resource into a ConfigMap that stores the Configuration YAML as data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pingerExternalDomain: {{ .Values.configurationSpec.ipv4.pingerExternalDomain | default "google.com" }}
podCIDR: {{ .Values.configurationSpec.ipv4.podCIDR | default "10.52.0.0/16" }}
podGateway: {{ .Values.configurationSpec.ipv4.podGateway | default "10.52.0.1" }}
serviceCIDR: {{ .Values.configurationSpec.ipv4.podGateway | default "10.53.0.1" }}
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serviceCIDR field is incorrectly referencing '.Values.configurationSpec.ipv4.podGateway' instead of '.Values.configurationSpec.ipv4.serviceCIDR'. This will cause the serviceCIDR to be set to the pod gateway value (10.52.0.1) instead of the intended service CIDR value (10.53.0.1), which could lead to network routing issues.

Suggested change
serviceCIDR: {{ .Values.configurationSpec.ipv4.podGateway | default "10.53.0.1" }}
serviceCIDR: {{ .Values.configurationSpec.ipv4.serviceCIDR | default "10.53.0.1" }}

Copilot uses AI. Check for mistakes.
- -f
- /opt/kubeovn-operator/config.yaml
securityContext:
allowPrivilegeEscalation: false
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Job template is missing readOnlyRootFilesystem security setting. For enhanced security in a restricted environment, consider adding 'readOnlyRootFilesystem: true' to the securityContext to prevent writes to the container's filesystem.

Suggested change
allowPrivilegeEscalation: false
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true

Copilot uses AI. Check for mistakes.
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Job is missing a backoffLimit specification. Without this, the Job will use the default backoffLimit of 6, which may lead to multiple retry attempts if the Configuration creation fails. Consider adding 'backoffLimit: 0' or a low value to the Job spec to fail fast if there's an issue with applying the configuration, making debugging easier.

Suggested change
spec:
spec:
backoffLimit: 0

Copilot uses AI. Check for mistakes.
spec:
template:
metadata:
namespace: {{ .Release.Namespace }}
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod template metadata includes a 'namespace' field, which is not a valid field for pod templates within Job specs. The namespace is inherited from the Job itself and shouldn't be specified in spec.template.metadata. This field should be removed as it may cause validation errors or be ignored by Kubernetes.

Suggested change
namespace: {{ .Release.Namespace }}

Copilot uses AI. Check for mistakes.
@Yu-Jack
Copy link
Contributor

Yu-Jack commented Jan 8, 2026

I got the this log message in kube-system/kubeovn-operator-xxxxx:post-job. Is that expected?

│ Error from server (InternalError): error when applying patch:                                                                                                                                                                      │
│ {"spec":{"ovnCentral":{"limits":{"cpu":3}},"ovsOVN":{"limits":{"cpu":2}}}}                                                                                                                                                         │
│ to:                                                                                                                                                                                                                                │
│ Resource: "kubeovn.io/v1, Resource=configurations", GroupVersionKind: "kubeovn.io/v1, Kind=Configuration"                                                                                                                          │
│ Name: "kubeovn", Namespace: "kube-system"                                                                                                                                                                                          │
│ for: "/opt/kubeovn-operator/config.yaml": error when patching "/opt/kubeovn-operator/config.yaml": Internal error occurred: failed calling webhook "mconfiguration-v1.kb.io": failed to call webhook: Post "https://kubeovn-operat │
│ or-webhook-service.kube-system.svc:443/mutate-kubeovn-io-v1-configuration?timeout=10s": no endpoints available for service "kubeovn-operator-webhook-service"                                                                      │
│ stream closed EOF for kube-system/kubeovn-operator-7k5m7 (post-job)                                                                                                                                                                

@starbops
Copy link
Member Author

starbops commented Jan 8, 2026

I got the this log message in kube-system/kubeovn-operator-xxxxx:post-job. Is that expected?

│ Error from server (InternalError): error when applying patch:                                                                                                                                                                      │
│ {"spec":{"ovnCentral":{"limits":{"cpu":3}},"ovsOVN":{"limits":{"cpu":2}}}}                                                                                                                                                         │
│ to:                                                                                                                                                                                                                                │
│ Resource: "kubeovn.io/v1, Resource=configurations", GroupVersionKind: "kubeovn.io/v1, Kind=Configuration"                                                                                                                          │
│ Name: "kubeovn", Namespace: "kube-system"                                                                                                                                                                                          │
│ for: "/opt/kubeovn-operator/config.yaml": error when patching "/opt/kubeovn-operator/config.yaml": Internal error occurred: failed calling webhook "mconfiguration-v1.kb.io": failed to call webhook: Post "https://kubeovn-operat │
│ or-webhook-service.kube-system.svc:443/mutate-kubeovn-io-v1-configuration?timeout=10s": no endpoints available for service "kubeovn-operator-webhook-service"                                                                      │
│ stream closed EOF for kube-system/kubeovn-operator-7k5m7 (post-job)                                                                                                                                                                

Did the job succeed eventually? It will undoubtedly encounter the error, but hopefully it can finish thanks to the job retry mechanism.

@Yu-Jack
Copy link
Contributor

Yu-Jack commented Jan 9, 2026

Did the job succeed eventually? It will undoubtedly encounter the error, but hopefully it can finish thanks to the job retry mechanism.

Yes. It ended up successful installation. Okay, so it seems fine with this case. Thanks!

"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on a post-install job, can the operator mount the config map and then create the embedded configuration if it's missing? Then the config map can remain part of the normal install/upgrade path. If desired, it also gives the operator the ability to revert any drift found in the CR back to the original found in the config map.

See relevant src here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ihcsim. I like the idea. But given the tight schedule for v1.7.1, we may want a more straightforward fix here. After discussing with @ibrokethecloud, it turns out the default Configuration CR does not have to be processed by the mutating webhook. A simpler option than the one I proposed earlier (the post-install job) is to make MutatingWebhookConfiguration a post-install-hook resource instead of a pre-install-hook resource. WDYT?

Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It works.

@starbops starbops merged commit f1df6bf into harvester:master Jan 13, 2026
5 checks passed
starbops added a commit to starbops/harvester-charts that referenced this pull request Jan 14, 2026
starbops added a commit that referenced this pull request Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants