Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade operator, the default auto-instrumentation image in Instrumentation has not been upgraded #3468

Open
chinaran opened this issue Nov 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@chinaran
Copy link
Contributor

Component(s)

auto-instrumentation

What happened?

Description

Upgrade operator, the default auto-instrumentation image in Instrumentation has not been upgraded

Steps to Reproduce

operator v0.107.0 was deployed using helm chart, to be migrated to OLM deployment and upgraded to v0.113.0.
Also, the default-auto-instrumentation-java-image version should be upgraded.
When v0.113.0 was deployed after removing v0.107.0, the mutate webhook failed during the upgrade of Instrumentation because the service was not ready.
As a result, auto-instrumentation-java-image was not upgraded.

Expected Result

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  annotations:
    instrumentation.opentelemetry.io/default-auto-instrumentation-java-image: xxx-v2
spec:
  java:
    image: xxx-v2

Actual Result

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  annotations:
    instrumentation.opentelemetry.io/default-auto-instrumentation-java-image: xxx-v1
spec:
  java:
    image: xxx-v1

Kubernetes Version

v1.29.2

Operator version

v0.113.0

Collector version

v0.108.1

Environment information

No response

Log output

{"level":"ERROR","timestamp":"2024-11-17T03:30:27.899239101Z","logger":"instrumentation-upgrade","message":"failed to apply changes to instance","name":"xxx-java","namespace":"cpaas-system","error":"Internal error occurred: failed calling webhook \"minstrumentation.kb.io\": failed to call webhook: Post \"https://opentelemetry-operator-controller-manager-service.xxx-ns.svc:443/mutate-opentelemetry-io-v1alpha1-instrumentation?timeout=10s\": dial tcp xxx.xxx.xxx.xxx:443: connect: connection refused","stacktrace":
"github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation/upgrade.(*InstrumentationUpgrade).ManagedInstances
/workspace/source/pkg/instrumentation/upgrade/upgrade.go:99
main.addDependencies.func2
/workspace/source/main.go:473
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
/workspace/cache/gomod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:307
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/cache/gomod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

Additional context

Should I add retry logic to upgrade Instrumentation?

@chinaran chinaran added bug Something isn't working needs triage labels Nov 17, 2024
@swiatekm
Copy link
Contributor

I don't think controller-runtime lets us specify dependencies between runnables, so it won't help us in this case. It does sound like we have a race condition at startup, where the upgraders depend on webhooks to run, but we don't guarantee that the webhooks are started first.

Maybe the solution is to just run the upgraders periodically forever? That they're only run at startup doesn't make that much sense to me. @pavolloffay @jaronoff97 wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants