OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

wking · 2020-11-24T23:57:04Z

We have blocking on this condition since 545c342 (#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (#136), motivated by install speed. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes.

We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator).

CC @abhinavdahiya, @deads2k, @smarterclayton as folks who were involved in the logic I'm removing here.

openshift-ci-robot · 2020-11-24T23:57:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-11-24T23:58:06Z

/hold

Big change; we want to give folks time to weigh in.

sdodson · 2020-11-25T17:14:03Z

I would personally prefer that we tighten up the definition of what degraded means such that it's only used when things are critical enough that they should block the forward progress of upgrades.

wking · 2020-11-25T17:30:30Z

I would personally prefer that we tighten up the definition of what degraded means...

Current definition hinges on quality-of-service. Available definition requires the operand to be "functional and available". Can you float an example of component behavior that would be Available=True but still sufficiently severe to need to block later-manifest reconciliation?

sdodson · 2020-11-25T18:15:39Z

Thanks for referencing that, I'd say my view was mostly inline with the statement A service should not report Degraded during the course of a normal upgrade. rather than a deep consideration of available versus degraded. I'm concerned that we're being too eager to move into a degraded state during normal upgrade behavior. I'd prefer those operators tune themselves to align more closely with real world cluster behavior, I fear their timeouts are tuned based on behaviors observed from unloaded or only CI clusters.

smarterclayton · 2020-12-02T17:50:40Z

I agree with Scott (I think) - I don’t believe degraded is an acceptable state for operators ever, especially during upgrade, as we have historically defined it in practice.

wking · 2020-12-05T00:45:18Z

I don’t believe degraded is an acceptable state for operators ever, especially during upgrade, as we have historically defined it in practice.

I'm not arguing for it to be acceptable, I'm arguing about it being non-blocking. We will still alert if operators go Degraded=True in the wild. We can build CI to fail operators that go Degraded=True during a run, if we don't do that already. But if an operator goes Degraded=True in a customer cluster, it's not clear to me how sticking mid-update is helping the customer resolve that situation, vs. pushing through with the rest of the update, as long as the operator is Available=True, and letting admins sort out the degrading issue orthogonally.

openshift-merge-robot · 2020-12-09T19:30:53Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/integration	`02fad45`	link	`/test integration`
ci/prow/e2e-agnostic-operator	`02fad45`	link	`/test e2e-agnostic-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-03-09T20:04:46Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-04-08T21:58:40Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-05-09T01:01:59Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-05-09T01:02:05Z

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-09T01:02:13Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2024-11-25T20:18:10Z

Picking this one back up for more discussion.

openshift-ci · 2024-11-25T20:19:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-11-25T20:28:08Z

@wking: This pull request references OTA-542 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

We have blocking on this condition since 545c342 (#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (#136), motivated by install speed. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes.

We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator).

CC @abhinavdahiya, @deads2k, @smarterclayton as folks who were involved in the logic I'm removing here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

petr-muller · 2024-11-26T12:31:16Z

/cc
/remove-lifecycle rotten

petr-muller · 2025-03-11T12:45:33Z

/uncc @smarterclayton @LalatenduMohanty

openshift-merge-robot · 2025-04-30T22:06:55Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-05-27T18:31:04Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`b198db0`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/unit	`b198db0`	link	true	`/test unit`
ci/prow/verify-deps	`b198db0`	link	true	`/test verify-deps`
ci/prow/verify-update	`b198db0`	link	true	`/test verify-update`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-08-26T01:00:22Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot requested review from LalatenduMohanty and smarterclayton November 24, 2020 23:57

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2020

wking mentioned this pull request Jan 20, 2021

Bug 1884334: UpdateError: enhance for ability to determine when upgrade failing #486

Merged

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 8, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2021

openshift-ci bot closed this May 9, 2021

wking reopened this Nov 25, 2024

wking force-pushed the ignore-operator-degraded branch from 02fad45 to b0a2a4b Compare November 25, 2024 20:27

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2024

wking changed the title ~~pkg/cvo/internal: Do not block on Degraded=True ClusterOperator~~ OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator Nov 25, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 25, 2024

wking force-pushed the ignore-operator-degraded branch from b0a2a4b to cf0f1e3 Compare November 25, 2024 20:54

wking mentioned this pull request Nov 25, 2024

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal openshift/enhancements#1719

Open

openshift-ci bot requested a review from petr-muller November 26, 2024 12:31

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 26, 2024

wking force-pushed the ignore-operator-degraded branch from cf0f1e3 to b198db0 Compare November 27, 2024 05:49

openshift-ci bot removed request for smarterclayton and LalatenduMohanty March 11, 2025 12:45

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2025

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2025

OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

Are you sure you want to change the base?

OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

Uh oh!

Conversation

wking commented Nov 24, 2020

Uh oh!

openshift-ci-robot commented Nov 24, 2020

Uh oh!

wking commented Nov 24, 2020

Uh oh!

sdodson commented Nov 25, 2020

Uh oh!

wking commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdodson commented Nov 25, 2020

Uh oh!

smarterclayton commented Dec 2, 2020

Uh oh!

wking commented Dec 5, 2020

Uh oh!

openshift-merge-robot commented Dec 9, 2020

Uh oh!

openshift-bot commented Mar 9, 2021

Uh oh!

openshift-bot commented Apr 8, 2021

Uh oh!

openshift-bot commented May 9, 2021

Uh oh!

openshift-ci bot commented May 9, 2021

Uh oh!

openshift-ci bot commented May 9, 2021

Uh oh!

wking commented Nov 25, 2024

Uh oh!

openshift-ci bot commented Nov 25, 2024

Uh oh!

openshift-ci-robot commented Nov 25, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Nov 26, 2024

Uh oh!

petr-muller commented Mar 11, 2025

Uh oh!

openshift-merge-robot commented Apr 30, 2025

Uh oh!

openshift-ci bot commented May 27, 2025

Uh oh!

openshift-bot commented Aug 26, 2025

Uh oh!

Uh oh!

wking commented Nov 25, 2020 •

edited

Loading

openshift-ci-robot commented Nov 25, 2024 •

edited by openshift-ci bot

Loading