Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wking
Copy link
Member

@wking wking commented Nov 24, 2020

We have blocking on this condition since 545c342 (#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (#136), motivated by install speed. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes.

We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator).

CC @abhinavdahiya, @deads2k, @smarterclayton as folks who were involved in the logic I'm removing here.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2020
@wking
Copy link
Member Author

wking commented Nov 24, 2020

/hold

Big change; we want to give folks time to weigh in.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2020
@sdodson
Copy link
Member

sdodson commented Nov 25, 2020

I would personally prefer that we tighten up the definition of what degraded means such that it's only used when things are critical enough that they should block the forward progress of upgrades.

@wking
Copy link
Member Author

wking commented Nov 25, 2020

I would personally prefer that we tighten up the definition of what degraded means...

Current definition hinges on quality-of-service. Available definition requires the operand to be "functional and available". Can you float an example of component behavior that would be Available=True but still sufficiently severe to need to block later-manifest reconciliation?

@sdodson
Copy link
Member

sdodson commented Nov 25, 2020

Thanks for referencing that, I'd say my view was mostly inline with the statement A service should not report Degraded during the course of a normal upgrade. rather than a deep consideration of available versus degraded. I'm concerned that we're being too eager to move into a degraded state during normal upgrade behavior. I'd prefer those operators tune themselves to align more closely with real world cluster behavior, I fear their timeouts are tuned based on behaviors observed from unloaded or only CI clusters.

@smarterclayton
Copy link
Contributor

I agree with Scott (I think) - I don’t believe degraded is an acceptable state for operators ever, especially during upgrade, as we have historically defined it in practice.

@wking
Copy link
Member Author

wking commented Dec 5, 2020

I don’t believe degraded is an acceptable state for operators ever, especially during upgrade, as we have historically defined it in practice.

I'm not arguing for it to be acceptable, I'm arguing about it being non-blocking. We will still alert if operators go Degraded=True in the wild. We can build CI to fail operators that go Degraded=True during a run, if we don't do that already. But if an operator goes Degraded=True in a customer cluster, it's not clear to me how sticking mid-update is helping the customer resolve that situation, vs. pushing through with the rest of the update, as long as the operator is Available=True, and letting admins sort out the degrading issue orthogonally.

@openshift-merge-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/integration 02fad45 link /test integration
ci/prow/e2e-agnostic-operator 02fad45 link /test e2e-agnostic-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 8, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2021

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2021
@openshift-ci openshift-ci bot closed this May 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Nov 25, 2024

Picking this one back up for more discussion.

@wking wking reopened this Nov 25, 2024
Copy link
Contributor

openshift-ci bot commented Nov 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking wking force-pushed the ignore-operator-degraded branch from 02fad45 to b0a2a4b Compare November 25, 2024 20:27
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2024
@wking wking changed the title pkg/cvo/internal: Do not block on Degraded=True ClusterOperator OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator Nov 25, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 25, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 25, 2024

@wking: This pull request references OTA-542 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

We have blocking on this condition since 545c342 (#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (#136), motivated by install speed. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes.

We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator).

CC @abhinavdahiya, @deads2k, @smarterclayton as folks who were involved in the logic I'm removing here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

We have blocking on this condition since 545c342 (api: make status
substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing.
We'd softened our install-time handling to act this way back in
b0b4902 (clusteroperator: Don't block on failing during
initialization, 2019-03-11, openshift#136), motivated by install speed [1].
And a degraded operator may slow dependent components in their own
transitions.  But as long as the operator/operand are available at
all, it should not block depndent components from transitioning, so
this commit removes the Degraded=True block from the remaining modes.

We still have the warning ClusterOperatorDegraded alerting admins
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
@petr-muller
Copy link
Member

/cc
/remove-lifecycle rotten

@openshift-ci openshift-ci bot requested a review from petr-muller November 26, 2024 12:31
@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 26, 2024
@wking wking force-pushed the ignore-operator-degraded branch from cf0f1e3 to b198db0 Compare November 27, 2024 05:49
Copy link
Contributor

openshift-ci bot commented Dec 14, 2024

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn b198db0 link false /test okd-scos-e2e-aws-ovn
ci/prow/unit b198db0 link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants