pkg/cvo/status: Failing is not more serious than Degraded by wking · Pull Request #905 · openshift/cluster-version-operator

wking · 2023-02-23T17:48:30Z

The outgoing text goes way back to the local Failing type in 7f5b7f4 (#191). But ClusterVersion doesn't include Degraded, and ClusterOperator don't set Failing, so we don't need a relative-seriousness ranking. In practice, a Degraded=True ClusterOperator is one of several issues that could lead to a Failing=True ClusterVersion, and when that's the only issue going on, they clearly have the same severity. When an Available=False ClusterOperator feeds a Failing=True ClusterVersion, that would be worse than a Degraded=True Available=True ClusterOperator. And there may also be issues like the CVO failing to reconcile a peripheral change like an alert rule where ClusterVersion is Failing=True despite the issue being less severe than many Degraded=True ClusterOperator situations.

The outgoing text goes way back to the local Failing type in 7f5b7f4 (conditions: Use a consistent constant for the Failing condition, 2019-05-19, openshift#191). But ClusterVersion doesn't include Degraded, and ClusterOperator don't set Failing, so we don't need a relative-seriousness ranking. In practice, a Degraded=True ClusterOperator is one of several issues that could lead to a Failing=True ClusterVersion, and when that's the only issue going on, they clearly have the same severity. When an Available=False ClusterOperator feeds a Failing=True ClusterVersion, that would be worse than a Degraded=True Available=True ClusterOperator. And there may also be issues like the CVO failing to reconcile a peripheral change like an alert rule where ClusterVersion is Failing=True despite the issue being less severe than many Degraded=True ClusterOperator situations.

openshift-ci · 2023-02-23T17:48:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LalatenduMohanty · 2023-02-23T17:59:07Z

 	// ClusterStatusFailing is set on the ClusterVersion status when a cluster
-	// cannot reach the desired state. It is considered more serious than Degraded
-	// and indicates the cluster is not healthy.
+	// cannot reach the desired state. It indicates the cluster is not healthy.


Why can not we say that it probably means one or more operators are in degraded state, rather than saying not healthy?

because there could be other reasons besides unavailable/degraded operators to be Failing=True. Although without #867 in place, it's hard to get Telemetry stats on how frequent the various modes are.

Though this is not customer facing documentation still makes me nervous about telling cluster is not healthy but we do not think this is more serious degraded condition of operators.

Failing may be more serious than Degraded (e.g. it may be because a ClusterOperator is Available=False). Failing may be less serious than Degraded (e.g. we may be having trouble rolling out a peripheral alert rule). I'm not saying Failing is not serious. I'm just dropping the apples-to-oranges Degraded comparison.

Even though it seems Failing is more serious given it indicates that a cluster cannot reach its desired state, is unhealthy, and requires an administrator to intervene. And Degraded's consequences may vary given the specific cluster operator and Degraded is only an indication that something may need investigation and adjustment. As long as the Operator is available, the Degraded condition does not cause user workload failure or application downtime. The Failing does sound scarier in the documentation, I am not going to lie. And a failing cluster sounds more serious than a degraded operator.

I agree with Trevor's statement:

dropping the apples-to-oranges Degraded comparison

The comparison seems to depend on the specific reasons for the conditions, and since we can't tell which reasons seem to be more frequent (https://github.com/openshift/cluster-version-operator/pull/905/files#r1116077435) we can drop the comparison. I would simply say Failing is for reporting one group of things, and Degraded is for reporting another group of things, and both can be reported due to more or less serious issues, and thus the comparison can be dropped.

It's also worth pointing out that if we end up modifying the comment, we can also modify the comment in the openshift/oc repository (https://github.com/openshift/oc/blob/master/pkg/cli/admin/upgrade/upgrade.go#L32-L35).

openshift-ci · 2023-03-10T22:25:49Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic-upgrade-into-change	`06ef64e`	link	true	`/test e2e-agnostic-upgrade-into-change`
ci/prow/e2e-agnostic-upgrade-out-of-change	`06ef64e`	link	true	`/test e2e-agnostic-upgrade-out-of-change`
ci/prow/e2e-agnostic-ovn-upgrade-out-of-change	`06ef64e`	link	true	`/test e2e-agnostic-ovn-upgrade-out-of-change`
ci/prow/e2e-agnostic-ovn-upgrade-into-change	`06ef64e`	link	true	`/test e2e-agnostic-ovn-upgrade-into-change`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2023-06-12T09:00:27Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-07-13T00:30:46Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

petr-muller · 2023-08-02T10:05:10Z

/uncc

openshift-bot · 2023-09-02T00:00:18Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-09-02T00:00:57Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci Bot requested review from DavidHurta and petr-muller February 23, 2023 17:48

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2023

LalatenduMohanty suggested changes Feb 23, 2023

View reviewed changes

openshift-ci Bot assigned LalatenduMohanty Feb 23, 2023

openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2023

openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 13, 2023

openshift-ci Bot removed the request for review from petr-muller August 2, 2023 10:05

openshift-ci Bot closed this Sep 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/cvo/status: Failing is not more serious than Degraded#905

pkg/cvo/status: Failing is not more serious than Degraded#905
wking wants to merge 1 commit intoopenshift:masterfrom
wking:degraded-vs-failing

wking commented Feb 23, 2023

Uh oh!

openshift-ci Bot commented Feb 23, 2023

Uh oh!

LalatenduMohanty Feb 23, 2023

Uh oh!

wking Feb 23, 2023

Uh oh!

LalatenduMohanty Feb 23, 2023

Uh oh!

wking Feb 27, 2023

Uh oh!

DavidHurta Mar 13, 2023

Uh oh!

openshift-ci Bot commented Mar 10, 2023

Uh oh!

openshift-bot commented Jun 12, 2023

Uh oh!

openshift-bot commented Jul 13, 2023

Uh oh!

petr-muller commented Aug 2, 2023

Uh oh!

openshift-bot commented Sep 2, 2023

Uh oh!

openshift-ci Bot commented Sep 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

wking commented Feb 23, 2023

Uh oh!

openshift-ci Bot commented Feb 23, 2023

Uh oh!

LalatenduMohanty Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

wking Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

wking Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

DavidHurta Mar 13, 2023

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Mar 10, 2023

Uh oh!

openshift-bot commented Jun 12, 2023

Uh oh!

openshift-bot commented Jul 13, 2023

Uh oh!

petr-muller commented Aug 2, 2023

Uh oh!

openshift-bot commented Sep 2, 2023

Uh oh!

openshift-ci Bot commented Sep 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants