docs: degraded condition by derekwaynecarr · Pull Request #161 · openshift/cluster-version-operator

derekwaynecarr · 2019-04-11T20:31:13Z

No description provided.

sjenning · 2019-04-11T20:36:15Z

+   change is happening that does not require a roll-out of new pods.  If your
+   rolling deployment bursts above its target replica count, you may be
+   `Progressing` but not `Degraded` because you have your desired number of pods
+   running to meet your service demand.  If  a component remains degraded for an


I would think another indicator of a persistent Degraded state is Degraded=true and Progressing=false.

I still wish there was a way to indicate the notion of "external intervention is needed to get out of Degraded=true"

smarterclayton · 2019-04-11T20:54:10Z

+   service is in a `Degraded` state.  A component may be `Degraded` while it is
+   `Progressing` to a new desired state; for example, if only 2 of the 3 desired
+   replicas are achieved.  As a result, it may be normal for a operator during
+   upgrade to temporarily report `Degraded`.  A component may be `Progressing`


I'm still not 100% on board with this. I don't want operators flashing between degraded and not degraded. There needs to be a "within a minute or so, if the condition isn't resolved, an operator should report degraded". It's acceptable to start by flashing degraded, but I'm going to open bugs to you until during normal upgrades you don't.

I don't think being 2/3 during a rolling upgrade is degraded. I think being 2/3 because the upgrade can't make progress is degraded.

We need to be very careful that people don't interpret Degraded as "normal operation but change is happening". Degraded is "you either have no idea what is going on" (which is something a development team is expected to fix and bugs will be opened against you) or "something is legitimately failing".

I don't want to enshrine "shrug" as degraded.

abhinavdahiya · 2019-04-11T23:40:40Z

+1. A operator doesn't report the `Available` status condition the first time
+   until they are completely rolled out (or within some reasonable percentage if
+   the component must be installed to all nodes)
+2. An operator reports `Degraded` when its current state does not match its


An operator reports Degraded when its current state does not match its
desired state resulting in a lower quality of service over a period of time.

nit: An operator reports Degraded when its current state does not match its desired state over a period of time resulting in a lower quality of service . is much more clearer that operators mark degraded when they have been trying to achieve desired but haven't achieved it for a period of time....

agreed. updated phrasing to match.

michaelgugino

I'm against this change. This will add a good amount of complexity for having to deal with transient issues. The amount of built-in requirements is already getting quite tedious, and I don't think there's room for another condition.

crawford · 2019-04-12T18:29:14Z

@michaelgugino is your concern that the over-a-period-of-time requirement is going to mask issues? It sounds like you'd rather the operators fail fast than hem and haw. (Just want to understand your position)

eparis · 2019-04-17T15:10:09Z

/retest
/lgtm
we can iterate on this definition over time, but we merged the name change, so we should talk some more about it :)

eparis · 2019-04-17T15:23:37Z

see: openshift/api#287

abhinavdahiya · 2019-04-17T16:53:57Z

/retest
/lgtm
we can iterate on this definition over time, but we merged the name change, so we should talk some more about it :)

/approve

openshift-ci-robot · 2019-04-17T16:54:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, derekwaynecarr, eparis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2019-04-17T16:54:03Z

/retest

openshift-bot · 2019-04-17T17:27:26Z

/retest

Please review the full test history for this PR and help us cut down flakes.

danwinship · 2019-04-17T18:06:49Z

@eparis So are we changing the definition or not? You implied in the other PRs that this was just a rename for now, but the docs change here also changes the semantics.

eparis · 2019-04-17T21:08:51Z

the name change is required before GA, the semantic change is aspirational.

runcom · 2019-04-19T08:30:37Z

+   pod is crash-looping. The service is `Available` but `Degraded` because it
+   may have a lower quality of service.  A component may be `Progressing` but
+   not `Degraded` because the transition from one state to another does not
+   persist over a long enough period to report `Degraded`.  A service should not


Why operator can't be Progressing=True and Degraded=True? Today, if progressing towards a new version fails we flip Failing=True but keep Progressing=True (if for instance, the master pool in MCO don't get ready after an upgrade, and that may be temporary till all nodes roll out or persistent over a period of time which I guess we can try to measure/act on). Besides, Why can't we be Degraded while Progressing?

This was still talking aobut the old Failing. And the content has been replaced by the discussion that came in with the Degraded docs in 8402d21 (docs: degraded condition, 2019-04-11, openshift#161).

openshift-ci-robot requested review from crawford and wking April 11, 2019 20:31

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 11, 2019

derekwaynecarr force-pushed the degraded-condition branch from f1b1946 to 01a9a14 Compare April 11, 2019 20:34

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 11, 2019

sjenning reviewed Apr 11, 2019

View reviewed changes

derekwaynecarr force-pushed the degraded-condition branch 2 times, most recently from bfff2c8 to 6b8b4a1 Compare April 11, 2019 20:47

derekwaynecarr mentioned this pull request Apr 11, 2019

introduce degraded condition openshift/api#286

Merged

smarterclayton reviewed Apr 11, 2019

View reviewed changes

derekwaynecarr force-pushed the degraded-condition branch from 6b8b4a1 to a43c196 Compare April 11, 2019 21:48

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 11, 2019

abhinavdahiya reviewed Apr 11, 2019

View reviewed changes

michaelgugino suggested changes Apr 12, 2019

View reviewed changes

docs: degraded condition

8402d21

derekwaynecarr force-pushed the degraded-condition branch from a43c196 to 8402d21 Compare April 12, 2019 18:05

This was referenced Apr 17, 2019

Change Failed to Degraded #165

Merged

Change ClusterOperatorStatus Failed to Degraded openshift/library-go#354

Merged

openshift-ci-robot assigned eparis Apr 17, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2019

openshift-merge-robot merged commit 5d518b9 into openshift:master Apr 17, 2019

danwinship mentioned this pull request Apr 18, 2019

Change "Failing" to "Degraded" openshift/cluster-network-operator#147

Merged

runcom reviewed Apr 19, 2019

View reviewed changes

wking mentioned this pull request May 10, 2019

docs/dev/clusteroperator: Remove obsolete "Conditions" section #184

Closed

Conversation

derekwaynecarr commented Apr 11, 2019

Uh oh!

sjenning Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

smarterclayton Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

michaelgugino left a comment

Choose a reason for hiding this comment

Uh oh!

crawford commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eparis commented Apr 17, 2019

Uh oh!

eparis commented Apr 17, 2019

Uh oh!

abhinavdahiya commented Apr 17, 2019

Uh oh!

openshift-ci-robot commented Apr 17, 2019

Uh oh!

abhinavdahiya commented Apr 17, 2019

Uh oh!

openshift-bot commented Apr 17, 2019

Uh oh!

danwinship commented Apr 17, 2019

Uh oh!

eparis commented Apr 17, 2019

Uh oh!

runcom Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

sjenning Apr 11, 2019 •

edited

Loading

abhinavdahiya Apr 11, 2019 •

edited

Loading

crawford commented Apr 12, 2019 •

edited

Loading

runcom Apr 19, 2019 •

edited

Loading