docs: degraded condition#161
Conversation
f1b1946 to
01a9a14
Compare
| change is happening that does not require a roll-out of new pods. If your | ||
| rolling deployment bursts above its target replica count, you may be | ||
| `Progressing` but not `Degraded` because you have your desired number of pods | ||
| running to meet your service demand. If a component remains degraded for an |
There was a problem hiding this comment.
I would think another indicator of a persistent Degraded state is Degraded=true and Progressing=false.
I still wish there was a way to indicate the notion of "external intervention is needed to get out of Degraded=true"
bfff2c8 to
6b8b4a1
Compare
| service is in a `Degraded` state. A component may be `Degraded` while it is | ||
| `Progressing` to a new desired state; for example, if only 2 of the 3 desired | ||
| replicas are achieved. As a result, it may be normal for a operator during | ||
| upgrade to temporarily report `Degraded`. A component may be `Progressing` |
There was a problem hiding this comment.
I'm still not 100% on board with this. I don't want operators flashing between degraded and not degraded. There needs to be a "within a minute or so, if the condition isn't resolved, an operator should report degraded". It's acceptable to start by flashing degraded, but I'm going to open bugs to you until during normal upgrades you don't.
I don't think being 2/3 during a rolling upgrade is degraded. I think being 2/3 because the upgrade can't make progress is degraded.
There was a problem hiding this comment.
We need to be very careful that people don't interpret Degraded as "normal operation but change is happening". Degraded is "you either have no idea what is going on" (which is something a development team is expected to fix and bugs will be opened against you) or "something is legitimately failing".
I don't want to enshrine "shrug" as degraded.
6b8b4a1 to
a43c196
Compare
| 1. A operator doesn't report the `Available` status condition the first time | ||
| until they are completely rolled out (or within some reasonable percentage if | ||
| the component must be installed to all nodes) | ||
| 2. An operator reports `Degraded` when its current state does not match its |
There was a problem hiding this comment.
An operator reports
Degradedwhen its current state does not match its
desired state resulting in a lower quality of service over a period of time.
nit: An operator reports Degraded when its current state does not match its desired state over a period of time resulting in a lower quality of service . is much more clearer that operators mark degraded when they have been trying to achieve desired but haven't achieved it for a period of time....
There was a problem hiding this comment.
agreed. updated phrasing to match.
michaelgugino
left a comment
There was a problem hiding this comment.
I'm against this change. This will add a good amount of complexity for having to deal with transient issues. The amount of built-in requirements is already getting quite tedious, and I don't think there's room for another condition.
a43c196 to
8402d21
Compare
|
@michaelgugino is your concern that the over-a-period-of-time requirement is going to mask issues? It sounds like you'd rather the operators fail fast than hem and haw. (Just want to understand your position) |
|
/retest |
|
see: openshift/api#287 |
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, derekwaynecarr, eparis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@eparis So are we changing the definition or not? You implied in the other PRs that this was just a rename for now, but the docs change here also changes the semantics. |
|
the name change is required before GA, the semantic change is aspirational. |
| pod is crash-looping. The service is `Available` but `Degraded` because it | ||
| may have a lower quality of service. A component may be `Progressing` but | ||
| not `Degraded` because the transition from one state to another does not | ||
| persist over a long enough period to report `Degraded`. A service should not |
There was a problem hiding this comment.
Why operator can't be Progressing=True and Degraded=True? Today, if progressing towards a new version fails we flip Failing=True but keep Progressing=True (if for instance, the master pool in MCO don't get ready after an upgrade, and that may be temporary till all nodes roll out or persistent over a period of time which I guess we can try to measure/act on). Besides, Why can't we be Degraded while Progressing?
This was still talking aobut the old Failing. And the content has been replaced by the discussion that came in with the Degraded docs in 8402d21 (docs: degraded condition, 2019-04-11, openshift#161).
No description provided.