Skip to content

docs: degraded condition#161

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
derekwaynecarr:degraded-condition
Apr 17, 2019
Merged

docs: degraded condition#161
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
derekwaynecarr:degraded-condition

Conversation

@derekwaynecarr
Copy link
Copy Markdown
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 11, 2019
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 11, 2019
Comment thread docs/dev/clusteroperator.md Outdated
change is happening that does not require a roll-out of new pods. If your
rolling deployment bursts above its target replica count, you may be
`Progressing` but not `Degraded` because you have your desired number of pods
running to meet your service demand. If a component remains degraded for an
Copy link
Copy Markdown
Contributor

@sjenning sjenning Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think another indicator of a persistent Degraded state is Degraded=true and Progressing=false.

I still wish there was a way to indicate the notion of "external intervention is needed to get out of Degraded=true"

@derekwaynecarr derekwaynecarr force-pushed the degraded-condition branch 2 times, most recently from bfff2c8 to 6b8b4a1 Compare April 11, 2019 20:47
Comment thread docs/dev/clusteroperator.md Outdated
service is in a `Degraded` state. A component may be `Degraded` while it is
`Progressing` to a new desired state; for example, if only 2 of the 3 desired
replicas are achieved. As a result, it may be normal for a operator during
upgrade to temporarily report `Degraded`. A component may be `Progressing`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not 100% on board with this. I don't want operators flashing between degraded and not degraded. There needs to be a "within a minute or so, if the condition isn't resolved, an operator should report degraded". It's acceptable to start by flashing degraded, but I'm going to open bugs to you until during normal upgrades you don't.

I don't think being 2/3 during a rolling upgrade is degraded. I think being 2/3 because the upgrade can't make progress is degraded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be very careful that people don't interpret Degraded as "normal operation but change is happening". Degraded is "you either have no idea what is going on" (which is something a development team is expected to fix and bugs will be opened against you) or "something is legitimately failing".

I don't want to enshrine "shrug" as degraded.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 11, 2019
1. A operator doesn't report the `Available` status condition the first time
until they are completely rolled out (or within some reasonable percentage if
the component must be installed to all nodes)
2. An operator reports `Degraded` when its current state does not match its
Copy link
Copy Markdown
Contributor

@abhinavdahiya abhinavdahiya Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An operator reports Degraded when its current state does not match its
desired state resulting in a lower quality of service over a period of time.

nit: An operator reports Degraded when its current state does not match its desired state over a period of time resulting in a lower quality of service . is much more clearer that operators mark degraded when they have been trying to achieve desired but haven't achieved it for a period of time....

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. updated phrasing to match.

Copy link
Copy Markdown
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this change. This will add a good amount of complexity for having to deal with transient issues. The amount of built-in requirements is already getting quite tedious, and I don't think there's room for another condition.

@crawford
Copy link
Copy Markdown
Contributor

crawford commented Apr 12, 2019

@michaelgugino is your concern that the over-a-period-of-time requirement is going to mask issues? It sounds like you'd rather the operators fail fast than hem and haw. (Just want to understand your position)

@eparis
Copy link
Copy Markdown
Member

eparis commented Apr 17, 2019

/retest
/lgtm
we can iterate on this definition over time, but we merged the name change, so we should talk some more about it :)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2019
@eparis
Copy link
Copy Markdown
Member

eparis commented Apr 17, 2019

see: openshift/api#287

@abhinavdahiya
Copy link
Copy Markdown
Contributor

/retest
/lgtm
we can iterate on this definition over time, but we merged the name change, so we should talk some more about it :)

/approve

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, derekwaynecarr, eparis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2019
@abhinavdahiya
Copy link
Copy Markdown
Contributor

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@danwinship
Copy link
Copy Markdown
Contributor

@eparis So are we changing the definition or not? You implied in the other PRs that this was just a rename for now, but the docs change here also changes the semantics.

@openshift-merge-robot openshift-merge-robot merged commit 5d518b9 into openshift:master Apr 17, 2019
@eparis
Copy link
Copy Markdown
Member

eparis commented Apr 17, 2019

the name change is required before GA, the semantic change is aspirational.

pod is crash-looping. The service is `Available` but `Degraded` because it
may have a lower quality of service. A component may be `Progressing` but
not `Degraded` because the transition from one state to another does not
persist over a long enough period to report `Degraded`. A service should not
Copy link
Copy Markdown
Member

@runcom runcom Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why operator can't be Progressing=True and Degraded=True? Today, if progressing towards a new version fails we flip Failing=True but keep Progressing=True (if for instance, the master pool in MCO don't get ready after an upgrade, and that may be temporary till all nodes roll out or persistent over a period of time which I guess we can try to measure/act on). Besides, Why can't we be Degraded while Progressing?

wking added a commit to wking/cluster-version-operator that referenced this pull request May 10, 2019
This was still talking aobut the old Failing.  And the content has
been replaced by the discussion that came in with the Degraded docs in
8402d21 (docs: degraded condition, 2019-04-11, openshift#161).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.