Skip to content

Conversation

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Aug 8, 2019

Adds alerts for cluster-version-operator and cluster operators

  • ClusterVersionOperatorDown
    This alert is fired when cluster-version-operator is not providing any metrics. Serverity is critical as upgrades will not work and the clusters can drift from expected state.
  • ClusterOperatorDegraded
    This alert is fired when the cluster operator is degraded. This is important as the cluster might be in an unacceptable state for produciton cluster, for example, using emptyDir for storage backend for registry. Severity is critical as
    degraded operator implies the operands are in a state that is not correct for the cluster.
  • ClusterOperatorDown
    This alert fires when a cluster operator is not up ie cluster_operator_up is 0. This means that the operator might not be reconcile the operands.
  • ClusterOperatorFlapping
    This alert fires when a cluster operator is flapping between up and down continously because of some weird condition.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 8, 2019
@abhinavdahiya abhinavdahiya changed the title isntall: add alerts for cluster-version-operator install: add alerts for cluster-version-operator Aug 8, 2019
rules:
- alert: ClusterVersionOperatorDown
annotations:
message: ClusterVersionOperator has disappeared from Prometheus target discovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to indicate why someone might care as well as possible causes since this critical. “Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible”

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rules:
- alert: ClusterOperatorDown
annotations:
message: ClusterOperator {{ $labels.name}} has not been available for 10 mins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to check for degraded or not available. Both are significant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also explain impact here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to check for degraded or not available.

Isn't that what the current available && !failing is doing? If you're not available, we're not setting cluster-operator-up for you. If you're available but degraded (which backs the failing variable), we're not setting it either.

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added ClusterOperatorDegraded in 136fb72

for: 2m
labels:
severity: critical

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ git show --check --oneline origin/pr/232
297f6ca isntall: add alerts for cluster-version-operator
install/0000_90_cluster-version-operator_00_servicemonitor.yaml:57: new blank line at EOF.

annotations:
message: ClusterOperator {{ $labels.name}} has not been available for 10 mins.
expr: |
cluster_operator_up{job="cluster-version-operator"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually report this? From my reading of this loop, it's just for ClusterOperator objects, and that's the operators reporting to the CVO, not the CVO itself (e.g. no cluster-version-operator) in here. Or is this "cluster_operator_up reported by the cluster-version-operator job" or some such? Can you unpack this expression for me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if other actions in the cluster can also emit something with name cluster_operator_up. so just want to make sure it's on the time series emitted by cluster-version-operator.

currently there's only one cvo, so namespace should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently there's only one cvo, so namespace should be fine.

Couldn't end users create jobs with that name? Can we rely on not scraping user namespaces?

annotations:
message: ClusterVersionOperator has disappeared from Prometheus target discovery.
expr: |
absent(up{job="cluster-version-operator"} == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be sufficient on its own, but we may want namespace="openshift-cluster-version" in the filter as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking
Copy link
Member

wking commented Aug 8, 2019

/title Bug 1717632: install: add alerts for cluster-version-operator

That's what this is fixing, right?

@wking
Copy link
Member

wking commented Aug 8, 2019

Oops:

/retitle Bug 1717632: install: add alerts for cluster-version-operator

@openshift-ci-robot openshift-ci-robot changed the title install: add alerts for cluster-version-operator Bug 1717632: install: add alerts for cluster-version-operator Aug 8, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 8, 2019
@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya abhinavdahiya force-pushed the alerts branch 3 times, most recently from 48237bc to 136fb72 Compare August 12, 2019 21:38
@abhinavdahiya
Copy link
Contributor Author

/test e2e-aws

@abhinavdahiya
Copy link
Contributor Author

/test e2e-aws

rules:
- alert: ClusterVersionOperatorDown
annotations:
message: ClusterVersionOperator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be tempted to say Cluster version operator has ... to make it flow as a message in more contexts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rules:
- alert: ClusterOperatorDown
annotations:
message: ClusterOperator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably Cluster operator {{ ... }}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

severity: critical
- alert: ClusterOperatorFlapping
annotations:
message: ClusterOperator {{ "{{ $labels.name }}" }} changing its up status often. This might cause upgrades to be unstable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might suggest ... up status is changing often. This might ....

Do pods already have an alert that fires if they're restarting? I think this list is probably sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 16, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrices like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01,
ClusterVersion deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 17, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 17, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 18, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 18, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rethink some of these strategies a bit. Also we're too liberal here with applying severity: critical. I'm not saying these things are not important, but they're not important enough to wake someone up at night. All these issues that these alerts are attempting to assess, can easily wait until work-time.

rules:
- alert: ClusterOperatorDown
annotations:
message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this alert as opposed to an alert like ClusterVersionOperatorDown for each cluster operator? As I understand this alert has multiple indirections which we want to avoid at all costs in monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster version operator is not a SLO. it does not have a cluster operator. that's why the ClusterVersionOperatorDown right above.

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand this alert has multiple indirections which we want to avoid at all costs in monitoring.

hmm, do you mean this alert can point to as many cluster operators. Can you provide details on why that should be avoided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's an alert on a metric that is not exposed by the process of origin. As in the cluster-version-operator is exposing a metric on behalf of another operator, and even worse, with information provided by the other operator. This requires that everything is working properly to provide a failure case. This is why metrics should be exposed whereever the origin of any information is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are monitoring the core control loop of the cluster. We have to monitor both, not just one or the other. This alert reflects the “making progress” part of the product - operator A and B coordinate via this api, not via alerts and metrics. We should have per operator monitoring. We must have core product control loop monitoring and alerting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move forward with this PR, then I would want to see concrete actions as to how we will properly resolve this in the very next release (4.3), which would be the requirement for each cluster-operator to:

  • expose identify and expose metrics to describe their SLIs
  • only be allowed to page customers (that includes our own SRE) with alerts that describe error budget is burnt at a high rate
  • not include alerts that are meant for our engineering team's, these are to be done via telemetry infrastructure (as in alert on data in telemetry)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Degraded operators are functional indistinguishable from broken clusters. We have alert because we know something is wrong.

Evidence wise I haven’t seen more than a small percentage of degraded operators that were not error conditions that should result in a customer requesting a bug fix for users.

For disconnected users alerts are mandatory on cluster to inform them to contact red hat.

cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} == 1
for: 10m
labels:
severity: critical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intend of this, but a cluster administrator is not going to know what to do here. This goes back to my general critizism about using cluster_operator_conditions for any type of alerting. Apart from the indirections it imposes, this is the wrong approach to monitoring. Instead we should be setting SLOs for each component, and alert on the failure to deliver those. Those alerts can then have clear runbooks as to what actionable things can be done by the cluster administrator about this particular component.

I think we can have alerts that go in a similar direction as thi this on the telemeter side, so our engineering teams can act on this, but I don't think this is useful for customers as is. In fact I think it's harmful as it produces alert fatigue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the user to actually be alerted when an cluster operator is degraded for that long.

And can you explain why is there a requirement for a runbook for each alert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a better suggestion for the severity?? any guidelines that we can follow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I think it's harmful as it produces alert fatigue.

I'd rather avoid alert fatigue on these by not hitting these thresholds, not by not monitoring the thresholds :). All of these things seem like pretty bad "I'd want to know first thing in the morning if this was happening" things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical severity is for when you page someone and there is something concrete must be fixed ASAP. I would suggest severity warning for something like this, but it's still too vague of an alert that I would send it to anyone, as it's unclear what to do on this alert.

Even if against our instinct, removing/not using flaky alerts is better than causing alert fatigue, alert fatigue causes people to not trust in our alerts and ignore all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this alert is to wake up an engineer because they are broken. We are increasingly targeting alerts towards our engineering / monitoring team, not just the customers.

Alert fatigue is real but this is a critical failure of a core operator, and we have to react to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are increasingly targeting alerts towards our engineering / monitoring team, not just the customers.

Then this should be an alert evaluated by telemetry infrastructure, not in-cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - we will have customers without telemetry.

Our job is to fix bugs rapidly. Alerts can be generated by both bugs, user error, and unanticipated states.

It would be irresponsible of us not to alert these conditions on cluster for all disconnected users.

annotations:
message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
expr: |
cluster_operator_up{job="cluster-version-operator"} == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this metric determined? And I assume this metric is somehow synthetically produced by the cluster-version-operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this metric determined? And I assume this metric is somehow synthetically produced by the cluster-version-operator?

what do we gain from knowing how this is generated?

it looks at clusteroperator and reports 1 when available & ! degraded else 0

g := m.clusterOperatorUp.WithLabelValues(op.Name, firstVersion)
failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, ClusterStatusFailing)
available := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable)
if available && !failing {
g.Set(1)
} else {
g.Set(0)
}
ch <- g

Copy link
Member

@wking wking Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a whole here where a rolling outage across operators could trigger the alert without having a single operator down for the full threshold. For example:

  1. Scrape 1, operator A is down.
  2. Scrape 2, operator B is down; A is back up.
  3. Scrape 3, operator C is down; B is back up.
  4. Alert fires with "there has been at least one down operator for the last three scrapes (over 10m)"

I'm not sure how to adjust this query to avoid that though, although it plays into the "many operators behind one alert" point you raise here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are metrics that should be exposed by the operators themselves and not by the CVO. That would remove the indirections, and we can make alerts concrete and actionable. As they are, these alerts, except for the very first one are not actionable, nor do they follow the best practice of metrics coming from the origin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment, but the metrics are part of the product. An individual operator doesn’t report whether other operators see the result. We must see the core loop view like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean by these metrics describing the "advancing of the CVO", and I'm not arguing that it's not important, however, it is not clear to an SRE (even our SRE I've talked to them) if they are paged with this alert today what an actionable thing to troubleshoot is, not even what's broken. As is, this is at most actionable for our own engineering. As I said in the other comment, I want us to have a plan for how we will get to a state where this is clear and it is clear what about the product is not working when it's not. This needs to be top priority for the next release(s).

annotations:
message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.
expr: |
changes(cluster_operator_up{job="cluster-version-operator"}[2m]) > 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2m range selector doesn't seem to cover sufficient failure cases, also this behavior seems highly specific to each operator. Some operators have a very short reconciliation loop, some have a very long one. I don't think we're going to be able to find any good general purpose value for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clusteroperators should not be flapping between conditions. It has nothing to do with reconcile times.

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#what-should-an-operator-report-with-clusteroperator-custom-resource

for example,

An operator reports Progressing when it is rolling out new code, propagating config changes, or otherwise moving from one steady state to another. It should not report progressing when it is reconciling a previously known state

that makes it pretty clear that reconciles of current behavior making clusteroperator progressing is not recommended.

Copy link
Contributor

@brancz brancz Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear that that's what you want to test for. Also this sounds like it rather belongs in an e2e test than an alert as this is largely an implementation fault if this happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to know whether it is happening in the wild. I think we’ve shown conclusively in 4.1 with telemetry that our internal testing of operators is insuffficient to catch what is happening in the wild.

I want to prove these things aren’t happening in the wild (and we have evidence it is) and maybe we can remove the alert later if it isn’t firing in the wild. But because this is a critical control loop extra alerts are a good thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prove this isn't happening in the wild severity warning is sufficient. I am ok with adding this to prove that it's not happening.

message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.
expr: |
changes(cluster_operator_up{job="cluster-version-operator"}[2m]) > 2
for: 2m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to alert when the operators are flapping as that is not the normal behavior.

message: Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
expr: |
absent(up{job="cluster-version-operator"} == 1)
for: 15m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 15m here and 10m for the other operators? I don't have strong opinions, but I'm curious about the motivation and think we should record why we picked different values in the commit message to inform future devs interested in changing the defaults. Maybe "Prometheus scrapes every ${whatever} minutes, so allow the operators to be down for three scrapes before freaking out", or some such.

}
g := m.clusterOperatorUp.WithLabelValues(op.Name, firstVersion)
failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, ClusterStatusFailing)
failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorDegraded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh

wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
Adds alerts for cluster-version-operator and cluster operators

* `ClusterVersionOperatorDown`
  This alert is fired when cluster-version-operator is not providing any metrics. Serverity is critical as upgrades will not work and the clusters can drift from expected state.
* `ClusterOperatorDegraded`
  This alert is fired when the cluster operator is degraded. This is important as the cluster might be in an unacceptable state for produciton cluster, for example, using emptyDir for storage backend for registry. Severity is critical as
  degraded operator implies the operands are in a state that is not correct for the cluster.
* `ClusterOperatorDown`
  This alert fires when a cluster operator is not up ie cluster_operator_up is 0. This means that the operator might not be reconcile the operands.
* `ClusterOperatorFlapping`
  This alert fires when a cluster operator is flapping between up and down continously because of some weird condition.
@smarterclayton
Copy link
Contributor

/lgtm

/cherrypick release-4.1

@openshift-cherrypick-robot

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

Details

In response to this:

/lgtm

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: This pull request references Bugzilla bug 1717632, which is valid.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 817a2d5 into openshift:master Aug 22, 2019
@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: All pull requests linked via external trackers have merged. Bugzilla bug 1717632 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@smarterclayton: #232 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/cvo/metrics.go
M	pkg/cvo/metrics_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/metrics_test.go
Auto-merging pkg/cvo/metrics.go
CONFLICT (content): Merge conflict in pkg/cvo/metrics.go
Patch failed at 0001 pkg/cvo/metrics/go: the cluster operators report Degraded and not Failing

Details

In response to this:

/lgtm

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya abhinavdahiya deleted the alerts branch August 22, 2019 20:45
wking pushed a commit to wking/cluster-version-operator that referenced this pull request Oct 17, 2019
…ling

Cherry-picked from fad0688 (pkg/cvo/metrics/go: the cluster
operators report Degraded and not Failing, 2019-08-07, openshift#232) and
manually removed the import shuffling from metrics.go.
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 27, 2019
cde8d50 (docs/user/reconciliation: Document release-image
application, 2019-06-10, openshift#201) just landed, so drop some of the
reconciliation docs from the parallel 105ea1f (document CVO
behavior w/ respect to cluster operator conditions, 2019-07-12, openshift#222)
and similar in favor of a link to the new reconciliation location.

Also move the file-extension note over to the reconciliation side,
since it's more generic than ClusterOperator.  It's not really user
docs either, but that's the most generic location we have at the
moment.

Also fix a few more Failing->Degraded bits to catch up with fad0688
(pkg/cvo/metrics/go: the cluster operators report Degraded and not
Failing, 2019-08-07, openshift#232) and similar.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 20, 2020
…table" from ClusterOperatorDegraded

If we considered Degraded=True something that impacted update
stability, I expect we'd want a guard against it that blocked new
updates until admins coaxed the operator back into happiness.  This
commit softens wording from af62f04 (install: add alerts for
cluster-version-operator, 2019-08-07, openshift#232).
@wking wking mentioned this pull request Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants