Bug 1717632: install: add alerts for cluster-version-operator #232

abhinavdahiya · 2019-08-08T00:17:14Z

Adds alerts for cluster-version-operator and cluster operators

ClusterVersionOperatorDown
This alert is fired when cluster-version-operator is not providing any metrics. Serverity is critical as upgrades will not work and the clusters can drift from expected state.
ClusterOperatorDegraded
This alert is fired when the cluster operator is degraded. This is important as the cluster might be in an unacceptable state for produciton cluster, for example, using emptyDir for storage backend for registry. Severity is critical as
degraded operator implies the operands are in a state that is not correct for the cluster.
ClusterOperatorDown
This alert fires when a cluster operator is not up ie cluster_operator_up is 0. This means that the operator might not be reconcile the operands.
ClusterOperatorFlapping
This alert fires when a cluster operator is flapping between up and down continously because of some weird condition.

smarterclayton · 2019-08-08T01:57:53Z

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

+    rules:
+    - alert: ClusterVersionOperatorDown
+      annotations:
+        message: ClusterVersionOperator has disappeared from Prometheus target discovery.


Probably better to indicate why someone might care as well as possible causes since this critical. “Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible”

@smarterclayton done with 136fb72

smarterclayton · 2019-08-08T01:58:26Z

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

+    rules:
+    - alert: ClusterOperatorDown
+      annotations:
+        message: ClusterOperator {{ $labels.name}} has not been available for 10 mins.


I would prefer to check for degraded or not available. Both are significant.

Also explain impact here.

I would prefer to check for degraded or not available.

Isn't that what the current available && !failing is doing? If you're not available, we're not setting cluster-operator-up for you. If you're available but degraded (which backs the failing variable), we're not setting it either.

added ClusterOperatorDegraded in 136fb72

wking · 2019-08-08T13:02:39Z

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

+      for: 2m
+      labels:
+        severity: critical
+


$ git show --check --oneline origin/pr/232 297f6ca isntall: add alerts for cluster-version-operator install/0000_90_cluster-version-operator_00_servicemonitor.yaml:57: new blank line at EOF.

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

wking · 2019-08-08T13:18:38Z

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

+      annotations:
+        message: ClusterOperator {{ $labels.name}} has not been available for 10 mins.
+      expr: |
+        cluster_operator_up{job="cluster-version-operator"}


Do we actually report this? From my reading of this loop, it's just for ClusterOperator objects, and that's the operators reporting to the CVO, not the CVO itself (e.g. no cluster-version-operator) in here. Or is this "cluster_operator_up reported by the cluster-version-operator job" or some such? Can you unpack this expression for me?

i'm not sure if other actions in the cluster can also emit something with name cluster_operator_up. so just want to make sure it's on the time series emitted by cluster-version-operator.

currently there's only one cvo, so namespace should be fine.

currently there's only one cvo, so namespace should be fine.

Couldn't end users create jobs with that name? Can we rely on not scraping user namespaces?

wking · 2019-08-08T17:02:35Z

install/0000_90_cluster-version-operator_00_servicemonitor.yaml

+      annotations:
+        message: ClusterVersionOperator has disappeared from Prometheus target discovery.
+      expr: |
+        absent(up{job="cluster-version-operator"} == 1)


This is likely to be sufficient on its own, but we may want namespace="openshift-cluster-version" in the filter as well.

#232 (comment)

wking · 2019-08-08T19:57:09Z

/title Bug 1717632: install: add alerts for cluster-version-operator

That's what this is fixing, right?

wking · 2019-08-08T19:57:23Z

Oops:

/retitle Bug 1717632: install: add alerts for cluster-version-operator

openshift-ci-robot · 2019-08-08T19:57:30Z

@abhinavdahiya: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ling

abhinavdahiya · 2019-08-13T22:32:26Z

/test e2e-aws

abhinavdahiya · 2019-08-15T19:09:02Z

/test e2e-aws

smarterclayton · 2019-08-16T03:16:50Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+    rules:
+    - alert: ClusterVersionOperatorDown
+      annotations:
+        message: ClusterVersionOperator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.


Would be tempted to say Cluster version operator has ... to make it flow as a message in more contexts.

@smarterclayton done with e7bc61e

smarterclayton · 2019-08-16T03:17:14Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+    rules:
+    - alert: ClusterOperatorDown
+      annotations:
+        message: ClusterOperator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.


Probably Cluster operator {{ ... }}

@smarterclayton done with e7bc61e

smarterclayton · 2019-08-16T03:19:09Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+        severity: critical
+    - alert: ClusterOperatorFlapping
+      annotations:
+        message: ClusterOperator {{ "{{ $labels.name }}" }} changing its up status often. This might cause upgrades to be unstable.


Might suggest ... up status is changing often. This might ....

Do pods already have an alert that fires if they're restarting? I think this list is probably sufficient.

@smarterclayton done with e7bc61e

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default, the CVO will continue applying the current version and reporting some metrices like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default, the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12

brancz

I think we need to rethink some of these strategies a bit. Also we're too liberal here with applying severity: critical. I'm not saying these things are not important, but they're not important enough to wake someone up at night. All these issues that these alerts are attempting to assess, can easily wait until work-time.

brancz · 2019-08-19T09:00:38Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+    rules:
+    - alert: ClusterOperatorDown
+      annotations:
+        message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.


Why this alert as opposed to an alert like ClusterVersionOperatorDown for each cluster operator? As I understand this alert has multiple indirections which we want to avoid at all costs in monitoring.

Cluster version operator is not a SLO. it does not have a cluster operator. that's why the ClusterVersionOperatorDown right above.

As I understand this alert has multiple indirections which we want to avoid at all costs in monitoring.

hmm, do you mean this alert can point to as many cluster operators. Can you provide details on why that should be avoided.

No, it's an alert on a metric that is not exposed by the process of origin. As in the cluster-version-operator is exposing a metric on behalf of another operator, and even worse, with information provided by the other operator. This requires that everything is working properly to provide a failure case. This is why metrics should be exposed whereever the origin of any information is.

We are monitoring the core control loop of the cluster. We have to monitor both, not just one or the other. This alert reflects the “making progress” part of the product - operator A and B coordinate via this api, not via alerts and metrics. We should have per operator monitoring. We must have core product control loop monitoring and alerting.

If we move forward with this PR, then I would want to see concrete actions as to how we will properly resolve this in the very next release (4.3), which would be the requirement for each cluster-operator to:

expose identify and expose metrics to describe their SLIs

only be allowed to page customers (that includes our own SRE) with alerts that describe error budget is burnt at a high rate

not include alerts that are meant for our engineering team's, these are to be done via telemetry infrastructure (as in alert on data in telemetry)

Degraded operators are functional indistinguishable from broken clusters. We have alert because we know something is wrong.

Evidence wise I haven’t seen more than a small percentage of degraded operators that were not error conditions that should result in a customer requesting a bug fix for users.

For disconnected users alerts are mandatory on cluster to inform them to contact red hat.

brancz · 2019-08-19T09:04:35Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+        cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} == 1
+      for: 10m
+      labels:
+        severity: critical


I understand the intend of this, but a cluster administrator is not going to know what to do here. This goes back to my general critizism about using cluster_operator_conditions for any type of alerting. Apart from the indirections it imposes, this is the wrong approach to monitoring. Instead we should be setting SLOs for each component, and alert on the failure to deliver those. Those alerts can then have clear runbooks as to what actionable things can be done by the cluster administrator about this particular component.

I think we can have alerts that go in a similar direction as thi this on the telemeter side, so our engineering teams can act on this, but I don't think this is useful for customers as is. In fact I think it's harmful as it produces alert fatigue.

I think we want the user to actually be alerted when an cluster operator is degraded for that long.

And can you explain why is there a requirement for a runbook for each alert?

do we have a better suggestion for the severity?? any guidelines that we can follow?

In fact I think it's harmful as it produces alert fatigue.

I'd rather avoid alert fatigue on these by not hitting these thresholds, not by not monitoring the thresholds :). All of these things seem like pretty bad "I'd want to know first thing in the morning if this was happening" things.

Critical severity is for when you page someone and there is something concrete must be fixed ASAP. I would suggest severity warning for something like this, but it's still too vague of an alert that I would send it to anyone, as it's unclear what to do on this alert.

Even if against our instinct, removing/not using flaky alerts is better than causing alert fatigue, alert fatigue causes people to not trust in our alerts and ignore all of them.

The point of this alert is to wake up an engineer because they are broken. We are increasingly targeting alerts towards our engineering / monitoring team, not just the customers.

Alert fatigue is real but this is a critical failure of a core operator, and we have to react to it.

We are increasingly targeting alerts towards our engineering / monitoring team, not just the customers.

Then this should be an alert evaluated by telemetry infrastructure, not in-cluster.

I disagree - we will have customers without telemetry.

Our job is to fix bugs rapidly. Alerts can be generated by both bugs, user error, and unanticipated states.

It would be irresponsible of us not to alert these conditions on cluster for all disconnected users.

brancz · 2019-08-19T09:08:13Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+      annotations:
+        message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
+      expr: |
+        cluster_operator_up{job="cluster-version-operator"} == 0


How is this metric determined? And I assume this metric is somehow synthetically produced by the cluster-version-operator?

How is this metric determined? And I assume this metric is somehow synthetically produced by the cluster-version-operator?

what do we gain from knowing how this is generated?

it looks at clusteroperator and reports 1 when available & ! degraded else 0

cluster-version-operator/pkg/cvo/metrics.go

Lines 262 to 270 in e47c778

g := m.clusterOperatorUp.WithLabelValues(op.Name, firstVersion)

failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, ClusterStatusFailing)

available := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable)

if available && !failing {

g.Set(1)

} else {

g.Set(0)

}

ch <- g

There's probably a whole here where a rolling outage across operators could trigger the alert without having a single operator down for the full threshold. For example:

Scrape 1, operator A is down.

Scrape 2, operator B is down; A is back up.

Scrape 3, operator C is down; B is back up.

Alert fires with "there has been at least one down operator for the last three scrapes (over 10m)"

I'm not sure how to adjust this query to avoid that though, although it plays into the "many operators behind one alert" point you raise here.

These are metrics that should be exposed by the operators themselves and not by the CVO. That would remove the indirections, and we can make alerts concrete and actionable. As they are, these alerts, except for the very first one are not actionable, nor do they follow the best practice of metrics coming from the origin.

See my other comment, but the metrics are part of the product. An individual operator doesn’t report whether other operators see the result. We must see the core loop view like this.

I understand what you mean by these metrics describing the "advancing of the CVO", and I'm not arguing that it's not important, however, it is not clear to an SRE (even our SRE I've talked to them) if they are paged with this alert today what an actionable thing to troubleshoot is, not even what's broken. As is, this is at most actionable for our own engineering. As I said in the other comment, I want us to have a plan for how we will get to a state where this is clear and it is clear what about the product is not working when it's not. This needs to be top priority for the next release(s).

brancz · 2019-08-19T09:15:43Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+      annotations:
+        message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.
+      expr: |
+        changes(cluster_operator_up{job="cluster-version-operator"}[2m]) > 2


2m range selector doesn't seem to cover sufficient failure cases, also this behavior seems highly specific to each operator. Some operators have a very short reconciliation loop, some have a very long one. I don't think we're going to be able to find any good general purpose value for this.

The clusteroperators should not be flapping between conditions. It has nothing to do with reconcile times.

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#what-should-an-operator-report-with-clusteroperator-custom-resource

for example,

An operator reports Progressing when it is rolling out new code, propagating config changes, or otherwise moving from one steady state to another. It should not report progressing when it is reconciling a previously known state

that makes it pretty clear that reconciles of current behavior making clusteroperator progressing is not recommended.

It's unclear that that's what you want to test for. Also this sounds like it rather belongs in an e2e test than an alert as this is largely an implementation fault if this happens.

We need to know whether it is happening in the wild. I think we’ve shown conclusively in 4.1 with telemetry that our internal testing of operators is insuffficient to catch what is happening in the wild.

I want to prove these things aren’t happening in the wild (and we have evidence it is) and maybe we can remove the alert later if it isn’t firing in the wild. But because this is a critical control loop extra alerts are a good thing.

To prove this isn't happening in the wild severity warning is sufficient. I am ok with adding this to prove that it's not happening.

brancz · 2019-08-19T09:20:20Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+        message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.
+      expr: |
+        changes(cluster_operator_up{job="cluster-version-operator"}[2m]) > 2
+      for: 2m


This seems a little short.

we want to alert when the operators are flapping as that is not the normal behavior.

wking · 2019-08-19T22:35:40Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+        message: Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
+      expr: |
+        absent(up{job="cluster-version-operator"} == 1)
+      for: 15m


Why 15m here and 10m for the other operators? I don't have strong opinions, but I'm curious about the motivation and think we should record why we picked different values in the commit message to inform future devs interested in changing the defaults. Maybe "Prometheus scrapes every ${whatever} minutes, so allow the operators to be down for three scrapes before freaking out", or some such.

smarterclayton · 2019-08-20T08:17:47Z

pkg/cvo/metrics.go

 		}
 		g := m.clusterOperatorUp.WithLabelValues(op.Name, firstVersion)
-		failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, ClusterStatusFailing)
+		failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorDegraded)


This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default (during bootstrap), the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. * We're only removing defaulting from the bootstrap CVO. Once we get on to the production CVO, we'll still have our previous default-ClusterVersion-injection behavior. An alternative approach we considered was passing a default cluster ID in via a CVO command line option [10]. But Abhinav wants bootstrap operators to be loading their config from on-disk manifests as much as possible [11], and this commit effectively ensures we load ClusterVersion from the installer-provided manifest during bootstrapping. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12 [10]: openshift#239 [11]: openshift#239 (comment)

Adds alerts for cluster-version-operator and cluster operators * `ClusterVersionOperatorDown` This alert is fired when cluster-version-operator is not providing any metrics. Serverity is critical as upgrades will not work and the clusters can drift from expected state. * `ClusterOperatorDegraded` This alert is fired when the cluster operator is degraded. This is important as the cluster might be in an unacceptable state for produciton cluster, for example, using emptyDir for storage backend for registry. Severity is critical as degraded operator implies the operands are in a state that is not correct for the cluster. * `ClusterOperatorDown` This alert fires when a cluster operator is not up ie cluster_operator_up is 0. This means that the operator might not be reconcile the operands. * `ClusterOperatorFlapping` This alert fires when a cluster operator is flapping between up and down continously because of some weird condition.

smarterclayton · 2019-08-22T20:30:29Z

/lgtm

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-08-22T20:30:30Z

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

Details

In response to this:

/lgtm

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-22T20:30:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-08-22T20:31:07Z

@abhinavdahiya: This pull request references Bugzilla bug 1717632, which is valid.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-22T20:31:36Z

@abhinavdahiya: All pull requests linked via external trackers have merged. Bugzilla bug 1717632 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1717632: install: add alerts for cluster-version-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2019-08-22T20:31:39Z

@smarterclayton: #232 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/cvo/metrics.go
M	pkg/cvo/metrics_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/metrics_test.go
Auto-merging pkg/cvo/metrics.go
CONFLICT (content): Merge conflict in pkg/cvo/metrics.go
Patch failed at 0001 pkg/cvo/metrics/go: the cluster operators report Degraded and not Failing

Details

In response to this:

/lgtm

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ling Cherry-picked from fad0688 (pkg/cvo/metrics/go: the cluster operators report Degraded and not Failing, 2019-08-07, openshift#232) and manually removed the import shuffling from metrics.go.

cde8d50 (docs/user/reconciliation: Document release-image application, 2019-06-10, openshift#201) just landed, so drop some of the reconciliation docs from the parallel 105ea1f (document CVO behavior w/ respect to cluster operator conditions, 2019-07-12, openshift#222) and similar in favor of a link to the new reconciliation location. Also move the file-extension note over to the reconciliation side, since it's more generic than ClusterOperator. It's not really user docs either, but that's the most generic location we have at the moment. Also fix a few more Failing->Degraded bits to catch up with fad0688 (pkg/cvo/metrics/go: the cluster operators report Degraded and not Failing, 2019-08-07, openshift#232) and similar.

…table" from ClusterOperatorDegraded If we considered Degraded=True something that impacted update stability, I expect we'd want a guard against it that blocked new updates until admins coaxed the operator back into happiness. This commit softens wording from af62f04 (install: add alerts for cluster-version-operator, 2019-08-07, openshift#232).

openshift-ci-robot requested review from crawford and wking August 8, 2019 00:17

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 8, 2019

abhinavdahiya force-pushed the alerts branch from 093d745 to 297f6ca Compare August 8, 2019 00:41

abhinavdahiya changed the title ~~isntall: add alerts for cluster-version-operator~~ install: add alerts for cluster-version-operator Aug 8, 2019

smarterclayton suggested changes Aug 8, 2019

View reviewed changes

wking reviewed Aug 8, 2019

View reviewed changes

install/0000_90_cluster-version-operator_00_servicemonitor.yaml Outdated Show resolved Hide resolved

wking reviewed Aug 8, 2019

View reviewed changes

openshift-ci-robot changed the title ~~install: add alerts for cluster-version-operator~~ Bug 1717632: install: add alerts for cluster-version-operator Aug 8, 2019

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 8, 2019

abhinavdahiya force-pushed the alerts branch 3 times, most recently from 48237bc to 136fb72 Compare August 12, 2019 21:38

pkg/cvo/metrics/go: the cluster operators report Degraded and not Fai…

fad0688

…ling

abhinavdahiya force-pushed the alerts branch from 136fb72 to a80d095 Compare August 13, 2019 18:52

wking mentioned this pull request Aug 14, 2019

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons #236

Merged

abhinavdahiya force-pushed the alerts branch from a80d095 to e57b07a Compare August 15, 2019 22:00

smarterclayton reviewed Aug 16, 2019

View reviewed changes

abhinavdahiya force-pushed the alerts branch from e57b07a to e7bc61e Compare August 16, 2019 16:22

brancz reviewed Aug 19, 2019

View reviewed changes

wking reviewed Aug 19, 2019

View reviewed changes

smarterclayton reviewed Aug 20, 2019

View reviewed changes

abhinavdahiya force-pushed the alerts branch from e7bc61e to af62f04 Compare August 22, 2019 18:14

openshift-ci-robot assigned smarterclayton Aug 22, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 22, 2019

openshift-merge-robot merged commit 817a2d5 into openshift:master Aug 22, 2019

abhinavdahiya deleted the alerts branch August 22, 2019 20:45

wking mentioned this pull request Oct 17, 2019

Bug 1762920: install: add alerts for cluster-version-operator #253

Closed

abhinavdahiya mentioned this pull request Oct 22, 2019

Bug 1762920: install: add alerts for cluster-version-operator #261

Merged

wking mentioned this pull request Nov 27, 2019

docs/dev/clusteroperator: Point at the reconciliation docs #274

Closed

wking mentioned this pull request May 20, 2020

install/0000_90_cluster-version-operator_02_servicemonitor: Drop "unstable" from ClusterOperatorDegraded #369

Closed

wking mentioned this pull request Feb 18, 2021

Adjust alert severities #520

Closed

	g := m.clusterOperatorUp.WithLabelValues(op.Name, firstVersion)
	failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, ClusterStatusFailing)
	available := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable)
	if available && !failing {
	g.Set(1)
	} else {
	g.Set(0)
	}
	ch <- g

Bug 1717632: install: add alerts for cluster-version-operator #232

Bug 1717632: install: add alerts for cluster-version-operator #232

Uh oh!

Conversation

abhinavdahiya commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking commented Aug 8, 2019

Uh oh!

wking commented Aug 8, 2019

Uh oh!

openshift-ci-robot commented Aug 8, 2019

Uh oh!

abhinavdahiya commented Aug 13, 2019

Uh oh!

abhinavdahiya commented Aug 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brancz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

abhinavdahiya commented Aug 8, 2019 •

edited

Loading

abhinavdahiya Aug 12, 2019 •

edited

Loading

abhinavdahiya Aug 12, 2019 •

edited

Loading

abhinavdahiya Aug 19, 2019 •

edited

Loading

wking Aug 19, 2019 •

edited

Loading