From fb5257d4be8e1b18a80a171a24ba6e8386026b94 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 6 May 2021 16:57:40 -0700 Subject: [PATCH] install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902fce (clusteroperator: Don't block on failing during initialization, 2019-03-11, #136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: https://github.com/openshift/api/pull/916 --- .../0000_90_cluster-version-operator_02_servicemonitor.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/install/0000_90_cluster-version-operator_02_servicemonitor.yaml b/install/0000_90_cluster-version-operator_02_servicemonitor.yaml index edf1b0a1df..44a52aee87 100644 --- a/install/0000_90_cluster-version-operator_02_servicemonitor.yaml +++ b/install/0000_90_cluster-version-operator_02_servicemonitor.yaml @@ -87,9 +87,9 @@ spec: or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 - for: 10m + for: 30m labels: - severity: critical + severity: warning - alert: ClusterOperatorFlapping annotations: message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.