From fb5257d4be8e1b18a80a171a24ba6e8386026b94 Mon Sep 17 00:00:00 2001
From: "W. Trevor King" <wking@tremily.us>
Date: Thu, 6 May 2021 16:57:40 -0700
Subject: [PATCH] install/0000_90_cluster-version-operator_02_servicemonitor:
 Soften ClusterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902fce (clusteroperator: Don't block on failing
during initialization, 2019-03-11, #136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: https://github.com/openshift/api/pull/916
---
 .../0000_90_cluster-version-operator_02_servicemonitor.yaml   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/install/0000_90_cluster-version-operator_02_servicemonitor.yaml b/install/0000_90_cluster-version-operator_02_servicemonitor.yaml
index edf1b0a1df..44a52aee87 100644
--- a/install/0000_90_cluster-version-operator_02_servicemonitor.yaml
+++ b/install/0000_90_cluster-version-operator_02_servicemonitor.yaml
@@ -87,9 +87,9 @@ spec:
           or on (name)
           group by (name) (cluster_operator_up{job="cluster-version-operator"})
         ) == 1
-      for: 10m
+      for: 30m
       labels:
-        severity: critical
+        severity: warning
     - alert: ClusterOperatorFlapping
       annotations:
         message: Cluster operator {{ "{{ $labels.name }}" }} up status is changing often. This might cause upgrades to be unstable.