From e2b2193e9cd26615202f16ac8ee58f6a5d18cd43 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 29 Apr 2021 16:20:24 -0700 Subject: [PATCH] config/v1/types_cluster_operator: Clarify Available and Degraded severity Available=False is really bad. Possibly a page-at-midnight thing. "Hey, your registry is down, so any new pods based on local images will fail to launch" or "ingress is down, so your users cannot reach you". If it's not a page-at-midnight thing, it's at least going to be the first batch of things admins should look at when they get the from-the-spout alerts [1]. Degraded=True is not great, but you should be able to survive with reduced quality-of-service until an admin wakes up in the morning. [1]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic#h.rvrk9gcasjzh Rob Ewaschuk, My Philosophy on Alerting --- config/v1/types_cluster_operator.go | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/config/v1/types_cluster_operator.go b/config/v1/types_cluster_operator.go index 299adb1c9f0..92f500dfd71 100644 --- a/config/v1/types_cluster_operator.go +++ b/config/v1/types_cluster_operator.go @@ -142,6 +142,8 @@ type ClusterStatusConditionType string const ( // Available indicates that the operand (eg: openshift-apiserver for the // openshift-apiserver-operator), is functional and available in the cluster. + // Available=False means at least part of the component is non-functional, + // and that the condition requires immediate administrator intervention. OperatorAvailable ClusterStatusConditionType = "Available" // Progressing indicates that the operator is actively rolling out new code, @@ -162,10 +164,10 @@ const ( // persist over a long enough period to report Degraded. A service should not // report Degraded during the course of a normal upgrade. A service may report // Degraded in response to a persistent infrastructure failure that requires - // administrator intervention. For example, if a control plane host is unhealthy - // and must be replaced. An operator should report Degraded if unexpected - // errors occur over a period, but the expectation is that all unexpected errors - // are handled as operators mature. + // eventual administrator intervention. For example, if a control plane host + // is unhealthy and must be replaced. An operator should report Degraded if + // unexpected errors occur over a period, but the expectation is that all + // unexpected errors are handled as operators mature. OperatorDegraded ClusterStatusConditionType = "Degraded" // Upgradeable indicates whether the operator is in a state that is safe to upgrade. When status is `False`