-
Notifications
You must be signed in to change notification settings - Fork 216
Bug 1762920: install: add alerts for cluster-version-operator #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1762920: install: add alerts for cluster-version-operator #261
Conversation
|
@abhinavdahiya: This pull request references Bugzilla bug 1762920, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
the `manifests` directory on the bootstrap is used by the cluster-bootstrap to push to the cluster. `servicemonitor` for cvo was added by openshift#214 `servicemonitor` api is created by the cluster-monitoring-operator and therefore this causes the bootstrapping to get stuck until we get the monitoring operator running. This skips the `servicemonitor` in the bootstrap render as it is not required for the bootstrap cvo pod.
Adds alerts for cluster-version-operator and cluster operators * `ClusterVersionOperatorDown` This alert is fired when cluster-version-operator is not providing any metrics. Serverity is critical as upgrades will not work and the clusters can drift from expected state. * `ClusterOperatorDegraded` This alert is fired when the cluster operator is degraded. This is important as the cluster might be in an unacceptable state for produciton cluster, for example, using emptyDir for storage backend for registry. Severity is critical as degraded operator implies the operands are in a state that is not correct for the cluster. * `ClusterOperatorDown` This alert fires when a cluster operator is not up ie cluster_operator_up is 0. This means that the operator might not be reconcile the operands. * `ClusterOperatorFlapping` This alert fires when a cluster operator is flapping between up and down continously because of some weird condition.
f2ae221 to
2125d2a
Compare
|
/retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/bugzilla refresh |
Backports #232 to release-4.1. Also pulls in the precursor commits from #221, and #214.
This one is smaller cherry-pick of #253 because that one needs changes in the CMO to move the servicemonitor to the CVO management.
This only moves the alerts using
PrometheusRuleskipping theServiceMonitorThis is not a clean cherry-pick and definitely needs review.
/cc @wking @smarterclayton