Convert PromQL risks to Always risks in releases older than 8 weeks#2968
Conversation
This works around a CVO bug/design decision where we only evaluate one newly enumerated risk every 10 minutes in an effort to avoid overwhelming the monitoring stack with requests. However this creates a bad UX where if there are many risks to evaluate in the set of available update paths it could be N-1 * 10 minutes before the set of recommended updates are computed. This preserves the notification of issue while largely being a no-op because clusters have, currently, had better update paths for at least 12 weeks. We intend to fix the CVO bug, but that won't fix the issue in the deployed fleet. See: https://issues.redhat.com/browse/OCPBUGS-5469
abdd39b to
370f8a7
Compare
| topk(1, | ||
| label_replace(group(ceph_health_status), "ceph", "yes", "", "") | ||
| or | ||
| label_replace(0 * group(cluster_version), "ceph", "no", "", "") |
There was a problem hiding this comment.
possibly shift this PromQL into the message? Or the linked bug (although this one links https://bugzilla.redhat.com/show_bug.cgi?id=2076312#c9, which seems to be private)? But the current message is phrased as if we know (or suspect) the cluster is exposed, while with Always this will also trip for clusters we know are not exposed. Or 🤷, maybe that's more polish than we care about for such an old 4.10.z target as these.
There was a problem hiding this comment.
I'd prefer to just leave it as is. Only 17% of the 4.10 fleet has upgrades to < 4.10.17 and those all have paths to better edges listed more prominently.
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This works around a CVO bug/design decision where we only evaluate one newly enumerated risk every 10 minutes in an effort to avoid overwhelming the monitoring stack with requests. However this creates a bad UX where if there are many risks to evaluate in the set of available update paths it could be N-1 * 10 minutes before the set of recommended updates are computed.
This preserves the notification of issue while largely being a no-op because clusters have, currently, had better update paths for at least 12 weeks.
We intend to fix the CVO bug, but that won't fix the issue in the deployed fleet.
See: https://issues.redhat.com/browse/OCPBUGS-5469