-
Notifications
You must be signed in to change notification settings - Fork 31
feat: add experimental cluster health rules #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
986dacf
04259d0
f13728e
b2e1557
5bdd9d6
26e9903
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,55 @@ | ||||||
| # Copyright 2025 Defense Unicorns | ||||||
| # SPDX-License-Identifier: AGPL-3.0-or-later OR LicenseRef-Defense-Unicorns-Commercial | ||||||
| {{- if .Values.clusterHealthRules.enabled }} | ||||||
| apiVersion: monitoring.coreos.com/v1 | ||||||
| kind: PrometheusRule | ||||||
| metadata: | ||||||
| name: "{{ .Release.Name }}-uds-cluster-health" | ||||||
| namespace: {{ .Release.Namespace }} | ||||||
| labels: | ||||||
| release: {{ .Release.Name }} | ||||||
| spec: | ||||||
| groups: | ||||||
| - name: "{{ .Release.Name }}-uds-cluster-health.rules" | ||||||
| interval: 1m | ||||||
| rules: | ||||||
| - record: uds_cluster_health_score | ||||||
| expr: | | ||||||
| ( | ||||||
| # 25% weight for API Server health | ||||||
| (avg(up{job="apiserver"}) or vector(1)) * 25 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we default to vector(0) here instead? What happens if the scrape itself is broken and no metric exists? I think with vector(1) would read that everything is healthy. And same for all the others.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I erred towards fail open in this first pass with vector(1). It is a valid point of consideration though... main blocker right now being with this simplistic expression we may have UDS Core deployed without some of the functional layers. |
||||||
| + | ||||||
| # 25% weight for Kubelet health | ||||||
| (avg(up{job="kubelet"}) or vector(1)) * 25 | ||||||
| + | ||||||
| # 15% weight for CoreDNS health | ||||||
| (avg(up{job="coredns"}) or vector(1)) * 15 | ||||||
| + | ||||||
| # 15% weight for Pepr health | ||||||
| (avg(up{job="pepr-uds-core"}) or vector(1)) * 15 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pepr also has a scrape target for |
||||||
| + | ||||||
| # 10% weight for Keycloak health | ||||||
| (avg(up{job="keycloak-http"}) or vector(1)) * 10 | ||||||
| + | ||||||
| # 10% weight for Prometheus health | ||||||
| (avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10 | ||||||
| ) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason you didn't include metrics-server, Grafana, Loki, Falco, or Velero? Fine keeping it simple for now to get iterations in... But either way, before we make this metric not experimental anymore, I think we should have a published definition of what this metric captures and how we decided on the weight for each of the services.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just didn't invest the cycles thus far to test and include (our multi-cluster bundles only have a few functional layers included in my defense). Imo before merging this should probably spend the cycles to include those so the score doesn't feel half baked.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. totally fine for this as a first iteration, just wanted to ask the question |
||||||
| - alert: UDSClusterHealthScoreWarning | ||||||
| expr: uds_cluster_health_score < 95 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this will also fire if it is below 90% creating duplicate alerts. consider making the expression:
Suggested change
|
||||||
| for: 1m | ||||||
| labels: | ||||||
| severity: warning | ||||||
| reason: "UDS cluster health score is below 95%" | ||||||
| annotations: | ||||||
| summary: "UDS Cluster Health Score Warning" | ||||||
| description: "The UDS cluster health score has been below 95% for more than 1 minute." | ||||||
| - alert: UDSClusterHealthScoreCritical | ||||||
| expr: uds_cluster_health_score < 90 | ||||||
| for: 1m | ||||||
| labels: | ||||||
| severity: critical | ||||||
| reason: "UDS cluster health score is below 90%" | ||||||
| annotations: | ||||||
| summary: "UDS Cluster Health Score Critical" | ||||||
| description: "The UDS cluster health score has been below 90% for more than 1 minute." | ||||||
| {{- end -}} | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Copyright 2025 Defense Unicorns | ||
| # SPDX-License-Identifier: AGPL-3.0-or-later OR LicenseRef-Defense-Unicorns-Commercial | ||
|
|
||
| # yaml-language-server: $schema=https://raw.githubusercontent.com/helm-unittest/helm-unittest/main/schema/helm-testsuite.json | ||
|
|
||
| suite: UDS Rules | ||
| templates: | ||
| - templates/uds-rules.yaml | ||
|
|
||
| tests: | ||
| - it: should include PrometheusRule when clusterHealthRules.enabled is true | ||
| set: | ||
| clusterHealthRules.enabled: true | ||
| release: | ||
| name: monitoring | ||
| asserts: | ||
| - containsDocument: | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| name: monitoring-uds-cluster-health | ||
| - it: should NOT include PrometheusRule when clusterHealthRules.enabled is false | ||
| set: | ||
| clusterHealthRules.enabled: false | ||
| release: | ||
| name: monitoring | ||
| asserts: | ||
| - containsDocument: | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| name: monitoring-uds-cluster-health | ||
| not: true | ||
| - it: should NOT include PrometheusRule given default clusterHealthRules.enabled value of false | ||
| release: | ||
| name: monitoring | ||
| asserts: | ||
| - containsDocument: | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| name: monitoring-uds-cluster-health | ||
| not: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to spell out explicitly and think out loud here... This metric essentially checks the health of the prometheus scrape targets for each of these services (not necessarily that the service is functional). I think technically if prometheus is down, all these metrics will be down?? Does it make sense to even include prometheus in the check/weight if that is the case?
Still think this is a sane heuristic for now to determine if a service is up. I think the uptime metrics we introduce at a later point may be a better option to determine if a component is healthy.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed, this is pretty rudimentary in general and looking at synthetic monitoring data would be better / in addition. @soltysh suggested swapping grafana out for prometheus but of course if prometheus is partially or fully down none of this will work. For partial prometheus failure scenarios (where metrics scraping and remote write capabilities are still up) doesn't hurt to include imo.