feat: add experimental cluster health rules #2107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

jsevedge wants to merge 6 commits into main from sevedge/add_cluster_health_rules

jsevedge commented Nov 4, 2025 •

edited

Loading

Description

Add support for configuring UDS Cluster Health Score metric via recording rules, optionally enabled

Related Issue

Not related to any existing uds-core issue.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Other (security config, docs update, etc)

Steps to Validate

Added helm unit tests to validate the conditional logic of all-or-nothing for the new PrometheusRule

Checklist before merging

Test, docs, adr added or updated as needed
Contributor Guide followed

jsevedge added 2 commits

November 3, 2025 18:15


          chore: add initial cluster health rules

986dacf


          chore: add conditional and helm unit tests

04259d0

jsevedge changed the title ~~Sevedge/add cluster health rules~~ feat: add experimental cluster health rules

jsevedge self-assigned this

jsevedge added 4 commits

November 4, 2025 12:28


          chore: Merge branch 'main' into sevedge/add_cluster_health_rules

f13728e


          chore: appease linter

b2e1557


          chore: enhance health calculation some more

5bdd9d6


          chore: normalize alerts with uds mc branch

26e9903

joelmccoy reviewed

View reviewed changes

Contributor

joelmccoy left a comment

Generally think this is a good approach. Left some comments/questions for consideration. Before we decide to make this metric not experimental, I would like to have a published definition (maybe in the docs somewhere) of what this metric is and how we loosely determined the weight for each component in the score.

src/prometheus-stack/chart/templates/uds-rules.yaml

+                  - name: "{{ .Release.Name }}-uds-cluster-health.rules"
+                    interval: 1m
+                    rules:
+                      - record: uds_cluster_health_score

Contributor

joelmccoy Nov 5, 2025

Just to spell out explicitly and think out loud here... This metric essentially checks the health of the prometheus scrape targets for each of these services (not necessarily that the service is functional). I think technically if prometheus is down, all these metrics will be down?? Does it make sense to even include prometheus in the check/weight if that is the case?

Still think this is a sane heuristic for now to determine if a service is up. I think the uptime metrics we introduce at a later point may be a better option to determine if a component is healthy.

Author

jsevedge Nov 5, 2025 •

edited

Loading

Totally agreed, this is pretty rudimentary in general and looking at synthetic monitoring data would be better / in addition. @soltysh suggested swapping grafana out for prometheus but of course if prometheus is partially or fully down none of this will work. For partial prometheus failure scenarios (where metrics scraping and remote write capabilities are still up) doesn't hurt to include imo.

src/prometheus-stack/chart/templates/uds-rules.yaml

+                            (avg(up{job="coredns"}) or vector(1)) * 15
+                            +
+                            # 15% weight for Pepr health
+                            (avg(up{job="pepr-uds-core"}) or vector(1)) * 15

Contributor

joelmccoy Nov 5, 2025

pepr also has a scrape target for job=pepr-uds-core-watcher. This metric just watches the admission service as is. And you will likely want to include this job to grab the watcher service.

src/prometheus-stack/chart/templates/uds-rules.yaml

+                            +
+                            # 10% weight for Prometheus health
+                            (avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10
+                          )

Contributor

joelmccoy Nov 5, 2025

Any reason you didn't include metrics-server, Grafana, Loki, Falco, or Velero? Fine keeping it simple for now to get iterations in... But either way, before we make this metric not experimental anymore, I think we should have a published definition of what this metric captures and how we decided on the weight for each of the services.

Author

jsevedge Nov 5, 2025

Just didn't invest the cycles thus far to test and include (our multi-cluster bundles only have a few functional layers included in my defense). Imo before merging this should probably spend the cycles to include those so the score doesn't feel half baked.

Contributor

joelmccoy Nov 5, 2025

totally fine for this as a first iteration, just wanted to ask the question

src/prometheus-stack/chart/templates/uds-rules.yaml

+                        expr: |
+                          (
+                            # 25% weight for API Server health
+                            (avg(up{job="apiserver"}) or vector(1)) * 25

Contributor

joelmccoy Nov 5, 2025

Should we default to vector(0) here instead? What happens if the scrape itself is broken and no metric exists? I think with vector(1) would read that everything is healthy. And same for all the others.

Author

jsevedge Nov 5, 2025 •

edited

Loading

I erred towards fail open in this first pass with vector(1). It is a valid point of consideration though... main blocker right now being with this simplistic expression we may have UDS Core deployed without some of the functional layers.

joelmccoy reviewed

View reviewed changes

src/prometheus-stack/chart/templates/uds-rules.yaml

+                            (avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10
+                          )
+                      - alert: UDSClusterHealthScoreWarning
+                        expr: uds_cluster_health_score < 95

Contributor

joelmccoy Nov 5, 2025

I think this will also fire if it is below 90% creating duplicate alerts. consider making the expression:

Suggested change

      
                      expr: uds_cluster_health_score < 95
          
                      expr: uds_cluster_health_score < 95 and uds_cluster_health_score >= 90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet