Skip to content

Conversation

@jsevedge
Copy link

@jsevedge jsevedge commented Nov 4, 2025

Description

Add support for configuring UDS Cluster Health Score metric via recording rules, optionally enabled

Related Issue

Not related to any existing uds-core issue.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Steps to Validate

  • Added helm unit tests to validate the conditional logic of all-or-nothing for the new PrometheusRule

Checklist before merging

@jsevedge jsevedge changed the title Sevedge/add cluster health rules feat: add experimental cluster health rules Nov 4, 2025
@jsevedge jsevedge self-assigned this Nov 4, 2025
Copy link
Contributor

@joelmccoy joelmccoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally think this is a good approach. Left some comments/questions for consideration. Before we decide to make this metric not experimental, I would like to have a published definition (maybe in the docs somewhere) of what this metric is and how we loosely determined the weight for each component in the score.

- name: "{{ .Release.Name }}-uds-cluster-health.rules"
interval: 1m
rules:
- record: uds_cluster_health_score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to spell out explicitly and think out loud here... This metric essentially checks the health of the prometheus scrape targets for each of these services (not necessarily that the service is functional). I think technically if prometheus is down, all these metrics will be down?? Does it make sense to even include prometheus in the check/weight if that is the case?

Still think this is a sane heuristic for now to determine if a service is up. I think the uptime metrics we introduce at a later point may be a better option to determine if a component is healthy.

Copy link
Author

@jsevedge jsevedge Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed, this is pretty rudimentary in general and looking at synthetic monitoring data would be better / in addition. @soltysh suggested swapping grafana out for prometheus but of course if prometheus is partially or fully down none of this will work. For partial prometheus failure scenarios (where metrics scraping and remote write capabilities are still up) doesn't hurt to include imo.

(avg(up{job="coredns"}) or vector(1)) * 15
+
# 15% weight for Pepr health
(avg(up{job="pepr-uds-core"}) or vector(1)) * 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pepr also has a scrape target for job=pepr-uds-core-watcher. This metric just watches the admission service as is. And you will likely want to include this job to grab the watcher service.

+
# 10% weight for Prometheus health
(avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you didn't include metrics-server, Grafana, Loki, Falco, or Velero? Fine keeping it simple for now to get iterations in... But either way, before we make this metric not experimental anymore, I think we should have a published definition of what this metric captures and how we decided on the weight for each of the services.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just didn't invest the cycles thus far to test and include (our multi-cluster bundles only have a few functional layers included in my defense). Imo before merging this should probably spend the cycles to include those so the score doesn't feel half baked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally fine for this as a first iteration, just wanted to ask the question

expr: |
(
# 25% weight for API Server health
(avg(up{job="apiserver"}) or vector(1)) * 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to vector(0) here instead? What happens if the scrape itself is broken and no metric exists? I think with vector(1) would read that everything is healthy. And same for all the others.

Copy link
Author

@jsevedge jsevedge Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I erred towards fail open in this first pass with vector(1). It is a valid point of consideration though... main blocker right now being with this simplistic expression we may have UDS Core deployed without some of the functional layers.

(avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10
)
- alert: UDSClusterHealthScoreWarning
expr: uds_cluster_health_score < 95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will also fire if it is below 90% creating duplicate alerts. consider making the expression:

Suggested change
expr: uds_cluster_health_score < 95
expr: uds_cluster_health_score < 95 and uds_cluster_health_score >= 90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants