Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions src/prometheus-stack/chart/templates/uds-rules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright 2025 Defense Unicorns
# SPDX-License-Identifier: AGPL-3.0-or-later OR LicenseRef-Defense-Unicorns-Commercial
{{- if .Values.clusterHealthRules.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: "{{ .Release.Name }}-uds-cluster-health"
namespace: {{ .Release.Namespace }}
labels:
release: {{ .Release.Name }}
spec:
groups:
- name: "{{ .Release.Name }}-uds-cluster-health.rules"
interval: 1m
rules:
- record: uds_cluster_health_score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to spell out explicitly and think out loud here... This metric essentially checks the health of the prometheus scrape targets for each of these services (not necessarily that the service is functional). I think technically if prometheus is down, all these metrics will be down?? Does it make sense to even include prometheus in the check/weight if that is the case?

Still think this is a sane heuristic for now to determine if a service is up. I think the uptime metrics we introduce at a later point may be a better option to determine if a component is healthy.

Copy link
Author

@jsevedge jsevedge Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed, this is pretty rudimentary in general and looking at synthetic monitoring data would be better / in addition. @soltysh suggested swapping grafana out for prometheus but of course if prometheus is partially or fully down none of this will work. For partial prometheus failure scenarios (where metrics scraping and remote write capabilities are still up) doesn't hurt to include imo.

expr: |
(
# 25% weight for API Server health
(avg(up{job="apiserver"}) or vector(1)) * 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to vector(0) here instead? What happens if the scrape itself is broken and no metric exists? I think with vector(1) would read that everything is healthy. And same for all the others.

Copy link
Author

@jsevedge jsevedge Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I erred towards fail open in this first pass with vector(1). It is a valid point of consideration though... main blocker right now being with this simplistic expression we may have UDS Core deployed without some of the functional layers.

+
# 25% weight for Kubelet health
(avg(up{job="kubelet"}) or vector(1)) * 25
+
# 15% weight for CoreDNS health
(avg(up{job="coredns"}) or vector(1)) * 15
+
# 15% weight for Pepr health
(avg(up{job="pepr-uds-core"}) or vector(1)) * 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pepr also has a scrape target for job=pepr-uds-core-watcher. This metric just watches the admission service as is. And you will likely want to include this job to grab the watcher service.

+
# 10% weight for Keycloak health
(avg(up{job="keycloak-http"}) or vector(1)) * 10
+
# 10% weight for Prometheus health
(avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you didn't include metrics-server, Grafana, Loki, Falco, or Velero? Fine keeping it simple for now to get iterations in... But either way, before we make this metric not experimental anymore, I think we should have a published definition of what this metric captures and how we decided on the weight for each of the services.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just didn't invest the cycles thus far to test and include (our multi-cluster bundles only have a few functional layers included in my defense). Imo before merging this should probably spend the cycles to include those so the score doesn't feel half baked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally fine for this as a first iteration, just wanted to ask the question

- alert: UDSClusterHealthScoreWarning
expr: uds_cluster_health_score < 95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will also fire if it is below 90% creating duplicate alerts. consider making the expression:

Suggested change
expr: uds_cluster_health_score < 95
expr: uds_cluster_health_score < 95 and uds_cluster_health_score >= 90

for: 1m
labels:
severity: warning
reason: "UDS cluster health score is below 95%"
annotations:
summary: "UDS Cluster Health Score Warning"
description: "The UDS cluster health score has been below 95% for more than 1 minute."
- alert: UDSClusterHealthScoreCritical
expr: uds_cluster_health_score < 90
for: 1m
labels:
severity: critical
reason: "UDS cluster health score is below 90%"
annotations:
summary: "UDS Cluster Health Score Critical"
description: "The UDS cluster health score has been below 90% for more than 1 minute."
{{- end -}}
40 changes: 40 additions & 0 deletions src/prometheus-stack/chart/tests/uds_rules_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Copyright 2025 Defense Unicorns
# SPDX-License-Identifier: AGPL-3.0-or-later OR LicenseRef-Defense-Unicorns-Commercial

# yaml-language-server: $schema=https://raw.githubusercontent.com/helm-unittest/helm-unittest/main/schema/helm-testsuite.json

suite: UDS Rules
templates:
- templates/uds-rules.yaml

tests:
- it: should include PrometheusRule when clusterHealthRules.enabled is true
set:
clusterHealthRules.enabled: true
release:
name: monitoring
asserts:
- containsDocument:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
name: monitoring-uds-cluster-health
- it: should NOT include PrometheusRule when clusterHealthRules.enabled is false
set:
clusterHealthRules.enabled: false
release:
name: monitoring
asserts:
- containsDocument:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
name: monitoring-uds-cluster-health
not: true
- it: should NOT include PrometheusRule given default clusterHealthRules.enabled value of false
release:
name: monitoring
asserts:
- containsDocument:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
name: monitoring-uds-cluster-health
not: true
5 changes: 5 additions & 0 deletions src/prometheus-stack/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,10 @@ additionalNetworkAllow: []
# remoteGenerated: Anywhere
# description: "from alertmanager to anywhere"
# port: 443

# NOTE: Enabling cluster health rules is currently an experimental feature
clusterHealthRules:
enabled: false

rke2CorednsNetpol:
enabled: false
Loading