-
Notifications
You must be signed in to change notification settings - Fork 31
feat: add experimental cluster health rules #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
joelmccoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally think this is a good approach. Left some comments/questions for consideration. Before we decide to make this metric not experimental, I would like to have a published definition (maybe in the docs somewhere) of what this metric is and how we loosely determined the weight for each component in the score.
| - name: "{{ .Release.Name }}-uds-cluster-health.rules" | ||
| interval: 1m | ||
| rules: | ||
| - record: uds_cluster_health_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to spell out explicitly and think out loud here... This metric essentially checks the health of the prometheus scrape targets for each of these services (not necessarily that the service is functional). I think technically if prometheus is down, all these metrics will be down?? Does it make sense to even include prometheus in the check/weight if that is the case?
Still think this is a sane heuristic for now to determine if a service is up. I think the uptime metrics we introduce at a later point may be a better option to determine if a component is healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed, this is pretty rudimentary in general and looking at synthetic monitoring data would be better / in addition. @soltysh suggested swapping grafana out for prometheus but of course if prometheus is partially or fully down none of this will work. For partial prometheus failure scenarios (where metrics scraping and remote write capabilities are still up) doesn't hurt to include imo.
| (avg(up{job="coredns"}) or vector(1)) * 15 | ||
| + | ||
| # 15% weight for Pepr health | ||
| (avg(up{job="pepr-uds-core"}) or vector(1)) * 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pepr also has a scrape target for job=pepr-uds-core-watcher. This metric just watches the admission service as is. And you will likely want to include this job to grab the watcher service.
| + | ||
| # 10% weight for Prometheus health | ||
| (avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you didn't include metrics-server, Grafana, Loki, Falco, or Velero? Fine keeping it simple for now to get iterations in... But either way, before we make this metric not experimental anymore, I think we should have a published definition of what this metric captures and how we decided on the weight for each of the services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just didn't invest the cycles thus far to test and include (our multi-cluster bundles only have a few functional layers included in my defense). Imo before merging this should probably spend the cycles to include those so the score doesn't feel half baked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally fine for this as a first iteration, just wanted to ask the question
| expr: | | ||
| ( | ||
| # 25% weight for API Server health | ||
| (avg(up{job="apiserver"}) or vector(1)) * 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we default to vector(0) here instead? What happens if the scrape itself is broken and no metric exists? I think with vector(1) would read that everything is healthy. And same for all the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I erred towards fail open in this first pass with vector(1). It is a valid point of consideration though... main blocker right now being with this simplistic expression we may have UDS Core deployed without some of the functional layers.
| (avg(up{job="kube-prometheus-stack-prometheus"}) or vector(1)) * 10 | ||
| ) | ||
| - alert: UDSClusterHealthScoreWarning | ||
| expr: uds_cluster_health_score < 95 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will also fire if it is below 90% creating duplicate alerts. consider making the expression:
| expr: uds_cluster_health_score < 95 | |
| expr: uds_cluster_health_score < 95 and uds_cluster_health_score >= 90 |
Description
Add support for configuring UDS Cluster Health Score metric via recording rules, optionally enabled
Related Issue
Not related to any existing uds-core issue.
Type of change
Steps to Validate
PrometheusRuleChecklist before merging