[mixins] Alertmanager Overview dashboard by ArthurSens · Pull Request #2540 · prometheus/alertmanager

ArthurSens · 2021-04-11T14:59:28Z

The dashboard aims to show an overview of the overall health of Alertmanager. Signed-off-by: ArthurSens <arthursens2005@gmail.com>

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-05-07T20:27:36Z

Hi @beorn7, apologies for the delay!

What do you think of the result? Anything you'd like me to change?

beorn7

Just some formatting nits in the comments.

doc/alertmanager-mixin/.gitignore

doc/alertmanager-mixin/config.libsonnet

doc/alertmanager-mixin/dashboards/overview.libsonnet

doc/alertmanager-mixin/jsonnetfile.json

beorn7 · 2021-05-10T21:09:07Z

How about providing screenshots here so that reviewer can more easily see the results of this?

beorn7

Now "real" comments about the dashboard.

Bottom line: Let's focus on the first half of dashboards first (which contain the meat, I'd say) and let's get them right. With the suggested break-out by instance and AM cluster (rather than K8s cluster), those will get complex enough to get right. So let's focus on those first. It might also become a quite rich dashboard already if we have separate panels per integration.

beorn7 · 2021-05-10T21:23:43Z

doc/alertmanager-mixin/dashboards/overview.libsonnet

+      template.new(
+        name='cluster',
+        datasource='$datasource',
+        query='label_values(alertmanager_alerts, %s)' % $._config.clusterLabel,


I'm not quite sure if this is doing what you want.

An Alertmanager cluster is a different thing from a Kubernetes cluster (or generally a cluster of nodes you run services on). In fact, for HA, an Alertmanager cluster will commonly span multiple K8s clusters. Ideally, you have only one (global) Alertmanager cluster in your org, but should you have multiple (e.g. you have a dev cluster, or you have strictly separate network partitions, each of which needs its own Alertmanager cluster), then you want your dashboard to switch between those, and when viewing at a particular Alertmanager cluster, you want to see all instances included there, even if they run in different K8s clusters.

In fact, the mixin already has this concept of an Alertmanager cluster, see alertmanagerClusterLabels and alertmanagerClusterName in config.libsonnet. So ideally, your multi-cluster support utilizes those. The mixin allows for multiple labels to define the Alertmanager cluster, which makes the templating here hard. But I think it's possible.

Particularly, my setup consists of multiple k8s clusters with one alertmanager per cluster that do not communicate between themselves. I understand that my setup is not the usual Alertmanager-HA that we all should aim for, but I think I won't be able to test the filter using alertmanagerClusterLabels 😬

In your case, you have "one-instance clusters", and the alertmanagerClusterLabels could indeed be just cluster.

Maybe the only problem is that the mixin as-is allows multiple labels for alertmanagerClusterLabels (which I introduced because it was easy at the time, and I think it's needed in some use cases). In fact, here at GL, we use job, namespace as the alertmanagerClusterLabels. job is usually something like global-alertmanager while namespace can be used to, for example, have a production global AM cluster and a separate test cluster.

I think this should all work, just that you need to jump through some jsonnet hoops to iterate through all the labels in alertmanagerClusterLabels and dynamically create the corresponding template variables for Grafana. (This will be easier if alertmanagerClusterLabels is a list. But we can totally make it one.)

The Grafana query_result(query) might also help here. See https://grafana.com/docs/grafana/latest/datasources/prometheus/#query-variable

doc/alertmanager-mixin/dashboards/overview.libsonnet

beorn7 · 2021-05-10T21:33:13Z

doc/alertmanager-mixin/dashboards/overview.libsonnet

+          fill=1,
+          legend_show=true,
+        )
+        .addTarget(prometheus.target('sum(rate(alertmanager_notifications_failed_total{%(alertmanagerSelector)s, %(clusterLabel)s="$cluster"}[5m])) by (integration)' % $._config, legendFormat='{{integration}}'));


Same as above, I think it makes more sense to have it broken up per instance.

doc/alertmanager-mixin/dashboards/overview.libsonnet

beorn7 · 2021-05-10T21:58:25Z

Tests were flaky, I re-ran them, and they succeeded.

doc/alertmanager-mixin/Makefile

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-05-13T22:49:26Z

Thanks a lot for the detailed review!

I think the main problem with my first implementation is because I don't use an HA setup for my alertmanagers. It also made me think about why I don't have an HA setup in the first place... 😅

I'll remove all the low-level metrics added and focus on the "meat", i.e. alert delivery.

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-05-14T00:02:30Z

I've cleaned-up all the low-level metrics, leaving only Alerts received and Notifications sent.

The panels repeat over an integration variable, that is filtered by alertmanagerCriticalIntegrationsRegEx in _config. Similar to what is done in the alerts.

Not much to see if we use just a few integrations, but can become quite packed if using all integrations...

ArthurSens · 2021-05-14T00:04:56Z

Turning the PR into a draft until I think of a good way to show HA set ups...

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-05-24T20:14:19Z

Hi @beorn7 and @paulfantom,

I've refactored the PR to support multiple Alertmanager HA-clusters instead of multiple k8s clusters. Please take another look when possible 🙂

I think I haven't done breaking changes to the mixin, but please pay some extra attention to the changes done in config.libsonnet. Thanks to the flexibility when identifying alertmanager clusters, I focused on identifying on labels names and boilerplate those labels into Grafana queries, legends and Prometheus alerting rules.

Here is an example dashboard:

beorn7 · 2021-05-25T20:01:24Z

Thanks, @ArthurSens . I'll have a look as soon as possible (which might not be very soon – very sorry, too much backlog).

@paulfantom , your feedback would be very appreciated.

beorn7

I'm not a real expert in jsonnet, mixins, and dashboards, so I don't feel qualified to improve details on the dashboard is it is now. I think it's a good starting point, and we can iterate on it based on results from people actually using it in different scenarios. (But if any of the experts have something to improve right now, please go ahead.)

Below just a few comments on the _config fields. I realize that some of my suggestions will break people that have set alertmanagerName. But since we are now using parallel naming for dashboards, with slightly different generation, I'd say let's better break them noticeable than causing a surprise that the names on the dashboard suddenly look different than the one in the alerts.

.gitpod.yml

doc/alertmanager-mixin/config.libsonnet

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-06-01T18:37:12Z

Thanks again for the review Beorn, I've moved the mentioned variables from config to the dashboard libsonnet

beorn7

Just a few more nits.

.gitignore

beorn7 · 2021-06-03T15:16:21Z

doc/alertmanager-mixin/config.libsonnet

+    // alertmanagerName is an identifier for alerts that is built from 'alertmanagerNameLabels'
+    alertmanagerName: std.join('/', ['{{$labels.%s}}' % [label] for label in std.split(c.alertmanagerNameLabels, ',')]),


I would either move this out of _config, too, or change the comment to "alertmanagerName is an identifier for alerts. By default, it is built from 'alertmanagerNameLabels'."

I'd like to leave this in config just for backward compatibility. PrometheusRules created by this mixin use this field in alert descriptions.

doc/alertmanager-mixin/config.libsonnet

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens · 2021-06-03T21:06:43Z

Thanks, Beorn 🙂

I think the PR is getting to its last moments, would you like me to squash commits?

beorn7

I think the backwards compatibility is an issue anyway because those that used to override alertmanagerName now probably need to set alertmanagerNameLabels. I'd say better break in a noticeable way.

Furthermore, having alertmanagerName in alert descriptions doesn't really require to have it as a configurable variable.

But on the other hand, users might ask for it anyway. The whole mixin is experimental anyway, so let's not try too hard to get it perfect.

I'll squash and merge, and we'll see what feedback we get.

paulfantom · 2021-06-08T11:06:58Z

Sorry, this was completely lost in an influx of GitHub notifications (I really need to fix my setup for this). But I am happy you managed to finalize this. Let's include it in kube-prometheus and gather community feedback.

@ArthurSens would you mind bringing this into kube-prometheus?

ArthurSens · 2021-06-08T11:22:24Z

Yep, I'll try to open the PR on kube-prometheus today or tomorrow 🙂

Implements a Grafana dashboard to the mixin.

2729314

The dashboard aims to show an overview of the overall health of Alertmanager. Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens force-pushed the as/dashboard-mixins branch from a5e0f1a to 2729314 Compare May 7, 2021 20:16

ArthurSens added 2 commits May 7, 2021 20:18

Install jsonnet-bundler in CI

18be2e3

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

Add UID to Alertmanager / Overview dashboard

904d455

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens marked this pull request as ready for review May 7, 2021 20:22

Merge branch 'master' into as/dashboard-mixins

7704475

beorn7 requested changes May 10, 2021

View reviewed changes

paulfantom reviewed May 11, 2021

View reviewed changes

doc/alertmanager-mixin/Makefile Show resolved Hide resolved

Change jsonnetfmt max-lines to 1

5360016

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

Clean-up low level metrics

b05360b

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens marked this pull request as draft May 13, 2021 23:58

ArthurSens added 3 commits May 24, 2021 16:34

wip

9b7d3b8

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

Add HA cluster support

216e2cc

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

Small fixes

b36beb8

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

ArthurSens marked this pull request as ready for review May 24, 2021 20:04

beorn7 requested changes May 31, 2021

View reviewed changes

.gitpod.yml Outdated Show resolved Hide resolved

doc/alertmanager-mixin/config.libsonnet Outdated Show resolved Hide resolved

doc/alertmanager-mixin/config.libsonnet Outdated Show resolved Hide resolved

doc/alertmanager-mixin/config.libsonnet Outdated Show resolved Hide resolved

Move variables from config to dashboard lib

43671c7

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

beorn7 requested changes Jun 3, 2021

View reviewed changes

Address comments

e7e574e

Signed-off-by: ArthurSens <arthursens2005@gmail.com>

beorn7 approved these changes Jun 7, 2021

View reviewed changes

beorn7 merged commit 8598683 into prometheus:master Jun 7, 2021

ArthurSens deleted the as/dashboard-mixins branch June 7, 2021 21:47

ArthurSens mentioned this pull request Jun 8, 2021

Update alertmanager mixin prometheus-operator/kube-prometheus#1193

Merged

		// alertmanagerName is an identifier for alerts that is built from 'alertmanagerNameLabels'
		alertmanagerName: std.join('/', ['{{$labels.%s}}' % [label] for label in std.split(c.alertmanagerNameLabels, ',')]),

Conversation

ArthurSens commented Apr 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurSens commented May 7, 2021

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 commented May 10, 2021

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 May 10, 2021

Choose a reason for hiding this comment

Uh oh!

ArthurSens May 13, 2021

Choose a reason for hiding this comment

Uh oh!

beorn7 May 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 May 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 commented May 10, 2021

Uh oh!

Uh oh!

ArthurSens commented May 13, 2021

Uh oh!

ArthurSens commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurSens commented May 14, 2021

Uh oh!

ArthurSens commented May 24, 2021

Uh oh!

beorn7 commented May 25, 2021

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurSens commented Jun 1, 2021

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beorn7 Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

ArthurSens Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurSens commented Jun 3, 2021

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

paulfantom commented Jun 8, 2021

Uh oh!

ArthurSens commented Jun 8, 2021

ArthurSens commented Apr 11, 2021 •

edited

Loading

ArthurSens commented May 14, 2021 •

edited

Loading