Improve notification instrumentation by stuartnelson3 · Pull Request #1335 · prometheus/alertmanager

stuartnelson3 · 2018-04-20T14:44:35Z

Add the groupName on notification
success/failure counter, to help users debug
quicker which group is failing (in case they use
different configs)
Add notificationLatencySeconds histogram to
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.

The buckets are very coarse. The current default delay is 15s, set by --cluster.peer-timeout, and I'm mostly interested in seeing if this threshhold is causing duplicate messages that have been reported.

I wanted to add the groupName to the success/failure counter for integrations, but now this makes it difficult to initialize the metrics with values. I'm not sure if this is worth it (I think it will help reduce the time taken to fix a broken receiver's config), but if so then I'm not sure what would be the best way to handle initialization.

EDIT:
Addresses #1241

- Add the groupName on notification success/failure counter, to help users debug quicker which group is failing (in case they use different configs) - Add notificationLatencySeconds histogram to debug duplicate messages. This can help rule out if duplicate messages are being caused by excessive latency when sending a notification. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

simonpasquier

It should fix #1241, right?

simonpasquier · 2018-04-20T15:02:06Z

notify/notify.go

+			retry, err := r.integration.Notify(ctx, alerts...)
+			notificationLatencySeconds.WithLabelValues(r.integration.name).Observe(time.Since(now).Seconds())
+			if err != nil {
+				numFailedNotifications.WithLabelValues(r.integration.name, r.groupName).Inc()


The definition of numFailedNotifications hasn't been updated with the new label.

simonpasquier · 2018-04-20T15:02:12Z

notify/notify.go

 				iErr = err
 			} else {
-				numNotifications.WithLabelValues(r.integration.name).Inc()
+				numNotifications.WithLabelValues(r.integration.name, r.groupName).Inc()


brian-brazil · 2018-04-20T15:10:13Z

There can be thousands of groupnames inside a single alertmanager, which is too high cardinality wise to use in a label. That's why we only break out by notifier currently.

stuartnelson3 · 2018-04-23T09:19:07Z

There can be thousands of groupnames inside a single alertmanager, which is too high cardinality wise to use in a label. That's why we only break out by notifier currently.

Right. And if a global configuration were broken it would potentially be massive. I guess folks will have to check for failed sends and then grep logs, since that does include the specific group name. I just wish there were a way to handle this without resorting to grepping logs :)

A mis-configured global flag could potentially create thousands of unique timeseries. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 · 2018-04-23T11:02:41Z

@simonpasquier could you take a look?

simonpasquier · 2018-04-23T12:01:27Z

👍

Updates for procfs refactoring Signed-off-by: Paul Gier <pgier@redhat.com>

stuartnelson3 force-pushed the stn/improve-notification-instrumentation branch from 50ccfa3 to 39d7737 Compare April 20, 2018 14:47

simonpasquier requested changes Apr 20, 2018

View reviewed changes

Remove groupName from success/failure counter

c6f43c7

A mis-configured global flag could potentially create thousands of unique timeseries. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 force-pushed the stn/improve-notification-instrumentation branch from c98e39c to c6f43c7 Compare April 23, 2018 09:41

stuartnelson3 merged commit bc263d3 into master Apr 23, 2018

stuartnelson3 deleted the stn/improve-notification-instrumentation branch April 23, 2018 12:23

stuartnelson3 mentioned this pull request May 3, 2018

Add metrics reguarding notification duration #1241

Closed

mxinden mentioned this pull request May 5, 2018

cherry-pick bug fixes and cli patches into release-0.15 #1363

Closed

simonpasquier mentioned this pull request Oct 4, 2018

Feature request: add metrics around the handling of alerts #1569

Closed

hh pushed a commit to ii/alertmanager that referenced this pull request May 7, 2019

update procfs to latest (prometheus#1335)

86f9079

Updates for procfs refactoring Signed-off-by: Paul Gier <pgier@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve notification instrumentation#1335

Improve notification instrumentation#1335
stuartnelson3 merged 2 commits intomasterfrom
stn/improve-notification-instrumentation

stuartnelson3 commented Apr 20, 2018 •

edited

Loading

Uh oh!

simonpasquier left a comment

Uh oh!

simonpasquier Apr 20, 2018

Uh oh!

simonpasquier Apr 20, 2018

Uh oh!

brian-brazil commented Apr 20, 2018

Uh oh!

stuartnelson3 commented Apr 23, 2018

Uh oh!

stuartnelson3 commented Apr 23, 2018

Uh oh!

simonpasquier commented Apr 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stuartnelson3 commented Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

simonpasquier Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

simonpasquier Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

brian-brazil commented Apr 20, 2018

Uh oh!

stuartnelson3 commented Apr 23, 2018

Uh oh!

stuartnelson3 commented Apr 23, 2018

Uh oh!

simonpasquier commented Apr 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stuartnelson3 commented Apr 20, 2018 •

edited

Loading