Skip to content

Improve notification instrumentation#1335

Merged
stuartnelson3 merged 2 commits intomasterfrom
stn/improve-notification-instrumentation
Apr 23, 2018
Merged

Improve notification instrumentation#1335
stuartnelson3 merged 2 commits intomasterfrom
stn/improve-notification-instrumentation

Conversation

@stuartnelson3
Copy link
Contributor

@stuartnelson3 stuartnelson3 commented Apr 20, 2018

  • Add the groupName on notification
    success/failure counter, to help users debug
    quicker which group is failing (in case they use
    different configs)
  • Add notificationLatencySeconds histogram to
    debug duplicate messages. This can help rule out
    if duplicate messages are being caused by
    excessive latency when sending a notification.

The buckets are very coarse. The current default delay is 15s, set by --cluster.peer-timeout, and I'm mostly interested in seeing if this threshhold is causing duplicate messages that have been reported.

I wanted to add the groupName to the success/failure counter for integrations, but now this makes it difficult to initialize the metrics with values. I'm not sure if this is worth it (I think it will help reduce the time taken to fix a broken receiver's config), but if so then I'm not sure what would be the best way to handle initialization.

EDIT:
Addresses #1241

- Add the groupName on notification
success/failure counter, to help users debug
quicker which group is failing (in case they use
different configs)
- Add notificationLatencySeconds histogram to
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/improve-notification-instrumentation branch from 50ccfa3 to 39d7737 Compare April 20, 2018 14:47
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should fix #1241, right?

notify/notify.go Outdated
retry, err := r.integration.Notify(ctx, alerts...)
notificationLatencySeconds.WithLabelValues(r.integration.name).Observe(time.Since(now).Seconds())
if err != nil {
numFailedNotifications.WithLabelValues(r.integration.name, r.groupName).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of numFailedNotifications hasn't been updated with the new label.

notify/notify.go Outdated
iErr = err
} else {
numNotifications.WithLabelValues(r.integration.name).Inc()
numNotifications.WithLabelValues(r.integration.name, r.groupName).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@brian-brazil
Copy link
Contributor

There can be thousands of groupnames inside a single alertmanager, which is too high cardinality wise to use in a label. That's why we only break out by notifier currently.

@stuartnelson3
Copy link
Contributor Author

There can be thousands of groupnames inside a single alertmanager, which is too high cardinality wise to use in a label. That's why we only break out by notifier currently.

Right. And if a global configuration were broken it would potentially be massive. I guess folks will have to check for failed sends and then grep logs, since that does include the specific group name. I just wish there were a way to handle this without resorting to grepping logs :)

A mis-configured global flag could potentially
create thousands of unique timeseries.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/improve-notification-instrumentation branch from c98e39c to c6f43c7 Compare April 23, 2018 09:41
@stuartnelson3
Copy link
Contributor Author

@simonpasquier could you take a look?

@simonpasquier
Copy link
Member

👍

@stuartnelson3 stuartnelson3 merged commit bc263d3 into master Apr 23, 2018
@stuartnelson3 stuartnelson3 deleted the stn/improve-notification-instrumentation branch April 23, 2018 12:23
hh pushed a commit to ii/alertmanager that referenced this pull request May 7, 2019
Updates for procfs refactoring

Signed-off-by: Paul Gier <pgier@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants