Improve notification instrumentation#1335
Conversation
- Add the groupName on notification success/failure counter, to help users debug quicker which group is failing (in case they use different configs) - Add notificationLatencySeconds histogram to debug duplicate messages. This can help rule out if duplicate messages are being caused by excessive latency when sending a notification. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
50ccfa3 to
39d7737
Compare
simonpasquier
left a comment
There was a problem hiding this comment.
It should fix #1241, right?
notify/notify.go
Outdated
| retry, err := r.integration.Notify(ctx, alerts...) | ||
| notificationLatencySeconds.WithLabelValues(r.integration.name).Observe(time.Since(now).Seconds()) | ||
| if err != nil { | ||
| numFailedNotifications.WithLabelValues(r.integration.name, r.groupName).Inc() |
There was a problem hiding this comment.
The definition of numFailedNotifications hasn't been updated with the new label.
notify/notify.go
Outdated
| iErr = err | ||
| } else { | ||
| numNotifications.WithLabelValues(r.integration.name).Inc() | ||
| numNotifications.WithLabelValues(r.integration.name, r.groupName).Inc() |
|
There can be thousands of groupnames inside a single alertmanager, which is too high cardinality wise to use in a label. That's why we only break out by notifier currently. |
Right. And if a global configuration were broken it would potentially be massive. I guess folks will have to check for failed sends and then grep logs, since that does include the specific group name. I just wish there were a way to handle this without resorting to grepping logs :) |
A mis-configured global flag could potentially create thousands of unique timeseries. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
c98e39c to
c6f43c7
Compare
|
@simonpasquier could you take a look? |
|
👍 |
Updates for procfs refactoring Signed-off-by: Paul Gier <pgier@redhat.com>
Add the groupName on notificationsuccess/failure counter, to help users debug
quicker which group is failing (in case they use
different configs)
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.
The buckets are very coarse. The current default delay is 15s, set by
--cluster.peer-timeout, and I'm mostly interested in seeing if this threshhold is causing duplicate messages that have been reported.I wanted to add the groupName to the success/failure counter for integrations, but now this makes it difficult to initialize the metrics with values. I'm not sure if this is worth it (I think it will help reduce the time taken to fix a broken receiver's config), but if so then I'm not sure what would be the best way to handle initialization.EDIT:
Addresses #1241