Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle#3461
Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle#3461emanlodovice wants to merge 1 commit intoprometheus:mainfrom
Conversation
50b60d1 to
b284f53
Compare
api/v1/api.go
Outdated
| validationErrs.Add(err) | ||
| api.m.Invalid().Inc() | ||
| if api.alertLCObserver != nil { | ||
| api.alertLCObserver.Rejected("Invalid", a) |
There was a problem hiding this comment.
can we change the invalid to the actual error?
api/v1/api.go
Outdated
| err: err, | ||
| }, nil) | ||
| if api.alertLCObserver != nil { | ||
| api.alertLCObserver.Rejected("Failed to create", validAlerts...) |
There was a problem hiding this comment.
This is when alerts.Put failed. Since we don't end up recording the alert I considered it as rejected.
api/v1/api_test.go
Outdated
|
|
||
| require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body))) | ||
|
|
||
| observer := alertobserver.NewFakeAlertLifeCycleObserver() |
There was a problem hiding this comment.
nit: maybe create a separate test case?
dispatch/dispatch_test.go
Outdated
| } | ||
| require.Equal(t, 1, len(recorder.Alerts())) | ||
| require.Equal(t, inputAlerts[0].Fingerprint(), observer.AggrGroupAlerts[0].Fingerprint()) | ||
| o, ok := notify.AlertLCObserver(dispatcher.ctx) |
There was a problem hiding this comment.
can we create a fake observer for example increment a counter, then verify if the observer's function get called?
There was a problem hiding this comment.
Yes we already do that. In line 598 we create a fake observer and in like 616 we verify that the function was called by checking the recorded alert.
dispatch/dispatch.go
Outdated
| d.ctx, d.cancel = context.WithCancel(context.Background()) | ||
| ctx := context.Background() | ||
| if d.alertLCObserver != nil { | ||
| ctx = notify.WithAlertLCObserver(ctx, d.alertLCObserver) |
There was a problem hiding this comment.
should we put the observer into the stages rather than in ctx?
There was a problem hiding this comment.
You mean pass it as one of the arguments in the Exec call instead of adding it in the context?
b284f53 to
5fbcf6f
Compare
|
This is great! I've been thinking about doing something similar, for the exact reasons mentioned:
|
alertobserver/alertobserver.go
Outdated
| "github.com/prometheus/alertmanager/types" | ||
| ) | ||
|
|
||
| type AlertLifeCycleObserver interface { |
There was a problem hiding this comment.
Instead of having a large interface with a method per event, have you considered having a generic Observe method that accepts metadata?
For example:
| type AlertLifeCycleObserver interface { | |
| type LifeCycleObserver interface { | |
| Observe(event string, alerts []*types.Alert, meta Metadata) | |
| } |
The metadata could be something as simple as:
type Metadata map[string]interface{}There was a problem hiding this comment.
Agreed, I'm not a fan of large interfaces either.
There was a problem hiding this comment.
Sure, I can update the code as suggested. Thanks for checking 🙇
simonpasquier
left a comment
There was a problem hiding this comment.
I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code?
Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".
alertobserver/alertobserver.go
Outdated
| "github.com/prometheus/alertmanager/types" | ||
| ) | ||
|
|
||
| type AlertLifeCycleObserver interface { |
There was a problem hiding this comment.
Agreed, I'm not a fan of large interfaces either.
The use that we are thinking of is just adding logs for these events. It sort of becomes an alert history that we can query when the customer comes in. We would like to have the flexibility in implementing how we collect and format the logs and how we will store them. |
a8d13d2 to
2b4fc5f
Compare
2b4fc5f to
494e304
Compare
a5d618e to
dee2f48
Compare
| // function, to make sure that when the run() will be executed the 1st | ||
| // alert is already there. | ||
| ag.insert(alert) | ||
| if d.alertLCObserver != nil { |
There was a problem hiding this comment.
do we need an event at d.metrics.aggrGroupLimitReached.Inc()?
notify/notify.go
Outdated
| m := alertobserver.AlertEventMeta{ | ||
| "ctx": ctx, | ||
| "msg": "Unrecoverable error", | ||
| "integration": r.integration.Name(), |
There was a problem hiding this comment.
do we care about each retry? should we just record final fail or final success at func (r RetryStage) Exec()
There was a problem hiding this comment.
I don't think we should care about retry here, currently we only record the final/success fail hence the if !retry.
There was a problem hiding this comment.
i mean why not put this into line 758 Exec function
There was a problem hiding this comment.
I updated the code the log the sent alerts instead because it is the correct list of alerts that was sent. I think because we don't return the sent alerts we have to keep the code where it currently is.
|
Just some nits but overall looks good! |
7eb6d7b to
fef64c8
Compare
9a6a3ea to
c700916
Compare
|
@grobinson-grafana @simonpasquier could you have a look at this PR when you have time? Thank you |
c700916 to
1335a7f
Compare
|
Rebased PR and fixed conflicts |
|
@simonpasquier this draft PR in cortex gives the general idea of our use case for this feature https://github.com/cortexproject/cortex/pull/5602/commits |
4de8e25 to
34e94ef
Compare
…ife cycle Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>
34e94ef to
a2495fb
Compare
|
@gotjosh good day. Can you take a look at this one? |
|
API v1 is deprecated, if still valid please reimplement this on top of API v2. |
|
I can take a look at this |
|
Checking this again, I think we can close it:
|
SoloJacobs
left a comment
There was a problem hiding this comment.
I will go ahead and close this, since I agree with Siavash here. In particular, I don't think we will accept changes to the v1 and v2 api, which are not strict bug fixes.
I don't want to dismiss the use cases, which motivated this PR. If you have specific needs or requirements, I would suggest to add them to these issues:
Thank you!
What this pull request does
This pull requests introduces a new
AlertLifeCycleObserverinterface that is accepted in the API, Dispatcher, and the notification pipeline. This interface contains methods to allow tracking what happens to an alert in alert manager.Motivation
Currently, when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.
By introducing a new interface that allows to hook into the alert life cycle, consumers of the alert manager package would be able to implement whatever observability solution works best for them.