Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle by emanlodovice · Pull Request #3461 · prometheus/alertmanager

emanlodovice · 2023-08-16T00:23:41Z

What this pull request does

This pull requests introduces a new AlertLifeCycleObserver interface that is accepted in the API, Dispatcher, and the notification pipeline. This interface contains methods to allow tracking what happens to an alert in alert manager.

Motivation

Currently, when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

By introducing a new interface that allows to hook into the alert life cycle, consumers of the alert manager package would be able to implement whatever observability solution works best for them.

api/v1/api.go

qinxx108 · 2023-08-16T22:23:33Z

api/v1/api.go

 			validationErrs.Add(err)
 			api.m.Invalid().Inc()
+			if api.alertLCObserver != nil {
+				api.alertLCObserver.Rejected("Invalid", a)


can we change the invalid to the actual error?

qinxx108 · 2023-08-16T22:28:46Z

api/v1/api.go

 			err: err,
 		}, nil)
+		if api.alertLCObserver != nil {
+			api.alertLCObserver.Rejected("Failed to create", validAlerts...)


why this is rejecting?

This is when alerts.Put failed. Since we don't end up recording the alert I considered it as rejected.

qinxx108 · 2023-08-16T22:30:22Z

api/v1/api_test.go


 		require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body)))
+
+		observer := alertobserver.NewFakeAlertLifeCycleObserver()


nit: maybe create a separate test case?

qinxx108 · 2023-08-16T22:46:18Z

dispatch/dispatch_test.go

+	}
+	require.Equal(t, 1, len(recorder.Alerts()))
+	require.Equal(t, inputAlerts[0].Fingerprint(), observer.AggrGroupAlerts[0].Fingerprint())
+	o, ok := notify.AlertLCObserver(dispatcher.ctx)


can we create a fake observer for example increment a counter, then verify if the observer's function get called?

Yes we already do that. In line 598 we create a fake observer and in like 616 we verify that the function was called by checking the recorded alert.

qinxx108 · 2023-08-16T22:49:56Z

dispatch/dispatch.go

-	d.ctx, d.cancel = context.WithCancel(context.Background())
+	ctx := context.Background()
+	if d.alertLCObserver != nil {
+		ctx = notify.WithAlertLCObserver(ctx, d.alertLCObserver)


should we put the observer into the stages rather than in ctx?

You mean pass it as one of the arguments in the Exec call instead of adding it in the context?

grobinson-grafana · 2023-08-29T10:34:48Z

This is great! I've been thinking about doing something similar, for the exact reasons mentioned:

when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

grobinson-grafana · 2023-08-29T10:39:49Z

alertobserver/alertobserver.go

+	"github.com/prometheus/alertmanager/types"
+)
+
+type AlertLifeCycleObserver interface {


Instead of having a large interface with a method per event, have you considered having a generic Observe method that accepts metadata?

For example:

Suggested change

type AlertLifeCycleObserver interface {

type LifeCycleObserver interface {

Observe(event string, alerts []*types.Alert, meta Metadata)

}

The metadata could be something as simple as:

type Metadata map[string]interface{}

Agreed, I'm not a fan of large interfaces either.

Sure, I can update the code as suggested. Thanks for checking 🙇

updated 🙇

simonpasquier

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code?
Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

simonpasquier · 2023-08-30T14:04:44Z

alertobserver/alertobserver.go

+	"github.com/prometheus/alertmanager/types"
+)
+
+type AlertLifeCycleObserver interface {


Agreed, I'm not a fan of large interfaces either.

emanlodovice · 2023-08-30T18:05:34Z

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code? Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

The use that we are thinking of is just adding logs for these events. It sort of becomes an alert history that we can query when the customer comes in. We would like to have the flexibility in implementing how we collect and format the logs and how we will store them.

qinxx108 · 2023-10-09T23:15:05Z

dispatch/dispatch.go

 	// function, to make sure that when the run() will be executed the 1st
 	// alert is already there.
 	ag.insert(alert)
+	if d.alertLCObserver != nil {


do we need an event at d.metrics.aggrGroupLimitReached.Inc()?

qinxx108 · 2023-10-09T23:20:28Z

notify/notify.go

+						m := alertobserver.AlertEventMeta{
+							"ctx":         ctx,
+							"msg":         "Unrecoverable error",
+							"integration": r.integration.Name(),


do we care about each retry? should we just record final fail or final success at func (r RetryStage) Exec()

I don't think we should care about retry here, currently we only record the final/success fail hence the if !retry.

i mean why not put this into line 758 Exec function

I updated the code the log the sent alerts instead because it is the correct list of alerts that was sent. I think because we don't return the sent alerts we have to keep the code where it currently is.

qinxx108 · 2023-10-09T23:20:58Z

Just some nits but overall looks good!

emanlodovice · 2023-10-12T06:46:05Z

@grobinson-grafana @simonpasquier could you have a look at this PR when you have time? Thank you

emanlodovice · 2023-10-17T21:11:18Z

Rebased PR and fixed conflicts

emanlodovice · 2023-10-19T03:52:12Z

@simonpasquier this draft PR in cortex gives the general idea of our use case for this feature https://github.com/cortexproject/cortex/pull/5602/commits

…ife cycle Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>

emanlodovice · 2023-11-20T21:41:50Z

@gotjosh good day. Can you take a look at this one?

siavashs · 2025-10-31T16:13:59Z

API v1 is deprecated, if still valid please reimplement this on top of API v2.
From my perspective investing in tracing, emitting events might be a better option.

rajagopalanand · 2025-11-14T14:49:45Z

I can take a look at this

siavashs · 2025-11-23T12:44:52Z

Checking this again, I think we can close it:

it uses the deprecated API v1
there is no garbage collection implemented, meaning this will just keep allocating memory until OOM
this will be inconsistent cross cluster nodes, it is not guaranteed if all nodes will observe the same alert
only the first peer will send most of the notifications, querying other peers will not show correct notification information unless nflog is queried (I did not analyze all the added notify logic to see if nflog is checked)
following the discussion on Proposal: Alert State Analytics for Alertmanager #4732 the current approach we are trying is tracing which is being added in feat: add distributed tracing support #4745

SoloJacobs

I will go ahead and close this, since I agree with Siavash here. In particular, I don't think we will accept changes to the v1 and v2 api, which are not strict bug fixes.

I don't want to dismiss the use cases, which motivated this PR. If you have specific needs or requirements, I would suggest to add them to these issues:

#4948
#4813

Thank you!

emanlodovice force-pushed the alert-observer branch 5 times, most recently from 50b60d1 to b284f53 Compare August 16, 2023 18:34

emanlodovice marked this pull request as ready for review August 16, 2023 18:40

qinxx108 reviewed Aug 16, 2023

View reviewed changes

api/v1/api.go Show resolved Hide resolved

qinxx108 reviewed Aug 16, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch from b284f53 to 5fbcf6f Compare August 17, 2023 00:06

grobinson-grafana approved these changes Aug 29, 2023

View reviewed changes

simonpasquier reviewed Aug 30, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch 2 times, most recently from a8d13d2 to 2b4fc5f Compare September 5, 2023 01:10

emanlodovice requested review from qinxx108 and simonpasquier September 5, 2023 01:42

emanlodovice force-pushed the alert-observer branch from 2b4fc5f to 494e304 Compare September 6, 2023 18:20

emanlodovice force-pushed the alert-observer branch from a5d618e to dee2f48 Compare September 29, 2023 07:38

qinxx108 reviewed Oct 9, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch 4 times, most recently from 7eb6d7b to fef64c8 Compare October 10, 2023 20:42

emanlodovice force-pushed the alert-observer branch 12 times, most recently from 9a6a3ea to c700916 Compare October 11, 2023 20:31

qinxx108 approved these changes Oct 11, 2023

View reviewed changes

emanlodovice force-pushed the alert-observer branch from c700916 to 1335a7f Compare October 17, 2023 20:54

emanlodovice mentioned this pull request Oct 18, 2023

Alert history log support cortexproject/cortex#5602

Closed

3 tasks

emanlodovice force-pushed the alert-observer branch 2 times, most recently from 4de8e25 to 34e94ef Compare November 16, 2023 07:20

Add AlertLifeCycleObserver that allows consumers to hook into Alert l…

a2495fb

…ife cycle Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>

emanlodovice force-pushed the alert-observer branch from 34e94ef to a2495fb Compare November 16, 2023 07:29

emanlodovice mentioned this pull request Nov 29, 2023

Add alert lifecycle observer amazon-contributing/alertmanager#1

Open

rajagopalanand self-assigned this Nov 14, 2025

SoloJacobs reviewed Feb 20, 2026

View reviewed changes

SoloJacobs closed this Feb 20, 2026


		require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body)))

		observer := alertobserver.NewFakeAlertLifeCycleObserver()

-type AlertLifeCycleObserver interface {
+type LifeCycleObserver interface {
+	Observe(event string, alerts []*types.Alert, meta Metadata)
+}

Conversation

emanlodovice commented Aug 16, 2023

What this pull request does

Motivation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grobinson-grafana commented Aug 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emanlodovice commented Aug 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emanlodovice Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinxx108 commented Oct 9, 2023

Uh oh!

emanlodovice commented Oct 12, 2023

Uh oh!

emanlodovice commented Oct 17, 2023

Uh oh!

emanlodovice commented Oct 19, 2023

Uh oh!

emanlodovice commented Nov 20, 2023

Uh oh!

siavashs commented Oct 31, 2025

Uh oh!

rajagopalanand commented Nov 14, 2025

Uh oh!

siavashs commented Nov 23, 2025

Uh oh!

SoloJacobs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

emanlodovice Oct 10, 2023 •

edited

Loading