eventrecorder: add structured event recorder by jackrosenthal · Pull Request #5072 · prometheus/alertmanager

jackrosenthal · 2026-03-04T00:00:11Z

This commit adds a structured event recorder to Alertmanager that
records events. Events are defined in a protobuf schema and fanned
out to one or more configured outputs (JSONL file, webhook). The
event recorder provides an audit trail for observability and
analytics, answering questions like "why did this alert fire?", "was
it silenced?", "which receiver got notified?", and "when did this
inhibition rule suppress it?".

The event recorder is gated behind --enable-feature=event-recorder.
When not enabled, a zero-value Recorder is used and all methods are
no-ops.

At this time, there is no filtering of events or event query API.
Downstream software is intended to consume the event recorder via the
webhook or file output and filter and store events in an appropriate
database as desired. Alertmanager is kept away from this logic.

Ten event types are defined, covering the full alert lifecycle:

Event	Emitted When
`AlertmanagerStartupEvent`	Alertmanager process starts
`AlertmanagerShutdownEvent`	Alertmanager process receives SIGTERM
`AlertCreatedEvent`	A new alert is inserted via the API
`AlertGroupedEvent`	An alert is added to an aggregation group
`AlertResolvedEvent`	An alert's end time passes during a flush
`NotificationEvent`	A notification is successfully delivered
`SilenceCreatedEvent`	A new silence is created
`SilenceUpdatedEvent`	An existing silence is updated in place
`SilenceMutedAlertEvent`	A silence matches and mutes an alert
`InhibitionMutedAlertEvent`	An inhibition rule suppresses an alert

Each event carries a full protobuf-defined payload with alert details,
labels, annotations, fingerprints, group metadata, silence/inhibition
rule details, and receiver information as applicable.

The event recorder is configured via a new top-level event_recorder
section in alertmanager.yml, supporting multiple simultaneous
outputs:

event_recorder:
  outputs:
    - type: file
      path: /var/log/alertmanager/events.jsonl
    - type: webhook
      url: https://example.com/events
      http_config:   # optional, same as receiver http_config
        tls_config:
          cert_file: /path/to/cert
      timeout: 10s       # optional, default 10s
      workers: 4         # optional, default 4
      max_retries: 3     # optional, default 3
      retry_backoff: 500ms  # optional, default 500ms

The configuration is hot-reloadable: sending SIGHUP swaps outputs
atomically (old outputs are closed, new ones created).

Writes JSONL to a file. Uses fsnotify to watch the parent directory
for file renames and removes (e.g. logrotate), triggering an immediate
reopen. The constructor blocks until the watcher is ready, avoiding
races in tests.

POSTs each event as a JSON body to a configured URL. A bounded worker
pool (configurable via workers) processes deliveries concurrently so
that a slow webhook does not block the event queue. Failed deliveries
are retried with exponential backoff (configurable via max_retries
and retry_backoff), capped at 30 seconds. When the internal queue
(capacity 1024) is full, events are dropped and counted via the
alertmanager_event_recorder_webhook_drops_total Prometheus counter.

RecordEvent never blocks the caller. Events are serialized with
protojson.Marshal and placed on a bounded in-memory channel (capacity
8,192, capping memory at roughly 4 MB). A background goroutine drains
the channel and writes to all configured outputs. If the channel is
full, the event is dropped and a metric is incremented. The event
recorder must never slow down alert processing.

Concurrency is handled entirely via message passing — no mutexes. The
writeLoop goroutine exclusively owns output state and config.
Config updates arrive via a cfgUpdate channel with an ack signal.
cluster.Peer uses atomic.Pointer. Close uses sync.Once.

RecordEvent accepts a context.Context. Recording is opt-in:
callers must pass eventrecorder.WithEventRecording(ctx) to enable
recording. The API update path omits this to avoid flooding the event
queue when checking silences and inhibitions for status display.

Event construction and protobuf conversion helpers live in the
eventrecorder package: NewAlertCreatedEvent,
NewSilenceMutedAlertEvent, NewSilenceCreatedEvent,
NewSilenceUpdatedEvent, NewInhibitionMutedAlertEvent,
SilenceAsProto, InhibitRuleAsProto, SilenceMatcherAsProto,
MatcherAsProto, MatchersAsProto, LabelSetAsProto. Call sites
in inhibit, silence, and provider/mem are one-liners.

extractEventType uses proto reflection (WhichOneof) to derive the
event type name from the oneof field, avoiding a manual type switch.

Five Prometheus metrics are exposed:

alertmanager_events_recorded_total (labels: event_type, output, result)
alertmanager_event_recorder_bytes_written_total (labels: event_type, output)
alertmanager_events_dropped_total (labels: event_type)
alertmanager_event_serialize_errors_total (labels: event_type)
alertmanager_event_recorder_queue_length (gauge)
alertmanager_event_recorder_webhook_drops_total (labels: output)

The output label uses a stable identifier (e.g. file:/path or
webhook:https://host/path) rather than a positional index, so it
remains meaningful across config reloads. Webhook URLs are sanitized
to strip userinfo and query parameters before use in identifiers,
metrics labels, and log messages.

Both SilenceCreated and SilenceUpdated events are recorded at the
same level in the Set method, rather than split between setSilence
(for create) and Set (for update). setSilence returns
(changed, added bool, err error) so the caller can decide what to
record and whether to broadcast.

Several new context keys were added to notify/notify.go to thread
dispatch-time metadata through the notification pipeline:

WithGroupMatchers / GroupMatchers: route-level matchers
WithRouteLabels / RouteLabels: route-level labels
WithFlushId / FlushId: per-flush monotonic counter
WithAggrGroupId / AggrGroupId: aggregation group UUID
WithMutedAlerts / MutedAlerts: muted alert fingerprints
WithNotificationReason / NotificationReason: trigger reason

These are populated in dispatch/dispatch.go where the aggregation
group context is set up, and consumed in notify/event.go to build
rich NotificationEvent and AlertGroupedEvent payloads.

Unit tests cover the eventrecorder package:

eventrecorder/eventrecorder_test.go: RecordEvent delivery,
NopRecorder, zero-value Recorder, WithEventRecording,
ApplyConfig no-op, extractEventType, LabelSetAsProto,
MatcherAsProto, MatchersAsProto, eventRecorderConfigEqual.
eventrecorder/file_test.go: SendEvent, reopen-after-rename,
reopen-after-remove, Close, invalid path.
eventrecorder/webhook_test.go: webhook delivery via
httptest.Server, concurrent workers, retry on failure, drop after
max retries, and graceful close draining.

Additionally, tested with a live Alertmanager instance configured with
JSONL file output, a webhook endpoint, inhibition rules, and silences.
All 10 event types were verified:

Started Alertmanager with --enable-feature=event-recorder and
event_recorder configured. Verified
alertmanagerStartupEvent.
Fired test alerts. Verified alertCreated and alertGrouped.
Fired CriticalAlert (severity=critical) and WarningAlert
(severity=warning) with an inhibition rule suppressing warnings
when a matching critical alert exists. Verified
inhibitionMutedAlert with correct rule details and source
fingerprint.
Created a silence matching SilencedAlert, then fired
SilencedAlert. Verified silenceCreated and
silenceMutedAlert with silence ID and matchers.
Updated an active silence via the v2 API, preserving matchers and
extending the end time (satisfying canUpdate). Verified
silenceUpdated with the updated silence details and same ID.
Waited for notification delivery. Verified notification events
with receiver name, integration type, and alert details.
Fired a short-lived alert (5-second TTL), waited for flush.
Verified alertResolved.
Sent SIGTERM. Verified alertmanagerShutdownEvent.
Tested hot-reload: started with file-only config, sent SIGHUP with
file+webhook config, verified new events appeared on the webhook
and old file continued receiving events.
Verified feature flag gating: started without
--enable-feature=event-recorder, confirmed no event file was
created and no events were recorded.

Summary by CodeRabbit

New Features
- Opt-in event recording for lifecycle, alerts, silences, notifications, grouping and inhibition with a structured protobuf event envelope.
- New outputs: file and webhook sinks with reopen/rotation handling, worker pools, retries, and non-blocking enqueueing.
- Asynchronous delivery with drop/error metrics, safe startup/shutdown, and hot-reloadable outputs.
Configuration
- New event_recorder config section (outputs, timeouts, retries, workers).
- Experimental feature flag "event-recorder" to enable this capability.

siavashs · 2026-03-05T17:16:36Z

Thank you @jackrosenthal
I'm going to take a look but for now check this agent's report about potential bugs https://app.devin.ai/review/prometheus/alertmanager/pull/5072

jackrosenthal · 2026-03-05T18:25:18Z

Addressed findings from Devin review:

Fixed nil pointer dereference in renderMatcher when labels.NewMatcher fails on invalid regex (silence/silence.go)
Drain HTTP response body before closing in webhook output to enable connection reuse (eventlog/webhook.go)
ApplyConfig now skips when config is unchanged, avoiding unnecessary output churn and a startup race (eventlog/eventlog.go)
Per-output metrics: eventsRecorded and eventLogBytesWritten now carry an output label; serialization errors and dropped events split into separate counters (eventlog/eventlog.go)
Added MutesNoRecord to Silencer and Inhibitor; api.Update closure uses it to avoid flooding the event queue (silence/silence.go, inhibit/inhibit.go, cmd/alertmanager/main.go)

Acking from Devin (no action needed):

HTTPConfig.SetDirectory on range copy is fine — HTTPConfig is a pointer
Defer ordering for shutdown event and Close() is correct (LIFO)
setSilence correctly avoids duplicate SilenceCreated events on updates

ultrotter · 2026-03-09T12:29:26Z

eventlog/eventlog.go

+
+func MatchersAsProto(matchers labels.Matchers) []*eventlogpb.Matcher {
+	result := make([]*eventlogpb.Matcher, len(matchers))
+	for i, m := range matchers {


Should we have a SingleMatcherAsProto func called here. and use it also instead of convertMatcherForEvent() in inhibit/inhibit.go:428-445 so we avoid duplicating all of this?

ultrotter · 2026-03-09T12:42:54Z

silence/silence.go

-		return s.setSilence(msil, now)
+		if err := s.setSilence(msil, now); err != nil {
+			return err
+		}


Seeing this flow here makes me think: here we are relying on the fact that setSilence will not emit the event itself, relying on the fact that if canUpdate the merge will succeed. Is it always guaranteed? Should we at least log/track if we call setSilence expecting no new silence and one happens against all odds? Or should we consider it out of scope? Also the pattern of calling record event in setsilence but only on one path, and out of it on the other path, is a bit unbalanced. I am wondering if we can always record inside setSilence, or always out

I think it is guaranteed - the checks before now check the id matches so even if canUpdate is wrong we'll still end up overwriting an existing silence when we call setSilence.

ultrotter · 2026-03-09T12:45:33Z

silence/silence.go

 				activeIDs = append(activeIDs, sil.Id)
 				allIDs = append(allIDs, sil.Id)
+
+				recorder.RecordEvent(&eventlogpb.EventData{


Would it make sense to emit just one event with all the silences that mute the alert, rather? WDYT?

I considered this, but I think per-silence events are more useful for downstream consumers. Each event is self-contained and can be filtered/routed independently, and the schema stays simple (single silence per event rather than a repeated field). If an alert is muted by 3 silences, 3 separate events is a clearer audit trail than one batched event where the consumer has to iterate.

ultrotter · 2026-03-09T12:48:08Z

eventlog/webhook.go

+}
+
+// WriteEvent POSTs the pre-serialized JSON bytes to the webhook URL.
+func (wo *WebhookOutput) WriteEvent(data []byte) error {


Should we have a goroutine per-POST or a max number of workers, to avoid a slow or temporarily down webhook from stalling us, and/or a small number of retries before dropping? and should we have a metric for retried, dropped, and length of the requests?

ultrotter · 2026-03-09T12:50:59Z

eventlog/eventlog.go

+	}
+
+	for i, out := range outputs {
+		idx := strconv.Itoa(i)


This index could change on conf reload, right? Should we use file:path or webhook:url for a more stable long term meaning of the output label?

ultrotter · 2026-03-09T12:52:08Z

eventlog/eventlog.go

+// placed on a bounded in-memory queue.  A background goroutine
+// drains the queue and writes to outputs.  If the queue is full,
+// events are dropped and a metric is incremented.
+package eventlog


Should we add any test for the eventlog package?

ultrotter · 2026-03-09T12:53:42Z

eventlog/eventlog.go

+	// Skip if the config hasn't changed to avoid closing and
+	// recreating identical outputs.
+	r.core.mu.Lock()
+	if reflect.DeepEqual(cfg, r.core.currentCfg) {


Should we use a different comparison here, only on what we care about and avoiding doing a full deepequal on a struct which might have unexported fields and such and is quite complex?
We could for example keep a hash of the yaml that we used to build the config and only rebuild if the new yaml has a different hash...

ultrotter · 2026-03-09T12:59:34Z

eventlog/eventlog.go

+		return
+	}
+
+	newOutputs := buildOutputs(cfg.Outputs, r.core.logger)


Is there any way to build these without holding the lock, then swapping them? Or for example we could have an atomic pointer to the config instead of a lock...

ultrotter · 2026-03-09T13:05:42Z

cmd/alertmanager/main.go

-			inhibitor.Load().Mutes(ctx, labels)
-			silencer.Mutes(ctx, labels)
+			inhibitor.Load().MutesNoRecord(ctx, labels)
+			silencer.MutesNoRecord(ctx, labels)


I am wondering if instead of a separate MutesNoRecord call in both modules we could pass a context to disable the event recorder for the call?

ultrotter · 2026-03-09T13:08:26Z

eventlog/file.go

+	"log/slog"
+	"time"
+
+	"github.com/pengsrc/go-shared/reopen"


This is a library that hasn't been updated in years. 8 years, not 2. :) At this point rather than the dependency it might be better to implement the minimum of the needful. Also should we consider using maybe https://github.com/fsnotify/fsnotify to detect if we need to reopen instead of blindly reopening every 5 min?

jackrosenthal · 2026-03-09T18:39:18Z

Addressed review feedback:

Extracted eventlog.MatcherAsProto and removed duplicate convertMatcherForEvent from inhibit/inhibit.go.
Moved SilenceCreated recording out of setSilence into Set, so both create and update events are recorded at the same level. setSilence now returns (added bool, err error).
WebhookOutput has a bounded worker pool (workers, default 4) with exponential-backoff retries (max_retries, retry_backoff). New config fields on EventLogOutput.
Added Name() to the Output interface (file:<path>, webhook:<url>). Metrics use these instead of positional indices.
Replaced reflect.DeepEqual in ApplyConfig with eventLogConfigEqual comparing fields explicitly.
ApplyConfig builds outputs outside the lock, then swaps under the lock.
RecordEvent takes a context.Context. eventlog.WithRecordingDisabled(ctx) suppresses recording. Removed MutesNoRecord from Silencer and Inhibitor.
Replaced pengsrc/go-shared/reopen with inline file-reopen logic using fsnotify for immediate rotation detection. Removed the stale dependency.
Added 16 unit tests (eventlog_test.go, webhook_test.go) covering delivery, retries, context suppression, config comparison, and conversion helpers.

siavashs · 2026-03-10T10:14:41Z

inhibit/inhibit.go

+			ih.recorder.RecordEvent(ctx, &eventlogpb.EventData{
+				EventType: &eventlogpb.EventData_InhibitionMutedAlert{
+					InhibitionMutedAlert: &eventlogpb.InhibitionMutedAlertEvent{
+						InhibitRules: []*eventlogpb.InhibitRule{convertInhibitRuleForEvent(r)},
+						MutedAlert: &eventlogpb.MutedAlert{
+							Fingerprint: uint64(fp),
+							Labels:      eventlog.LabelSetAsProto(lset),
+						},
+						InhibitingFingerprints: []uint64{uint64(inhibitedByFP)},
+					},
+				},
+			})


Can we add some helper functions to make constructing events less verbose in the emitter side?
For example:

Suggested change

ih.recorder.RecordEvent(ctx, &eventlogpb.EventData{

EventType: &eventlogpb.EventData_InhibitionMutedAlert{

InhibitionMutedAlert: &eventlogpb.InhibitionMutedAlertEvent{

InhibitRules: []*eventlogpb.InhibitRule{convertInhibitRuleForEvent(r)},

MutedAlert: &eventlogpb.MutedAlert{

Fingerprint: uint64(fp),

Labels: eventlog.LabelSetAsProto(lset),

},

InhibitingFingerprints: []uint64{uint64(inhibitedByFP)},

},

},

})

ih.recorder.RecordEvent(ctx, eventlog.NewInhibitionMutedAlertEvent(r, fp, lset, inhibitedBy))

siavashs · 2026-03-10T10:15:38Z

inhibit/inhibit.go

+func convertInhibitRuleForEvent(r *InhibitRule) *eventlogpb.InhibitRule {
+	sourceMatchers := make([]*eventlogpb.Matcher, len(r.SourceMatchers))
+	for i, m := range r.SourceMatchers {
+		sourceMatchers[i] = eventlog.MatcherAsProto(m)
+	}
+
+	targetMatchers := make([]*eventlogpb.Matcher, len(r.TargetMatchers))
+	for i, m := range r.TargetMatchers {
+		targetMatchers[i] = eventlog.MatcherAsProto(m)
+	}
+
+	equalLabels := make([]string, 0, len(r.Equal))
+	for label := range r.Equal {
+		equalLabels = append(equalLabels, string(label))
+	}
+
+	return &eventlogpb.InhibitRule{
+		SourceMatchers: sourceMatchers,
+		TargetMatchers: targetMatchers,
+		EqualLabels:    equalLabels,
+	}
+}


following my previous comment this should live in the eventlog package and be used internally there.

Done. Moved to eventrecorder.InhibitRuleAsProto.

siavashs · 2026-03-10T10:49:57Z

I think the package name eventlog is too specific and I would prefer naming the package something like record which is more generic and follows the "convention" we have in Alertmanager.
For example silence.NewSilenser(), record.NewRecorder() or record.NewEventRecorder() and then you could call recorder.Record() or recorder.RecordEvent(), etc.
What do you think?

ultrotter · 2026-03-10T10:55:58Z

eventlog/file.go

+			if event.Op&(fsnotify.Rename|fsnotify.Remove) != 0 {
+				fo.reopen()
+				// Re-add the watch since the original inode is gone.
+				_ = watcher.Add(fo.path)


Here in case of error we may miss future notifications. Would it make sense to watch the directory instead, only once, and filter the filename we care about from the notification we get from the changes there?

Done. Changed to watch the parent directory with fsnotify, filtering by the target file path.

ultrotter · 2026-03-10T10:59:26Z

eventlog/webhook.go

+		maxRetries:   maxRetries,
+		retryBackoff: retryBackoff,
+		logger:       logger,
+		work:         make(chan []byte, workers),


This chan has 4 spaces... So when below you do
wo.work <- data
It'll block after 4 workers are busy and 4 events are enqueued. Should we have a larger channel here?
And maybe below drop events and log/increase a metric if the big channel is completely full?

Done. Increased the queue to 1024. SendEvent is now non-blocking and drops the event when the queue is full, incrementing alertmanager_event_recorder_webhook_drops_total.

ultrotter · 2026-03-10T11:00:51Z

eventrecorder/file.go

+// FileOutput writes pre-serialized JSON event bytes to a JSONL file.
+// The file is reopened when fsnotify detects a rename or remove (e.g.
+// logrotate).
+type FileOutput struct {


Can we have some tests for fileoutput and the relevant reopener/notification that all works as we expect?

Done. Added five tests: TestFileOutput_SendEvent, TestFileOutput_ReopenAfterRename, TestFileOutput_ReopenAfterRemove, TestFileOutput_Close, and TestFileOutput_InvalidPath.

ultrotter · 2026-03-10T11:09:27Z

eventlog/eventlog.go

+	newOutputs := buildOutputs(cfg.Outputs, r.core.logger)
+
+	// Swap under the lock.
+	r.core.mu.Lock()


Here we should check that oldOutputs is unchanged since we unlocked and re-locked. I know it's unlikely, but... If it was an atomic pointer we would use:
func (*Pointer[T]) [CompareAndSwap]

But otherwise we could use a version number stored inside, or a hash, or re-compare eventLogConfigEqual but that seems less ideal...

This is no longer applicable. The concurrency refactor replaced all locks with message passing, so writeLoop exclusively owns the outputs and there is no unlock/relock sequence.

ultrotter · 2026-03-10T11:14:58Z

eventlog/webhook.go

+		}
+		wo.logger.Warn("Event log webhook POST failed", "url", wo.url, "attempt", attempt+1, "err", err)
+		if attempt < wo.maxRetries-1 {
+			backoff := time.Duration(float64(wo.retryBackoff) * math.Pow(2, float64(attempt)))


Should we have a max backoff here? the largest one with the default values is 256 seconds... ?
And should we have a metric for dropped values?
also do we need to bring in math.Pow here? (though it shouldn't matter too much, fair enough).

Done. Added a max backoff cap of 30 seconds and simplified the calculation to min(retryBackoff<<attempt, maxBackoff), removing the math import. Drop metric added as noted above.

ultrotter · 2026-03-10T11:18:36Z

eventlog/eventlog.go

+		select {
+		case req := <-c.events:
+			c.writeToOutputs(req)
+		case <-c.done:


Here in this case while we are draining c.events the workers may have already found an empty channel and quit. i.e.
done is closed.
all workers quit because their channel is empty
and then we get scheduled and start adding events.... which nobody handles! :(

We should first tell writeLoop to drain-and-return and only after it's done tell the workers to drain-and-return.

Fixed by the concurrency refactor. writeLoop handles its own drain in a defer when <-done fires, so there is no race between workers quitting and events being enqueued.

ultrotter · 2026-03-10T11:21:27Z

eventlog/eventlog_test.go

+
+	// Apply a new config.  Since we can't easily create real outputs
+	// in a test, we manually swap outputs.  But we can verify that an
+	// unchanged config is a no-op.


Uhm it does seem that we should also verify that the config changes and new events go to the new output correctly...

Verified manually with a live instance. Started with file-only config, sent SIGHUP with file+webhook config, and confirmed new events appeared on the webhook while the file continued receiving events.

ultrotter · 2026-03-10T11:23:06Z

eventlog/eventlog.go

+
+// Close signals the background goroutine to drain remaining events
+// and stop, then closes all outputs.
+func (r Recorder) Close() error {


Shall we test that Close drained all events properly?

TestWebhookOutput_CloseFlushesQueue and TestWebhookOutput_MultipleWorkers both verify that Close drains all queued events.

ultrotter · 2026-03-10T11:26:43Z

eventlog/eventlog.go

+	case r.core.events <- writeRequest{data: data, eventType: eventType}:
+	default:
+		// Queue full; drop event to avoid blocking alertmanager.
+		r.core.metrics.eventsDropped.WithLabelValues(eventType).Inc()


Should we add a test for this unhappy path, and potentially other unhappy paths?

Added TestFileOutput_InvalidPath and TestWebhookOutput_DropsAfterMaxRetries.

siavashs · 2026-03-10T14:52:16Z

silence/silence.go

+func (s *Silences) convertSilenceForEvent(sil *pb.Silence) *eventlogpb.Silence {
+	matcherSets := make([]*eventlogpb.MatcherSet, len(sil.MatcherSets))
+	for i, ms := range sil.MatcherSets {
+		matcherSet := &eventlogpb.MatcherSet{
+			Matchers: make([]*eventlogpb.Matcher, len(ms.Matchers)),
+		}
+		for j, m := range ms.Matchers {
+			matcherSet.Matchers[j] = &eventlogpb.Matcher{
+				Type:     convertMatcherType(m.Type),
+				Name:     m.Name,
+				Pattern:  m.Pattern,
+				Rendered: renderMatcher(m),
+			}
+		}
+		matcherSets[i] = matcherSet
+	}
+
+	var matchers []*eventlogpb.Matcher
+	if len(matcherSets) > 0 {
+		matchers = matcherSets[0].Matchers
+	}
+
+	return &eventlogpb.Silence{
+		Id:          sil.Id,
+		Matchers:    matchers,
+		MatcherSets: matcherSets,
+		StartsAt:    sil.StartsAt,
+		EndsAt:      sil.EndsAt,
+		UpdatedAt:   sil.UpdatedAt,
+		CreatedBy:   sil.CreatedBy,
+		Comment:     sil.Comment,
+	}


This does not access anything from Silences instance s and should not be part of it.
Since both sides are the conversion are proto types, can we use generated code to handle this conversion?
This should be a helper function in the eventlog (whatever we name it) package.

Done. Moved to eventrecorder.SilenceAsProto and eventrecorder.SilenceMatcherAsProto.

siavashs · 2026-03-10T15:00:32Z

eventlog/eventlog.go

+type recordingDisabledContextKey struct{}
+
+// WithRecordingDisabled returns a context that suppresses event recording.
+func WithRecordingDisabled(ctx context.Context) context.Context {
+	return context.WithValue(ctx, recordingDisabledContextKey{}, true)
+}
+
+// RecordingDisabled reports whether event recording has been suppressed
+// in the given context via WithRecordingDisabled.
+func RecordingDisabled(ctx context.Context) bool {
+	v, _ := ctx.Value(recordingDisabledContextKey{}).(bool)
+	return v
+}


This reads very odd, we should have it disabled by default and then have an option to enable it.
For example WithEventRecording()

Done. Renamed to WithEventRecording. Recording is disabled by default; callers opt in.

siavashs · 2026-03-10T15:01:20Z

eventlog/eventlog.go

+// recorderCore holds the mutable state shared by all copies of a
+// Recorder value.  Access is protected by mu.
+type recorderCore struct {


Maybe sharedRecorder would be a better name?

siavashs · 2026-03-10T15:03:17Z

eventlog/eventlog.go

+// Output is a single event log destination.  Implementations receive
+// pre-serialized JSON bytes and are responsible for delivering them.
+type Output interface {
+	// Name returns a stable identifier for this output, suitable for
+	// use as a Prometheus label value (e.g. "file:/var/log/events.jsonl"
+	// or "webhook:https://example.com/hook").
+	Name() string
+	WriteEvent(data []byte) error
+	io.Closer
+}


Maybe call it Destination and SendEvent()?

eventlog/eventlog.go

SoloJacobs

General remark: It will be easier to review and merge this, if we did the file.go and webbok.go implementation in a separate PR, but here we go.

a) Some of these events are extremely detailed. E.g., InhibitionMutedAlertEvent . We could construct all the relevant information in Alertmanager just from the Silences, the configuration and the alerts. If we precompute results, then this isn't really event sourcing? We already have tracing for adding observability to alertmanager.

b) I don't think we should need any locks in an event-recorder? Can we just use message passing. Below the AIs attempt (Sorry, I let the AI do this, I otherwise would have to delay this review even more):

Locking for the cluster.peer is weird, we should simply emit an event with the Position, whenever peer.Position() is called. We could also keep track of that position inside the EventRecorder, but definitely not from

func (r Recorder) UpdateClusterPosition(pos int) {
	if r.ch == nil { return }
	p := uint32(pos)
	select {
	case r.ch <- logRequest{clusterPos: &p}:
	default:
		// Queue full. For state updates, we might want to retry or block briefly,
		// but dropping is standard for this non-blocking design.
	}
}

The EventRecorder should be able to handle config reloads just by message passing:

func (c *recorderCore) writeLoop(reqs <-chan logRequest) {
	var currentClusterPos uint32 = 0
	var outputs []Output // <--- Local state, completely lock-free!

	// Ensure we clean up whatever is active when the process finally exits
	defer func() {
		for _, out := range outputs {
			out.Close()
		}
	}()

	for req := range reqs {
		// --- CONTROL MESSAGE: CONFIG RELOAD ---
		if req.newOutputs != nil {
			c.logger.Info("Consumer applying new event log configuration")
			
			// 1. Gracefully close the old outputs
			for _, out := range outputs {
				if err := out.Close(); err != nil {
					c.logger.Error("Failed to close old output", "err", err)
				}
			}
			
			// 2. Swap to the new outputs
			outputs = req.newOutputs
			continue
		}

		// --- CONTROL MESSAGE: CLUSTER POSITION ---
		if req.clusterPos != nil {
			currentClusterPos = *req.clusterPos
			continue
		}

		// --- STANDARD DATA MESSAGE: EVENT ---
		if req.event != nil {
			// (Wrap the event, stamp the cluster position, marshal to JSON...)
			data := marshalEvent(req.event, currentClusterPos)
			
			// Write to the currently active outputs
			for _, out := range outputs {
				out.WriteEvent(data)
			}
		}
	}
}

SoloJacobs · 2026-03-10T18:30:11Z

eventrecorder/webhook.go

+// WebhookOutput POSTs each event as a JSON body to a configured URL.
+// Events are processed by a bounded worker pool so that a slow or
+// temporarily unavailable webhook does not block the event log queue.
+type WebhookOutput struct {


Should we really call this Webhook? This name is already taken by the notifier.

While I do think it's a little confusing, I also think it's ok since it's in another package. IMO, It's fairly conventional to have this kind of name duplication in Go since the package name always prefixes the symbol name.

I agree with @Spaceman1701 that this is fine since the type is in a separate package (eventrecorder.WebhookOutput vs webhook.Notifier).

SoloJacobs · 2026-03-10T18:38:04Z

cmd/alertmanager/main.go

+		},
+	})
+	defer func() {
+		eventRecorder.RecordEvent(context.Background(), &eventlogpb.EventData{


Can we put this explicitly at the end of the function?

siavashs · 2026-03-11T09:57:02Z

a) Some of these events are extremely detailed. E.g., InhibitionMutedAlertEvent . We could construct all the relevant information in Alertmanager just from the Silences, the configuration and the alerts. If we precompute results, then this isn't really event sourcing? We already have tracing for adding observability to alertmanager.

So event log always records all events, while tracing is for sampling. Technically one could sample traces at 100% but tracing is less efficient than event recording. Event recording is more about analytics than for observability.

SoloJacobs · 2026-03-11T10:23:19Z

So event log always records all events, while tracing is for sampling. Technically one could sample traces at 100% but tracing is less efficient than event recording. Event recording is more about analytics than for observability.

But we could very easily address the sampling issue by using our existing logger, correct? It wasn't clear from my first review, but the actual code changes I requested were for item b). For item a), what I really need is a coherent strategy for our different logging mechanism. When to pick which emitter (logs, metrics, traces, event log)? How to deal with duplicate events? This feels like the part, which is not explicit here.

As far as I can see, we can't really use the EventRecorder for event sourcing. That is fine, but we really need to be clear about what capabilities this feature gives users. Currently, the PR just describes what events are added, but not why.

siavashs · 2026-03-11T15:09:32Z

But we could very easily address the sampling issue by using our existing logger, correct? It wasn't clear from my first review, but the actual code changes I requested were for item b). For item a), what I really need is a coherent strategy for our different logging mechanism. When to pick which emitter (logs, metrics, traces, event log)? How to deal with duplicate events? This feels like the part, which is not explicit here.

So some background: eventlog is not a very good name (see my comment above about the package naming), the history is that HRT uses this implementation internally to log all the events happening in Alertmanager, this is different from application logs and it is intentionally kept separate.
Now since the meet up we discussed about open sourcing this implementation and making it generic enough for others to use. For example we at CF want to build our analytics system on top of the events recorder and we do not want to use log files but rather messaging or kafka, etc.

I agree the description of PR can be better, or be linked to the issue #4948

Spaceman1701

Thanks for opening this PR! I agree with the general requests around cleanup (e.g. moving event construction into the eventlog package), but overall the code and behavior looks good (and familiar 😅 ).

I also wanted to mention that we discussed this change during the last Alertmanager working group meeting. I think the biggest takeaway is that we should put this change behind an experimental feature flag now - the eventlog protos will be the first "public" protos in Alertmanager, so it'd be nice to have some time to iterate on them before we have to support them as part of the stable API.

Spaceman1701 · 2026-03-17T21:36:07Z

eventrecorder/webhook.go

+// WebhookOutput POSTs each event as a JSON body to a configured URL.
+// Events are processed by a bounded worker pool so that a slow or
+// temporarily unavailable webhook does not block the event log queue.
+type WebhookOutput struct {


While I do think it's a little confusing, I also think it's ok since it's in another package. IMO, It's fairly conventional to have this kind of name duplication in Go since the package name always prefixes the symbol name.

Spaceman1701 · 2026-03-17T21:40:49Z

silence/silence.go

-		return s.setSilence(msil, now)
+		if err := s.setSilence(msil, now); err != nil {
+			return err
+		}


I think it is guaranteed - the checks before now check the id matches so even if canUpdate is wrong we'll still end up overwriting an existing silence when we call setSilence.

Spaceman1701 · 2026-03-17T21:51:08Z

So event log always records all events, while tracing is for sampling. Technically one could sample traces at 100% but tracing is less efficient than event recording. Event recording is more about analytics than for observability.

But we could very easily address the sampling issue by using our existing logger, correct? It wasn't clear from my first review, but the actual code changes I requested were for item b). For item a), what I really need is a coherent strategy for our different logging mechanism. When to pick which emitter (logs, metrics, traces, event log)? How to deal with duplicate events? This feels like the part, which is not explicit here.

To echo @siavashs a little, the way we use the event recorder in HRT is as a separate structured event reporting system. It's read by machines to generate historical information that drives both analytics and user workflows.

As far as I can see, we can't really use the EventRecorder for event sourcing. That is fine, but we really need to be clear about what capabilities this feature gives users. Currently, the PR just describes what events are added, but not why.

The goal is to get enough events that we can do event sourcing, but I agree there are some things missing. Over time we can improve this system to export more events and provide more detail.

This is also why the events sometimes include "extra" data like the peer position - while a perfect event souring system wouldn't need this, we practically:

cannot collect every alertmanager state change right now
will sometimes drop events (so duplicating information is useful to avoid data loss)
don't want to do full event sourcing to answer every question - sometimes including a little extra data in each event makes certain questions answerable from just the single message

(2) is the most important one, I think: In order to do "true" event sourcing to replicate the state of Alertmanager you would need infrastructure which can guarantee that you never drop an event. This is not really practical in reality, so (in my opinion) it's helpful to "duplicate" some information in each event message.

I don't think having extra events breaks the model, really. If you were implementing an event sourcing system you would just ignore the events that don't contain useful data to you. Perhaps in the future we could even support some filtering directly from the Alertmanager configuration.

coderabbitai · 2026-03-26T20:11:21Z

📝 Walkthrough

Walkthrough

Adds an opt-in event recording subsystem: protobuf schema, Recorder with async queue and file/webhook outputs, config and feature-flag support, and instrumentation wired into alerts, silences, inhibitions, dispatcher, and notification pipeline; startup/shutdown events and API tracing context wrapping are also added.

Changes

Cohort / File(s)	Summary
Protocol Buffer Schema `eventrecorder/eventrecorderpb/eventrecorder.proto`	Adds Event envelope, EventData oneof, messages/enums for startup/shutdown, alerts, notifications, silences, inhibitions, matchers, and supporting types.
Core Recorder & Tests `eventrecorder/eventrecorder.go`, `eventrecorder/eventrecorder_test.go`	New Recorder package: context helpers, async queue/writeLoop, hot-reloadable outputs, metrics, conversion helpers, event constructors, and unit tests for delivery, enablement, and conversions.
Outputs: File & Webhook (+tests) `eventrecorder/file.go`, `eventrecorder/file_test.go`, `eventrecorder/webhook.go`, `eventrecorder/webhook_test.go`	File output with fsnotify reopen handling; webhook output with worker pool, retry/backoff, drop metrics; tests for rotation, retries, concurrency, and close semantics.
Config & Coordinator `config/config.go`, `config/coordinator.go`, `buf.yaml`	Adds `EventRecorderConfig`/outputs types and YAML validation, filepath resolution, `Coordinator.ApplyConfig` method, and buf module mapping for the proto package.
Feature Flagging `featurecontrol/featurecontrol.go`	Adds `FeatureEventRecorder` flag, flagger method `EnableEventRecorder()`, parsing option and startup warning.
Bootstrap / Main / API traces `cmd/alertmanager/main.go`, `go.mod`, `api/v2/api.go`	Initial config load and `ApplyConfig`, Recorder lifecycle (create/close, startup/shutdown events), pass recorder into subsystems, set cluster peer, add fsnotify dependency, and wrap API handler contexts with `eventrecorder.WithEventRecording(...)`.
Dispatch / Aggregation `dispatch/dispatch.go`, `dispatch/dispatch_test.go`, `dispatch/dispatch_bench_test.go`	Dispatcher and aggrGroup accept Recorder; track flushIdx/matchers; record grouped and resolved alert events; tests/bench use `eventrecorder.NopRecorder()`.
Notification Pipeline `notify/notify.go`, `notify/event.go`, `notify/notify_test.go`, `notify/mute_test.go`	PipelineBuilder/RetryStage accept Recorder; new context helpers for muted alerts, group ID, flush ID and matchers; emit Notification/Grouped/Resolved events on successful delivery; tests updated.
Silences `silence/silence.go`, `silence/silence_test.go`, `silence/silence_bench_test.go`	Silencer/Silences accept Recorder; record silence-created/updated and silence-muted-alert events; `setSilence` signature changed to return (changed, added, err); tests/bench updated.
Inhibit `inhibit/inhibit.go`, `inhibit/inhibit_test.go`, `inhibit/inhibit_bench_test.go`	Inhibitor accepts Recorder and records inhibition-muted-alert events when inhibitions apply; tests/bench updated to use `NopRecorder()`.
Provider (in-memory alerts) `provider/mem/mem.go`, `provider/mem/mem_test.go`	In-memory Alerts accept Recorder; records alert-created events for newly created alerts; tests updated to pass `NopRecorder()`.
Minor API change `api/v2/api.go`	Several handlers start spans from contexts wrapped with `eventrecorder.WithEventRecording(...)` instead of the raw request context.
Tests & Benches updated `*.go` (notify, silence, inhibit, dispatch, provider, etc.)	Multiple test and benchmark call sites updated to pass `eventrecorder.NopRecorder()` or import the new package.

Sequence Diagram

sequenceDiagram
    participant Subsystem as Subsystem
    participant Recorder as Recorder
    participant WriteLoop as WriteLoop
    participant Destination as Destination

    Subsystem->>Recorder: RecordEvent(ctx, EventData)
    activate Recorder
    Recorder->>Recorder: extract event_type, wrap with timestamp/instance
    Recorder->>Recorder: protojson.Marshal + append newline
    Recorder->>Recorder: enqueue bytes (non-blocking)
    deactivate Recorder

    WriteLoop->>Recorder: drain queue
    activate WriteLoop
    WriteLoop->>Destination: SendEvent(serialized bytes)
    activate Destination
    Destination->>Destination: write/POST (file or HTTP), handle retries/reopen/drops
    Destination-->>WriteLoop: success / error
    deactivate Destination
    WriteLoop->>WriteLoop: update metrics, apply config updates
    deactivate WriteLoop

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.48% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a structured event recorder feature to Alertmanager.
Description check	✅ Passed	The description is comprehensive, detailing the feature scope, implementation, configuration, testing, and design decisions, though the required template sections (checklist items) are not explicitly addressed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

dispatch/dispatch.go (2)

819-838: ⚠️ Potential issue | 🟠 Major

Record resolved events only after the delete confirms the transition.

resolvedSlice is only tentative here. If an alert fires again during the notification flush, DeleteIfNotModified will keep it in the store, but the current ordering has already emitted AlertResolvedEvent. Please move recording after the delete and only emit for fingerprints that are actually gone.

Possible fix

 	if notify(alertsSlice...) {
-		ag.recordResolvedEvents(resolvedSlice)
-
 		// Delete all resolved alerts as we just sent a notification for them,
 		// and we don't want to send another one. However, we need to make sure
 		// that each resolved alert has not fired again during the flush as then
 		// we would delete an active alert thinking it was resolved.
 		// Since we are passing DestroyIfEmpty=true the group will be marked as
 		// destroyed if there are no more alerts after the deletion.
 		if err := ag.alerts.DeleteIfNotModified(resolvedSlice, true); err != nil {
 			ag.logger.Error("error on delete alerts", "err", err)
 		} else {
+			actuallyResolved := resolvedSlice[:0]
 			// Delete markers for resolved alerts that are not in the store.
 			for _, alert := range resolvedSlice {
 				_, err := ag.alerts.Get(alert.Fingerprint())
 				if errors.Is(err, store.ErrNotFound) {
 					ag.marker.Delete(alert.Fingerprint())
+					actuallyResolved = append(actuallyResolved, alert)
 				}
 			}
+			ag.recordResolvedEvents(actuallyResolved)
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 819 - 838, The current code calls
ag.recordResolvedEvents(resolvedSlice) before confirming the alerts were
actually removed; move the record/emit step to after the call to
ag.alerts.DeleteIfNotModified and only record/emit events for those fingerprints
that were confirmed deleted (i.e., for each alert in resolvedSlice, call
ag.alerts.Get(alert.Fingerprint()) and if errors.Is(err, store.ErrNotFound) then
call ag.recordResolvedEvents (or record per-fingerprint) and ag.marker.Delete);
adjust notify/DeleteIfNotModified handling so you do not emit resolved events
for alerts that remained in the store.

771-781: ⚠️ Potential issue | 🟠 Major

Only emit grouped events on first membership.

ag.alerts.Set also succeeds when an existing alert is refreshed in the same aggregation group, so this will emit AlertGroupedEvent for every update of a long-lived alert. The event needs a real “new to group” check, not just “Set returned nil”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 771 - 781, ag.alerts.Set currently
succeeds for both new and refreshed alerts, so change the logic around
ag.alerts.Set in dispatch.go to only call
ag.recorder.RecordEvent(notify.NewAlertGroupedEvent(...)) when the alert is
actually new to the aggregation group: perform a pre- or post-check of
membership (e.g., call a membership helper such as ag.alerts.Exists/IsMember or
change Set to return a boolean like isNew) and only call
ag.recorder.RecordEvent(ctx, notify.NewAlertGroupedEvent(ag.alertGroupInfo(),
alert)) when that check indicates the alert was newly added to the group; leave
the error handling path and span logging unchanged.

inhibit/inhibit.go (1)

210-221: ⚠️ Potential issue | 🟠 Major

Only emit inhibition events on the state transition.

Mutes is re-evaluated on every notification attempt, so this unconditional RecordEvent will enqueue the same InhibitionMutedAlertEvent on every flush while the alert stays inhibited. Please gate recording on the marker actually changing to the inhibited state instead of every successful lookup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit.go` around lines 210 - 221, Only record the
InhibitionMutedAlertEvent when the marker actually transitions to inhibited:
change the logic around ih.marker.SetInhibited(fp, ...) so you detect a state
change (either by using a returned bool from SetInhibited indicating it newly
became inhibited, or by checking prior state with ih.marker.IsInhibited(fp) and
then calling SetInhibited) and only call ih.recorder.RecordEvent(...
eventrecorder.NewInhibitionMutedAlertEvent(...)) when that transition occurred;
keep the span.AddEvent behavior as needed but gate the recorder.RecordEvent
behind the transition check.

🧹 Nitpick comments (2)

silence/silence_test.go (1)
646-647: Assert the added result in TestSilencesSetSilence.

This test validates a create path; it should assert added == true so a non-error no-op merge doesn’t pass unnoticed.
✅ Suggested assertion
-		_, err := s.setSilence(s.toMeshSilence(sil), nowpb)
+		added, err := s.setSilence(s.toMeshSilence(sil), nowpb)
 		require.NoError(t, err)
+		require.True(t, added)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@silence/silence_test.go` around lines 646 - 647, The test call to
s.setSilence ignores the boolean that indicates whether the silence was added;
update the call in TestSilencesSetSilence to capture the returned added value
from s.setSilence(s.toMeshSilence(sil), nowpb) and add an assertion that added
is true (e.g., require.True(t, added) or require.Equal(t, true, added)) so the
create path is verified and a no-op merge won't be mistaken for a successful
add.
silence/silence.go (1)
280-282: Guard muted-alert event construction on the recording flag.

RecordEvent bails out when recording is disabled, but by this point the silence and label payloads are already fully materialized. Silencer.Mutes is on a hot path, so a cheap if eventrecorder.EventRecordingEnabled(ctx) around this constructor would avoid unnecessary allocations when the recorder is off.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@silence/silence.go` around lines 280 - 282, Wrap the construction and call to
s.recorder.RecordEvent(...) with a cheap guard using
eventrecorder.EventRecordingEnabled(ctx) to avoid allocating the silence and
label payloads when recording is disabled; specifically, in the Silencer.Mutes
hot path, check eventrecorder.EventRecordingEnabled(ctx) before calling
eventrecorder.NewSilenceMutedAlertEvent(...) and s.recorder.RecordEvent(ctx,
...), so the NewSilenceMutedAlertEvent(...) invocation and its arguments
(including eventrecorder.SilenceAsProto(sil), fp, lset) are not created unless
recording is enabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v2/api.go`:
- Around line 332-333: The deleteSilenceHandler currently starts its trace span
from the raw request context, so event recording is skipped; update the span
creation in deleteSilenceHandler to use
eventrecorder.WithEventRecording(params.HTTPRequest.Context()) just like
postAlertsHandler and postSilencesHandler do (i.e., call
tracer.Start(eventrecorder.WithEventRecording(params.HTTPRequest.Context()),
"api.deleteSilenceHandler") and defer span.End()) so API-driven silence
deletions participate in event recording.

In `@config/config.go`:
- Around line 293-296: The URL field in the webhook output struct currently uses
*commoncfg.URL which can expose credentials during serialization; change the
field type named URL to use the secret wrapper (e.g., *commoncfg.SecretURL or
the project's equivalent secret URL type) and keep the existing field tags
(yaml:"url,omitempty" json:"url,omitempty") so that serialization masks or omits
sensitive parts; also search for usages of the URL field
(constructors/parsers/validators) and update any places that instantiate or read
from URL to handle the SecretURL accessor methods (unwrap/secret getters) so
runtime behavior remains the same.

In `@config/coordinator.go`:
- Around line 150-170: In ApplyConfig, validate the incoming conf is not nil at
the top of the function and return a clear error instead of proceeding;
specifically, before assigning c.config = conf or calling c.notifySubscribers(),
check if conf == nil, log an error (including context like c.configFilePath),
set c.configSuccessMetric.Set(0) and return an error to avoid the subsequent
panic when reading c.config.original; keep the existing notifySubscribers and
success-metric updates unchanged otherwise.

In `@eventrecorder/eventrecorder.go`:
- Around line 341-376: The current eventRecorderConfigEqual function omits
HTTPConfig, so changes to TLS/auth/proxy/headers don't trigger rebuilds; update
eventRecorderConfigEqual (used to compare config.EventRecorderConfig Outputs) to
include a comparison of each output's HTTPConfig — either via a deep/equality
check (e.g., reflect.DeepEqual on oa.HTTPConfig vs ob.HTTPConfig) or by
explicitly comparing relevant fields (CA/cert/key, client cert/key, bearer
token/basic auth, proxy URL, custom headers, insecure skip verify, timeouts)
while handling nil pointers, and return false if any HTTPConfig differences are
found.
- Around line 425-430: The iteration over a Go map in LabelSetAsProto yields
non-deterministic ordering; fix by collecting all keys from ls,
sort.Strings(keys), then iterate the sorted keys to build pairs so Labels are
consistently ordered; apply the same pattern to the corresponding proto->model
helper around lines 557-566 (the LabelSet proto conversion function) so both
serialization and deserialization preserve deterministic ordering.

In `@eventrecorder/file.go`:
- Around line 58-61: NewFileOutput currently starts watchLoop(ready) and waits
on ready, but watchLoop swallows startup failures; modify watchLoop to report
startup errors back (e.g., change ready to a struct or use a separate error
channel like readyErr chan error) and have watchLoop send nil on success or an
error on failure; in NewFileOutput, wait for that response and return (nil, err)
when watchLoop reports a startup error (or the error from directory
registration/abs path resolution), ensuring NewFileOutput fails fast instead of
returning a healthy-looking fo; update any other similar call sites (lines
83-110 behavior) to use the same ready/readyErr pattern so watcher-start errors
are propagated.
- Around line 145-150: Close currently signals fo.done but doesn't wait for the
watcher goroutine (watchLoop) to exit, allowing reopen to run after Close and
leak/keep the file open; fix by adding a sync.WaitGroup (or similar) to
FileOutput, call wg.Add(1) when spawning watchLoop and defer wg.Done() inside
it, then change FileOutput.Close to close(fo.done) followed by wg.Wait() before
acquiring fo.mu and closing fo.f; additionally guard reopen so it no-ops if
Close has completed (e.g., set a mu-protected closed bool or use sync.Once) to
ensure reopen cannot run after Close has finished.

In `@eventrecorder/webhook.go`:
- Around line 91-100: The code uses cfg.URL.String() (assigned to urlStr) as the
WebhookOutput.url and in labels/logs (e.g., dropsCounter.WithLabelValues and
Name()), which can leak secrets; replace use of the raw URL with a sanitized or
opaque identifier: compute a safeID (for example, a host-only or hashed
representation derived from cfg.URL without userinfo, path, or query) and set
WebhookOutput.url (or add a new field like id) to that safeID, then use that
safeID in dropsCounter.WithLabelValues(fmt.Sprintf("webhook:%s", safeID)), in
Name(), and in any warning logs; ensure the original cfg.URL is only used
internally for the actual request and never logged or used in metrics.
- Around line 150-163: postWithRetry is blocking shutdown because retries (calls
to post, client.Timeout waits and time.Sleep backoffs) don't observe the
WebhookOutput shutdown signal; modify postWithRetry (and post) to accept and use
a cancelable context (or watch the existing done channel) so each attempt and
any backoff sleeps select on ctx.Done() (or done) and return early; ensure
Close/ApplyConfig cancels that context (or closes done) and that any flush path
enforces an overall deadline so waits won't block shutdown indefinitely; update
references to postWithRetry, post, done, maxRetries, retryBackoff, and
maxBackoff accordingly.
- Around line 120-127: SendEvent currently always returns nil even when the
non-blocking send to the wo.work channel fails, causing upstream sendToOutputs
to treat dropped events as successful; change SendEvent to return a non-nil
error when the default branch is taken (i.e., when the queue is full).
Specifically, in SendEvent (function name: SendEvent, channel: wo.work, metric:
wo.drops, logger: wo.logger) increment wo.drops and log the drop as you do now,
but also return an error (e.g., errors.New("webhook queue full")) so
sendToOutputs sees the failure and records the write as unsuccessful. Ensure the
successful enqueue path still returns nil.

In `@provider/mem/mem.go`:
- Around line 126-127: NewAlerts currently doesn't ensure the Recorder field is
non-nil, and calls to a.recorder.RecordEvent (e.g., in Alerts methods referenced
by NewAlerts and at the call site a.recorder.RecordEvent) will panic if nil;
update NewAlerts to guard/default the recorder (e.g., accept a nil recorder and
replace it with a no-op/fake recorder) or add nil checks before calling
a.recorder.RecordEvent in the Alerts methods so they skip recording when
recorder == nil, referencing NewAlerts and the a.recorder.RecordEvent calls to
locate where to apply the fix.

In `@silence/silence.go`:
- Around line 817-828: The setSilence function currently ignores the first
return value from s.st.merge(msil, now) so callers can't know if the merge
actually changed state; change the call to capture the "changed" (or "merged")
boolean returned by s.st.merge and use that flag to decide whether to run
s.indexSilence(msil.Silence), s.updateSizeMetrics(), and s.broadcast(b), then
return that changed flag (or include it in the returned added/changed values)
from setSilence; apply the same pattern to the analogous code around lines
861-867 so event emission and broadcasts only occur when s.st.merge reports a
real change.

---

Outside diff comments:
In `@dispatch/dispatch.go`:
- Around line 819-838: The current code calls
ag.recordResolvedEvents(resolvedSlice) before confirming the alerts were
actually removed; move the record/emit step to after the call to
ag.alerts.DeleteIfNotModified and only record/emit events for those fingerprints
that were confirmed deleted (i.e., for each alert in resolvedSlice, call
ag.alerts.Get(alert.Fingerprint()) and if errors.Is(err, store.ErrNotFound) then
call ag.recordResolvedEvents (or record per-fingerprint) and ag.marker.Delete);
adjust notify/DeleteIfNotModified handling so you do not emit resolved events
for alerts that remained in the store.
- Around line 771-781: ag.alerts.Set currently succeeds for both new and
refreshed alerts, so change the logic around ag.alerts.Set in dispatch.go to
only call ag.recorder.RecordEvent(notify.NewAlertGroupedEvent(...)) when the
alert is actually new to the aggregation group: perform a pre- or post-check of
membership (e.g., call a membership helper such as ag.alerts.Exists/IsMember or
change Set to return a boolean like isNew) and only call
ag.recorder.RecordEvent(ctx, notify.NewAlertGroupedEvent(ag.alertGroupInfo(),
alert)) when that check indicates the alert was newly added to the group; leave
the error handling path and span logging unchanged.

In `@inhibit/inhibit.go`:
- Around line 210-221: Only record the InhibitionMutedAlertEvent when the marker
actually transitions to inhibited: change the logic around
ih.marker.SetInhibited(fp, ...) so you detect a state change (either by using a
returned bool from SetInhibited indicating it newly became inhibited, or by
checking prior state with ih.marker.IsInhibited(fp) and then calling
SetInhibited) and only call ih.recorder.RecordEvent(...
eventrecorder.NewInhibitionMutedAlertEvent(...)) when that transition occurred;
keep the span.AddEvent behavior as needed but gate the recorder.RecordEvent
behind the transition check.

---

Nitpick comments:
In `@silence/silence_test.go`:
- Around line 646-647: The test call to s.setSilence ignores the boolean that
indicates whether the silence was added; update the call in
TestSilencesSetSilence to capture the returned added value from
s.setSilence(s.toMeshSilence(sil), nowpb) and add an assertion that added is
true (e.g., require.True(t, added) or require.Equal(t, true, added)) so the
create path is verified and a no-op merge won't be mistaken for a successful
add.

In `@silence/silence.go`:
- Around line 280-282: Wrap the construction and call to
s.recorder.RecordEvent(...) with a cheap guard using
eventrecorder.EventRecordingEnabled(ctx) to avoid allocating the silence and
label payloads when recording is disabled; specifically, in the Silencer.Mutes
hot path, check eventrecorder.EventRecordingEnabled(ctx) before calling
eventrecorder.NewSilenceMutedAlertEvent(...) and s.recorder.RecordEvent(ctx,
...), so the NewSilenceMutedAlertEvent(...) invocation and its arguments
(including eventrecorder.SilenceAsProto(sil), fp, lset) are not created unless
recording is enabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc9b1c06-0883-4e86-8b27-cc9ea664bc9e

📥 Commits

Reviewing files that changed from the base of the PR and between 8a11f85 and 65f596e.

⛔ Files ignored due to path filters (2)

eventrecorder/eventrecorderpb/eventrecorder.pb.go is excluded by !**/*.pb.go
go.sum is excluded by !**/*.sum

📒 Files selected for processing (29)

api/v2/api.go
buf.yaml
cmd/alertmanager/main.go
config/config.go
config/coordinator.go
dispatch/dispatch.go
dispatch/dispatch_bench_test.go
dispatch/dispatch_test.go
eventrecorder/eventrecorder.go
eventrecorder/eventrecorder_test.go
eventrecorder/eventrecorderpb/eventrecorder.proto
eventrecorder/file.go
eventrecorder/file_test.go
eventrecorder/webhook.go
eventrecorder/webhook_test.go
featurecontrol/featurecontrol.go
go.mod
inhibit/inhibit.go
inhibit/inhibit_bench_test.go
inhibit/inhibit_test.go
notify/event.go
notify/mute_test.go
notify/notify.go
notify/notify_test.go
provider/mem/mem.go
provider/mem/mem_test.go
silence/silence.go
silence/silence_bench_test.go
silence/silence_test.go

api/v2/api.go

config/coordinator.go

eventrecorder/eventrecorder.go

eventrecorder/file.go

eventrecorder/webhook.go

coderabbitai · 2026-03-26T20:27:03Z

eventrecorder/webhook.go

+func (wo *WebhookOutput) SendEvent(data []byte) error {
+	select {
+	case wo.work <- data:
+	default:
+		wo.drops.Inc()
+		wo.logger.Warn("Event recorder webhook queue full, dropping event", "url", wo.url)
+	}
+	return nil


⚠️ Potential issue | 🟠 Major

Webhook drops are reported upstream as successful writes.

This returns nil immediately—even when the queue is full, and before any HTTP delivery attempt. sendToOutputs in eventrecorder/eventrecorder.go treats that as "success" and increments alertmanager_event_recorder_bytes_written_total, so local drops and later retry exhaustion are invisible in the recorder's main metrics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@eventrecorder/webhook.go` around lines 120 - 127, SendEvent currently always returns nil even when the non-blocking send to the wo.work channel fails, causing upstream sendToOutputs to treat dropped events as successful; change SendEvent to return a non-nil error when the default branch is taken (i.e., when the queue is full). Specifically, in SendEvent (function name: SendEvent, channel: wo.work, metric: wo.drops, logger: wo.logger) increment wo.drops and log the drop as you do now, but also return an error (e.g., errors.New("webhook queue full")) so sendToOutputs sees the failure and records the write as unsuccessful. Ensure the successful enqueue path still returns nil.

coderabbitai · 2026-03-26T20:27:03Z

eventrecorder/webhook.go

+func (wo *WebhookOutput) postWithRetry(data []byte) {
+	for attempt := range wo.maxRetries {
+		err := wo.post(data)
+		if err == nil {
+			return
+		}
+		wo.logger.Warn("Event recorder webhook POST failed", "url", wo.url, "attempt", attempt+1, "err", err)
+		if attempt < wo.maxRetries-1 {
+			backoff := min(wo.retryBackoff<<attempt, wo.maxBackoff)
+			time.Sleep(backoff)
+		}
+	}
+	wo.logger.Error("Event recorder webhook POST failed after retries, dropping event", "url", wo.url, "retries", wo.maxRetries)
+}


⚠️ Potential issue | 🟠 Major

Close can hang reloads and shutdown behind retry backlog.

After Line 183 closes done, workers still run postWithRetry, including client.Timeout waits and time.Sleep backoff, until the queue is drained. With a full queue, Recorder.ApplyConfig or process shutdown can stall for a very long time. Please make retries cancelable on close or bound the flush deadline.

Also applies to: 181-185

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@eventrecorder/webhook.go` around lines 150 - 163, postWithRetry is blocking shutdown because retries (calls to post, client.Timeout waits and time.Sleep backoffs) don't observe the WebhookOutput shutdown signal; modify postWithRetry (and post) to accept and use a cancelable context (or watch the existing done channel) so each attempt and any backoff sleeps select on ctx.Done() (or done) and return early; ensure Close/ApplyConfig cancels that context (or closes done) and that any flush path enforces an overall deadline so waits won't block shutdown indefinitely; update references to postWithRetry, post, done, maxRetries, retryBackoff, and maxBackoff accordingly.

coderabbitai · 2026-03-26T20:27:03Z

provider/mem/mem.go

+	recorder eventrecorder.Recorder,
 	r prometheus.Registerer,


⚠️ Potential issue | 🟠 Major

Guard against nil recorder to prevent runtime panic.

a.recorder.RecordEvent(...) (Line 356) assumes a non-nil recorder, but NewAlerts(...) does not enforce/default it. A nil value will panic on first new alert.

🛡️ Suggested fix

func NewAlerts( ctx context.Context, m types.AlertMarker, intervalGC time.Duration, perAlertNameLimit int, alertCallback AlertStoreCallback, l *slog.Logger, recorder eventrecorder.Recorder, r prometheus.Registerer, flagger featurecontrol.Flagger, ) (*Alerts, error) { + if recorder == nil { + recorder = eventrecorder.NopRecorder() + } + if alertCallback == nil { alertCallback = noopCallback{} }

Also applies to: 151-152, 355-357

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@provider/mem/mem.go` around lines 126 - 127, NewAlerts currently doesn't ensure the Recorder field is non-nil, and calls to a.recorder.RecordEvent (e.g., in Alerts methods referenced by NewAlerts and at the call site a.recorder.RecordEvent) will panic if nil; update NewAlerts to guard/default the recorder (e.g., accept a nil recorder and replace it with a no-op/fake recorder) or add nil checks before calling a.recorder.RecordEvent in the Alerts methods so they skip recording when recorder == nil, referencing NewAlerts and the a.recorder.RecordEvent calls to locate where to apply the fix.

silence/silence.go

jackrosenthal · 2026-03-26T20:30:38Z

Addressed all review feedback since last round:

I think the package name eventlog is too specific
...
eventlog is not a very good name

Renamed the entire feature to eventrecorder. Package, config types (EventRecorderConfig/EventRecorderOutput), YAML key (event_recorder), metric names, comments, and error messages all updated.

b) I don't think we should need any locks in an event-recorder? Can we just use message passing.

Done. Adopted the message passing approach you suggested. writeLoop now exclusively owns outputs and config. Config updates arrive via a cfgUpdate channel, and cluster.Peer uses atomic.Pointer. No mutexes remain.

I think the biggest takeaway is that we should put this change behind an experimental feature flag now

Done. Gated behind --enable-feature=event-recorder. When the flag is not set, a zero-value Recorder is used and all methods are no-ops.

jackrosenthal · 2026-03-26T21:10:03Z

Addressed feedback from @coderabbitai:

Comment	Addressed?	Remarks
`deleteSilenceHandler` missing `WithEventRecording`	Yes	Added context wrapper.
Webhook URL should use `SecretURL` type	Yes	Changed to `SecretURL` to prevent credential leakage during serialization.
Nil-config guard in `ApplyConfig`	Yes	Added nil check.
`HTTPConfig` missing from config comparison	Yes	Added `reflect.DeepEqual` comparison for `HTTPConfig`.
Non-deterministic label ordering in proto conversion	Yes	Sorted label pairs and equal-label lists before serialization.
`NewFileOutput` swallows watcher startup errors	Yes	Errors are now propagated to the caller.
`Close` doesn't wait for `watchLoop` goroutine	Yes	Added `WaitGroup` and post-close guard.
Full webhook URL in `Name()`/metrics/logs leaks credentials	Yes	Sanitized URL strips userinfo and query parameters.
`SendEvent` returns nil on queue-full drop	No	Intentional fire-and-forget design. Drops are tracked via the `alertmanager_event_recorder_webhook_drops_total` counter.
`Close` can hang behind retry backlog	Yes	Retry backoff now respects shutdown. `Close` enforces a 30-second drain deadline.
Nil recorder guard in `provider/mem`	No	`Recorder` is a value type. The zero value is nil-safe (`RecordEvent` checks `r.core == nil`).
`setSilence` discards `changed` flag from merge	Yes	`setSilence` now returns `(changed, added, err)`. Broadcasts and event recording are gated on `changed`.

coderabbitai · 2026-03-26T21:10:27Z

Tip

For best results, initiate chat on the files or code changes.

@jackrosenthal Thank you for the thorough follow-up! Here's my take on each item:

All 10 "Yes" items — the fixes look great. Particularly glad to see the WaitGroup for watchLoop, the 30-second drain deadline in Close, and the deterministic label ordering sorted before serialization.
SendEvent returns nil on queue-full drop — the fire-and-forget design with a dedicated alertmanager_event_recorder_webhook_drops_total counter is a reasonable and intentional trade-off for a non-blocking event recorder. Accepted.
Nil recorder guard in provider/mem — value-type nil-safety via r.core == nil in RecordEvent is idiomatic Go and avoids defensive nil checks at every call site. Accepted.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

eventrecorder/eventrecorder_test.go (1)
111-121: Cover the rest of the zero-value recorder API.

cmd/alertmanager/main.go now also calls ApplyConfig(...) and SetClusterPeer(...) on a disabled recorder, but the no-op contract is only locked in for RecordEvent(...) and Close() here. Adding explicit smoke tests for those methods would catch startup regressions when the feature flag is off.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/eventrecorder_test.go` around lines 111 - 121, Extend the tests
to exercise the rest of the no-op recorder API by calling ApplyConfig(...) and
SetClusterPeer(...) on both the NopRecorder() instance and the zero-value
Recorder variable; ensure these calls do not panic and, where applicable, return
nil/error as expected (e.g., Validate that calling rec.ApplyConfig(... ) and
rec.SetClusterPeer(...) on Recorder and on NopRecorder() completes without error
or panic), so add assertions in TestNopRecorderDoesNotPanic and
TestZeroRecorderDoesNotPanic that invoke ApplyConfig and SetClusterPeer with
representative arguments and assert safe no-op behavior.
eventrecorder/eventrecorder.go (1)
91-91: Fix formatting issue.

Static analysis (gofumpt) reports improper formatting at this line. This is likely a whitespace or blank line issue within the struct definition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/eventrecorder.go` at line 91, The struct field declaration for
the instance field has improper spacing; locate the struct containing the field
named "instance" in eventrecorder.go and normalize the spacing so the field
reads with a single space between name and type (e.g., "instance string"), then
run gofmt/gofumpt to ensure formatting passes static analysis.
eventrecorder/webhook_test.go (1)
80-82: Use require.JSONEq for comparing JSON payloads.

String equality comparison is fragile for JSON because field ordering may vary. The static analysis tool correctly identifies this.
🔧 Proposed fix
 	mu.Lock()
-	require.Equal(t, `{"test":"data"}`, string(received[0]))
+	require.JSONEq(t, `{"test":"data"}`, string(received[0]))
 	mu.Unlock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/webhook_test.go` around lines 80 - 82, Replace the fragile
string equality assertion with a JSON-aware comparison: instead of
require.Equal(t, `{"test":"data"}`, string(received[0])) use require.JSONEq(t,
`{"test":"data"}`, string(received[0])) so the test (in webhook_test.go where mu
and received are used) validates JSON semantics regardless of key ordering; keep
the existing mu.Lock()/mu.Unlock() around the read of received[0].
silence/silence.go (1)
833-833: Remove extra blank line.

Static analysis (gofumpt) reports improper formatting. There's an extra blank line here.
🔧 Proposed fix
 	return changed, added, nil
 }
-
 // Set the specified silence. If a silence with the ID already exists and the modification
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@silence/silence.go` at line 833, Remove the extraneous blank line reported by
gofumpt in silence.go by deleting the empty line so the surrounding
declaration/block is compact (i.e., remove the extra blank line between the
preceding statement and the next token in the function or declaration where the
report appears), ensuring the function/method or declaration formatting (the
enclosing function/block) follows gofumpt style.
eventrecorder/webhook.go (1)
140-148: SendEvent returns nil even when dropping events.

When the queue is full, SendEvent increments the drop counter and logs a warning but returns nil. This means sendToOutputs in eventrecorder.go will count the event as successfully recorded.

This appears to be intentional: the webhook has its own drop metric (alertmanager_event_webhook_drops_total), and the main recorder metrics track queue acceptance rather than final delivery. If you want the main metrics to reflect webhook queue saturation, consider returning an error here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/webhook.go` around lines 140 - 148, SendEvent currently
swallows queue-full drops by always returning nil; change it so the default
branch returns an error when the event is dropped (instead of nil) so callers
like sendToOutputs can detect rejection. Locate SendEvent (method on
WebhookOutput) and in the select's default case, after incrementing wo.drops and
logging via wo.logger.Warn, return a descriptive error (e.g., fmt.Errorf or a
package-level sentinel error) indicating the webhook queue is full; keep the
success path returning nil when sending to wo.work succeeds.
eventrecorder/eventrecorderpb/eventrecorder.proto (1)
1-8: Address protobuf package versioning across all proto files.

Buf lint (STANDARD rules) reports that all protobuf packages should include version suffixes (e.g., eventrecorder.v1). This is a best practice for API evolution and applies repo-wide:

eventrecorder → eventrecorder.v1

clusterpb → clusterpb.v1

nflogpb → nflogpb.v1

silencepb → silencepb.v1

Additionally, buf reports directory mismatches for all packages, though these may be reconcilable with the current buf.yaml module configuration.

If adopting version suffixes, coordinate across all four proto modules in a single change. If these are intentional project decisions, add lint rule exceptions to buf.yaml to document the rationale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/eventrecorderpb/eventrecorder.proto` around lines 1 - 8, Update
the protobuf package naming to include a version suffix and keep Go package
mapping in sync: change package eventrecorder to eventrecorder.v1 in
eventrecorder.proto and update option go_package to include the v1 suffix (e.g.,
github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb/v1 or matching
repo convention); ensure all other proto modules (clusterpb, nflogpb, silencepb)
are changed together to use their .v1 package names and corresponding go_package
paths, and update any inter-proto import references to the new package paths; if
you intentionally do not want versioned packages, add explicit lint exceptions
in buf.yaml documenting the decision instead of changing proto packages.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/config.go`:
- Around line 293-296: The struct field alignment in config.go is misformatted;
run gofumpt (or remove the extra spaces) to fix spacing so fields align with
gofumpt style—specifically adjust the fields Type, Path, URL, and HTTPConfig in
the output struct so the `URL` line does not contain extra alignment spaces and
matches the formatting of the other tags.

In `@notify/notify.go`:
- Line 1043: The code eagerly constructs a NotificationEvent via
NewNotificationEvent(ctx, sent, r.integration) even when recording is disabled;
change the call site in the emitter where r.recorder.RecordEvent(ctx,
NewNotificationEvent(...)) is used to first check r.recorder.Enabled(ctx) (or a
similar cheap boolean) and only call NewNotificationEvent when enabled, or wrap
event creation in a lazy helper like buildNotificationEventIfEnabled(ctx, sent)
that returns nil when recording is off; apply the same pattern to the other
emitters that call r.recorder.RecordEvent so we avoid allocating/copying the
protobuf payload on the default (disabled) path.

---

Nitpick comments:
In `@eventrecorder/eventrecorder_test.go`:
- Around line 111-121: Extend the tests to exercise the rest of the no-op
recorder API by calling ApplyConfig(...) and SetClusterPeer(...) on both the
NopRecorder() instance and the zero-value Recorder variable; ensure these calls
do not panic and, where applicable, return nil/error as expected (e.g., Validate
that calling rec.ApplyConfig(... ) and rec.SetClusterPeer(...) on Recorder and
on NopRecorder() completes without error or panic), so add assertions in
TestNopRecorderDoesNotPanic and TestZeroRecorderDoesNotPanic that invoke
ApplyConfig and SetClusterPeer with representative arguments and assert safe
no-op behavior.

In `@eventrecorder/eventrecorder.go`:
- Line 91: The struct field declaration for the instance field has improper
spacing; locate the struct containing the field named "instance" in
eventrecorder.go and normalize the spacing so the field reads with a single
space between name and type (e.g., "instance string"), then run gofmt/gofumpt to
ensure formatting passes static analysis.

In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 1-8: Update the protobuf package naming to include a version
suffix and keep Go package mapping in sync: change package eventrecorder to
eventrecorder.v1 in eventrecorder.proto and update option go_package to include
the v1 suffix (e.g.,
github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb/v1 or matching
repo convention); ensure all other proto modules (clusterpb, nflogpb, silencepb)
are changed together to use their .v1 package names and corresponding go_package
paths, and update any inter-proto import references to the new package paths; if
you intentionally do not want versioned packages, add explicit lint exceptions
in buf.yaml documenting the decision instead of changing proto packages.

In `@eventrecorder/webhook_test.go`:
- Around line 80-82: Replace the fragile string equality assertion with a
JSON-aware comparison: instead of require.Equal(t, `{"test":"data"}`,
string(received[0])) use require.JSONEq(t, `{"test":"data"}`,
string(received[0])) so the test (in webhook_test.go where mu and received are
used) validates JSON semantics regardless of key ordering; keep the existing
mu.Lock()/mu.Unlock() around the read of received[0].

In `@eventrecorder/webhook.go`:
- Around line 140-148: SendEvent currently swallows queue-full drops by always
returning nil; change it so the default branch returns an error when the event
is dropped (instead of nil) so callers like sendToOutputs can detect rejection.
Locate SendEvent (method on WebhookOutput) and in the select's default case,
after incrementing wo.drops and logging via wo.logger.Warn, return a descriptive
error (e.g., fmt.Errorf or a package-level sentinel error) indicating the
webhook queue is full; keep the success path returning nil when sending to
wo.work succeeds.

In `@silence/silence.go`:
- Line 833: Remove the extraneous blank line reported by gofumpt in silence.go
by deleting the empty line so the surrounding declaration/block is compact
(i.e., remove the extra blank line between the preceding statement and the next
token in the function or declaration where the report appears), ensuring the
function/method or declaration formatting (the enclosing function/block) follows
gofumpt style.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dfc390ab-cce4-4be5-911e-dcac8a620540

📥 Commits

Reviewing files that changed from the base of the PR and between 65f596e and ac7b8f5.

⛔ Files ignored due to path filters (2)

eventrecorder/eventrecorderpb/eventrecorder.pb.go is excluded by !**/*.pb.go
go.sum is excluded by !**/*.sum

📒 Files selected for processing (29)

api/v2/api.go
buf.yaml
cmd/alertmanager/main.go
config/config.go
config/coordinator.go
dispatch/dispatch.go
dispatch/dispatch_bench_test.go
dispatch/dispatch_test.go
eventrecorder/eventrecorder.go
eventrecorder/eventrecorder_test.go
eventrecorder/eventrecorderpb/eventrecorder.proto
eventrecorder/file.go
eventrecorder/file_test.go
eventrecorder/webhook.go
eventrecorder/webhook_test.go
featurecontrol/featurecontrol.go
go.mod
inhibit/inhibit.go
inhibit/inhibit_bench_test.go
inhibit/inhibit_test.go
notify/event.go
notify/mute_test.go
notify/notify.go
notify/notify_test.go
provider/mem/mem.go
provider/mem/mem_test.go
silence/silence.go
silence/silence_bench_test.go
silence/silence_test.go

✅ Files skipped from review due to trivial changes (4)

buf.yaml
go.mod
dispatch/dispatch_bench_test.go
api/v2/api.go

🚧 Files skipped from review as they are similar to previous changes (9)

notify/mute_test.go
inhibit/inhibit_bench_test.go
silence/silence_bench_test.go
eventrecorder/file_test.go
featurecontrol/featurecontrol.go
inhibit/inhibit_test.go
eventrecorder/file.go
dispatch/dispatch_test.go
provider/mem/mem.go

config/config.go

coderabbitai · 2026-03-26T21:22:46Z

notify/notify.go

 					l.Info("Notify success")
 				}

+				r.recorder.RecordEvent(ctx, NewNotificationEvent(ctx, sent, r.integration))


⚠️ Potential issue | 🟠 Major

Avoid eager NotificationEvent construction on the default path.

NewNotificationEvent(ctx, sent, r.integration) is built before RecordEvent(...) gets a chance to notice that recording is disabled. Because cmd/alertmanager/main.go threads a zero-value recorder through notify.NewPipelineBuilder(...) when the feature flag is off, every successful notification still pays to allocate/copy the protobuf payload even though it will be dropped. Please add a cheap enabled check or a lazy builder helper before constructing the event, and reuse that pattern in the other emitters following the same shape.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notify/notify.go` at line 1043, The code eagerly constructs a NotificationEvent via NewNotificationEvent(ctx, sent, r.integration) even when recording is disabled; change the call site in the emitter where r.recorder.RecordEvent(ctx, NewNotificationEvent(...)) is used to first check r.recorder.Enabled(ctx) (or a similar cheap boolean) and only call NewNotificationEvent when enabled, or wrap event creation in a lazy helper like buildNotificationEventIfEnabled(ctx, sent) that returns nil when recording is off; apply the same pattern to the other emitters that call r.recorder.RecordEvent so we avoid allocating/copying the protobuf payload on the default (disabled) path.

jackrosenthal · 2026-03-27T15:36:02Z

Addressed feedback from @coderabbitai:

Comment	Addressed?	Remarks
Run `gofumpt` on modified files (config.go, eventrecorder.go, silence.go, api.go)	Yes	Ran `gofumpt` on all modified `.go` files.
Avoid eager `NotificationEvent` construction when disabled	No	Not a hot path; runs once per successful notification delivery. The protobuf allocation is negligible.
Cover `ApplyConfig` and `SetClusterPeer` in no-op recorder tests	Yes	Added calls in `TestNopRecorderDoesNotPanic` and `TestZeroRecorderDoesNotPanic`.
Use `require.JSONEq` for JSON payload comparison in webhook_test.go	Yes	Replaced `require.Equal` with `require.JSONEq`.
`SendEvent` returns nil on queue-full drop	No	Intentional fire-and-forget design. Drops are tracked via the `alertmanager_event_recorder_webhook_drops_total` counter.
Protobuf package versioning (`eventrecorder.v1`)	No	Would be a repo-wide change affecting all proto modules (`clusterpb`, `nflogpb`, `silencepb`), not scoped to this PR.

coderabbitai · 2026-03-27T15:36:23Z

Tip

For best results, initiate chat on the files or code changes.

@jackrosenthal Thank you for the detailed follow-up!

Accepted — all items look good:

gofumpt, NopRecorder/ZeroRecorder test coverage, and require.JSONEq — all confirmed addressed. ✅
Eager NotificationEvent construction — agreed; a single proto allocation per successful notification delivery is negligible and the simplicity is worth it. Accepted.
SendEvent returns nil on queue-full drop — already accepted in the prior round; the alertmanager_event_recorder_webhook_drops_total counter is the right mechanism here.
Protobuf package versioning (eventrecorder.v1) — scoping this to a separate repo-wide PR alongside clusterpb, nflogpb, and silencepb is the right call. Accepted.

No remaining blockers from my side. The PR looks well-structured and the implementation is solid.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

eventrecorder/file.go (1)
146-151: Consider checking fo.closed before writing in SendEvent.

SendEvent doesn't check fo.closed before attempting to write. If called after Close(), it will return an error from fo.f.Write() on the closed file handle. While this works (the error is returned), checking fo.closed first would provide a clearer error message like "output closed" rather than a low-level file error.
♻️ Optional: Add closed check
 func (fo *FileOutput) SendEvent(data []byte) error {
 	fo.mu.Lock()
 	defer fo.mu.Unlock()
+	if fo.closed {
+		return errors.New("file output closed")
+	}
 	_, err := fo.f.Write(data)
 	return err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/file.go` around lines 146 - 151, SendEvent should check the
FileOutput.closed flag under the same mutex before attempting to write so
callers get a clear "output closed" error instead of a low-level file error;
inside SendEvent (which already locks fo.mu), if fo.closed is true return a
descriptive error (or a package-level var like ErrOutputClosed) and only call
fo.f.Write when not closed. Ensure the error type/message matches any existing
Close() behavior and tests.
notify/event.go (1)
66-73: Sort muted alert hashes before appending them.

MutedAlerts(ctx) is a map, so this produces nondeterministic MutedAlerts ordering. Two otherwise identical notification events can end up with different JSON, which makes downstream diffing and deduplication harder.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/event.go` around lines 66 - 73, extractMutedGroupedAlerts currently
iterates over the map returned by MutedAlerts(ctx) yielding nondeterministic
order; fix by collecting the map keys into a slice, sort them (e.g.,
sort.Strings), then iterate the sorted slice to append
&eventrecorderpb.GroupedAlert{Hash: hash} to the result so the output is
deterministic; update extractMutedGroupedAlerts to use this sorted-key approach
while keeping the same return type.
eventrecorder/eventrecorder_test.go (1)
201-212: Assert the sorted order directly here.

LabelSetAsProto promises deterministic ordering, but this test collapses the result back into a map, so it would still pass if the helper regressed to Go's map iteration order. Please assert proto.Labels in order.
♻️ Possible assertion update
 	require.Len(t, proto.Labels, 2)
-	found := map[string]string{}
-	for _, lp := range proto.Labels {
-		found[lp.Key] = lp.Value
-	}
-	require.Equal(t, "bar", found["foo"])
-	require.Equal(t, "qux", found["baz"])
+	require.Equal(t, "baz", proto.Labels[0].Key)
+	require.Equal(t, "qux", proto.Labels[0].Value)
+	require.Equal(t, "foo", proto.Labels[1].Key)
+	require.Equal(t, "bar", proto.Labels[1].Value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/eventrecorder_test.go` around lines 201 - 212, The test
collapses proto.Labels into a map which masks ordering regressions; update
TestLabelSetAsProto to assert the deterministic ordering promised by
LabelSetAsProto by checking proto.Labels directly (e.g., verify
proto.Labels[0].Key/Value and proto.Labels[1].Key/Value or compare the slice to
an expected ordered slice) instead of converting to a map, referencing
LabelSetAsProto and proto.Labels for locating the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@eventrecorder/eventrecorder.go`:
- Around line 252-260: The code currently closes the live outputs before
building the replacement, risking loss of all outputs if buildOutputs fails;
change the logic in the reload branch (where
eventRecorderConfigEqual(update.cfg, currentCfg) is checked) to first call
buildOutputs(update.cfg.Outputs, c.metrics, c.logger) into a newOutputs slice,
validate newOutputs (e.g., ensure len(newOutputs) > 0 or otherwise decide
fallback), and only after confirming the replacement is acceptable: close the
existing outputs (calling Close on each element), set outputs = newOutputs,
update currentCfg = update.cfg, and then log the reload via c.logger.Info; if
newOutputs is not acceptable, keep the old outputs and log the failure instead
of swapping/closing first.

In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 1-8: The proto package declaration is inconsistent: change the
line declaring "package eventrecorder;" to "package eventrecorderpb;" so the
proto package matches the directory/name pattern used elsewhere; ensure any
generated Go package references (e.g., option go_package =
"github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb";) remain
correct after this rename and update any imports that reference eventrecorder to
use eventrecorderpb.

In `@notify/event.go`:
- Around line 37-53: extractAlertGroupInfo currently rebuilds the group key from
a LabelSet (using groupKey.String() / groupKey.Hash()), which yields empty-set
metadata; instead pull and use the raw aggregation group key stored in context
by dispatch (the value obtained via AggrGroupID/its corresponding context entry)
when populating AlertGroupInfo. Update extractAlertGroupInfo to read the raw
group key from context (the stored aggregation key placed by dispatch), set
AlertGroupInfo.GroupKey to that raw string and compute GroupId from that raw key
(rather than calling groupKey.String() / groupKey.Hash()), keeping ReceiverName,
Matchers, RouteLabels and GroupUuid (aggrGroupID) as before.

In `@notify/notify.go`:
- Around line 283-303: WithRouteLabels and WithMutedAlerts are defined but never
invoked, so RouteLabels(ctx) and MutedAlerts(ctx) always return empty values
when notify/event.go builds events; update the context initialization in
dispatch/dispatch.go where the notification context is created to call
WithRouteLabels(ctx, <routeLabelSet>) and WithMutedAlerts(ctx, <mutedAlertSet>)
using the route's label set and the current muted alerts collection (or, if this
feature is unwanted, remove WithRouteLabels/WithMutedAlerts plus the
RouteLabels/MutedAlerts getters and their corresponding proto fields); locate
and modify the context-building code around dispatch logic to inject these
values so notify/event.go receives populated data.

---

Nitpick comments:
In `@eventrecorder/eventrecorder_test.go`:
- Around line 201-212: The test collapses proto.Labels into a map which masks
ordering regressions; update TestLabelSetAsProto to assert the deterministic
ordering promised by LabelSetAsProto by checking proto.Labels directly (e.g.,
verify proto.Labels[0].Key/Value and proto.Labels[1].Key/Value or compare the
slice to an expected ordered slice) instead of converting to a map, referencing
LabelSetAsProto and proto.Labels for locating the code.

In `@eventrecorder/file.go`:
- Around line 146-151: SendEvent should check the FileOutput.closed flag under
the same mutex before attempting to write so callers get a clear "output closed"
error instead of a low-level file error; inside SendEvent (which already locks
fo.mu), if fo.closed is true return a descriptive error (or a package-level var
like ErrOutputClosed) and only call fo.f.Write when not closed. Ensure the error
type/message matches any existing Close() behavior and tests.

In `@notify/event.go`:
- Around line 66-73: extractMutedGroupedAlerts currently iterates over the map
returned by MutedAlerts(ctx) yielding nondeterministic order; fix by collecting
the map keys into a slice, sort them (e.g., sort.Strings), then iterate the
sorted slice to append &eventrecorderpb.GroupedAlert{Hash: hash} to the result
so the output is deterministic; update extractMutedGroupedAlerts to use this
sorted-key approach while keeping the same return type.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e5468fb6-0156-4fb7-a5ec-01c9e0dfc092

📥 Commits

Reviewing files that changed from the base of the PR and between ac7b8f5 and fad8b84.

⛔ Files ignored due to path filters (2)

eventrecorder/eventrecorderpb/eventrecorder.pb.go is excluded by !**/*.pb.go
go.sum is excluded by !**/*.sum

📒 Files selected for processing (29)

api/v2/api.go
buf.yaml
cmd/alertmanager/main.go
config/config.go
config/coordinator.go
dispatch/dispatch.go
dispatch/dispatch_bench_test.go
dispatch/dispatch_test.go
eventrecorder/eventrecorder.go
eventrecorder/eventrecorder_test.go
eventrecorder/eventrecorderpb/eventrecorder.proto
eventrecorder/file.go
eventrecorder/file_test.go
eventrecorder/webhook.go
eventrecorder/webhook_test.go
featurecontrol/featurecontrol.go
go.mod
inhibit/inhibit.go
inhibit/inhibit_bench_test.go
inhibit/inhibit_test.go
notify/event.go
notify/mute_test.go
notify/notify.go
notify/notify_test.go
provider/mem/mem.go
provider/mem/mem_test.go
silence/silence.go
silence/silence_bench_test.go
silence/silence_test.go

✅ Files skipped from review due to trivial changes (8)

dispatch/dispatch_bench_test.go
provider/mem/mem_test.go
api/v2/api.go
go.mod
buf.yaml
notify/mute_test.go
eventrecorder/file_test.go
dispatch/dispatch.go

🚧 Files skipped from review as they are similar to previous changes (9)

notify/notify_test.go
inhibit/inhibit_bench_test.go
featurecontrol/featurecontrol.go
inhibit/inhibit.go
silence/silence_test.go
silence/silence_bench_test.go
cmd/alertmanager/main.go
provider/mem/mem.go
config/config.go

eventrecorder/eventrecorder.go

eventrecorder/eventrecorderpb/eventrecorder.proto

notify/event.go

coderabbitai · 2026-03-27T15:46:47Z

notify/notify.go

+// WithMutedAlerts populates a context with a set of muted alert hashes.
+func WithMutedAlerts(ctx context.Context, alerts map[uint64]struct{}) context.Context {
+	return context.WithValue(ctx, keyMutedAlerts, alerts)
+}
+
+// MutedAlerts extracts a set of muted alert hashes from the context.
+func MutedAlerts(ctx context.Context) (map[uint64]struct{}, bool) {
+	v, ok := ctx.Value(keyMutedAlerts).(map[uint64]struct{})
+	return v, ok
+}
+
+// WithRouteLabels populates a context with route-level labels.
+func WithRouteLabels(ctx context.Context, rl model.LabelSet) context.Context {
+	return context.WithValue(ctx, keyRouteLabels, rl)
+}
+
+// RouteLabels extracts route-level labels from the context.
+func RouteLabels(ctx context.Context) (model.LabelSet, bool) {
+	v, ok := ctx.Value(keyRouteLabels).(model.LabelSet)
+	return v, ok
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for usages of WithRouteLabels and WithMutedAlerts echo "=== WithRouteLabels usages ===" rg -n 'WithRouteLabels\(' --type=go echo "" echo "=== WithMutedAlerts usages ===" rg -n 'WithMutedAlerts\(' --type=go

Repository: prometheus/alertmanager

Length of output: 338

🏁 Script executed:

#!/bin/bash # Search for usages of RouteLabels and MutedAlerts getter functions echo "=== RouteLabels() usages ===" rg -n 'RouteLabels\(' --type=go -A 2 echo "" echo "=== MutedAlerts() usages ===" rg -n 'MutedAlerts\(' --type=go -A 2

Repository: prometheus/alertmanager

Length of output: 1787

🏁 Script executed:

#!/bin/bash # Read the relevant sections of notify/event.go echo "=== notify/event.go around RouteLabels (line 42) ===" sed -n '35,55p' notify/event.go echo "" echo "=== notify/event.go around MutedAlerts (line 68) ===" sed -n '60,80p' notify/event.go

Repository: prometheus/alertmanager

Length of output: 1452

🏁 Script executed:

#!/bin/bash # Check the dispatch/dispatch.go context setup to see where setters should be called echo "=== dispatch/dispatch.go context setup ===" sed -n '663,716p' dispatch/dispatch.go | head -60

Repository: prometheus/alertmanager

Length of output: 1836

WithRouteLabels and WithMutedAlerts are defined but never populated.

These context setters are never called anywhere in the codebase. However, their corresponding getters—RouteLabels() and MutedAlerts()—are actively used in notify/event.go (lines 42 and 68) to populate the notification event. This means RouteLabels and MutedAlerts will always be empty in the resulting event records. Either these setters should be called during context initialization in dispatch/dispatch.go, or the getters and proto fields should be removed if this feature is not intended.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notify/notify.go` around lines 283 - 303, WithRouteLabels and WithMutedAlerts are defined but never invoked, so RouteLabels(ctx) and MutedAlerts(ctx) always return empty values when notify/event.go builds events; update the context initialization in dispatch/dispatch.go where the notification context is created to call WithRouteLabels(ctx, <routeLabelSet>) and WithMutedAlerts(ctx, <mutedAlertSet>) using the route's label set and the current muted alerts collection (or, if this feature is unwanted, remove WithRouteLabels/WithMutedAlerts plus the RouteLabels/MutedAlerts getters and their corresponding proto fields); locate and modify the context-building code around dispatch logic to inject these values so notify/event.go receives populated data.

This commit adds a structured event recorder to Alertmanager that records events. Events are defined in a protobuf schema and fanned out to one or more configured outputs (JSONL file, webhook). The event recorder provides an audit trail for observability and analytics, answering questions like "why did this alert fire?", "was it silenced?", "which receiver got notified?", and "when did this inhibition rule suppress it?". The event recorder is gated behind --enable-feature=event-recorder. When not enabled, a zero-value Recorder is used and all methods are no-ops. At this time, there is no filtering of events or event query API. Downstream software is intended to consume the event recorder via the webhook or file output and filter and store events in an appropriate database as desired. Alertmanager is kept away from this logic. Ten event types are defined, covering the full alert lifecycle: | Event | Emitted When | |---|---| | `AlertmanagerStartupEvent` | Alertmanager process starts | | `AlertmanagerShutdownEvent` | Alertmanager process receives SIGTERM | | `AlertCreatedEvent` | A new alert is inserted via the API | | `AlertGroupedEvent` | An alert is added to an aggregation group | | `AlertResolvedEvent` | An alert's end time passes during a flush | | `NotificationEvent` | A notification is successfully delivered | | `SilenceCreatedEvent` | A new silence is created | | `SilenceUpdatedEvent` | An existing silence is updated in place | | `SilenceMutedAlertEvent` | A silence matches and mutes an alert | | `InhibitionMutedAlertEvent` | An inhibition rule suppresses an alert | Each event carries a full protobuf-defined payload with alert details, labels, annotations, fingerprints, group metadata, silence/inhibition rule details, and receiver information as applicable. The event recorder is configured via a new top-level `event_recorder` section in `alertmanager.yml`, supporting multiple simultaneous outputs: ```yaml event_recorder: outputs: - type: file path: /var/log/alertmanager/events.jsonl - type: webhook url: https://example.com/events http_config: # optional, same as receiver http_config tls_config: cert_file: /path/to/cert timeout: 10s # optional, default 10s workers: 4 # optional, default 4 max_retries: 3 # optional, default 3 retry_backoff: 500ms # optional, default 500ms ``` The configuration is hot-reloadable: sending SIGHUP swaps outputs atomically (old outputs are closed, new ones created). Writes JSONL to a file. Uses fsnotify to watch the parent directory for file renames and removes (e.g. logrotate), triggering an immediate reopen. The constructor blocks until the watcher is ready, avoiding races in tests. POSTs each event as a JSON body to a configured URL. A bounded worker pool (configurable via `workers`) processes deliveries concurrently so that a slow webhook does not block the event queue. Failed deliveries are retried with exponential backoff (configurable via `max_retries` and `retry_backoff`), capped at 30 seconds. When the internal queue (capacity 1024) is full, events are dropped and counted via the `alertmanager_event_recorder_webhook_drops_total` Prometheus counter. `RecordEvent` never blocks the caller. Events are serialized with `protojson.Marshal` and placed on a bounded in-memory channel (capacity 8,192, capping memory at roughly 4 MB). A background goroutine drains the channel and writes to all configured outputs. If the channel is full, the event is dropped and a metric is incremented. The event recorder must never slow down alert processing. Concurrency is handled entirely via message passing — no mutexes. The `writeLoop` goroutine exclusively owns output state and config. Config updates arrive via a `cfgUpdate` channel with an ack signal. `cluster.Peer` uses `atomic.Pointer`. `Close` uses `sync.Once`. `RecordEvent` accepts a `context.Context`. Recording is opt-in: callers must pass `eventrecorder.WithEventRecording(ctx)` to enable recording. The API update path omits this to avoid flooding the event queue when checking silences and inhibitions for status display. Event construction and protobuf conversion helpers live in the `eventrecorder` package: `NewAlertCreatedEvent`, `NewSilenceMutedAlertEvent`, `NewSilenceCreatedEvent`, `NewSilenceUpdatedEvent`, `NewInhibitionMutedAlertEvent`, `SilenceAsProto`, `InhibitRuleAsProto`, `SilenceMatcherAsProto`, `MatcherAsProto`, `MatchersAsProto`, `LabelSetAsProto`. Call sites in `inhibit`, `silence`, and `provider/mem` are one-liners. `extractEventType` uses proto reflection (`WhichOneof`) to derive the event type name from the oneof field, avoiding a manual type switch. Five Prometheus metrics are exposed: - `alertmanager_events_recorded_total` (labels: `event_type`, `output`, `result`) - `alertmanager_event_recorder_bytes_written_total` (labels: `event_type`, `output`) - `alertmanager_events_dropped_total` (labels: `event_type`) - `alertmanager_event_serialize_errors_total` (labels: `event_type`) - `alertmanager_event_recorder_queue_length` (gauge) - `alertmanager_event_recorder_webhook_drops_total` (labels: `output`) The `output` label uses a stable identifier (e.g. `file:/path` or `webhook:https://host/path`) rather than a positional index, so it remains meaningful across config reloads. Webhook URLs are sanitized to strip userinfo and query parameters before use in identifiers, metrics labels, and log messages. Both `SilenceCreated` and `SilenceUpdated` events are recorded at the same level in the `Set` method, rather than split between `setSilence` (for create) and `Set` (for update). `setSilence` returns `(changed, added bool, err error)` so the caller can decide what to record and whether to broadcast. Several new context keys were added to `notify/notify.go` to thread dispatch-time metadata through the notification pipeline: - `WithGroupMatchers` / `GroupMatchers`: route-level matchers - `WithRouteLabels` / `RouteLabels`: route-level labels - `WithFlushId` / `FlushId`: per-flush monotonic counter - `WithAggrGroupId` / `AggrGroupId`: aggregation group UUID - `WithMutedAlerts` / `MutedAlerts`: muted alert fingerprints - `WithNotificationReason` / `NotificationReason`: trigger reason These are populated in `dispatch/dispatch.go` where the aggregation group context is set up, and consumed in `notify/event.go` to build rich `NotificationEvent` and `AlertGroupedEvent` payloads. Unit tests cover the eventrecorder package: - `eventrecorder/eventrecorder_test.go`: RecordEvent delivery, NopRecorder, zero-value Recorder, WithEventRecording, ApplyConfig no-op, extractEventType, LabelSetAsProto, MatcherAsProto, MatchersAsProto, eventRecorderConfigEqual. - `eventrecorder/file_test.go`: SendEvent, reopen-after-rename, reopen-after-remove, Close, invalid path. - `eventrecorder/webhook_test.go`: webhook delivery via httptest.Server, concurrent workers, retry on failure, drop after max retries, and graceful close draining. Additionally, tested with a live Alertmanager instance configured with JSONL file output, a webhook endpoint, inhibition rules, and silences. All 10 event types were verified: 1. Started Alertmanager with `--enable-feature=event-recorder` and `event_recorder` configured. Verified `alertmanagerStartupEvent`. 2. Fired test alerts. Verified `alertCreated` and `alertGrouped`. 3. Fired `CriticalAlert` (severity=critical) and `WarningAlert` (severity=warning) with an inhibition rule suppressing warnings when a matching critical alert exists. Verified `inhibitionMutedAlert` with correct rule details and source fingerprint. 4. Created a silence matching `SilencedAlert`, then fired `SilencedAlert`. Verified `silenceCreated` and `silenceMutedAlert` with silence ID and matchers. 5. Updated an active silence via the v2 API, preserving matchers and extending the end time (satisfying `canUpdate`). Verified `silenceUpdated` with the updated silence details and same ID. 6. Waited for notification delivery. Verified `notification` events with receiver name, integration type, and alert details. 7. Fired a short-lived alert (5-second TTL), waited for flush. Verified `alertResolved`. 8. Sent SIGTERM. Verified `alertmanagerShutdownEvent`. 9. Tested hot-reload: started with file-only config, sent SIGHUP with file+webhook config, verified new events appeared on the webhook and old file continued receiving events. 10. Verified feature flag gating: started without `--enable-feature=event-recorder`, confirmed no event file was created and no events were recorded. Signed-off-by: Jack Rosenthal <jack@rosenth.al>

jackrosenthal · 2026-03-30T20:31:32Z

Addressed feedback from @coderabbitai:

Comment	Addressed?	Remarks
Don't tear down live outputs before replacement set is known-good	Yes	`buildOutputs()` is now called first; old outputs are only closed after the new set is fully initialized. If any new output fails, the existing outputs are kept.
Protobuf package name inconsistent with project pattern	Yes	Renamed `package eventrecorder` to `package eventrecorderpb` to match `nflogpb`, `silencepb`, and `clusterpb`.
Use the stored `ag.GroupKey()` value instead of rebuilding it from `LabelSet`	No	False positive. `ExtractGroupKey` returns a `Key` (string alias) built from the real group key string stored in context by `WithGroupKey(ctx, ag.GroupKey())`. `groupKey.String()` and `groupKey.Hash()` work correctly.
`WithRouteLabels` and `WithMutedAlerts` are defined but never populated	Yes	Ported `WithMutedAlerts` from our fork into `MuteStage.Exec`. Removed `WithRouteLabels`/`RouteLabels` and the `route_labels` proto field, as there is no upstream equivalent of `RouteOpts.Labels`.

coderabbitai · 2026-03-30T20:31:56Z

Tip

For best results, initiate chat on the files or code changes.

@jackrosenthal Thank you for the thorough follow-up!

Accepted — all items look good:

buildOutputs() safety — building the replacement set first and falling back to existing outputs on failure is exactly the right approach for safe hot-reload. ✅
Protobuf package rename to eventrecorderpb — consistent with nflogpb, silencepb, and clusterpb. ✅
ag.GroupKey() / ExtractGroupKey false positive — you're right; WithGroupKey(ctx, ag.GroupKey()) stores the real key string in context and ExtractGroupKey simply reads it back. groupKey.String() and groupKey.Hash() operate on that stored value, not on a LabelSet reconstruction. I was wrong to flag this — accepted.
WithMutedAlerts populated in MuteStage.Exec; WithRouteLabels/route_labels removed — clean resolution. Removing route_labels entirely rather than leaving an unpopulated field is the right call given there's no upstream equivalent for RouteOpts.Labels. ✅

No remaining blockers from my side. The PR is in great shape.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

inhibit/inhibit.go (1)

210-223: ⚠️ Potential issue | 🟠 Major

Record all inhibiting matches, not just the first one.

hasEqual() only returns a single source fingerprint, and this branch emits/returns on the first matching rule. If the target is suppressed by multiple rules or multiple source alerts, the InhibitionMutedAlertEvent will under-report the inhibition context even though the schema is plural. Please either collect all matches before returning or make the event schema singular.

🧹 Nitpick comments (3)

notify/mute.go (1)
97-106: Sort muted alert hashes for deterministic ordering.

This stores hashes in a map, while extractMutedGroupedAlerts in notify/event.go:67 ranges that map directly. Go map iteration order is undefined, causing NotificationEvent.muted_alerts order to vary between runs. Unlike firing_alerts and resolved_alerts which use ordered slices (lines 56–58), muted alerts lose ordering. Since muted_alerts is a repeated GroupedAlert field in the proto, serialization order matters for deterministic event logs and webhooks.

Either sort the hashes before building the result or use an ordered slice alongside the set.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/mute.go` around lines 97 - 106, The map of muted alert hashes created
in MutedAlerts/WithMutedAlerts (using hashAlert) produces nondeterministic
iteration order later in extractMutedGroupedAlerts and causes
NotificationEvent.muted_alerts to be emitted in variable order; fix by
preserving deterministic ordering: alongside the set/map stored in ctx, also
store an ordered slice of uint64 hash keys (or when reading the map in
extractMutedGroupedAlerts, collect keys into a slice and sort them) so that the
construction path that builds NotificationEvent.muted_alerts (referenced by
extractMutedGroupedAlerts and NotificationEvent.muted_alerts) iterates a sorted
list rather than ranging the map. Ensure the change updates both the storage
site (WithMutedAlerts / MutedAlerts usage where mutedHashes are created) and the
consumer (extractMutedGroupedAlerts) to use the ordered slice or sorted keys.
eventrecorder/file.go (1)
81-83: Consider logging the close error in reopen.

The error from closing the old file handle is silently ignored. While not critical (the new file is opened regardless), logging it could help diagnose issues like ENOSPC when flushing buffered data.
🔧 Optional improvement
 	if fo.f != nil {
-		fo.f.Close()
+		if err := fo.f.Close(); err != nil {
+			fo.logger.Warn("Failed to close old event recorder file", "path", fo.path, "err", err)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eventrecorder/file.go` around lines 81 - 83, The call to fo.f.Close() in the
reopen code path currently ignores any returned error; change it to capture the
error (err := fo.f.Close()) and log it instead of discarding it. In the reopen
function (the block calling fo.f.Close()) emit a descriptive log entry including
context like "closing old file" and the error value using the package/logger
already used by eventrecorder (e.g., the existing logger or fmt/log package) so
ENOSPC or flush failures are recorded for diagnostics.
notify/notify.go (1)
1035-1035: Add early check to avoid allocating event payload when recording is disabled.

NewNotificationEvent(ctx, sent, r.integration) allocates memory before RecordEvent checks if recording is enabled. Use the existing EventRecordingEnabled(ctx) helper to skip construction when recording is disabled.
♻️ Suggested optimization
+			if eventrecorder.EventRecordingEnabled(ctx) {
 				r.recorder.RecordEvent(ctx, NewNotificationEvent(ctx, sent, r.integration))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` at line 1035, Add an early guard using
EventRecordingEnabled(ctx) before constructing the event to avoid allocating
NewNotificationEvent(ctx, sent, r.integration) when recording is disabled; i.e.,
check EventRecordingEnabled(ctx) and only call r.recorder.RecordEvent(ctx,
NewNotificationEvent(ctx, sent, r.integration)) when that helper returns true so
the event payload is not allocated unnecessarily.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 3-8: The proto package declaration "package eventrecorderpb"
violates Buf layout/version rules; either move this file into a top-level
directory named "eventrecorderpb" to match the package, or update the proto to a
versioned package (e.g., change the package to "eventrecorderpb.v1") and adjust
the accompanying option go_package and any imports/refs accordingly so symbols
(package eventrecorderpb -> eventrecorderpb.v1 and option go_package) and
client/server code point to the new package; alternatively if you intentionally
want to bypass the rule, add a buflint disable directive for layout/version at
the top of the file before merging.

In `@eventrecorder/file.go`:
- Around line 126-130: The code currently ignores the error from
filepath.Abs(event.Name) which can produce an empty/malformed evPath and
silently skip real rotations; update the block that computes evPath (where
filepath.Abs(event.Name) is called and evPath/absPath are compared) to handle
the error: if filepath.Abs returns an error, fall back to using a cleaned
event.Name (e.g., filepath.Clean(event.Name)) as evPath (or otherwise derive an
absolute path fallback) instead of ignoring the event, and optionally log the
Abs error so failures are visible; ensure the subsequent comparison against
absPath still uses the normalized evPath.

---

Nitpick comments:
In `@eventrecorder/file.go`:
- Around line 81-83: The call to fo.f.Close() in the reopen code path currently
ignores any returned error; change it to capture the error (err := fo.f.Close())
and log it instead of discarding it. In the reopen function (the block calling
fo.f.Close()) emit a descriptive log entry including context like "closing old
file" and the error value using the package/logger already used by eventrecorder
(e.g., the existing logger or fmt/log package) so ENOSPC or flush failures are
recorded for diagnostics.

In `@notify/mute.go`:
- Around line 97-106: The map of muted alert hashes created in
MutedAlerts/WithMutedAlerts (using hashAlert) produces nondeterministic
iteration order later in extractMutedGroupedAlerts and causes
NotificationEvent.muted_alerts to be emitted in variable order; fix by
preserving deterministic ordering: alongside the set/map stored in ctx, also
store an ordered slice of uint64 hash keys (or when reading the map in
extractMutedGroupedAlerts, collect keys into a slice and sort them) so that the
construction path that builds NotificationEvent.muted_alerts (referenced by
extractMutedGroupedAlerts and NotificationEvent.muted_alerts) iterates a sorted
list rather than ranging the map. Ensure the change updates both the storage
site (WithMutedAlerts / MutedAlerts usage where mutedHashes are created) and the
consumer (extractMutedGroupedAlerts) to use the ordered slice or sorted keys.

In `@notify/notify.go`:
- Line 1035: Add an early guard using EventRecordingEnabled(ctx) before
constructing the event to avoid allocating NewNotificationEvent(ctx, sent,
r.integration) when recording is disabled; i.e., check
EventRecordingEnabled(ctx) and only call r.recorder.RecordEvent(ctx,
NewNotificationEvent(ctx, sent, r.integration)) when that helper returns true so
the event payload is not allocated unnecessarily.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4339101e-a846-405f-8428-348194ae0d09

📥 Commits

Reviewing files that changed from the base of the PR and between fad8b84 and 9057f64.

⛔ Files ignored due to path filters (2)

eventrecorder/eventrecorderpb/eventrecorder.pb.go is excluded by !**/*.pb.go
go.sum is excluded by !**/*.sum

📒 Files selected for processing (30)

api/v2/api.go
buf.yaml
cmd/alertmanager/main.go
config/config.go
config/coordinator.go
dispatch/dispatch.go
dispatch/dispatch_bench_test.go
dispatch/dispatch_test.go
eventrecorder/eventrecorder.go
eventrecorder/eventrecorder_test.go
eventrecorder/eventrecorderpb/eventrecorder.proto
eventrecorder/file.go
eventrecorder/file_test.go
eventrecorder/webhook.go
eventrecorder/webhook_test.go
featurecontrol/featurecontrol.go
go.mod
inhibit/inhibit.go
inhibit/inhibit_bench_test.go
inhibit/inhibit_test.go
notify/event.go
notify/mute.go
notify/mute_test.go
notify/notify.go
notify/notify_test.go
provider/mem/mem.go
provider/mem/mem_test.go
silence/silence.go
silence/silence_bench_test.go
silence/silence_test.go

✅ Files skipped from review due to trivial changes (7)

go.mod
dispatch/dispatch_bench_test.go
api/v2/api.go
notify/mute_test.go
provider/mem/mem_test.go
eventrecorder/file_test.go
notify/event.go

🚧 Files skipped from review as they are similar to previous changes (10)

buf.yaml
inhibit/inhibit_bench_test.go
inhibit/inhibit_test.go
notify/notify_test.go
dispatch/dispatch_test.go
silence/silence_test.go
provider/mem/mem.go
dispatch/dispatch.go
silence/silence_bench_test.go
eventrecorder/eventrecorder.go

coderabbitai · 2026-03-30T20:46:46Z

eventrecorder/eventrecorderpb/eventrecorder.proto

+package eventrecorderpb;
+
+import "google/protobuf/duration.proto";
+import "google/protobuf/timestamp.proto";
+
+option go_package = "github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb";


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail buf lint eventrecorder/eventrecorderpb/eventrecorder.proto

Repository: prometheus/alertmanager

Length of output: 415

Proto package fails Buf's layout and version lint rules.

The file has two blocking lint failures:

Package "eventrecorderpb" must be in a directory named "eventrecorderpb" relative to root, but it's in a subdirectory.

Package name should be versioned, e.g., "eventrecorderpb.v1".

Either restructure the package to match Buf's directory layout rules or explicitly disable these lint rules before merging.

🧰 Tools

🪛 Buf (1.66.1)

[error] 3-3: Files with package "eventrecorderpb" must be within a directory "eventrecorderpb" relative to root but were in directory ".".

(PACKAGE_DIRECTORY_MATCH)

[error] 3-3: Package name "eventrecorderpb" should be suffixed with a correctly formed version, such as "eventrecorderpb.v1".

(PACKAGE_VERSION_SUFFIX)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@eventrecorder/eventrecorderpb/eventrecorder.proto` around lines 3 - 8, The proto package declaration "package eventrecorderpb" violates Buf layout/version rules; either move this file into a top-level directory named "eventrecorderpb" to match the package, or update the proto to a versioned package (e.g., change the package to "eventrecorderpb.v1") and adjust the accompanying option go_package and any imports/refs accordingly so symbols (package eventrecorderpb -> eventrecorderpb.v1 and option go_package) and client/server code point to the new package; alternatively if you intentionally want to bypass the rule, add a buflint disable directive for layout/version at the top of the file before merging.

coderabbitai · 2026-03-30T20:46:47Z

eventrecorder/file.go

+			// Only react to events on our target file.
+			evPath, _ := filepath.Abs(event.Name)
+			if evPath != absPath {
+				continue
+			}


⚠️ Potential issue | 🟡 Minor

Ignoring filepath.Abs error may cause missed file rotation events.

If filepath.Abs(event.Name) fails, evPath will be empty or malformed, and the comparison with absPath will never match—silently skipping legitimate rotation events for the target file.

🛠️ Suggested fix

// Only react to events on our target file. - evPath, _ := filepath.Abs(event.Name) + evPath, err := filepath.Abs(event.Name) + if err != nil { + continue + } if evPath != absPath { continue }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Only react to events on our target file.

evPath, _ := filepath.Abs(event.Name)

if evPath != absPath {

continue

}

// Only react to events on our target file.

evPath, err := filepath.Abs(event.Name)

if err != nil {

continue

}

if evPath != absPath {

continue

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@eventrecorder/file.go` around lines 126 - 130, The code currently ignores the error from filepath.Abs(event.Name) which can produce an empty/malformed evPath and silently skip real rotations; update the block that computes evPath (where filepath.Abs(event.Name) is called and evPath/absPath are compared) to handle the error: if filepath.Abs returns an error, fall back to using a cleaned event.Name (e.g., filepath.Clean(event.Name)) as evPath (or otherwise derive an absolute path fallback) instead of ignoring the event, and optionally log the Abs error so failures are visible; ensure the subsequent comparison against absPath still uses the normalized evPath.

siavashs · 2026-04-04T11:07:16Z

Sorry for the delay in reviews, I was away and kinda busy but I'm doing another review now.
Thank you for addressing the comments, I think this looks much better now.

jackrosenthal force-pushed the event-log branch from 76a3c62 to 23bdb39 Compare March 4, 2026 00:06

ultrotter requested review from SoloJacobs, TheMeier and siavashs March 5, 2026 12:41

jackrosenthal marked this pull request as ready for review March 5, 2026 16:47

jackrosenthal force-pushed the event-log branch from 23bdb39 to 6c50017 Compare March 5, 2026 18:20

jackrosenthal force-pushed the event-log branch from 6c50017 to 88e44b8 Compare March 6, 2026 23:05

ultrotter reviewed Mar 9, 2026

View reviewed changes

jackrosenthal force-pushed the event-log branch from 88e44b8 to 33a5b39 Compare March 9, 2026 18:37

siavashs reviewed Mar 10, 2026

View reviewed changes

ultrotter reviewed Mar 10, 2026

View reviewed changes

siavashs reviewed Mar 10, 2026

View reviewed changes

eventlog/eventlog.go Outdated Show resolved Hide resolved

siavashs reviewed Mar 10, 2026

View reviewed changes

eventlog/eventlog.go Outdated Show resolved Hide resolved

siavashs reviewed Mar 10, 2026

View reviewed changes

eventlog/eventlog.go Outdated Show resolved Hide resolved

SoloJacobs requested changes Mar 10, 2026

View reviewed changes

Spaceman1701 reviewed Mar 17, 2026

View reviewed changes

TheMeier mentioned this pull request Mar 19, 2026

Silence notifications #730

Open

anarcat mentioned this pull request Mar 19, 2026

Add silence logging option #4163

Merged

jackrosenthal force-pushed the event-log branch from 33a5b39 to 65f596e Compare March 26, 2026 20:11

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

jackrosenthal changed the title ~~eventlog: add structured event log~~ eventrecorder: add structured event recorder Mar 26, 2026

jackrosenthal force-pushed the event-log branch from 65f596e to ac7b8f5 Compare March 26, 2026 21:08

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

jackrosenthal force-pushed the event-log branch from ac7b8f5 to fad8b84 Compare March 27, 2026 15:34

jackrosenthal requested a review from a team as a code owner March 27, 2026 15:34

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

jackrosenthal force-pushed the event-log branch from fad8b84 to 9057f64 Compare March 30, 2026 20:31

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Conversation

jackrosenthal commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

siavashs commented Mar 5, 2026

Uh oh!

jackrosenthal commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackrosenthal commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs commented Mar 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackrosenthal commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

jackrosenthal commented Mar 9, 2026 •

edited

Loading