eventrecorder: add structured event recorder#5072
eventrecorder: add structured event recorder#5072jackrosenthal wants to merge 1 commit intoprometheus:mainfrom
Conversation
|
Thank you @jackrosenthal |
|
Addressed findings from Devin review:
Acking from Devin (no action needed):
|
eventlog/eventlog.go
Outdated
|
|
||
| func MatchersAsProto(matchers labels.Matchers) []*eventlogpb.Matcher { | ||
| result := make([]*eventlogpb.Matcher, len(matchers)) | ||
| for i, m := range matchers { |
There was a problem hiding this comment.
Should we have a SingleMatcherAsProto func called here. and use it also instead of convertMatcherForEvent() in inhibit/inhibit.go:428-445 so we avoid duplicating all of this?
| return s.setSilence(msil, now) | ||
| if err := s.setSilence(msil, now); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Seeing this flow here makes me think: here we are relying on the fact that setSilence will not emit the event itself, relying on the fact that if canUpdate the merge will succeed. Is it always guaranteed? Should we at least log/track if we call setSilence expecting no new silence and one happens against all odds? Or should we consider it out of scope? Also the pattern of calling record event in setsilence but only on one path, and out of it on the other path, is a bit unbalanced. I am wondering if we can always record inside setSilence, or always out
There was a problem hiding this comment.
I think it is guaranteed - the checks before now check the id matches so even if canUpdate is wrong we'll still end up overwriting an existing silence when we call setSilence.
silence/silence.go
Outdated
| activeIDs = append(activeIDs, sil.Id) | ||
| allIDs = append(allIDs, sil.Id) | ||
|
|
||
| recorder.RecordEvent(&eventlogpb.EventData{ |
There was a problem hiding this comment.
Would it make sense to emit just one event with all the silences that mute the alert, rather? WDYT?
There was a problem hiding this comment.
I considered this, but I think per-silence events are more useful for downstream consumers. Each event is self-contained and can be filtered/routed independently, and the schema stays simple (single silence per event rather than a repeated field). If an alert is muted by 3 silences, 3 separate events is a clearer audit trail than one batched event where the consumer has to iterate.
eventlog/webhook.go
Outdated
| } | ||
|
|
||
| // WriteEvent POSTs the pre-serialized JSON bytes to the webhook URL. | ||
| func (wo *WebhookOutput) WriteEvent(data []byte) error { |
There was a problem hiding this comment.
Should we have a goroutine per-POST or a max number of workers, to avoid a slow or temporarily down webhook from stalling us, and/or a small number of retries before dropping? and should we have a metric for retried, dropped, and length of the requests?
eventlog/eventlog.go
Outdated
| } | ||
|
|
||
| for i, out := range outputs { | ||
| idx := strconv.Itoa(i) |
There was a problem hiding this comment.
This index could change on conf reload, right? Should we use file:path or webhook:url for a more stable long term meaning of the output label?
eventlog/eventlog.go
Outdated
| // placed on a bounded in-memory queue. A background goroutine | ||
| // drains the queue and writes to outputs. If the queue is full, | ||
| // events are dropped and a metric is incremented. | ||
| package eventlog |
There was a problem hiding this comment.
Should we add any test for the eventlog package?
eventlog/eventlog.go
Outdated
| // Skip if the config hasn't changed to avoid closing and | ||
| // recreating identical outputs. | ||
| r.core.mu.Lock() | ||
| if reflect.DeepEqual(cfg, r.core.currentCfg) { |
There was a problem hiding this comment.
Should we use a different comparison here, only on what we care about and avoiding doing a full deepequal on a struct which might have unexported fields and such and is quite complex?
We could for example keep a hash of the yaml that we used to build the config and only rebuild if the new yaml has a different hash...
eventlog/eventlog.go
Outdated
| return | ||
| } | ||
|
|
||
| newOutputs := buildOutputs(cfg.Outputs, r.core.logger) |
There was a problem hiding this comment.
Is there any way to build these without holding the lock, then swapping them? Or for example we could have an atomic pointer to the config instead of a lock...
cmd/alertmanager/main.go
Outdated
| inhibitor.Load().Mutes(ctx, labels) | ||
| silencer.Mutes(ctx, labels) | ||
| inhibitor.Load().MutesNoRecord(ctx, labels) | ||
| silencer.MutesNoRecord(ctx, labels) |
There was a problem hiding this comment.
I am wondering if instead of a separate MutesNoRecord call in both modules we could pass a context to disable the event recorder for the call?
eventlog/file.go
Outdated
| "log/slog" | ||
| "time" | ||
|
|
||
| "github.com/pengsrc/go-shared/reopen" |
There was a problem hiding this comment.
This is a library that hasn't been updated in years. 8 years, not 2. :) At this point rather than the dependency it might be better to implement the minimum of the needful. Also should we consider using maybe https://github.com/fsnotify/fsnotify to detect if we need to reopen instead of blindly reopening every 5 min?
|
Addressed review feedback:
|
inhibit/inhibit.go
Outdated
| ih.recorder.RecordEvent(ctx, &eventlogpb.EventData{ | ||
| EventType: &eventlogpb.EventData_InhibitionMutedAlert{ | ||
| InhibitionMutedAlert: &eventlogpb.InhibitionMutedAlertEvent{ | ||
| InhibitRules: []*eventlogpb.InhibitRule{convertInhibitRuleForEvent(r)}, | ||
| MutedAlert: &eventlogpb.MutedAlert{ | ||
| Fingerprint: uint64(fp), | ||
| Labels: eventlog.LabelSetAsProto(lset), | ||
| }, | ||
| InhibitingFingerprints: []uint64{uint64(inhibitedByFP)}, | ||
| }, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
Can we add some helper functions to make constructing events less verbose in the emitter side?
For example:
| ih.recorder.RecordEvent(ctx, &eventlogpb.EventData{ | |
| EventType: &eventlogpb.EventData_InhibitionMutedAlert{ | |
| InhibitionMutedAlert: &eventlogpb.InhibitionMutedAlertEvent{ | |
| InhibitRules: []*eventlogpb.InhibitRule{convertInhibitRuleForEvent(r)}, | |
| MutedAlert: &eventlogpb.MutedAlert{ | |
| Fingerprint: uint64(fp), | |
| Labels: eventlog.LabelSetAsProto(lset), | |
| }, | |
| InhibitingFingerprints: []uint64{uint64(inhibitedByFP)}, | |
| }, | |
| }, | |
| }) | |
| ih.recorder.RecordEvent(ctx, eventlog.NewInhibitionMutedAlertEvent(r, fp, lset, inhibitedBy)) |
inhibit/inhibit.go
Outdated
| func convertInhibitRuleForEvent(r *InhibitRule) *eventlogpb.InhibitRule { | ||
| sourceMatchers := make([]*eventlogpb.Matcher, len(r.SourceMatchers)) | ||
| for i, m := range r.SourceMatchers { | ||
| sourceMatchers[i] = eventlog.MatcherAsProto(m) | ||
| } | ||
|
|
||
| targetMatchers := make([]*eventlogpb.Matcher, len(r.TargetMatchers)) | ||
| for i, m := range r.TargetMatchers { | ||
| targetMatchers[i] = eventlog.MatcherAsProto(m) | ||
| } | ||
|
|
||
| equalLabels := make([]string, 0, len(r.Equal)) | ||
| for label := range r.Equal { | ||
| equalLabels = append(equalLabels, string(label)) | ||
| } | ||
|
|
||
| return &eventlogpb.InhibitRule{ | ||
| SourceMatchers: sourceMatchers, | ||
| TargetMatchers: targetMatchers, | ||
| EqualLabels: equalLabels, | ||
| } | ||
| } |
There was a problem hiding this comment.
following my previous comment this should live in the eventlog package and be used internally there.
There was a problem hiding this comment.
Done. Moved to eventrecorder.InhibitRuleAsProto.
|
I think the package name |
eventlog/file.go
Outdated
| if event.Op&(fsnotify.Rename|fsnotify.Remove) != 0 { | ||
| fo.reopen() | ||
| // Re-add the watch since the original inode is gone. | ||
| _ = watcher.Add(fo.path) |
There was a problem hiding this comment.
Here in case of error we may miss future notifications. Would it make sense to watch the directory instead, only once, and filter the filename we care about from the notification we get from the changes there?
There was a problem hiding this comment.
Done. Changed to watch the parent directory with fsnotify, filtering by the target file path.
eventlog/webhook.go
Outdated
| maxRetries: maxRetries, | ||
| retryBackoff: retryBackoff, | ||
| logger: logger, | ||
| work: make(chan []byte, workers), |
There was a problem hiding this comment.
This chan has 4 spaces... So when below you do
wo.work <- data
It'll block after 4 workers are busy and 4 events are enqueued. Should we have a larger channel here?
And maybe below drop events and log/increase a metric if the big channel is completely full?
There was a problem hiding this comment.
Done. Increased the queue to 1024. SendEvent is now non-blocking and drops the event when the queue is full, incrementing alertmanager_event_recorder_webhook_drops_total.
| // FileOutput writes pre-serialized JSON event bytes to a JSONL file. | ||
| // The file is reopened when fsnotify detects a rename or remove (e.g. | ||
| // logrotate). | ||
| type FileOutput struct { |
There was a problem hiding this comment.
Can we have some tests for fileoutput and the relevant reopener/notification that all works as we expect?
There was a problem hiding this comment.
Done. Added five tests: TestFileOutput_SendEvent, TestFileOutput_ReopenAfterRename, TestFileOutput_ReopenAfterRemove, TestFileOutput_Close, and TestFileOutput_InvalidPath.
eventlog/eventlog.go
Outdated
| newOutputs := buildOutputs(cfg.Outputs, r.core.logger) | ||
|
|
||
| // Swap under the lock. | ||
| r.core.mu.Lock() |
There was a problem hiding this comment.
Here we should check that oldOutputs is unchanged since we unlocked and re-locked. I know it's unlikely, but... If it was an atomic pointer we would use:
func (*Pointer[T]) [CompareAndSwap]
But otherwise we could use a version number stored inside, or a hash, or re-compare eventLogConfigEqual but that seems less ideal...
There was a problem hiding this comment.
This is no longer applicable. The concurrency refactor replaced all locks with message passing, so writeLoop exclusively owns the outputs and there is no unlock/relock sequence.
eventlog/webhook.go
Outdated
| } | ||
| wo.logger.Warn("Event log webhook POST failed", "url", wo.url, "attempt", attempt+1, "err", err) | ||
| if attempt < wo.maxRetries-1 { | ||
| backoff := time.Duration(float64(wo.retryBackoff) * math.Pow(2, float64(attempt))) |
There was a problem hiding this comment.
Should we have a max backoff here? the largest one with the default values is 256 seconds... ?
And should we have a metric for dropped values?
also do we need to bring in math.Pow here? (though it shouldn't matter too much, fair enough).
There was a problem hiding this comment.
Done. Added a max backoff cap of 30 seconds and simplified the calculation to min(retryBackoff<<attempt, maxBackoff), removing the math import. Drop metric added as noted above.
eventlog/eventlog.go
Outdated
| select { | ||
| case req := <-c.events: | ||
| c.writeToOutputs(req) | ||
| case <-c.done: |
There was a problem hiding this comment.
Here in this case while we are draining c.events the workers may have already found an empty channel and quit. i.e.
done is closed.
all workers quit because their channel is empty
and then we get scheduled and start adding events.... which nobody handles! :(
We should first tell writeLoop to drain-and-return and only after it's done tell the workers to drain-and-return.
There was a problem hiding this comment.
Fixed by the concurrency refactor. writeLoop handles its own drain in a defer when <-done fires, so there is no race between workers quitting and events being enqueued.
eventlog/eventlog_test.go
Outdated
|
|
||
| // Apply a new config. Since we can't easily create real outputs | ||
| // in a test, we manually swap outputs. But we can verify that an | ||
| // unchanged config is a no-op. |
There was a problem hiding this comment.
Uhm it does seem that we should also verify that the config changes and new events go to the new output correctly...
There was a problem hiding this comment.
Verified manually with a live instance. Started with file-only config, sent SIGHUP with file+webhook config, and confirmed new events appeared on the webhook while the file continued receiving events.
eventlog/eventlog.go
Outdated
|
|
||
| // Close signals the background goroutine to drain remaining events | ||
| // and stop, then closes all outputs. | ||
| func (r Recorder) Close() error { |
There was a problem hiding this comment.
Shall we test that Close drained all events properly?
There was a problem hiding this comment.
TestWebhookOutput_CloseFlushesQueue and TestWebhookOutput_MultipleWorkers both verify that Close drains all queued events.
eventlog/eventlog.go
Outdated
| case r.core.events <- writeRequest{data: data, eventType: eventType}: | ||
| default: | ||
| // Queue full; drop event to avoid blocking alertmanager. | ||
| r.core.metrics.eventsDropped.WithLabelValues(eventType).Inc() |
There was a problem hiding this comment.
Should we add a test for this unhappy path, and potentially other unhappy paths?
There was a problem hiding this comment.
Added TestFileOutput_InvalidPath and TestWebhookOutput_DropsAfterMaxRetries.
silence/silence.go
Outdated
| func (s *Silences) convertSilenceForEvent(sil *pb.Silence) *eventlogpb.Silence { | ||
| matcherSets := make([]*eventlogpb.MatcherSet, len(sil.MatcherSets)) | ||
| for i, ms := range sil.MatcherSets { | ||
| matcherSet := &eventlogpb.MatcherSet{ | ||
| Matchers: make([]*eventlogpb.Matcher, len(ms.Matchers)), | ||
| } | ||
| for j, m := range ms.Matchers { | ||
| matcherSet.Matchers[j] = &eventlogpb.Matcher{ | ||
| Type: convertMatcherType(m.Type), | ||
| Name: m.Name, | ||
| Pattern: m.Pattern, | ||
| Rendered: renderMatcher(m), | ||
| } | ||
| } | ||
| matcherSets[i] = matcherSet | ||
| } | ||
|
|
||
| var matchers []*eventlogpb.Matcher | ||
| if len(matcherSets) > 0 { | ||
| matchers = matcherSets[0].Matchers | ||
| } | ||
|
|
||
| return &eventlogpb.Silence{ | ||
| Id: sil.Id, | ||
| Matchers: matchers, | ||
| MatcherSets: matcherSets, | ||
| StartsAt: sil.StartsAt, | ||
| EndsAt: sil.EndsAt, | ||
| UpdatedAt: sil.UpdatedAt, | ||
| CreatedBy: sil.CreatedBy, | ||
| Comment: sil.Comment, | ||
| } |
There was a problem hiding this comment.
This does not access anything from Silences instance s and should not be part of it.
Since both sides are the conversion are proto types, can we use generated code to handle this conversion?
This should be a helper function in the eventlog (whatever we name it) package.
There was a problem hiding this comment.
Done. Moved to eventrecorder.SilenceAsProto and eventrecorder.SilenceMatcherAsProto.
eventlog/eventlog.go
Outdated
| type recordingDisabledContextKey struct{} | ||
|
|
||
| // WithRecordingDisabled returns a context that suppresses event recording. | ||
| func WithRecordingDisabled(ctx context.Context) context.Context { | ||
| return context.WithValue(ctx, recordingDisabledContextKey{}, true) | ||
| } | ||
|
|
||
| // RecordingDisabled reports whether event recording has been suppressed | ||
| // in the given context via WithRecordingDisabled. | ||
| func RecordingDisabled(ctx context.Context) bool { | ||
| v, _ := ctx.Value(recordingDisabledContextKey{}).(bool) | ||
| return v | ||
| } |
There was a problem hiding this comment.
This reads very odd, we should have it disabled by default and then have an option to enable it.
For example WithEventRecording()
There was a problem hiding this comment.
Done. Renamed to WithEventRecording. Recording is disabled by default; callers opt in.
eventlog/eventlog.go
Outdated
| // recorderCore holds the mutable state shared by all copies of a | ||
| // Recorder value. Access is protected by mu. | ||
| type recorderCore struct { |
There was a problem hiding this comment.
Maybe sharedRecorder would be a better name?
eventlog/eventlog.go
Outdated
| // Output is a single event log destination. Implementations receive | ||
| // pre-serialized JSON bytes and are responsible for delivering them. | ||
| type Output interface { | ||
| // Name returns a stable identifier for this output, suitable for | ||
| // use as a Prometheus label value (e.g. "file:/var/log/events.jsonl" | ||
| // or "webhook:https://example.com/hook"). | ||
| Name() string | ||
| WriteEvent(data []byte) error | ||
| io.Closer | ||
| } |
There was a problem hiding this comment.
Maybe call it Destination and SendEvent()?
SoloJacobs
left a comment
There was a problem hiding this comment.
General remark: It will be easier to review and merge this, if we did the file.go and webbok.go implementation in a separate PR, but here we go.
a) Some of these events are extremely detailed. E.g., InhibitionMutedAlertEvent . We could construct all the relevant information in Alertmanager just from the Silences, the configuration and the alerts. If we precompute results, then this isn't really event sourcing? We already have tracing for adding observability to alertmanager.
b) I don't think we should need any locks in an event-recorder? Can we just use message passing. Below the AIs attempt (Sorry, I let the AI do this, I otherwise would have to delay this review even more):
- Locking for the
cluster.peeris weird, we should simply emit an event with the Position, wheneverpeer.Position()is called. We could also keep track of that position inside theEventRecorder, but definitely not from
func (r Recorder) UpdateClusterPosition(pos int) {
if r.ch == nil { return }
p := uint32(pos)
select {
case r.ch <- logRequest{clusterPos: &p}:
default:
// Queue full. For state updates, we might want to retry or block briefly,
// but dropping is standard for this non-blocking design.
}
}
- The EventRecorder should be able to handle config reloads just by message passing:
func (c *recorderCore) writeLoop(reqs <-chan logRequest) {
var currentClusterPos uint32 = 0
var outputs []Output // <--- Local state, completely lock-free!
// Ensure we clean up whatever is active when the process finally exits
defer func() {
for _, out := range outputs {
out.Close()
}
}()
for req := range reqs {
// --- CONTROL MESSAGE: CONFIG RELOAD ---
if req.newOutputs != nil {
c.logger.Info("Consumer applying new event log configuration")
// 1. Gracefully close the old outputs
for _, out := range outputs {
if err := out.Close(); err != nil {
c.logger.Error("Failed to close old output", "err", err)
}
}
// 2. Swap to the new outputs
outputs = req.newOutputs
continue
}
// --- CONTROL MESSAGE: CLUSTER POSITION ---
if req.clusterPos != nil {
currentClusterPos = *req.clusterPos
continue
}
// --- STANDARD DATA MESSAGE: EVENT ---
if req.event != nil {
// (Wrap the event, stamp the cluster position, marshal to JSON...)
data := marshalEvent(req.event, currentClusterPos)
// Write to the currently active outputs
for _, out := range outputs {
out.WriteEvent(data)
}
}
}
}
| // WebhookOutput POSTs each event as a JSON body to a configured URL. | ||
| // Events are processed by a bounded worker pool so that a slow or | ||
| // temporarily unavailable webhook does not block the event log queue. | ||
| type WebhookOutput struct { |
There was a problem hiding this comment.
Should we really call this Webhook? This name is already taken by the notifier.
There was a problem hiding this comment.
While I do think it's a little confusing, I also think it's ok since it's in another package. IMO, It's fairly conventional to have this kind of name duplication in Go since the package name always prefixes the symbol name.
There was a problem hiding this comment.
I agree with @Spaceman1701 that this is fine since the type is in a separate package (eventrecorder.WebhookOutput vs webhook.Notifier).
cmd/alertmanager/main.go
Outdated
| }, | ||
| }) | ||
| defer func() { | ||
| eventRecorder.RecordEvent(context.Background(), &eventlogpb.EventData{ |
There was a problem hiding this comment.
Can we put this explicitly at the end of the function?
So event log always records all events, while tracing is for sampling. Technically one could sample traces at 100% but tracing is less efficient than event recording. Event recording is more about analytics than for observability. |
But we could very easily address the sampling issue by using our existing logger, correct? It wasn't clear from my first review, but the actual code changes I requested were for item b). For item a), what I really need is a coherent strategy for our different logging mechanism. When to pick which emitter (logs, metrics, traces, event log)? How to deal with duplicate events? This feels like the part, which is not explicit here. As far as I can see, we can't really use the EventRecorder for event sourcing. That is fine, but we really need to be clear about what capabilities this feature gives users. Currently, the PR just describes what events are added, but not why. |
So some background: I agree the description of PR can be better, or be linked to the issue #4948 |
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for opening this PR! I agree with the general requests around cleanup (e.g. moving event construction into the eventlog package), but overall the code and behavior looks good (and familiar 😅 ).
I also wanted to mention that we discussed this change during the last Alertmanager working group meeting. I think the biggest takeaway is that we should put this change behind an experimental feature flag now - the eventlog protos will be the first "public" protos in Alertmanager, so it'd be nice to have some time to iterate on them before we have to support them as part of the stable API.
| // WebhookOutput POSTs each event as a JSON body to a configured URL. | ||
| // Events are processed by a bounded worker pool so that a slow or | ||
| // temporarily unavailable webhook does not block the event log queue. | ||
| type WebhookOutput struct { |
There was a problem hiding this comment.
While I do think it's a little confusing, I also think it's ok since it's in another package. IMO, It's fairly conventional to have this kind of name duplication in Go since the package name always prefixes the symbol name.
| return s.setSilence(msil, now) | ||
| if err := s.setSilence(msil, now); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
I think it is guaranteed - the checks before now check the id matches so even if canUpdate is wrong we'll still end up overwriting an existing silence when we call setSilence.
To echo @siavashs a little, the way we use the event recorder in HRT is as a separate structured event reporting system. It's read by machines to generate historical information that drives both analytics and user workflows.
The goal is to get enough events that we can do event sourcing, but I agree there are some things missing. Over time we can improve this system to export more events and provide more detail. This is also why the events sometimes include "extra" data like the peer position - while a perfect event souring system wouldn't need this, we practically:
(2) is the most important one, I think: In order to do "true" event sourcing to replicate the state of Alertmanager you would need infrastructure which can guarantee that you never drop an event. This is not really practical in reality, so (in my opinion) it's helpful to "duplicate" some information in each event message. I don't think having extra events breaks the model, really. If you were implementing an event sourcing system you would just ignore the events that don't contain useful data to you. Perhaps in the future we could even support some filtering directly from the Alertmanager configuration. |
33a5b39 to
65f596e
Compare
📝 WalkthroughWalkthroughAdds an opt-in event recording subsystem: protobuf schema, Recorder with async queue and file/webhook outputs, config and feature-flag support, and instrumentation wired into alerts, silences, inhibitions, dispatcher, and notification pipeline; startup/shutdown events and API tracing context wrapping are also added. Changes
Sequence DiagramsequenceDiagram
participant Subsystem as Subsystem
participant Recorder as Recorder
participant WriteLoop as WriteLoop
participant Destination as Destination
Subsystem->>Recorder: RecordEvent(ctx, EventData)
activate Recorder
Recorder->>Recorder: extract event_type, wrap with timestamp/instance
Recorder->>Recorder: protojson.Marshal + append newline
Recorder->>Recorder: enqueue bytes (non-blocking)
deactivate Recorder
WriteLoop->>Recorder: drain queue
activate WriteLoop
WriteLoop->>Destination: SendEvent(serialized bytes)
activate Destination
Destination->>Destination: write/POST (file or HTTP), handle retries/reopen/drops
Destination-->>WriteLoop: success / error
deactivate Destination
WriteLoop->>WriteLoop: update metrics, apply config updates
deactivate WriteLoop
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
dispatch/dispatch.go (2)
819-838:⚠️ Potential issue | 🟠 MajorRecord resolved events only after the delete confirms the transition.
resolvedSliceis only tentative here. If an alert fires again during the notification flush,DeleteIfNotModifiedwill keep it in the store, but the current ordering has already emittedAlertResolvedEvent. Please move recording after the delete and only emit for fingerprints that are actually gone.Possible fix
if notify(alertsSlice...) { - ag.recordResolvedEvents(resolvedSlice) - // Delete all resolved alerts as we just sent a notification for them, // and we don't want to send another one. However, we need to make sure // that each resolved alert has not fired again during the flush as then // we would delete an active alert thinking it was resolved. // Since we are passing DestroyIfEmpty=true the group will be marked as // destroyed if there are no more alerts after the deletion. if err := ag.alerts.DeleteIfNotModified(resolvedSlice, true); err != nil { ag.logger.Error("error on delete alerts", "err", err) } else { + actuallyResolved := resolvedSlice[:0] // Delete markers for resolved alerts that are not in the store. for _, alert := range resolvedSlice { _, err := ag.alerts.Get(alert.Fingerprint()) if errors.Is(err, store.ErrNotFound) { ag.marker.Delete(alert.Fingerprint()) + actuallyResolved = append(actuallyResolved, alert) } } + ag.recordResolvedEvents(actuallyResolved) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/dispatch.go` around lines 819 - 838, The current code calls ag.recordResolvedEvents(resolvedSlice) before confirming the alerts were actually removed; move the record/emit step to after the call to ag.alerts.DeleteIfNotModified and only record/emit events for those fingerprints that were confirmed deleted (i.e., for each alert in resolvedSlice, call ag.alerts.Get(alert.Fingerprint()) and if errors.Is(err, store.ErrNotFound) then call ag.recordResolvedEvents (or record per-fingerprint) and ag.marker.Delete); adjust notify/DeleteIfNotModified handling so you do not emit resolved events for alerts that remained in the store.
771-781:⚠️ Potential issue | 🟠 MajorOnly emit grouped events on first membership.
ag.alerts.Setalso succeeds when an existing alert is refreshed in the same aggregation group, so this will emitAlertGroupedEventfor every update of a long-lived alert. The event needs a real “new to group” check, not just “Set returned nil”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dispatch/dispatch.go` around lines 771 - 781, ag.alerts.Set currently succeeds for both new and refreshed alerts, so change the logic around ag.alerts.Set in dispatch.go to only call ag.recorder.RecordEvent(notify.NewAlertGroupedEvent(...)) when the alert is actually new to the aggregation group: perform a pre- or post-check of membership (e.g., call a membership helper such as ag.alerts.Exists/IsMember or change Set to return a boolean like isNew) and only call ag.recorder.RecordEvent(ctx, notify.NewAlertGroupedEvent(ag.alertGroupInfo(), alert)) when that check indicates the alert was newly added to the group; leave the error handling path and span logging unchanged.inhibit/inhibit.go (1)
210-221:⚠️ Potential issue | 🟠 MajorOnly emit inhibition events on the state transition.
Mutesis re-evaluated on every notification attempt, so this unconditionalRecordEventwill enqueue the sameInhibitionMutedAlertEventon every flush while the alert stays inhibited. Please gate recording on the marker actually changing to the inhibited state instead of every successful lookup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inhibit/inhibit.go` around lines 210 - 221, Only record the InhibitionMutedAlertEvent when the marker actually transitions to inhibited: change the logic around ih.marker.SetInhibited(fp, ...) so you detect a state change (either by using a returned bool from SetInhibited indicating it newly became inhibited, or by checking prior state with ih.marker.IsInhibited(fp) and then calling SetInhibited) and only call ih.recorder.RecordEvent(... eventrecorder.NewInhibitionMutedAlertEvent(...)) when that transition occurred; keep the span.AddEvent behavior as needed but gate the recorder.RecordEvent behind the transition check.
🧹 Nitpick comments (2)
silence/silence_test.go (1)
646-647: Assert theaddedresult inTestSilencesSetSilence.This test validates a create path; it should assert
added == trueso a non-error no-op merge doesn’t pass unnoticed.✅ Suggested assertion
- _, err := s.setSilence(s.toMeshSilence(sil), nowpb) + added, err := s.setSilence(s.toMeshSilence(sil), nowpb) require.NoError(t, err) + require.True(t, added)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@silence/silence_test.go` around lines 646 - 647, The test call to s.setSilence ignores the boolean that indicates whether the silence was added; update the call in TestSilencesSetSilence to capture the returned added value from s.setSilence(s.toMeshSilence(sil), nowpb) and add an assertion that added is true (e.g., require.True(t, added) or require.Equal(t, true, added)) so the create path is verified and a no-op merge won't be mistaken for a successful add.silence/silence.go (1)
280-282: Guard muted-alert event construction on the recording flag.
RecordEventbails out when recording is disabled, but by this point the silence and label payloads are already fully materialized.Silencer.Mutesis on a hot path, so a cheapif eventrecorder.EventRecordingEnabled(ctx)around this constructor would avoid unnecessary allocations when the recorder is off.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@silence/silence.go` around lines 280 - 282, Wrap the construction and call to s.recorder.RecordEvent(...) with a cheap guard using eventrecorder.EventRecordingEnabled(ctx) to avoid allocating the silence and label payloads when recording is disabled; specifically, in the Silencer.Mutes hot path, check eventrecorder.EventRecordingEnabled(ctx) before calling eventrecorder.NewSilenceMutedAlertEvent(...) and s.recorder.RecordEvent(ctx, ...), so the NewSilenceMutedAlertEvent(...) invocation and its arguments (including eventrecorder.SilenceAsProto(sil), fp, lset) are not created unless recording is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v2/api.go`:
- Around line 332-333: The deleteSilenceHandler currently starts its trace span
from the raw request context, so event recording is skipped; update the span
creation in deleteSilenceHandler to use
eventrecorder.WithEventRecording(params.HTTPRequest.Context()) just like
postAlertsHandler and postSilencesHandler do (i.e., call
tracer.Start(eventrecorder.WithEventRecording(params.HTTPRequest.Context()),
"api.deleteSilenceHandler") and defer span.End()) so API-driven silence
deletions participate in event recording.
In `@config/config.go`:
- Around line 293-296: The URL field in the webhook output struct currently uses
*commoncfg.URL which can expose credentials during serialization; change the
field type named URL to use the secret wrapper (e.g., *commoncfg.SecretURL or
the project's equivalent secret URL type) and keep the existing field tags
(yaml:"url,omitempty" json:"url,omitempty") so that serialization masks or omits
sensitive parts; also search for usages of the URL field
(constructors/parsers/validators) and update any places that instantiate or read
from URL to handle the SecretURL accessor methods (unwrap/secret getters) so
runtime behavior remains the same.
In `@config/coordinator.go`:
- Around line 150-170: In ApplyConfig, validate the incoming conf is not nil at
the top of the function and return a clear error instead of proceeding;
specifically, before assigning c.config = conf or calling c.notifySubscribers(),
check if conf == nil, log an error (including context like c.configFilePath),
set c.configSuccessMetric.Set(0) and return an error to avoid the subsequent
panic when reading c.config.original; keep the existing notifySubscribers and
success-metric updates unchanged otherwise.
In `@eventrecorder/eventrecorder.go`:
- Around line 341-376: The current eventRecorderConfigEqual function omits
HTTPConfig, so changes to TLS/auth/proxy/headers don't trigger rebuilds; update
eventRecorderConfigEqual (used to compare config.EventRecorderConfig Outputs) to
include a comparison of each output's HTTPConfig — either via a deep/equality
check (e.g., reflect.DeepEqual on oa.HTTPConfig vs ob.HTTPConfig) or by
explicitly comparing relevant fields (CA/cert/key, client cert/key, bearer
token/basic auth, proxy URL, custom headers, insecure skip verify, timeouts)
while handling nil pointers, and return false if any HTTPConfig differences are
found.
- Around line 425-430: The iteration over a Go map in LabelSetAsProto yields
non-deterministic ordering; fix by collecting all keys from ls,
sort.Strings(keys), then iterate the sorted keys to build pairs so Labels are
consistently ordered; apply the same pattern to the corresponding proto->model
helper around lines 557-566 (the LabelSet proto conversion function) so both
serialization and deserialization preserve deterministic ordering.
In `@eventrecorder/file.go`:
- Around line 58-61: NewFileOutput currently starts watchLoop(ready) and waits
on ready, but watchLoop swallows startup failures; modify watchLoop to report
startup errors back (e.g., change ready to a struct or use a separate error
channel like readyErr chan error) and have watchLoop send nil on success or an
error on failure; in NewFileOutput, wait for that response and return (nil, err)
when watchLoop reports a startup error (or the error from directory
registration/abs path resolution), ensuring NewFileOutput fails fast instead of
returning a healthy-looking fo; update any other similar call sites (lines
83-110 behavior) to use the same ready/readyErr pattern so watcher-start errors
are propagated.
- Around line 145-150: Close currently signals fo.done but doesn't wait for the
watcher goroutine (watchLoop) to exit, allowing reopen to run after Close and
leak/keep the file open; fix by adding a sync.WaitGroup (or similar) to
FileOutput, call wg.Add(1) when spawning watchLoop and defer wg.Done() inside
it, then change FileOutput.Close to close(fo.done) followed by wg.Wait() before
acquiring fo.mu and closing fo.f; additionally guard reopen so it no-ops if
Close has completed (e.g., set a mu-protected closed bool or use sync.Once) to
ensure reopen cannot run after Close has finished.
In `@eventrecorder/webhook.go`:
- Around line 91-100: The code uses cfg.URL.String() (assigned to urlStr) as the
WebhookOutput.url and in labels/logs (e.g., dropsCounter.WithLabelValues and
Name()), which can leak secrets; replace use of the raw URL with a sanitized or
opaque identifier: compute a safeID (for example, a host-only or hashed
representation derived from cfg.URL without userinfo, path, or query) and set
WebhookOutput.url (or add a new field like id) to that safeID, then use that
safeID in dropsCounter.WithLabelValues(fmt.Sprintf("webhook:%s", safeID)), in
Name(), and in any warning logs; ensure the original cfg.URL is only used
internally for the actual request and never logged or used in metrics.
- Around line 150-163: postWithRetry is blocking shutdown because retries (calls
to post, client.Timeout waits and time.Sleep backoffs) don't observe the
WebhookOutput shutdown signal; modify postWithRetry (and post) to accept and use
a cancelable context (or watch the existing done channel) so each attempt and
any backoff sleeps select on ctx.Done() (or done) and return early; ensure
Close/ApplyConfig cancels that context (or closes done) and that any flush path
enforces an overall deadline so waits won't block shutdown indefinitely; update
references to postWithRetry, post, done, maxRetries, retryBackoff, and
maxBackoff accordingly.
- Around line 120-127: SendEvent currently always returns nil even when the
non-blocking send to the wo.work channel fails, causing upstream sendToOutputs
to treat dropped events as successful; change SendEvent to return a non-nil
error when the default branch is taken (i.e., when the queue is full).
Specifically, in SendEvent (function name: SendEvent, channel: wo.work, metric:
wo.drops, logger: wo.logger) increment wo.drops and log the drop as you do now,
but also return an error (e.g., errors.New("webhook queue full")) so
sendToOutputs sees the failure and records the write as unsuccessful. Ensure the
successful enqueue path still returns nil.
In `@provider/mem/mem.go`:
- Around line 126-127: NewAlerts currently doesn't ensure the Recorder field is
non-nil, and calls to a.recorder.RecordEvent (e.g., in Alerts methods referenced
by NewAlerts and at the call site a.recorder.RecordEvent) will panic if nil;
update NewAlerts to guard/default the recorder (e.g., accept a nil recorder and
replace it with a no-op/fake recorder) or add nil checks before calling
a.recorder.RecordEvent in the Alerts methods so they skip recording when
recorder == nil, referencing NewAlerts and the a.recorder.RecordEvent calls to
locate where to apply the fix.
In `@silence/silence.go`:
- Around line 817-828: The setSilence function currently ignores the first
return value from s.st.merge(msil, now) so callers can't know if the merge
actually changed state; change the call to capture the "changed" (or "merged")
boolean returned by s.st.merge and use that flag to decide whether to run
s.indexSilence(msil.Silence), s.updateSizeMetrics(), and s.broadcast(b), then
return that changed flag (or include it in the returned added/changed values)
from setSilence; apply the same pattern to the analogous code around lines
861-867 so event emission and broadcasts only occur when s.st.merge reports a
real change.
---
Outside diff comments:
In `@dispatch/dispatch.go`:
- Around line 819-838: The current code calls
ag.recordResolvedEvents(resolvedSlice) before confirming the alerts were
actually removed; move the record/emit step to after the call to
ag.alerts.DeleteIfNotModified and only record/emit events for those fingerprints
that were confirmed deleted (i.e., for each alert in resolvedSlice, call
ag.alerts.Get(alert.Fingerprint()) and if errors.Is(err, store.ErrNotFound) then
call ag.recordResolvedEvents (or record per-fingerprint) and ag.marker.Delete);
adjust notify/DeleteIfNotModified handling so you do not emit resolved events
for alerts that remained in the store.
- Around line 771-781: ag.alerts.Set currently succeeds for both new and
refreshed alerts, so change the logic around ag.alerts.Set in dispatch.go to
only call ag.recorder.RecordEvent(notify.NewAlertGroupedEvent(...)) when the
alert is actually new to the aggregation group: perform a pre- or post-check of
membership (e.g., call a membership helper such as ag.alerts.Exists/IsMember or
change Set to return a boolean like isNew) and only call
ag.recorder.RecordEvent(ctx, notify.NewAlertGroupedEvent(ag.alertGroupInfo(),
alert)) when that check indicates the alert was newly added to the group; leave
the error handling path and span logging unchanged.
In `@inhibit/inhibit.go`:
- Around line 210-221: Only record the InhibitionMutedAlertEvent when the marker
actually transitions to inhibited: change the logic around
ih.marker.SetInhibited(fp, ...) so you detect a state change (either by using a
returned bool from SetInhibited indicating it newly became inhibited, or by
checking prior state with ih.marker.IsInhibited(fp) and then calling
SetInhibited) and only call ih.recorder.RecordEvent(...
eventrecorder.NewInhibitionMutedAlertEvent(...)) when that transition occurred;
keep the span.AddEvent behavior as needed but gate the recorder.RecordEvent
behind the transition check.
---
Nitpick comments:
In `@silence/silence_test.go`:
- Around line 646-647: The test call to s.setSilence ignores the boolean that
indicates whether the silence was added; update the call in
TestSilencesSetSilence to capture the returned added value from
s.setSilence(s.toMeshSilence(sil), nowpb) and add an assertion that added is
true (e.g., require.True(t, added) or require.Equal(t, true, added)) so the
create path is verified and a no-op merge won't be mistaken for a successful
add.
In `@silence/silence.go`:
- Around line 280-282: Wrap the construction and call to
s.recorder.RecordEvent(...) with a cheap guard using
eventrecorder.EventRecordingEnabled(ctx) to avoid allocating the silence and
label payloads when recording is disabled; specifically, in the Silencer.Mutes
hot path, check eventrecorder.EventRecordingEnabled(ctx) before calling
eventrecorder.NewSilenceMutedAlertEvent(...) and s.recorder.RecordEvent(ctx,
...), so the NewSilenceMutedAlertEvent(...) invocation and its arguments
(including eventrecorder.SilenceAsProto(sil), fp, lset) are not created unless
recording is enabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bc9b1c06-0883-4e86-8b27-cc9ea664bc9e
⛔ Files ignored due to path filters (2)
eventrecorder/eventrecorderpb/eventrecorder.pb.gois excluded by!**/*.pb.gogo.sumis excluded by!**/*.sum
📒 Files selected for processing (29)
api/v2/api.gobuf.yamlcmd/alertmanager/main.goconfig/config.goconfig/coordinator.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.goeventrecorder/eventrecorder.goeventrecorder/eventrecorder_test.goeventrecorder/eventrecorderpb/eventrecorder.protoeventrecorder/file.goeventrecorder/file_test.goeventrecorder/webhook.goeventrecorder/webhook_test.gofeaturecontrol/featurecontrol.gogo.modinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gonotify/event.gonotify/mute_test.gonotify/notify.gonotify/notify_test.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.go
| func (wo *WebhookOutput) SendEvent(data []byte) error { | ||
| select { | ||
| case wo.work <- data: | ||
| default: | ||
| wo.drops.Inc() | ||
| wo.logger.Warn("Event recorder webhook queue full, dropping event", "url", wo.url) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Webhook drops are reported upstream as successful writes.
This returns nil immediately—even when the queue is full, and before any HTTP delivery attempt. sendToOutputs in eventrecorder/eventrecorder.go treats that as "success" and increments alertmanager_event_recorder_bytes_written_total, so local drops and later retry exhaustion are invisible in the recorder's main metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@eventrecorder/webhook.go` around lines 120 - 127, SendEvent currently always
returns nil even when the non-blocking send to the wo.work channel fails,
causing upstream sendToOutputs to treat dropped events as successful; change
SendEvent to return a non-nil error when the default branch is taken (i.e., when
the queue is full). Specifically, in SendEvent (function name: SendEvent,
channel: wo.work, metric: wo.drops, logger: wo.logger) increment wo.drops and
log the drop as you do now, but also return an error (e.g., errors.New("webhook
queue full")) so sendToOutputs sees the failure and records the write as
unsuccessful. Ensure the successful enqueue path still returns nil.
| func (wo *WebhookOutput) postWithRetry(data []byte) { | ||
| for attempt := range wo.maxRetries { | ||
| err := wo.post(data) | ||
| if err == nil { | ||
| return | ||
| } | ||
| wo.logger.Warn("Event recorder webhook POST failed", "url", wo.url, "attempt", attempt+1, "err", err) | ||
| if attempt < wo.maxRetries-1 { | ||
| backoff := min(wo.retryBackoff<<attempt, wo.maxBackoff) | ||
| time.Sleep(backoff) | ||
| } | ||
| } | ||
| wo.logger.Error("Event recorder webhook POST failed after retries, dropping event", "url", wo.url, "retries", wo.maxRetries) | ||
| } |
There was a problem hiding this comment.
Close can hang reloads and shutdown behind retry backlog.
After Line 183 closes done, workers still run postWithRetry, including client.Timeout waits and time.Sleep backoff, until the queue is drained. With a full queue, Recorder.ApplyConfig or process shutdown can stall for a very long time. Please make retries cancelable on close or bound the flush deadline.
Also applies to: 181-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@eventrecorder/webhook.go` around lines 150 - 163, postWithRetry is blocking
shutdown because retries (calls to post, client.Timeout waits and time.Sleep
backoffs) don't observe the WebhookOutput shutdown signal; modify postWithRetry
(and post) to accept and use a cancelable context (or watch the existing done
channel) so each attempt and any backoff sleeps select on ctx.Done() (or done)
and return early; ensure Close/ApplyConfig cancels that context (or closes done)
and that any flush path enforces an overall deadline so waits won't block
shutdown indefinitely; update references to postWithRetry, post, done,
maxRetries, retryBackoff, and maxBackoff accordingly.
| recorder eventrecorder.Recorder, | ||
| r prometheus.Registerer, |
There was a problem hiding this comment.
Guard against nil recorder to prevent runtime panic.
a.recorder.RecordEvent(...) (Line 356) assumes a non-nil recorder, but NewAlerts(...) does not enforce/default it. A nil value will panic on first new alert.
🛡️ Suggested fix
func NewAlerts(
ctx context.Context,
m types.AlertMarker,
intervalGC time.Duration,
perAlertNameLimit int,
alertCallback AlertStoreCallback,
l *slog.Logger,
recorder eventrecorder.Recorder,
r prometheus.Registerer,
flagger featurecontrol.Flagger,
) (*Alerts, error) {
+ if recorder == nil {
+ recorder = eventrecorder.NopRecorder()
+ }
+
if alertCallback == nil {
alertCallback = noopCallback{}
}Also applies to: 151-152, 355-357
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@provider/mem/mem.go` around lines 126 - 127, NewAlerts currently doesn't
ensure the Recorder field is non-nil, and calls to a.recorder.RecordEvent (e.g.,
in Alerts methods referenced by NewAlerts and at the call site
a.recorder.RecordEvent) will panic if nil; update NewAlerts to guard/default the
recorder (e.g., accept a nil recorder and replace it with a no-op/fake recorder)
or add nil checks before calling a.recorder.RecordEvent in the Alerts methods so
they skip recording when recorder == nil, referencing NewAlerts and the
a.recorder.RecordEvent calls to locate where to apply the fix.
|
Addressed all review feedback since last round:
Renamed the entire feature to
Done. Adopted the message passing approach you suggested.
Done. Gated behind |
65f596e to
ac7b8f5
Compare
|
Addressed feedback from @coderabbitai:
|
|
Tip For best results, initiate chat on the files or code changes.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
eventrecorder/eventrecorder_test.go (1)
111-121: Cover the rest of the zero-value recorder API.
cmd/alertmanager/main.gonow also callsApplyConfig(...)andSetClusterPeer(...)on a disabled recorder, but the no-op contract is only locked in forRecordEvent(...)andClose()here. Adding explicit smoke tests for those methods would catch startup regressions when the feature flag is off.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/eventrecorder_test.go` around lines 111 - 121, Extend the tests to exercise the rest of the no-op recorder API by calling ApplyConfig(...) and SetClusterPeer(...) on both the NopRecorder() instance and the zero-value Recorder variable; ensure these calls do not panic and, where applicable, return nil/error as expected (e.g., Validate that calling rec.ApplyConfig(... ) and rec.SetClusterPeer(...) on Recorder and on NopRecorder() completes without error or panic), so add assertions in TestNopRecorderDoesNotPanic and TestZeroRecorderDoesNotPanic that invoke ApplyConfig and SetClusterPeer with representative arguments and assert safe no-op behavior.eventrecorder/eventrecorder.go (1)
91-91: Fix formatting issue.Static analysis (gofumpt) reports improper formatting at this line. This is likely a whitespace or blank line issue within the struct definition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/eventrecorder.go` at line 91, The struct field declaration for the instance field has improper spacing; locate the struct containing the field named "instance" in eventrecorder.go and normalize the spacing so the field reads with a single space between name and type (e.g., "instance string"), then run gofmt/gofumpt to ensure formatting passes static analysis.eventrecorder/webhook_test.go (1)
80-82: Userequire.JSONEqfor comparing JSON payloads.String equality comparison is fragile for JSON because field ordering may vary. The static analysis tool correctly identifies this.
🔧 Proposed fix
mu.Lock() - require.Equal(t, `{"test":"data"}`, string(received[0])) + require.JSONEq(t, `{"test":"data"}`, string(received[0])) mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/webhook_test.go` around lines 80 - 82, Replace the fragile string equality assertion with a JSON-aware comparison: instead of require.Equal(t, `{"test":"data"}`, string(received[0])) use require.JSONEq(t, `{"test":"data"}`, string(received[0])) so the test (in webhook_test.go where mu and received are used) validates JSON semantics regardless of key ordering; keep the existing mu.Lock()/mu.Unlock() around the read of received[0].silence/silence.go (1)
833-833: Remove extra blank line.Static analysis (gofumpt) reports improper formatting. There's an extra blank line here.
🔧 Proposed fix
return changed, added, nil } - // Set the specified silence. If a silence with the ID already exists and the modification🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@silence/silence.go` at line 833, Remove the extraneous blank line reported by gofumpt in silence.go by deleting the empty line so the surrounding declaration/block is compact (i.e., remove the extra blank line between the preceding statement and the next token in the function or declaration where the report appears), ensuring the function/method or declaration formatting (the enclosing function/block) follows gofumpt style.eventrecorder/webhook.go (1)
140-148:SendEventreturnsnileven when dropping events.When the queue is full,
SendEventincrements the drop counter and logs a warning but returnsnil. This meanssendToOutputsineventrecorder.gowill count the event as successfully recorded.This appears to be intentional: the webhook has its own drop metric (
alertmanager_event_webhook_drops_total), and the main recorder metrics track queue acceptance rather than final delivery. If you want the main metrics to reflect webhook queue saturation, consider returning an error here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/webhook.go` around lines 140 - 148, SendEvent currently swallows queue-full drops by always returning nil; change it so the default branch returns an error when the event is dropped (instead of nil) so callers like sendToOutputs can detect rejection. Locate SendEvent (method on WebhookOutput) and in the select's default case, after incrementing wo.drops and logging via wo.logger.Warn, return a descriptive error (e.g., fmt.Errorf or a package-level sentinel error) indicating the webhook queue is full; keep the success path returning nil when sending to wo.work succeeds.eventrecorder/eventrecorderpb/eventrecorder.proto (1)
1-8: Address protobuf package versioning across all proto files.Buf lint (STANDARD rules) reports that all protobuf packages should include version suffixes (e.g.,
eventrecorder.v1). This is a best practice for API evolution and applies repo-wide:
eventrecorder→eventrecorder.v1clusterpb→clusterpb.v1nflogpb→nflogpb.v1silencepb→silencepb.v1Additionally, buf reports directory mismatches for all packages, though these may be reconcilable with the current buf.yaml module configuration.
If adopting version suffixes, coordinate across all four proto modules in a single change. If these are intentional project decisions, add lint rule exceptions to
buf.yamlto document the rationale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/eventrecorderpb/eventrecorder.proto` around lines 1 - 8, Update the protobuf package naming to include a version suffix and keep Go package mapping in sync: change package eventrecorder to eventrecorder.v1 in eventrecorder.proto and update option go_package to include the v1 suffix (e.g., github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb/v1 or matching repo convention); ensure all other proto modules (clusterpb, nflogpb, silencepb) are changed together to use their .v1 package names and corresponding go_package paths, and update any inter-proto import references to the new package paths; if you intentionally do not want versioned packages, add explicit lint exceptions in buf.yaml documenting the decision instead of changing proto packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/config.go`:
- Around line 293-296: The struct field alignment in config.go is misformatted;
run gofumpt (or remove the extra spaces) to fix spacing so fields align with
gofumpt style—specifically adjust the fields Type, Path, URL, and HTTPConfig in
the output struct so the `URL` line does not contain extra alignment spaces and
matches the formatting of the other tags.
In `@notify/notify.go`:
- Line 1043: The code eagerly constructs a NotificationEvent via
NewNotificationEvent(ctx, sent, r.integration) even when recording is disabled;
change the call site in the emitter where r.recorder.RecordEvent(ctx,
NewNotificationEvent(...)) is used to first check r.recorder.Enabled(ctx) (or a
similar cheap boolean) and only call NewNotificationEvent when enabled, or wrap
event creation in a lazy helper like buildNotificationEventIfEnabled(ctx, sent)
that returns nil when recording is off; apply the same pattern to the other
emitters that call r.recorder.RecordEvent so we avoid allocating/copying the
protobuf payload on the default (disabled) path.
---
Nitpick comments:
In `@eventrecorder/eventrecorder_test.go`:
- Around line 111-121: Extend the tests to exercise the rest of the no-op
recorder API by calling ApplyConfig(...) and SetClusterPeer(...) on both the
NopRecorder() instance and the zero-value Recorder variable; ensure these calls
do not panic and, where applicable, return nil/error as expected (e.g., Validate
that calling rec.ApplyConfig(... ) and rec.SetClusterPeer(...) on Recorder and
on NopRecorder() completes without error or panic), so add assertions in
TestNopRecorderDoesNotPanic and TestZeroRecorderDoesNotPanic that invoke
ApplyConfig and SetClusterPeer with representative arguments and assert safe
no-op behavior.
In `@eventrecorder/eventrecorder.go`:
- Line 91: The struct field declaration for the instance field has improper
spacing; locate the struct containing the field named "instance" in
eventrecorder.go and normalize the spacing so the field reads with a single
space between name and type (e.g., "instance string"), then run gofmt/gofumpt to
ensure formatting passes static analysis.
In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 1-8: Update the protobuf package naming to include a version
suffix and keep Go package mapping in sync: change package eventrecorder to
eventrecorder.v1 in eventrecorder.proto and update option go_package to include
the v1 suffix (e.g.,
github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb/v1 or matching
repo convention); ensure all other proto modules (clusterpb, nflogpb, silencepb)
are changed together to use their .v1 package names and corresponding go_package
paths, and update any inter-proto import references to the new package paths; if
you intentionally do not want versioned packages, add explicit lint exceptions
in buf.yaml documenting the decision instead of changing proto packages.
In `@eventrecorder/webhook_test.go`:
- Around line 80-82: Replace the fragile string equality assertion with a
JSON-aware comparison: instead of require.Equal(t, `{"test":"data"}`,
string(received[0])) use require.JSONEq(t, `{"test":"data"}`,
string(received[0])) so the test (in webhook_test.go where mu and received are
used) validates JSON semantics regardless of key ordering; keep the existing
mu.Lock()/mu.Unlock() around the read of received[0].
In `@eventrecorder/webhook.go`:
- Around line 140-148: SendEvent currently swallows queue-full drops by always
returning nil; change it so the default branch returns an error when the event
is dropped (instead of nil) so callers like sendToOutputs can detect rejection.
Locate SendEvent (method on WebhookOutput) and in the select's default case,
after incrementing wo.drops and logging via wo.logger.Warn, return a descriptive
error (e.g., fmt.Errorf or a package-level sentinel error) indicating the
webhook queue is full; keep the success path returning nil when sending to
wo.work succeeds.
In `@silence/silence.go`:
- Line 833: Remove the extraneous blank line reported by gofumpt in silence.go
by deleting the empty line so the surrounding declaration/block is compact
(i.e., remove the extra blank line between the preceding statement and the next
token in the function or declaration where the report appears), ensuring the
function/method or declaration formatting (the enclosing function/block) follows
gofumpt style.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dfc390ab-cce4-4be5-911e-dcac8a620540
⛔ Files ignored due to path filters (2)
eventrecorder/eventrecorderpb/eventrecorder.pb.gois excluded by!**/*.pb.gogo.sumis excluded by!**/*.sum
📒 Files selected for processing (29)
api/v2/api.gobuf.yamlcmd/alertmanager/main.goconfig/config.goconfig/coordinator.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.goeventrecorder/eventrecorder.goeventrecorder/eventrecorder_test.goeventrecorder/eventrecorderpb/eventrecorder.protoeventrecorder/file.goeventrecorder/file_test.goeventrecorder/webhook.goeventrecorder/webhook_test.gofeaturecontrol/featurecontrol.gogo.modinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gonotify/event.gonotify/mute_test.gonotify/notify.gonotify/notify_test.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.go
✅ Files skipped from review due to trivial changes (4)
- buf.yaml
- go.mod
- dispatch/dispatch_bench_test.go
- api/v2/api.go
🚧 Files skipped from review as they are similar to previous changes (9)
- notify/mute_test.go
- inhibit/inhibit_bench_test.go
- silence/silence_bench_test.go
- eventrecorder/file_test.go
- featurecontrol/featurecontrol.go
- inhibit/inhibit_test.go
- eventrecorder/file.go
- dispatch/dispatch_test.go
- provider/mem/mem.go
| l.Info("Notify success") | ||
| } | ||
|
|
||
| r.recorder.RecordEvent(ctx, NewNotificationEvent(ctx, sent, r.integration)) |
There was a problem hiding this comment.
Avoid eager NotificationEvent construction on the default path.
NewNotificationEvent(ctx, sent, r.integration) is built before RecordEvent(...) gets a chance to notice that recording is disabled. Because cmd/alertmanager/main.go threads a zero-value recorder through notify.NewPipelineBuilder(...) when the feature flag is off, every successful notification still pays to allocate/copy the protobuf payload even though it will be dropped. Please add a cheap enabled check or a lazy builder helper before constructing the event, and reuse that pattern in the other emitters following the same shape.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notify/notify.go` at line 1043, The code eagerly constructs a
NotificationEvent via NewNotificationEvent(ctx, sent, r.integration) even when
recording is disabled; change the call site in the emitter where
r.recorder.RecordEvent(ctx, NewNotificationEvent(...)) is used to first check
r.recorder.Enabled(ctx) (or a similar cheap boolean) and only call
NewNotificationEvent when enabled, or wrap event creation in a lazy helper like
buildNotificationEventIfEnabled(ctx, sent) that returns nil when recording is
off; apply the same pattern to the other emitters that call
r.recorder.RecordEvent so we avoid allocating/copying the protobuf payload on
the default (disabled) path.
ac7b8f5 to
fad8b84
Compare
|
Addressed feedback from @coderabbitai:
|
|
Tip For best results, initiate chat on the files or code changes.
Accepted — all items look good:
No remaining blockers from my side. The PR looks well-structured and the implementation is solid. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
eventrecorder/file.go (1)
146-151: Consider checkingfo.closedbefore writing inSendEvent.
SendEventdoesn't checkfo.closedbefore attempting to write. If called afterClose(), it will return an error fromfo.f.Write()on the closed file handle. While this works (the error is returned), checkingfo.closedfirst would provide a clearer error message like "output closed" rather than a low-level file error.♻️ Optional: Add closed check
func (fo *FileOutput) SendEvent(data []byte) error { fo.mu.Lock() defer fo.mu.Unlock() + if fo.closed { + return errors.New("file output closed") + } _, err := fo.f.Write(data) return err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/file.go` around lines 146 - 151, SendEvent should check the FileOutput.closed flag under the same mutex before attempting to write so callers get a clear "output closed" error instead of a low-level file error; inside SendEvent (which already locks fo.mu), if fo.closed is true return a descriptive error (or a package-level var like ErrOutputClosed) and only call fo.f.Write when not closed. Ensure the error type/message matches any existing Close() behavior and tests.notify/event.go (1)
66-73: Sort muted alert hashes before appending them.
MutedAlerts(ctx)is a map, so this produces nondeterministicMutedAlertsordering. Two otherwise identical notification events can end up with different JSON, which makes downstream diffing and deduplication harder.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/event.go` around lines 66 - 73, extractMutedGroupedAlerts currently iterates over the map returned by MutedAlerts(ctx) yielding nondeterministic order; fix by collecting the map keys into a slice, sort them (e.g., sort.Strings), then iterate the sorted slice to append &eventrecorderpb.GroupedAlert{Hash: hash} to the result so the output is deterministic; update extractMutedGroupedAlerts to use this sorted-key approach while keeping the same return type.eventrecorder/eventrecorder_test.go (1)
201-212: Assert the sorted order directly here.
LabelSetAsProtopromises deterministic ordering, but this test collapses the result back into a map, so it would still pass if the helper regressed to Go's map iteration order. Please assertproto.Labelsin order.♻️ Possible assertion update
require.Len(t, proto.Labels, 2) - found := map[string]string{} - for _, lp := range proto.Labels { - found[lp.Key] = lp.Value - } - require.Equal(t, "bar", found["foo"]) - require.Equal(t, "qux", found["baz"]) + require.Equal(t, "baz", proto.Labels[0].Key) + require.Equal(t, "qux", proto.Labels[0].Value) + require.Equal(t, "foo", proto.Labels[1].Key) + require.Equal(t, "bar", proto.Labels[1].Value)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/eventrecorder_test.go` around lines 201 - 212, The test collapses proto.Labels into a map which masks ordering regressions; update TestLabelSetAsProto to assert the deterministic ordering promised by LabelSetAsProto by checking proto.Labels directly (e.g., verify proto.Labels[0].Key/Value and proto.Labels[1].Key/Value or compare the slice to an expected ordered slice) instead of converting to a map, referencing LabelSetAsProto and proto.Labels for locating the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@eventrecorder/eventrecorder.go`:
- Around line 252-260: The code currently closes the live outputs before
building the replacement, risking loss of all outputs if buildOutputs fails;
change the logic in the reload branch (where
eventRecorderConfigEqual(update.cfg, currentCfg) is checked) to first call
buildOutputs(update.cfg.Outputs, c.metrics, c.logger) into a newOutputs slice,
validate newOutputs (e.g., ensure len(newOutputs) > 0 or otherwise decide
fallback), and only after confirming the replacement is acceptable: close the
existing outputs (calling Close on each element), set outputs = newOutputs,
update currentCfg = update.cfg, and then log the reload via c.logger.Info; if
newOutputs is not acceptable, keep the old outputs and log the failure instead
of swapping/closing first.
In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 1-8: The proto package declaration is inconsistent: change the
line declaring "package eventrecorder;" to "package eventrecorderpb;" so the
proto package matches the directory/name pattern used elsewhere; ensure any
generated Go package references (e.g., option go_package =
"github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb";) remain
correct after this rename and update any imports that reference eventrecorder to
use eventrecorderpb.
In `@notify/event.go`:
- Around line 37-53: extractAlertGroupInfo currently rebuilds the group key from
a LabelSet (using groupKey.String() / groupKey.Hash()), which yields empty-set
metadata; instead pull and use the raw aggregation group key stored in context
by dispatch (the value obtained via AggrGroupID/its corresponding context entry)
when populating AlertGroupInfo. Update extractAlertGroupInfo to read the raw
group key from context (the stored aggregation key placed by dispatch), set
AlertGroupInfo.GroupKey to that raw string and compute GroupId from that raw key
(rather than calling groupKey.String() / groupKey.Hash()), keeping ReceiverName,
Matchers, RouteLabels and GroupUuid (aggrGroupID) as before.
In `@notify/notify.go`:
- Around line 283-303: WithRouteLabels and WithMutedAlerts are defined but never
invoked, so RouteLabels(ctx) and MutedAlerts(ctx) always return empty values
when notify/event.go builds events; update the context initialization in
dispatch/dispatch.go where the notification context is created to call
WithRouteLabels(ctx, <routeLabelSet>) and WithMutedAlerts(ctx, <mutedAlertSet>)
using the route's label set and the current muted alerts collection (or, if this
feature is unwanted, remove WithRouteLabels/WithMutedAlerts plus the
RouteLabels/MutedAlerts getters and their corresponding proto fields); locate
and modify the context-building code around dispatch logic to inject these
values so notify/event.go receives populated data.
---
Nitpick comments:
In `@eventrecorder/eventrecorder_test.go`:
- Around line 201-212: The test collapses proto.Labels into a map which masks
ordering regressions; update TestLabelSetAsProto to assert the deterministic
ordering promised by LabelSetAsProto by checking proto.Labels directly (e.g.,
verify proto.Labels[0].Key/Value and proto.Labels[1].Key/Value or compare the
slice to an expected ordered slice) instead of converting to a map, referencing
LabelSetAsProto and proto.Labels for locating the code.
In `@eventrecorder/file.go`:
- Around line 146-151: SendEvent should check the FileOutput.closed flag under
the same mutex before attempting to write so callers get a clear "output closed"
error instead of a low-level file error; inside SendEvent (which already locks
fo.mu), if fo.closed is true return a descriptive error (or a package-level var
like ErrOutputClosed) and only call fo.f.Write when not closed. Ensure the error
type/message matches any existing Close() behavior and tests.
In `@notify/event.go`:
- Around line 66-73: extractMutedGroupedAlerts currently iterates over the map
returned by MutedAlerts(ctx) yielding nondeterministic order; fix by collecting
the map keys into a slice, sort them (e.g., sort.Strings), then iterate the
sorted slice to append &eventrecorderpb.GroupedAlert{Hash: hash} to the result
so the output is deterministic; update extractMutedGroupedAlerts to use this
sorted-key approach while keeping the same return type.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e5468fb6-0156-4fb7-a5ec-01c9e0dfc092
⛔ Files ignored due to path filters (2)
eventrecorder/eventrecorderpb/eventrecorder.pb.gois excluded by!**/*.pb.gogo.sumis excluded by!**/*.sum
📒 Files selected for processing (29)
api/v2/api.gobuf.yamlcmd/alertmanager/main.goconfig/config.goconfig/coordinator.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.goeventrecorder/eventrecorder.goeventrecorder/eventrecorder_test.goeventrecorder/eventrecorderpb/eventrecorder.protoeventrecorder/file.goeventrecorder/file_test.goeventrecorder/webhook.goeventrecorder/webhook_test.gofeaturecontrol/featurecontrol.gogo.modinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gonotify/event.gonotify/mute_test.gonotify/notify.gonotify/notify_test.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.go
✅ Files skipped from review due to trivial changes (8)
- dispatch/dispatch_bench_test.go
- provider/mem/mem_test.go
- api/v2/api.go
- go.mod
- buf.yaml
- notify/mute_test.go
- eventrecorder/file_test.go
- dispatch/dispatch.go
🚧 Files skipped from review as they are similar to previous changes (9)
- notify/notify_test.go
- inhibit/inhibit_bench_test.go
- featurecontrol/featurecontrol.go
- inhibit/inhibit.go
- silence/silence_test.go
- silence/silence_bench_test.go
- cmd/alertmanager/main.go
- provider/mem/mem.go
- config/config.go
notify/notify.go
Outdated
| // WithMutedAlerts populates a context with a set of muted alert hashes. | ||
| func WithMutedAlerts(ctx context.Context, alerts map[uint64]struct{}) context.Context { | ||
| return context.WithValue(ctx, keyMutedAlerts, alerts) | ||
| } | ||
|
|
||
| // MutedAlerts extracts a set of muted alert hashes from the context. | ||
| func MutedAlerts(ctx context.Context) (map[uint64]struct{}, bool) { | ||
| v, ok := ctx.Value(keyMutedAlerts).(map[uint64]struct{}) | ||
| return v, ok | ||
| } | ||
|
|
||
| // WithRouteLabels populates a context with route-level labels. | ||
| func WithRouteLabels(ctx context.Context, rl model.LabelSet) context.Context { | ||
| return context.WithValue(ctx, keyRouteLabels, rl) | ||
| } | ||
|
|
||
| // RouteLabels extracts route-level labels from the context. | ||
| func RouteLabels(ctx context.Context) (model.LabelSet, bool) { | ||
| v, ok := ctx.Value(keyRouteLabels).(model.LabelSet) | ||
| return v, ok | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for usages of WithRouteLabels and WithMutedAlerts
echo "=== WithRouteLabels usages ==="
rg -n 'WithRouteLabels\(' --type=go
echo ""
echo "=== WithMutedAlerts usages ==="
rg -n 'WithMutedAlerts\(' --type=goRepository: prometheus/alertmanager
Length of output: 338
🏁 Script executed:
#!/bin/bash
# Search for usages of RouteLabels and MutedAlerts getter functions
echo "=== RouteLabels() usages ==="
rg -n 'RouteLabels\(' --type=go -A 2
echo ""
echo "=== MutedAlerts() usages ==="
rg -n 'MutedAlerts\(' --type=go -A 2Repository: prometheus/alertmanager
Length of output: 1787
🏁 Script executed:
#!/bin/bash
# Read the relevant sections of notify/event.go
echo "=== notify/event.go around RouteLabels (line 42) ==="
sed -n '35,55p' notify/event.go
echo ""
echo "=== notify/event.go around MutedAlerts (line 68) ==="
sed -n '60,80p' notify/event.goRepository: prometheus/alertmanager
Length of output: 1452
🏁 Script executed:
#!/bin/bash
# Check the dispatch/dispatch.go context setup to see where setters should be called
echo "=== dispatch/dispatch.go context setup ==="
sed -n '663,716p' dispatch/dispatch.go | head -60Repository: prometheus/alertmanager
Length of output: 1836
WithRouteLabels and WithMutedAlerts are defined but never populated.
These context setters are never called anywhere in the codebase. However, their corresponding getters—RouteLabels() and MutedAlerts()—are actively used in notify/event.go (lines 42 and 68) to populate the notification event. This means RouteLabels and MutedAlerts will always be empty in the resulting event records. Either these setters should be called during context initialization in dispatch/dispatch.go, or the getters and proto fields should be removed if this feature is not intended.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notify/notify.go` around lines 283 - 303, WithRouteLabels and WithMutedAlerts
are defined but never invoked, so RouteLabels(ctx) and MutedAlerts(ctx) always
return empty values when notify/event.go builds events; update the context
initialization in dispatch/dispatch.go where the notification context is created
to call WithRouteLabels(ctx, <routeLabelSet>) and WithMutedAlerts(ctx,
<mutedAlertSet>) using the route's label set and the current muted alerts
collection (or, if this feature is unwanted, remove
WithRouteLabels/WithMutedAlerts plus the RouteLabels/MutedAlerts getters and
their corresponding proto fields); locate and modify the context-building code
around dispatch logic to inject these values so notify/event.go receives
populated data.
This commit adds a structured event recorder to Alertmanager that
records events. Events are defined in a protobuf schema and fanned
out to one or more configured outputs (JSONL file, webhook). The
event recorder provides an audit trail for observability and
analytics, answering questions like "why did this alert fire?", "was
it silenced?", "which receiver got notified?", and "when did this
inhibition rule suppress it?".
The event recorder is gated behind --enable-feature=event-recorder.
When not enabled, a zero-value Recorder is used and all methods are
no-ops.
At this time, there is no filtering of events or event query API.
Downstream software is intended to consume the event recorder via the
webhook or file output and filter and store events in an appropriate
database as desired. Alertmanager is kept away from this logic.
Ten event types are defined, covering the full alert lifecycle:
| Event | Emitted When |
|---|---|
| `AlertmanagerStartupEvent` | Alertmanager process starts |
| `AlertmanagerShutdownEvent` | Alertmanager process receives SIGTERM |
| `AlertCreatedEvent` | A new alert is inserted via the API |
| `AlertGroupedEvent` | An alert is added to an aggregation group |
| `AlertResolvedEvent` | An alert's end time passes during a flush |
| `NotificationEvent` | A notification is successfully delivered |
| `SilenceCreatedEvent` | A new silence is created |
| `SilenceUpdatedEvent` | An existing silence is updated in place |
| `SilenceMutedAlertEvent` | A silence matches and mutes an alert |
| `InhibitionMutedAlertEvent` | An inhibition rule suppresses an alert |
Each event carries a full protobuf-defined payload with alert details,
labels, annotations, fingerprints, group metadata, silence/inhibition
rule details, and receiver information as applicable.
The event recorder is configured via a new top-level `event_recorder`
section in `alertmanager.yml`, supporting multiple simultaneous
outputs:
```yaml
event_recorder:
outputs:
- type: file
path: /var/log/alertmanager/events.jsonl
- type: webhook
url: https://example.com/events
http_config: # optional, same as receiver http_config
tls_config:
cert_file: /path/to/cert
timeout: 10s # optional, default 10s
workers: 4 # optional, default 4
max_retries: 3 # optional, default 3
retry_backoff: 500ms # optional, default 500ms
```
The configuration is hot-reloadable: sending SIGHUP swaps outputs
atomically (old outputs are closed, new ones created).
Writes JSONL to a file. Uses fsnotify to watch the parent directory
for file renames and removes (e.g. logrotate), triggering an immediate
reopen. The constructor blocks until the watcher is ready, avoiding
races in tests.
POSTs each event as a JSON body to a configured URL. A bounded worker
pool (configurable via `workers`) processes deliveries concurrently so
that a slow webhook does not block the event queue. Failed deliveries
are retried with exponential backoff (configurable via `max_retries`
and `retry_backoff`), capped at 30 seconds. When the internal queue
(capacity 1024) is full, events are dropped and counted via the
`alertmanager_event_recorder_webhook_drops_total` Prometheus counter.
`RecordEvent` never blocks the caller. Events are serialized with
`protojson.Marshal` and placed on a bounded in-memory channel (capacity
8,192, capping memory at roughly 4 MB). A background goroutine drains
the channel and writes to all configured outputs. If the channel is
full, the event is dropped and a metric is incremented. The event
recorder must never slow down alert processing.
Concurrency is handled entirely via message passing — no mutexes. The
`writeLoop` goroutine exclusively owns output state and config.
Config updates arrive via a `cfgUpdate` channel with an ack signal.
`cluster.Peer` uses `atomic.Pointer`. `Close` uses `sync.Once`.
`RecordEvent` accepts a `context.Context`. Recording is opt-in:
callers must pass `eventrecorder.WithEventRecording(ctx)` to enable
recording. The API update path omits this to avoid flooding the event
queue when checking silences and inhibitions for status display.
Event construction and protobuf conversion helpers live in the
`eventrecorder` package: `NewAlertCreatedEvent`,
`NewSilenceMutedAlertEvent`, `NewSilenceCreatedEvent`,
`NewSilenceUpdatedEvent`, `NewInhibitionMutedAlertEvent`,
`SilenceAsProto`, `InhibitRuleAsProto`, `SilenceMatcherAsProto`,
`MatcherAsProto`, `MatchersAsProto`, `LabelSetAsProto`. Call sites
in `inhibit`, `silence`, and `provider/mem` are one-liners.
`extractEventType` uses proto reflection (`WhichOneof`) to derive the
event type name from the oneof field, avoiding a manual type switch.
Five Prometheus metrics are exposed:
- `alertmanager_events_recorded_total` (labels: `event_type`, `output`, `result`)
- `alertmanager_event_recorder_bytes_written_total` (labels: `event_type`, `output`)
- `alertmanager_events_dropped_total` (labels: `event_type`)
- `alertmanager_event_serialize_errors_total` (labels: `event_type`)
- `alertmanager_event_recorder_queue_length` (gauge)
- `alertmanager_event_recorder_webhook_drops_total` (labels: `output`)
The `output` label uses a stable identifier (e.g. `file:/path` or
`webhook:https://host/path`) rather than a positional index, so it
remains meaningful across config reloads. Webhook URLs are sanitized
to strip userinfo and query parameters before use in identifiers,
metrics labels, and log messages.
Both `SilenceCreated` and `SilenceUpdated` events are recorded at the
same level in the `Set` method, rather than split between `setSilence`
(for create) and `Set` (for update). `setSilence` returns
`(changed, added bool, err error)` so the caller can decide what to
record and whether to broadcast.
Several new context keys were added to `notify/notify.go` to thread
dispatch-time metadata through the notification pipeline:
- `WithGroupMatchers` / `GroupMatchers`: route-level matchers
- `WithRouteLabels` / `RouteLabels`: route-level labels
- `WithFlushId` / `FlushId`: per-flush monotonic counter
- `WithAggrGroupId` / `AggrGroupId`: aggregation group UUID
- `WithMutedAlerts` / `MutedAlerts`: muted alert fingerprints
- `WithNotificationReason` / `NotificationReason`: trigger reason
These are populated in `dispatch/dispatch.go` where the aggregation
group context is set up, and consumed in `notify/event.go` to build
rich `NotificationEvent` and `AlertGroupedEvent` payloads.
Unit tests cover the eventrecorder package:
- `eventrecorder/eventrecorder_test.go`: RecordEvent delivery,
NopRecorder, zero-value Recorder, WithEventRecording,
ApplyConfig no-op, extractEventType, LabelSetAsProto,
MatcherAsProto, MatchersAsProto, eventRecorderConfigEqual.
- `eventrecorder/file_test.go`: SendEvent, reopen-after-rename,
reopen-after-remove, Close, invalid path.
- `eventrecorder/webhook_test.go`: webhook delivery via
httptest.Server, concurrent workers, retry on failure, drop after
max retries, and graceful close draining.
Additionally, tested with a live Alertmanager instance configured with
JSONL file output, a webhook endpoint, inhibition rules, and silences.
All 10 event types were verified:
1. Started Alertmanager with `--enable-feature=event-recorder` and
`event_recorder` configured. Verified
`alertmanagerStartupEvent`.
2. Fired test alerts. Verified `alertCreated` and `alertGrouped`.
3. Fired `CriticalAlert` (severity=critical) and `WarningAlert`
(severity=warning) with an inhibition rule suppressing warnings
when a matching critical alert exists. Verified
`inhibitionMutedAlert` with correct rule details and source
fingerprint.
4. Created a silence matching `SilencedAlert`, then fired
`SilencedAlert`. Verified `silenceCreated` and
`silenceMutedAlert` with silence ID and matchers.
5. Updated an active silence via the v2 API, preserving matchers and
extending the end time (satisfying `canUpdate`). Verified
`silenceUpdated` with the updated silence details and same ID.
6. Waited for notification delivery. Verified `notification` events
with receiver name, integration type, and alert details.
7. Fired a short-lived alert (5-second TTL), waited for flush.
Verified `alertResolved`.
8. Sent SIGTERM. Verified `alertmanagerShutdownEvent`.
9. Tested hot-reload: started with file-only config, sent SIGHUP with
file+webhook config, verified new events appeared on the webhook
and old file continued receiving events.
10. Verified feature flag gating: started without
`--enable-feature=event-recorder`, confirmed no event file was
created and no events were recorded.
Signed-off-by: Jack Rosenthal <jack@rosenth.al>
fad8b84 to
9057f64
Compare
|
Addressed feedback from @coderabbitai:
|
|
Tip For best results, initiate chat on the files or code changes.
Accepted — all items look good:
No remaining blockers from my side. The PR is in great shape. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
inhibit/inhibit.go (1)
210-223:⚠️ Potential issue | 🟠 MajorRecord all inhibiting matches, not just the first one.
hasEqual()only returns a single source fingerprint, and this branch emits/returns on the first matching rule. If the target is suppressed by multiple rules or multiple source alerts, theInhibitionMutedAlertEventwill under-report the inhibition context even though the schema is plural. Please either collect all matches before returning or make the event schema singular.
🧹 Nitpick comments (3)
notify/mute.go (1)
97-106: Sort muted alert hashes for deterministic ordering.This stores hashes in a map, while
extractMutedGroupedAlertsinnotify/event.go:67ranges that map directly. Go map iteration order is undefined, causingNotificationEvent.muted_alertsorder to vary between runs. Unlikefiring_alertsandresolved_alertswhich use ordered slices (lines 56–58), muted alerts lose ordering. Sincemuted_alertsis arepeated GroupedAlertfield in the proto, serialization order matters for deterministic event logs and webhooks.Either sort the hashes before building the result or use an ordered slice alongside the set.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/mute.go` around lines 97 - 106, The map of muted alert hashes created in MutedAlerts/WithMutedAlerts (using hashAlert) produces nondeterministic iteration order later in extractMutedGroupedAlerts and causes NotificationEvent.muted_alerts to be emitted in variable order; fix by preserving deterministic ordering: alongside the set/map stored in ctx, also store an ordered slice of uint64 hash keys (or when reading the map in extractMutedGroupedAlerts, collect keys into a slice and sort them) so that the construction path that builds NotificationEvent.muted_alerts (referenced by extractMutedGroupedAlerts and NotificationEvent.muted_alerts) iterates a sorted list rather than ranging the map. Ensure the change updates both the storage site (WithMutedAlerts / MutedAlerts usage where mutedHashes are created) and the consumer (extractMutedGroupedAlerts) to use the ordered slice or sorted keys.eventrecorder/file.go (1)
81-83: Consider logging the close error in reopen.The error from closing the old file handle is silently ignored. While not critical (the new file is opened regardless), logging it could help diagnose issues like ENOSPC when flushing buffered data.
🔧 Optional improvement
if fo.f != nil { - fo.f.Close() + if err := fo.f.Close(); err != nil { + fo.logger.Warn("Failed to close old event recorder file", "path", fo.path, "err", err) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eventrecorder/file.go` around lines 81 - 83, The call to fo.f.Close() in the reopen code path currently ignores any returned error; change it to capture the error (err := fo.f.Close()) and log it instead of discarding it. In the reopen function (the block calling fo.f.Close()) emit a descriptive log entry including context like "closing old file" and the error value using the package/logger already used by eventrecorder (e.g., the existing logger or fmt/log package) so ENOSPC or flush failures are recorded for diagnostics.notify/notify.go (1)
1035-1035: Add early check to avoid allocating event payload when recording is disabled.
NewNotificationEvent(ctx, sent, r.integration)allocates memory beforeRecordEventchecks if recording is enabled. Use the existingEventRecordingEnabled(ctx)helper to skip construction when recording is disabled.♻️ Suggested optimization
+ if eventrecorder.EventRecordingEnabled(ctx) { r.recorder.RecordEvent(ctx, NewNotificationEvent(ctx, sent, r.integration)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/notify.go` at line 1035, Add an early guard using EventRecordingEnabled(ctx) before constructing the event to avoid allocating NewNotificationEvent(ctx, sent, r.integration) when recording is disabled; i.e., check EventRecordingEnabled(ctx) and only call r.recorder.RecordEvent(ctx, NewNotificationEvent(ctx, sent, r.integration)) when that helper returns true so the event payload is not allocated unnecessarily.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@eventrecorder/eventrecorderpb/eventrecorder.proto`:
- Around line 3-8: The proto package declaration "package eventrecorderpb"
violates Buf layout/version rules; either move this file into a top-level
directory named "eventrecorderpb" to match the package, or update the proto to a
versioned package (e.g., change the package to "eventrecorderpb.v1") and adjust
the accompanying option go_package and any imports/refs accordingly so symbols
(package eventrecorderpb -> eventrecorderpb.v1 and option go_package) and
client/server code point to the new package; alternatively if you intentionally
want to bypass the rule, add a buflint disable directive for layout/version at
the top of the file before merging.
In `@eventrecorder/file.go`:
- Around line 126-130: The code currently ignores the error from
filepath.Abs(event.Name) which can produce an empty/malformed evPath and
silently skip real rotations; update the block that computes evPath (where
filepath.Abs(event.Name) is called and evPath/absPath are compared) to handle
the error: if filepath.Abs returns an error, fall back to using a cleaned
event.Name (e.g., filepath.Clean(event.Name)) as evPath (or otherwise derive an
absolute path fallback) instead of ignoring the event, and optionally log the
Abs error so failures are visible; ensure the subsequent comparison against
absPath still uses the normalized evPath.
---
Nitpick comments:
In `@eventrecorder/file.go`:
- Around line 81-83: The call to fo.f.Close() in the reopen code path currently
ignores any returned error; change it to capture the error (err := fo.f.Close())
and log it instead of discarding it. In the reopen function (the block calling
fo.f.Close()) emit a descriptive log entry including context like "closing old
file" and the error value using the package/logger already used by eventrecorder
(e.g., the existing logger or fmt/log package) so ENOSPC or flush failures are
recorded for diagnostics.
In `@notify/mute.go`:
- Around line 97-106: The map of muted alert hashes created in
MutedAlerts/WithMutedAlerts (using hashAlert) produces nondeterministic
iteration order later in extractMutedGroupedAlerts and causes
NotificationEvent.muted_alerts to be emitted in variable order; fix by
preserving deterministic ordering: alongside the set/map stored in ctx, also
store an ordered slice of uint64 hash keys (or when reading the map in
extractMutedGroupedAlerts, collect keys into a slice and sort them) so that the
construction path that builds NotificationEvent.muted_alerts (referenced by
extractMutedGroupedAlerts and NotificationEvent.muted_alerts) iterates a sorted
list rather than ranging the map. Ensure the change updates both the storage
site (WithMutedAlerts / MutedAlerts usage where mutedHashes are created) and the
consumer (extractMutedGroupedAlerts) to use the ordered slice or sorted keys.
In `@notify/notify.go`:
- Line 1035: Add an early guard using EventRecordingEnabled(ctx) before
constructing the event to avoid allocating NewNotificationEvent(ctx, sent,
r.integration) when recording is disabled; i.e., check
EventRecordingEnabled(ctx) and only call r.recorder.RecordEvent(ctx,
NewNotificationEvent(ctx, sent, r.integration)) when that helper returns true so
the event payload is not allocated unnecessarily.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4339101e-a846-405f-8428-348194ae0d09
⛔ Files ignored due to path filters (2)
eventrecorder/eventrecorderpb/eventrecorder.pb.gois excluded by!**/*.pb.gogo.sumis excluded by!**/*.sum
📒 Files selected for processing (30)
api/v2/api.gobuf.yamlcmd/alertmanager/main.goconfig/config.goconfig/coordinator.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.goeventrecorder/eventrecorder.goeventrecorder/eventrecorder_test.goeventrecorder/eventrecorderpb/eventrecorder.protoeventrecorder/file.goeventrecorder/file_test.goeventrecorder/webhook.goeventrecorder/webhook_test.gofeaturecontrol/featurecontrol.gogo.modinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gonotify/event.gonotify/mute.gonotify/mute_test.gonotify/notify.gonotify/notify_test.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.go
✅ Files skipped from review due to trivial changes (7)
- go.mod
- dispatch/dispatch_bench_test.go
- api/v2/api.go
- notify/mute_test.go
- provider/mem/mem_test.go
- eventrecorder/file_test.go
- notify/event.go
🚧 Files skipped from review as they are similar to previous changes (10)
- buf.yaml
- inhibit/inhibit_bench_test.go
- inhibit/inhibit_test.go
- notify/notify_test.go
- dispatch/dispatch_test.go
- silence/silence_test.go
- provider/mem/mem.go
- dispatch/dispatch.go
- silence/silence_bench_test.go
- eventrecorder/eventrecorder.go
| package eventrecorderpb; | ||
|
|
||
| import "google/protobuf/duration.proto"; | ||
| import "google/protobuf/timestamp.proto"; | ||
|
|
||
| option go_package = "github.com/prometheus/alertmanager/eventrecorder/eventrecorderpb"; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
buf lint eventrecorder/eventrecorderpb/eventrecorder.protoRepository: prometheus/alertmanager
Length of output: 415
Proto package fails Buf's layout and version lint rules.
The file has two blocking lint failures:
- Package "eventrecorderpb" must be in a directory named "eventrecorderpb" relative to root, but it's in a subdirectory.
- Package name should be versioned, e.g., "eventrecorderpb.v1".
Either restructure the package to match Buf's directory layout rules or explicitly disable these lint rules before merging.
🧰 Tools
🪛 Buf (1.66.1)
[error] 3-3: Files with package "eventrecorderpb" must be within a directory "eventrecorderpb" relative to root but were in directory ".".
(PACKAGE_DIRECTORY_MATCH)
[error] 3-3: Package name "eventrecorderpb" should be suffixed with a correctly formed version, such as "eventrecorderpb.v1".
(PACKAGE_VERSION_SUFFIX)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@eventrecorder/eventrecorderpb/eventrecorder.proto` around lines 3 - 8, The
proto package declaration "package eventrecorderpb" violates Buf layout/version
rules; either move this file into a top-level directory named "eventrecorderpb"
to match the package, or update the proto to a versioned package (e.g., change
the package to "eventrecorderpb.v1") and adjust the accompanying option
go_package and any imports/refs accordingly so symbols (package eventrecorderpb
-> eventrecorderpb.v1 and option go_package) and client/server code point to the
new package; alternatively if you intentionally want to bypass the rule, add a
buflint disable directive for layout/version at the top of the file before
merging.
| // Only react to events on our target file. | ||
| evPath, _ := filepath.Abs(event.Name) | ||
| if evPath != absPath { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Ignoring filepath.Abs error may cause missed file rotation events.
If filepath.Abs(event.Name) fails, evPath will be empty or malformed, and the comparison with absPath will never match—silently skipping legitimate rotation events for the target file.
🛠️ Suggested fix
// Only react to events on our target file.
- evPath, _ := filepath.Abs(event.Name)
+ evPath, err := filepath.Abs(event.Name)
+ if err != nil {
+ continue
+ }
if evPath != absPath {
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Only react to events on our target file. | |
| evPath, _ := filepath.Abs(event.Name) | |
| if evPath != absPath { | |
| continue | |
| } | |
| // Only react to events on our target file. | |
| evPath, err := filepath.Abs(event.Name) | |
| if err != nil { | |
| continue | |
| } | |
| if evPath != absPath { | |
| continue | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@eventrecorder/file.go` around lines 126 - 130, The code currently ignores the
error from filepath.Abs(event.Name) which can produce an empty/malformed evPath
and silently skip real rotations; update the block that computes evPath (where
filepath.Abs(event.Name) is called and evPath/absPath are compared) to handle
the error: if filepath.Abs returns an error, fall back to using a cleaned
event.Name (e.g., filepath.Clean(event.Name)) as evPath (or otherwise derive an
absolute path fallback) instead of ignoring the event, and optionally log the
Abs error so failures are visible; ensure the subsequent comparison against
absPath still uses the normalized evPath.
|
Sorry for the delay in reviews, I was away and kinda busy but I'm doing another review now. |
This commit adds a structured event recorder to Alertmanager that
records events. Events are defined in a protobuf schema and fanned
out to one or more configured outputs (JSONL file, webhook). The
event recorder provides an audit trail for observability and
analytics, answering questions like "why did this alert fire?", "was
it silenced?", "which receiver got notified?", and "when did this
inhibition rule suppress it?".
The event recorder is gated behind --enable-feature=event-recorder.
When not enabled, a zero-value Recorder is used and all methods are
no-ops.
At this time, there is no filtering of events or event query API.
Downstream software is intended to consume the event recorder via the
webhook or file output and filter and store events in an appropriate
database as desired. Alertmanager is kept away from this logic.
Ten event types are defined, covering the full alert lifecycle:
AlertmanagerStartupEventAlertmanagerShutdownEventAlertCreatedEventAlertGroupedEventAlertResolvedEventNotificationEventSilenceCreatedEventSilenceUpdatedEventSilenceMutedAlertEventInhibitionMutedAlertEventEach event carries a full protobuf-defined payload with alert details,
labels, annotations, fingerprints, group metadata, silence/inhibition
rule details, and receiver information as applicable.
The event recorder is configured via a new top-level
event_recordersection in
alertmanager.yml, supporting multiple simultaneousoutputs:
The configuration is hot-reloadable: sending SIGHUP swaps outputs
atomically (old outputs are closed, new ones created).
Writes JSONL to a file. Uses fsnotify to watch the parent directory
for file renames and removes (e.g. logrotate), triggering an immediate
reopen. The constructor blocks until the watcher is ready, avoiding
races in tests.
POSTs each event as a JSON body to a configured URL. A bounded worker
pool (configurable via
workers) processes deliveries concurrently sothat a slow webhook does not block the event queue. Failed deliveries
are retried with exponential backoff (configurable via
max_retriesand
retry_backoff), capped at 30 seconds. When the internal queue(capacity 1024) is full, events are dropped and counted via the
alertmanager_event_recorder_webhook_drops_totalPrometheus counter.RecordEventnever blocks the caller. Events are serialized withprotojson.Marshaland placed on a bounded in-memory channel (capacity8,192, capping memory at roughly 4 MB). A background goroutine drains
the channel and writes to all configured outputs. If the channel is
full, the event is dropped and a metric is incremented. The event
recorder must never slow down alert processing.
Concurrency is handled entirely via message passing — no mutexes. The
writeLoopgoroutine exclusively owns output state and config.Config updates arrive via a
cfgUpdatechannel with an ack signal.cluster.Peerusesatomic.Pointer.Closeusessync.Once.RecordEventaccepts acontext.Context. Recording is opt-in:callers must pass
eventrecorder.WithEventRecording(ctx)to enablerecording. The API update path omits this to avoid flooding the event
queue when checking silences and inhibitions for status display.
Event construction and protobuf conversion helpers live in the
eventrecorderpackage:NewAlertCreatedEvent,NewSilenceMutedAlertEvent,NewSilenceCreatedEvent,NewSilenceUpdatedEvent,NewInhibitionMutedAlertEvent,SilenceAsProto,InhibitRuleAsProto,SilenceMatcherAsProto,MatcherAsProto,MatchersAsProto,LabelSetAsProto. Call sitesin
inhibit,silence, andprovider/memare one-liners.extractEventTypeuses proto reflection (WhichOneof) to derive theevent type name from the oneof field, avoiding a manual type switch.
Five Prometheus metrics are exposed:
alertmanager_events_recorded_total(labels:event_type,output,result)alertmanager_event_recorder_bytes_written_total(labels:event_type,output)alertmanager_events_dropped_total(labels:event_type)alertmanager_event_serialize_errors_total(labels:event_type)alertmanager_event_recorder_queue_length(gauge)alertmanager_event_recorder_webhook_drops_total(labels:output)The
outputlabel uses a stable identifier (e.g.file:/pathorwebhook:https://host/path) rather than a positional index, so itremains meaningful across config reloads. Webhook URLs are sanitized
to strip userinfo and query parameters before use in identifiers,
metrics labels, and log messages.
Both
SilenceCreatedandSilenceUpdatedevents are recorded at thesame level in the
Setmethod, rather than split betweensetSilence(for create) and
Set(for update).setSilencereturns(changed, added bool, err error)so the caller can decide what torecord and whether to broadcast.
Several new context keys were added to
notify/notify.goto threaddispatch-time metadata through the notification pipeline:
WithGroupMatchers/GroupMatchers: route-level matchersWithRouteLabels/RouteLabels: route-level labelsWithFlushId/FlushId: per-flush monotonic counterWithAggrGroupId/AggrGroupId: aggregation group UUIDWithMutedAlerts/MutedAlerts: muted alert fingerprintsWithNotificationReason/NotificationReason: trigger reasonThese are populated in
dispatch/dispatch.gowhere the aggregationgroup context is set up, and consumed in
notify/event.goto buildrich
NotificationEventandAlertGroupedEventpayloads.Unit tests cover the eventrecorder package:
eventrecorder/eventrecorder_test.go: RecordEvent delivery,NopRecorder, zero-value Recorder, WithEventRecording,
ApplyConfig no-op, extractEventType, LabelSetAsProto,
MatcherAsProto, MatchersAsProto, eventRecorderConfigEqual.
eventrecorder/file_test.go: SendEvent, reopen-after-rename,reopen-after-remove, Close, invalid path.
eventrecorder/webhook_test.go: webhook delivery viahttptest.Server, concurrent workers, retry on failure, drop after
max retries, and graceful close draining.
Additionally, tested with a live Alertmanager instance configured with
JSONL file output, a webhook endpoint, inhibition rules, and silences.
All 10 event types were verified:
--enable-feature=event-recorderandevent_recorderconfigured. VerifiedalertmanagerStartupEvent.alertCreatedandalertGrouped.CriticalAlert(severity=critical) andWarningAlert(severity=warning) with an inhibition rule suppressing warnings
when a matching critical alert exists. Verified
inhibitionMutedAlertwith correct rule details and sourcefingerprint.
SilencedAlert, then firedSilencedAlert. VerifiedsilenceCreatedandsilenceMutedAlertwith silence ID and matchers.extending the end time (satisfying
canUpdate). VerifiedsilenceUpdatedwith the updated silence details and same ID.notificationeventswith receiver name, integration type, and alert details.
Verified
alertResolved.alertmanagerShutdownEvent.file+webhook config, verified new events appeared on the webhook
and old file continued receiving events.
--enable-feature=event-recorder, confirmed no event file wascreated and no events were recorded.
Summary by CodeRabbit
New Features
Configuration