Skip to content

feat(dispatch): add alert markers per group#5047

Open
siavashs wants to merge 6 commits intoprometheus:mainfrom
siavashs:feat/group_markers
Open

feat(dispatch): add alert markers per group#5047
siavashs wants to merge 6 commits intoprometheus:mainfrom
siavashs:feat/group_markers

Conversation

@siavashs
Copy link
Copy Markdown
Contributor

@siavashs siavashs commented Feb 25, 2026

This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track the state of alerts in each aggregation group.

This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context.

The API has no breaking changes:

  • /alerts uses a temporary marker to track the state of alerts
  • /alerts/groups returns the status from group markers

The metrics are also updated to use group markers.
The alertmanager_alerts metric is moved to dispatcher.
The alertmanager_marked_alerts metric is removed.
By default it behaves the same as before, by aggregating all alerts in the groups.
Enabling group-key-in-metrics flag will cause the metrics to be grouped by group_key.

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

[FEATURE] Introduce per aggregation group AlertMarkers and drop Global Alert Marker
[CHANGE] Add `group-key-in-metrics` feature flag
[CHANGE] Remove `alertmanager_marked_alerts`
[CHANGE] Remove the following from `types` package: `MemMarker`, `AlertState*`, `AlertStatus`
[CHANGE] Move `AlertMarker`, `GroupMarker` to `marker` package

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional group-key labeling in dispatcher metrics via the group-key-in-metrics feature flag for enhanced alert group observability.
  • Bug Fixes

    • Improved alert status tracking accuracy through per-request marker-based resolution instead of callback-driven status derivation.

@siavashs siavashs changed the title feat(dispatch): add group markers feat(dispatch): add alert markers per group Feb 25, 2026
@TheMeier
Copy link
Copy Markdown
Contributor

TheMeier commented Mar 1, 2026

Reviewed the new marker package standalone. LGTM

}

// SetInhibited implements AlertMarker.
func (m *alertMarker) SetInhibited(fp model.Fingerprint, inhibitedBy []string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity the previous SetIhibited used
func (m *MemMarker) SetInhibited(alert model.Fingerprint, ids ...string) {
And we changed it to []string ... Is there a reason we changed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency in the interface API for both SetInhibited and SetSilenced.

delete(m.groups, routeID+groupKey)
}

func (m *MemMarker) registerMetrics(r prometheus.Registerer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note in the release note that we removed this metric, as it would be too expensive to calculate it across all markers... ? Unless each time we change it a marker we were to add/remove the relevant number from the metric... But yeah, if not worth it/useful definitely we should say it's gone

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in the PR description:

[CHANGE] Remove `alertmanager_marked_alerts`

Do we need to also update any MD files? cc @SoloJacobs

} else {
filtered = append(filtered, a)
}
// TODO(fabxc): increment muted alerts counter if muted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are this TODO and the one above still relevant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he wanted to add/change the metrics here.
I am not sure if we need it.
I can remove these but my changes don't touch these.

@siavashs siavashs force-pushed the feat/group_markers branch from 724c634 to 9a5c33e Compare March 11, 2026 10:54
@siavashs siavashs requested a review from ultrotter March 11, 2026 16:19
@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request migrates Alertmanager's marker architecture from a global, shared marker to dedicated per-aggregation-group markers. The changes replace legacy callback-based alert status resolution with a context-propagated marker pattern, switch alert types from types.Alert to alert.Alert throughout the system, introduce a new marker package with AlertMarker and GroupMarker implementations, relocate dispatcher metrics to a new module, and remove marker dependencies from several subsystems (inhibitor, silencer, alert storage).

Changes

Cohort / File(s) Summary
API Layer Refactoring
api/api.go, api/v2/api.go, api/v2/api_test.go, api/v2/compat.go
Removed AlertStatusFunc callback from public API options. Switched alert status resolution from callback-driven to per-request marker.AlertMarker-based approach. Updated function signatures and imports to use alert.Alert instead of types.Alert. Updated test/benchmark code to match new alert types.
Marker Package (New)
marker/marker.go, marker/alert.go, marker/alert_test.go, marker/group.go, marker/group_test.go, marker/context.go, marker/status.go
Introduced new dedicated marker package with AlertMarker and GroupMarker interfaces. Implemented in-memory AlertMarker for tracking per-alert silenced/inhibited state and GroupMarker for mute intervals. Added context propagation utilities WithContext and FromContext for injecting markers into request pipelines. Comprehensive unit test coverage for all marker operations.
Dispatcher Architecture
dispatch/dispatch.go, dispatch/dispatch_test.go, dispatch/dispatch_bench_test.go
Switched from global types.GroupMarker to dedicated marker.GroupMarker. Changed aggregation groups to use per-group marker.AlertMarker created via marker.NewAlertMarker(). Updated Groups callback and alert filtering to use alert.Alert. Added AlertStatuses field to AlertGroup populated from per-group marker. Injected marker into request context via marker.WithContext(). Refactored tests to use new marker types and alert type.
Dispatcher Metrics
dispatch/metric.go, featurecontrol/featurecontrol.go
Added new dispatch/metric.go with DispatcherMetrics and NewDispatcherMetrics constructor. Implemented custom alertStateCollector that reads dispatcher state and emits alertmanager_alerts gauge with optional group-key labeling. Added EnableGroupKeyInMetrics() feature flag support. Constructor accepts registerer and feature flags, enabling/disabling metrics collection.
Notification Pipeline
notify/notify.go, notify/mute.go, notify/mute_test.go
Updated all Stage/Notifier/Integration signatures to use alert.Alert instead of types.Alert. Changed PipelineBuilder.New marker parameter from types.GroupMarker to marker.GroupMarker. Updated time-based mute stages to inject marker via context and defer SetMuted calls. Refactored tests to construct marker.NewGroupMarker() and use marker-enabled contexts.
Inhibit Module
inhibit/inhibit.go, inhibit/inhibit_bench_test.go, inhibit/inhibit_test.go
Removed marker types.AlertMarker field from Inhibitor struct. Updated NewInhibitor to omit marker parameter. Changed Mutes to retrieve marker from context via marker.FromContext() and defer SetInhibited() calls. Updated tests to use marker.NewAlertMarker() and inject via context. Removed benchmark marker initialization.
Silence Module
silence/silence.go, silence/silence_bench_test.go, silence/silence_test.go
Removed marker types.AlertMarker field from Silencer. Updated NewSilencer signature to omit marker. Changed Mutes to retrieve marker from context and defer SetSilenced() calls. Updated callback signatures to use alert.Alert. Removed benchmark marker setup and validation logic. Refactored tests to use marker.NewAlertMarker() with context injection.
Alert Storage
provider/mem/mem.go, provider/mem/mem_test.go
Removed marker types.AlertMarker field and constructor parameter from NewAlerts. Deleted alert-count-by-state collector and related metric registration logic. Simplified garbage collection to skip marker deletion. Updated AlertStoreCallback method signatures to use alert.Alert. Removed marker-dependent test assertions and cleanup logic.
Legacy Types Cleanup
types/types.go, types/types_test.go
Removed entire marker API surface from types package: deleted AlertState/AlertStatus aliases, AlertMarker/GroupMarker interfaces, NewMarker constructor, MemMarker implementation, and all marker methods. Deleted corresponding test file. Retained only deprecated alert structure aliases.
Alert Type Extension
alert/state.go
Added Compare(other AlertState) int method to AlertState defining priority ordering: suppressed > active > unprocessed. Returns -1/0/1 for lower/equal/higher priority comparisons.
Main Wiring
cmd/alertmanager/main.go
Updated marker initialization to use marker.NewGroupMarker() instead of types.NewMarker(). Removed marker propagation from Silencer, mem.NewAlerts, and Inhibitor constructors. Changed dispatcher alert-filter callback signature to use alert.Alert. Removed AlertStatusFunc from API options wiring. Updated dispatcher metrics construction to accept feature flags.

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIv2
    participant Dispatcher
    participant GroupMarker as Per-Group Marker
    participant AlertMarker as Alert Marker
    participant Context
    
    Client->>APIv2: GET /api/v2/alerts
    APIv2->>Dispatcher: Groups(ctx, alertFilter)
    Dispatcher->>GroupMarker: Create group marker
    Dispatcher->>Dispatcher: Iterate routes & groups
    Dispatcher->>AlertMarker: Create per-group alert marker
    AlertMarker->>AlertMarker: NewAlertMarker()
    Dispatcher->>Context: WithContext(ctx, alertMarker)
    Dispatcher->>APIv2: AlertGroup{Alerts, AlertStatuses}
    APIv2->>AlertMarker: Status(fp) for each alert
    AlertMarker-->>APIv2: alert.AlertStatus{State, SilencedBy, InhibitedBy}
    APIv2-->>Client: OpenAPI alerts with statuses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(dispatch): add alert markers per group' accurately reflects the main objective: introducing per-aggregation-group AlertMarkers in the dispatcher to replace the global marker.
Description check ✅ Passed The PR description is comprehensive and addresses the template requirements. It explains the feature, lists linked issues (#4953), confirms testing was added, verifies no API breaking changes, indicates sign-off, and provides detailed release notes.
Linked Issues check ✅ Passed The PR fully addresses issue #4953's requirements: replaces global marker with per-aggregation-group markers, maintains API/behavior compatibility via temporary markers in /alerts and group markers in /alerts/groups, and propagates markers through dispatcher, notifiers, inhibitor, and silencer via context.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to implementing per-group alert markers. Changes include new marker package, dispatcher integration, API updates, metrics migration, and removal of legacy types—all supporting the primary objective of per-group marker tracking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@siavashs siavashs force-pushed the feat/group_markers branch from e62a8c8 to 3708dcb Compare March 16, 2026 18:53
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
notify/mute.go (1)

145-156: ⚠️ Potential issue | 🟠 Major

Clear stale group markers when a route has no time intervals.

cmd/alertmanager/main.go now keeps one marker.GroupMarker alive across reloads. These early returns skip SetMuted, so a group that used to be muted can keep its old marker when the route later has no active/mute intervals, and /alerts/groups will still report it as muted until the group disappears. Clear the marker before returning from the “no interval data / empty interval list” paths.

Suggested fix
 	muteTimeIntervalNames, ok := MuteTimeIntervalNames(ctx)
 	if !ok {
+		tms.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, nil
 	}
 	now, ok := Now(ctx)
 	if !ok {
+		tms.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, errors.New("missing now timestamp")
 	}
@@
 	if len(muteTimeIntervalNames) == 0 {
+		tms.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, nil
 	}
@@
 	activeTimeIntervalNames, ok := ActiveTimeIntervalNames(ctx)
 	if !ok {
+		tas.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, nil
 	}
@@
 	if len(activeTimeIntervalNames) == 0 {
+		tas.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, nil
 	}
@@
 	now, ok := Now(ctx)
 	if !ok {
+		tas.marker.SetMuted(routeID, gkey, nil)
 		return ctx, alerts, errors.New("missing now timestamp")
 	}

Also applies to: 205-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/mute.go` around lines 145 - 156, The early-return paths in the
MuteTimeIntervalNames/Now checks and the empty-interval branch skip clearing the
group's marker, so a previously-muted group can remain marked; update the code
in the function that calls MuteTimeIntervalNames and Now to call SetMuted(ctx,
groupKey, nil) (or the equivalent marker-clear helper) before each return where
muteTimeIntervalNames is missing/empty or Now fails, referencing the existing
SetMuted function and marker.GroupMarker semantics to ensure the group's marker
is cleared prior to returning from those branches.
🧹 Nitpick comments (3)
api/v2/compat.go (1)

141-142: Parameter name shadows the alert package import.

The parameter alert *alert.Alert shadows the imported alert package. While this works because the parameter is only used via its fields (not requiring additional package-level access), it could cause confusion during future maintenance.

♻️ Suggested rename to avoid shadowing
 // AlertToOpenAPIAlert converts internal alerts, alert types, and receivers to *open_api_models.GettableAlert.
-func AlertToOpenAPIAlert(alert *alert.Alert, status alert.AlertStatus, receivers, mutedBy []string) *open_api_models.GettableAlert {
-	startsAt := strfmt.DateTime(alert.StartsAt)
-	updatedAt := strfmt.DateTime(alert.UpdatedAt)
-	endsAt := strfmt.DateTime(alert.EndsAt)
+func AlertToOpenAPIAlert(a *alert.Alert, status alert.AlertStatus, receivers, mutedBy []string) *open_api_models.GettableAlert {
+	startsAt := strfmt.DateTime(a.StartsAt)
+	updatedAt := strfmt.DateTime(a.UpdatedAt)
+	endsAt := strfmt.DateTime(a.EndsAt)

(Continue renaming alert to a throughout the function)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/compat.go` around lines 141 - 142, The function AlertToOpenAPIAlert
currently declares a parameter named "alert" which shadows the imported "alert"
package; rename the parameter (e.g., to "a" or "alertObj") and update all uses
inside AlertToOpenAPIAlert to that new identifier so references to fields (e.g.,
a.ID, a.Labels) no longer shadow the package name and future edits won't confuse
package-level access to the alert package.
marker/alert.go (1)

68-82: Potential nil slice issue in Status return.

When the alert is found in the map, the returned status.SilencedBy and status.InhibitedBy are directly assigned from the stored alertStatus fields. If these were set to nil (which happens when SetSilenced or SetInhibited is called with nil/empty slices based on lines 47 and 60), the returned AlertStatus could have nil slices instead of empty slices.

This creates inconsistency: when a fingerprint is not found (lines 70-71), empty slices are returned. But when found with nil internal state, nil slices could be returned. This may cause issues for callers expecting consistent non-nil slices.

♻️ Suggested fix to ensure consistent non-nil slices
 	status.State = s.state()
-	status.SilencedBy = s.SilencedBy
-	status.InhibitedBy = s.InhibitedBy
+	if s.SilencedBy != nil {
+		status.SilencedBy = s.SilencedBy
+	}
+	if s.InhibitedBy != nil {
+		status.InhibitedBy = s.InhibitedBy
+	}
 	return status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/alert.go` around lines 68 - 82, When building the returned
alert.AlertStatus in the getter, ensure SilencedBy and InhibitedBy are never
nil: if m.status[fp] is found, copy s.SilencedBy and s.InhibitedBy but replace
any nil slice with an empty slice before assigning to status.SilencedBy and
status.InhibitedBy (or use a small helper like ensureNonNilSlice), so callers
always receive non-nil, empty slices even when internal alertStatus fields were
set to nil by SetSilenced/SetInhibited.
notify/notify.go (1)

777-796: Complete dedup typing migration to remove residual types coupling.

At Line 795, dedup passes []*alert.Alert, but the helper it calls is still typed with []*types.Alert (Line 757). It works via aliasing, but it leaves avoidable cross-package coupling and makes future types cleanup harder.

♻️ Proposed refactor
-	"github.com/prometheus/alertmanager/types"
@@
-func partitionAlertsByState(alerts []*types.Alert, hashFn func(*types.Alert) uint64) (firing, resolved []uint64, firingSet, resolvedSet map[uint64]struct{}) {
+func partitionAlertsByState(alerts []*alert.Alert, hashFn func(*alert.Alert) uint64) (firing, resolved []uint64, firingSet, resolvedSet map[uint64]struct{}) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` around lines 777 - 796, The helper partitionAlertsByState
is still typed to accept []*types.Alert causing residual coupling; update the
partitionAlertsByState function signature and any related helpers to accept
[]*alert.Alert instead, update its implementation to use alert.Alert (not
types.Alert), adjust its call site in DedupStage.Exec (which currently passes
alerts and n.hash), and remove any now-unused references/imports to the types
package so the dedup path is fully decoupled from types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v2/api.go`:
- Around line 281-283: The alertFilter closure currently discards the incoming
request context by using context.Background() for status recomputation, which
detaches silence/inhibition lookups from the request deadline and span; update
the closure in api.alertFilter (and the same pattern in the other occurrences)
to capture the handler's context (e.g., pass ctx from the handler or close over
ctx variable) and use that ctx for all downstream lookups instead of
context.Background(), ensuring status recomputation honors cancellation and
deadlines associated with the request.

---

Outside diff comments:
In `@notify/mute.go`:
- Around line 145-156: The early-return paths in the MuteTimeIntervalNames/Now
checks and the empty-interval branch skip clearing the group's marker, so a
previously-muted group can remain marked; update the code in the function that
calls MuteTimeIntervalNames and Now to call SetMuted(ctx, groupKey, nil) (or the
equivalent marker-clear helper) before each return where muteTimeIntervalNames
is missing/empty or Now fails, referencing the existing SetMuted function and
marker.GroupMarker semantics to ensure the group's marker is cleared prior to
returning from those branches.

---

Nitpick comments:
In `@api/v2/compat.go`:
- Around line 141-142: The function AlertToOpenAPIAlert currently declares a
parameter named "alert" which shadows the imported "alert" package; rename the
parameter (e.g., to "a" or "alertObj") and update all uses inside
AlertToOpenAPIAlert to that new identifier so references to fields (e.g., a.ID,
a.Labels) no longer shadow the package name and future edits won't confuse
package-level access to the alert package.

In `@marker/alert.go`:
- Around line 68-82: When building the returned alert.AlertStatus in the getter,
ensure SilencedBy and InhibitedBy are never nil: if m.status[fp] is found, copy
s.SilencedBy and s.InhibitedBy but replace any nil slice with an empty slice
before assigning to status.SilencedBy and status.InhibitedBy (or use a small
helper like ensureNonNilSlice), so callers always receive non-nil, empty slices
even when internal alertStatus fields were set to nil by
SetSilenced/SetInhibited.

In `@notify/notify.go`:
- Around line 777-796: The helper partitionAlertsByState is still typed to
accept []*types.Alert causing residual coupling; update the
partitionAlertsByState function signature and any related helpers to accept
[]*alert.Alert instead, update its implementation to use alert.Alert (not
types.Alert), adjust its call site in DedupStage.Exec (which currently passes
alerts and n.hash), and remove any now-unused references/imports to the types
package so the dedup path is fully decoupled from types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f024d09f-6fcf-456c-b696-44f03d44838d

📥 Commits

Reviewing files that changed from the base of the PR and between eac945c and e62a8c8.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • types/types_test.go
  • api/api.go
  • types/types.go
  • provider/mem/mem.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dispatch/dispatch.go (1)

470-511: ⚠️ Potential issue | 🔴 Critical

Refresh map state after failed swap to avoid dropping alerts under contention.

At Line 472, failed CompareAndSwap retries against stale el without reloading the current map entry. If another goroutine replaced the group, this path can repeatedly fail until Line 509, silently dropping the alert.

💡 Proposed fix
  retries := 0
  for {
    if loaded {
-     // Try to store the new group in the map. If another goroutine has already created the same group, use the existing one.
-     swapped := d.routeGroupsSlice[route.Idx].groups.CompareAndSwap(fp, el, ag)
+     // Try to replace destroyed group.
+     swapped := d.routeGroupsSlice[route.Idx].groups.CompareAndSwap(fp, el, ag)
      if swapped {
        // We swapped the new group in, we can break and start it.
        break
      }
+     // Refresh current state before retrying.
+     el, loaded = d.routeGroupsSlice[route.Idx].groups.Load(fp)
+     if loaded {
+       agExisting := el.(*aggrGroup)
+       if agExisting.insert(ctx, alert) {
+         return
+       }
+     }
    } else {
      el, loaded = d.routeGroupsSlice[route.Idx].groups.LoadOrStore(fp, ag)
      if !loaded {
        d.routeGroupsSlice[route.Idx].groupsLen.Add(1)
        d.aggrGroupsNum.Add(1)
        d.metrics.aggrGroups.Set(float64(d.aggrGroupsNum.Load()))
        // We stored the new group, we can break and start it.
        break
      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 470 - 511, The CompareAndSwap failure path
currently retries using a stale `el` and can drop alerts under contention; after
a failed `d.routeGroupsSlice[route.Idx].groups.CompareAndSwap(fp, el, ag)` you
must reload the current map entry (call `Load`/`LoadOrStore` on
`d.routeGroupsSlice[route.Idx].groups` for `fp`) to get the up-to-date `el`,
check for nil, and if it yields an existing `*aggrGroup` attempt
`agExisting.insert(ctx, alert)` before incrementing retries or looping; only
increment `retries` and log/give up after confirming the entry is not usable/was
destroyed, keeping the existing logic that increments counters and starts the
group only when this goroutine successfully stored `ag`.
♻️ Duplicate comments (1)
api/v2/api.go (1)

281-283: ⚠️ Potential issue | 🟠 Major

Preserve request context in alertFilter execution path.

alertFilter creates spans from context.Background() (Line 472), so silence/inhibition recomputation ignores request cancellation/deadlines and detaches from request tracing. This can keep work running after client disconnects on large queries.

Suggested fix
-	tempMarker := marker.NewAlertMarker()
-	alertFilter := api.alertFilter(matchers, *params.Silenced, *params.Inhibited, *params.Active, tempMarker)
+	tempMarker := marker.NewAlertMarker()
+	alertFilter := api.alertFilter(ctx, matchers, *params.Silenced, *params.Inhibited, *params.Active, tempMarker)
@@
-	af := api.alertFilter(matchers, *params.Silenced, *params.Inhibited, *params.Active, marker.NewAlertMarker())
+	af := api.alertFilter(ctx, matchers, *params.Silenced, *params.Inhibited, *params.Active, marker.NewAlertMarker())
@@
-func (api *API) alertFilter(matchers []*labels.Matcher, silenced, inhibited, active bool, m marker.AlertMarker) func(a *alert.Alert, now time.Time) bool {
+func (api *API) alertFilter(parent context.Context, matchers []*labels.Matcher, silenced, inhibited, active bool, m marker.AlertMarker) func(a *alert.Alert, now time.Time) bool {
 	return func(a *alert.Alert, now time.Time) bool {
-		ctx, span := tracer.Start(context.Background(), "alertFilter")
+		ctx, span := tracer.Start(parent, "alertFilter")
 		defer span.End()

Also applies to: 433-433, 470-486

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/api.go` around lines 281 - 283, The alert recomputation path uses
context.Background() inside alertFilter (and its span creation), detaching work
from the incoming request; change alertFilter to accept a context parameter
(e.g., ctx) and propagate the handler's request context into alertFilter and any
internal calls that call tracer.Start/contexts instead of using
context.Background(); update all callers (e.g., where alertFilter is invoked to
build alertFilter, and the other referenced call sites) to pass the request ctx
so cancellation/deadlines and tracing are preserved during
silenced/inhibited/active recomputation.
🧹 Nitpick comments (1)
dispatch/metric.go (1)

111-124: Per-scrape allocations when group_key is enabled.

Each scrape with enableGroupKey=true allocates 3 new slices per aggregation group (one per state) via append(c.labels[...], groupKey). For deployments with many groups, this could generate noticeable GC pressure during metrics collection.

Consider pre-allocating a reusable buffer or using a sync.Pool if metrics scraping latency becomes a concern in large deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 111 - 124, The current loop in the
enableGroupKey branch creates three new slices per group via
append(c.labels[...], groupKey), causing per-scrape allocations; to fix,
allocate a reusable buffer slice once per group iteration (size =
len(c.labels[state])+1), copy the base labels from c.labels[alert.AlertStateX]
into that buffer, set the final element to ag.GroupKey(), and call c.emit with
that buffer for active, suppressed and unprocessed — reuse or reassign the same
buffer per state or use a sync.Pool for buffers if needed; update the loop
around routeGroupsSlice / aggrGroup (functions countAlertsByState and GroupKey
and calls to c.emit) to use this preallocated buffer instead of append.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dispatch/dispatch.go`:
- Around line 470-511: The CompareAndSwap failure path currently retries using a
stale `el` and can drop alerts under contention; after a failed
`d.routeGroupsSlice[route.Idx].groups.CompareAndSwap(fp, el, ag)` you must
reload the current map entry (call `Load`/`LoadOrStore` on
`d.routeGroupsSlice[route.Idx].groups` for `fp`) to get the up-to-date `el`,
check for nil, and if it yields an existing `*aggrGroup` attempt
`agExisting.insert(ctx, alert)` before incrementing retries or looping; only
increment `retries` and log/give up after confirming the entry is not usable/was
destroyed, keeping the existing logic that increments counters and starts the
group only when this goroutine successfully stored `ag`.

---

Duplicate comments:
In `@api/v2/api.go`:
- Around line 281-283: The alert recomputation path uses context.Background()
inside alertFilter (and its span creation), detaching work from the incoming
request; change alertFilter to accept a context parameter (e.g., ctx) and
propagate the handler's request context into alertFilter and any internal calls
that call tracer.Start/contexts instead of using context.Background(); update
all callers (e.g., where alertFilter is invoked to build alertFilter, and the
other referenced call sites) to pass the request ctx so cancellation/deadlines
and tracing are preserved during silenced/inhibited/active recomputation.

---

Nitpick comments:
In `@dispatch/metric.go`:
- Around line 111-124: The current loop in the enableGroupKey branch creates
three new slices per group via append(c.labels[...], groupKey), causing
per-scrape allocations; to fix, allocate a reusable buffer slice once per group
iteration (size = len(c.labels[state])+1), copy the base labels from
c.labels[alert.AlertStateX] into that buffer, set the final element to
ag.GroupKey(), and call c.emit with that buffer for active, suppressed and
unprocessed — reuse or reassign the same buffer per state or use a sync.Pool for
buffers if needed; update the loop around routeGroupsSlice / aggrGroup
(functions countAlertsByState and GroupKey and calls to c.emit) to use this
preallocated buffer instead of append.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51ca087b-f904-46c9-9be3-4af50acfede9

📥 Commits

Reviewing files that changed from the base of the PR and between e62a8c8 and 3708dcb.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • types/types.go
  • provider/mem/mem.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • marker/util.go
  • api/v2/api_test.go
  • marker/alert.go
  • api/v2/compat.go
  • featurecontrol/featurecontrol.go
  • marker/marker.go
  • marker/group_test.go
  • marker/group.go

@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@siavashs siavashs force-pushed the feat/group_markers branch from 55723d3 to 0db0801 Compare March 30, 2026 13:14
@siavashs siavashs requested a review from a team as a code owner March 30, 2026 13:14
This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track
the state of alerts in each aggregation group.

This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor
and silencer using context.

The API has no breaking changes:
- /alerts uses a temporary marker to track the state of alerts
- /alerts/groups returns the group markers

Update metrics to use group markers.
The `alertmanager_alerts` metric is moved to dispatcher.
The `alertmanager_marked_alerts` metric is removed.
By default it behaves the same as before, by aggregating
all alerts in the groups.
Enabling `group-key-in-metrics` flag will cause the metrics
to be grouped by `group_key`.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the feat/group_markers branch from 0db0801 to 35b9fb6 Compare March 30, 2026 13:23
@siavashs
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
notify/notify.go (1)

37-43: ⚠️ Potential issue | 🟡 Minor

Remove unused types import from dispatch/dispatch_test.go.

The types import in notify/notify.go (line 43) is needed and used at line 887 ([]*types.Alert). However, the same import in dispatch/dispatch_test.go (line 36) is unused and should be removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` around lines 37 - 43, Remove the unused "types" import from
dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go
and delete the "github.com/prometheus/alertmanager/types" entry (it's not
referenced in that test), then run `go vet`/`go test` to ensure no other usages;
do not change the existing usage of types in notify/notify.go (e.g., the
[]*types.Alert at line 887) which must remain.
🧹 Nitpick comments (3)
marker/alert.go (1)

38-48: Copy the reason slices at the marker boundary.

SetSilenced / SetInhibited keep the caller’s slice as-is, and Status returns the same backing arrays back out. That makes the marker’s internal state externally mutable if a caller reuses or edits either slice later. Clone on write or before returning so the marker fully owns its state.

Also applies to: 51-61, 79-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/alert.go` around lines 38 - 48, SetSilenced, SetInhibited and Status
currently reuse caller slices and return the same backing arrays, exposing
internal state; fix by copying slices on write and before returning so the
marker owns its data. In alertMarker.SetSilenced and alertMarker.SetInhibited,
replace assignments to alertStatus.SilencedBy / InhibitedBy with a copied slice
(make + copy) of the incoming slice; in alertMarker.Status, when constructing
the returned Alert or alertStatus, clone any slices you return (SilencedBy,
InhibitedBy, and Reason slices if present) so callers never get pointers to the
marker’s internal backing arrays.
dispatch/metric.go (1)

35-79: registerLimitMetrics is ignored here.

NewDispatcherMetrics still registers alertmanager_dispatcher_aggregation_group_limit_reached_total unconditionally, so callers passing false no longer get the behavior the signature advertises. Either wire the flag back into registration or drop the parameter to avoid a misleading API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 35 - 79, The function NewDispatcherMetrics
ignores the registerLimitMetrics parameter but still unconditionally
creates/registers the aggrGroupLimitReached counter; update NewDispatcherMetrics
so that when registerLimitMetrics is false it does not create or register
alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set
aggrGroupLimitReached to nil or a no-op metric) and only call
promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics
is true; alternatively, if the flag is unnecessary, remove the
registerLimitMetrics parameter and its callers—make the change in
NewDispatcherMetrics and the DispatcherMetrics struct usage
(aggrGroupLimitReached) accordingly.
marker/group_test.go (1)

53-75: Consider adding a concurrent access test.

Given that GroupMarker uses sync.RWMutex for thread safety, a test exercising concurrent SetMuted/Muted/DeleteByGroupKey calls would help validate the locking behavior under contention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/group_test.go` around lines 53 - 75, Add a new test that creates a
GroupMarker via NewGroupMarker and runs concurrent goroutines calling SetMuted,
Muted, and DeleteByGroupKey to exercise the RWMutex under contention; use
sync.WaitGroup and multiple iterations with randomized sleeps to interleave
operations, assert no panics/races and that final state matches expected (e.g.,
set a key, concurrently read and delete it, then verify Muted returns false
after DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and
DeleteByGroupKey in the test to validate thread-safety.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch_test.go`:
- Line 37: The build is failing because the import
"github.com/prometheus/alertmanager/types" is unused in dispatch_test.go; remove
that import from the file's import block (or replace it with a blank identifier
only if intentionally needed) so the test compiles—locate the import declaration
referencing types in dispatch_test.go and delete it or remove any references to
types.* so the compiler no longer reports an unused import.

In `@dispatch/metric.go`:
- Around line 124-149: The seen map population in the loop over
d.routeGroupsSlice (aggrGroup -> ag.alerts.List()) uses first-write-wins; change
it so on each fingerprint fp encountered you compute the incoming state via
ag.marker.Status(fp).State and merge it with any existing seen[fp] using
explicit precedence suppressed > active > unprocessed (i.e., if existing is
lower precedence than incoming, replace it; otherwise keep existing). Keep the
rest of the counting logic over seen unchanged.

In `@notify/notify.go`:
- Line 887: The method RetryStage.exec currently accepts variadic alerts of type
*alert.Alert but declares it returns []*types.Alert and then returns the
incoming alerts directly, causing a type mismatch; fix by converting the
incoming []*alert.Alert to []*types.Alert before returning (or change the
parameter type to []*types.Alert if callers can be updated). Locate
RetryStage.exec and either (A) map each *alert.Alert to a new *types.Alert
(create a small converter inside exec or call an existing conversion function)
and return that slice, or (B) change the exec signature to accept alerts
...*types.Alert and update call sites accordingly so the returned value matches
the declared []*types.Alert.

---

Outside diff comments:
In `@notify/notify.go`:
- Around line 37-43: Remove the unused "types" import from
dispatch/dispatch_test.go: open the import block in dispatch/dispatch_test.go
and delete the "github.com/prometheus/alertmanager/types" entry (it's not
referenced in that test), then run `go vet`/`go test` to ensure no other usages;
do not change the existing usage of types in notify/notify.go (e.g., the
[]*types.Alert at line 887) which must remain.

---

Nitpick comments:
In `@dispatch/metric.go`:
- Around line 35-79: The function NewDispatcherMetrics ignores the
registerLimitMetrics parameter but still unconditionally creates/registers the
aggrGroupLimitReached counter; update NewDispatcherMetrics so that when
registerLimitMetrics is false it does not create or register
alertmanager_dispatcher_aggregation_group_limit_reached_total (e.g., set
aggrGroupLimitReached to nil or a no-op metric) and only call
promauto.With(r).NewCounter for aggrGroupLimitReached when registerLimitMetrics
is true; alternatively, if the flag is unnecessary, remove the
registerLimitMetrics parameter and its callers—make the change in
NewDispatcherMetrics and the DispatcherMetrics struct usage
(aggrGroupLimitReached) accordingly.

In `@marker/alert.go`:
- Around line 38-48: SetSilenced, SetInhibited and Status currently reuse caller
slices and return the same backing arrays, exposing internal state; fix by
copying slices on write and before returning so the marker owns its data. In
alertMarker.SetSilenced and alertMarker.SetInhibited, replace assignments to
alertStatus.SilencedBy / InhibitedBy with a copied slice (make + copy) of the
incoming slice; in alertMarker.Status, when constructing the returned Alert or
alertStatus, clone any slices you return (SilencedBy, InhibitedBy, and Reason
slices if present) so callers never get pointers to the marker’s internal
backing arrays.

In `@marker/group_test.go`:
- Around line 53-75: Add a new test that creates a GroupMarker via
NewGroupMarker and runs concurrent goroutines calling SetMuted, Muted, and
DeleteByGroupKey to exercise the RWMutex under contention; use sync.WaitGroup
and multiple iterations with randomized sleeps to interleave operations, assert
no panics/races and that final state matches expected (e.g., set a key,
concurrently read and delete it, then verify Muted returns false after
DeleteByGroupKey). Target the methods NewGroupMarker, SetMuted, Muted, and
DeleteByGroupKey in the test to validate thread-safety.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 62685eb0-6c40-469e-87cc-0d7fd5712f7b

📥 Commits

Reviewing files that changed from the base of the PR and between 55723d3 and 0db0801.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • provider/mem/mem.go
  • types/types.go
✅ Files skipped from review due to trivial changes (3)
  • marker/status.go
  • featurecontrol/featurecontrol.go
  • api/v2/api_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • inhibit/inhibit_bench_test.go
  • notify/mute_test.go
  • silence/silence_test.go
  • notify/mute.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
inhibit/inhibit_test.go (1)

312-356: ⚠️ Potential issue | 🟡 Minor

Test case comments are misleading and don't match expected values.

Several comments in this test don't align with the expected results. This appears to be copy-paste from TestInhibitRuleMatches without updating for the different matcher types (rule1 uses MatchNotEqual for t1, not MatchEqual).

Examples:

  • Line 313: Comment says "inhibited" but expected: false
  • Line 335: Comment says "not inhibited" but expected: true
  • Line 349: Comment says "not inhibited" but expected: true

The expected values themselves appear correct for the MatchNotEqual semantics, but the comments should be updated to avoid confusion for future maintainers.

📝 Suggested comment fixes
 		{
-			// Matches target filter of rule1, inhibited.
+			// Does NOT match target filter of rule1 (MatchNotEqual t1!=1), not inhibited.
 			target:   model.LabelSet{"t1": "1", "e": "1"},
 			expected: false,
 		},
 		{
-			// Doesn't match target filter, not inhibited.
+			// Matches target filter of rule1 (t1!=1, so t1=0 matches), inhibited.
 			target:   model.LabelSet{"t1": "0", "e": "1"},
 			expected: true,
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit_test.go` around lines 312 - 356, Update the misleading inline
comments in the inhibit test cases to reflect the actual semantics of rule1
(which uses MatchNotEqual for label "t1") rather than MatchEqual; specifically,
review each test case referencing rule1 and rule2 (symbols: rule1, rule2,
MatchNotEqual) and change the comment text to match the existing expected
boolean values (keep expected values as-is), e.g. where a comment currently says
"inhibited" but expected:false change it to "not inhibited (because
MatchNotEqual on t1)" and where it says "not inhibited" but expected:true change
to "inhibited (because MatchNotEqual on t1 does not block)"; ensure all comments
describing source/target matching (mentions of sourceAlert1, sourceAlert2,
target filters) are consistent with MatchNotEqual semantics.
♻️ Duplicate comments (2)
dispatch/metric.go (1)

124-153: ⚠️ Potential issue | 🟠 Major

Merge duplicate fingerprint states with explicit precedence.

With per-group markers, the same fingerprint can be active in one aggregation group and suppressed in another. The current first-write-wins logic in seen (line 132-134) makes the metric non-deterministic depending on iteration order. Consider merging with explicit precedence (e.g., suppressed > active > unprocessed).

🛠️ Proposed fix with merge precedence
 // Deduplicate by fingerprint for backward compatibility.
 seen := map[model.Fingerprint]alert.AlertState{}
 for i := range d.routeGroupsSlice {
     d.routeGroupsSlice[i].groups.Range(func(_, el any) bool {
         ag := el.(*aggrGroup)
         for _, a := range ag.alerts.List() {
             fp := a.Fingerprint()
             if !a.Resolved() {
-                if _, ok := seen[fp]; !ok {
-                    seen[fp] = ag.marker.Status(fp).State
-                }
+                incoming := ag.marker.Status(fp).State
+                existing, ok := seen[fp]
+                if !ok || statePrecedence(incoming) > statePrecedence(existing) {
+                    seen[fp] = incoming
+                }
             }
         }
         return true
     })
 }

+// statePrecedence returns the precedence of the state for merging.
+// Higher precedence wins when the same fingerprint appears in multiple groups.
+func statePrecedence(s alert.AlertState) int {
+    switch s {
+    case alert.AlertStateSuppressed:
+        return 2
+    case alert.AlertStateActive:
+        return 1
+    default: // unprocessed
+        return 0
+    }
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 124 - 153, The deduplication currently uses
first-write-wins in the seen map (seen :=
map[model.Fingerprint]alert.AlertState{}) when iterating d.routeGroupsSlice and
aggrGroup alerts, causing non-deterministic metric values; change the logic
inside the loop that does fp := a.Fingerprint() and sets seen[fp] =
ag.marker.Status(fp).State to instead merge states by explicit precedence: treat
AlertStateSuppressed highest, AlertStateActive next, AlertStateUnprocessed
lowest, and when a fingerprint already exists in seen compare the existing state
and the new ag.marker.Status(fp).State and replace it only if the new state has
higher precedence so the final seen map deterministically reflects suppressed >
active > unprocessed before the counting/emission.
notify/notify.go (1)

887-887: ⚠️ Potential issue | 🔴 Critical

Critical: Return type mismatch will cause a compile error.

The exec method declares return type []*types.Alert but receives alerts ...*alert.Alert and returns alerts directly. This type mismatch won't compile.

🐛 Proposed fix
-func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*types.Alert, error) {
+func (r RetryStage) exec(ctx context.Context, l *slog.Logger, alerts ...*alert.Alert) (context.Context, []*alert.Alert, error) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notify/notify.go` at line 887, The exec method signature on RetryStage uses
parameter type alerts ...*alert.Alert but declares it returns []*types.Alert,
causing a type mismatch; fix by changing the parameter type to ...*types.Alert
in the RetryStage.exec signature (and update all callers to pass types.Alert
pointers), or alternatively, keep alerts as ...*alert.Alert and convert/marshal
each alert.Alert into a *types.Alert slice before returning; refer to the
function name RetryStage.exec and the symbols alerts, alert.Alert and
types.Alert when making the change.
🧹 Nitpick comments (1)
marker/group_test.go (1)

22-75: LGTM!

The tests provide good coverage of the GroupMarker functionality including isolation between different route/group combinations and clearing behavior.

Consider adding a concurrent access test in the future to validate thread safety under load, though this is not critical for the initial implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/group_test.go` around lines 22 - 75, Add a concurrent-access unit test
for GroupMarker to validate thread-safety: create a test (e.g.,
TestGroupMarker_ConcurrentAccess) that exercises GroupMarker.SetMuted,
GroupMarker.Muted and GroupMarker.DeleteByGroupKey from many goroutines using
sync.WaitGroup and randomized or patterned operations, wait for completion, and
assert final invariants (for example that Muted returns expected
timeIntervalNames or empty after deletes); ensure the test covers mixed
reads/writes across multiple route/group combinations (route1/route2,
group1/group2) to catch race conditions and run with -race in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@inhibit/inhibit_test.go`:
- Around line 312-356: Update the misleading inline comments in the inhibit test
cases to reflect the actual semantics of rule1 (which uses MatchNotEqual for
label "t1") rather than MatchEqual; specifically, review each test case
referencing rule1 and rule2 (symbols: rule1, rule2, MatchNotEqual) and change
the comment text to match the existing expected boolean values (keep expected
values as-is), e.g. where a comment currently says "inhibited" but
expected:false change it to "not inhibited (because MatchNotEqual on t1)" and
where it says "not inhibited" but expected:true change to "inhibited (because
MatchNotEqual on t1 does not block)"; ensure all comments describing
source/target matching (mentions of sourceAlert1, sourceAlert2, target filters)
are consistent with MatchNotEqual semantics.

---

Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-153: The deduplication currently uses first-write-wins in the
seen map (seen := map[model.Fingerprint]alert.AlertState{}) when iterating
d.routeGroupsSlice and aggrGroup alerts, causing non-deterministic metric
values; change the logic inside the loop that does fp := a.Fingerprint() and
sets seen[fp] = ag.marker.Status(fp).State to instead merge states by explicit
precedence: treat AlertStateSuppressed highest, AlertStateActive next,
AlertStateUnprocessed lowest, and when a fingerprint already exists in seen
compare the existing state and the new ag.marker.Status(fp).State and replace it
only if the new state has higher precedence so the final seen map
deterministically reflects suppressed > active > unprocessed before the
counting/emission.

In `@notify/notify.go`:
- Line 887: The exec method signature on RetryStage uses parameter type alerts
...*alert.Alert but declares it returns []*types.Alert, causing a type mismatch;
fix by changing the parameter type to ...*types.Alert in the RetryStage.exec
signature (and update all callers to pass types.Alert pointers), or
alternatively, keep alerts as ...*alert.Alert and convert/marshal each
alert.Alert into a *types.Alert slice before returning; refer to the function
name RetryStage.exec and the symbols alerts, alert.Alert and types.Alert when
making the change.

---

Nitpick comments:
In `@marker/group_test.go`:
- Around line 22-75: Add a concurrent-access unit test for GroupMarker to
validate thread-safety: create a test (e.g., TestGroupMarker_ConcurrentAccess)
that exercises GroupMarker.SetMuted, GroupMarker.Muted and
GroupMarker.DeleteByGroupKey from many goroutines using sync.WaitGroup and
randomized or patterned operations, wait for completion, and assert final
invariants (for example that Muted returns expected timeIntervalNames or empty
after deletes); ensure the test covers mixed reads/writes across multiple
route/group combinations (route1/route2, group1/group2) to catch race conditions
and run with -race in CI.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 14704549-fc4f-4c3d-99d7-c49241be9280

📥 Commits

Reviewing files that changed from the base of the PR and between 0db0801 and 35b9fb6.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • types/types_test.go
  • types/types.go
  • provider/mem/mem.go
  • api/api.go
✅ Files skipped from review due to trivial changes (9)
  • marker/status.go
  • silence/silence_bench_test.go
  • marker/alert.go
  • inhibit/inhibit_bench_test.go
  • marker/alert_test.go
  • dispatch/dispatch_test.go
  • marker/marker.go
  • featurecontrol/featurecontrol.go
  • silence/silence.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • cmd/alertmanager/main.go
  • api/v2/api_test.go
  • marker/util.go
  • silence/silence_test.go
  • notify/mute_test.go
  • notify/mute.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dispatch/metric.go (1)

124-149: ⚠️ Potential issue | 🟠 Major

Deduped alert-state metrics are still order-dependent.

With per-group markers, the same fingerprint can legitimately have different states in different aggregation groups. First-write-wins on seen makes the default alertmanager_alerts result depend on sync.Map.Range order instead of a defined precedence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 124 - 149, The dedupe uses first-write-wins
on seen which makes results order-dependent; change the assignment logic inside
the loop that populates seen (where seen is
map[model.Fingerprint]alert.AlertState and the code calls
ag.marker.Status(fp).State) to apply a deterministic precedence instead of
skipping when a key exists: compute the newState := ag.marker.Status(fp).State
and update seen[fp] only if newState has higher precedence than the current
seen[fp] (e.g., prefer alert.AlertStateActive over alert.AlertStateSuppressed
over others), or implement a small helper priority function to compare states
before assigning.
🧹 Nitpick comments (3)
cmd/alertmanager/main.go (1)

291-291: Minor: Variable shadows package name.

The local variable marker shadows the imported marker package. While this works correctly (the package is only used for the NewGroupMarker() call at initialization), consider renaming to groupMarker for clarity.

♻️ Suggested rename for clarity
-	marker := marker.NewGroupMarker()
+	groupMarker := marker.NewGroupMarker()

Then update references at lines 381, 492, 510 accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/alertmanager/main.go` at line 291, The local variable named marker
shadows the imported marker package; rename the local variable returned by
marker.NewGroupMarker() to groupMarker (e.g., groupMarker :=
marker.NewGroupMarker()) and update all intra-file references that use that
local variable (formerly marker) to groupMarker so the package identifier marker
remains unshadowed—ensure you update every use site that currently refers to the
local variable (the subsequent references in this file).
dispatch/dispatch_test.go (1)

433-438: Assert the AlertStatuses contents, not just that the map exists.

This only verifies allocation. A regression that populates every entry with the wrong state would still pass, even though Groups() now exposes these per-alert statuses.

✅ Small test hardening
 	for _, ag := range alertGroups {
 		require.NotNil(t, ag.AlertStatuses)
+		for _, a := range ag.Alerts {
+			status, ok := ag.AlertStatuses[a.Fingerprint()]
+			require.True(t, ok)
+			require.Equal(t, alert.AlertStateActive, status.State)
+		}
 		ag.AlertStatuses = nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch_test.go` around lines 433 - 438, Test currently only checks
AlertStatuses is non-nil then clears it; instead assert the actual contents
before nil-ing so regressions in per-alert state are caught. In the loop over
alertGroups (variable alertGroups, field AlertStatuses) replace the single
require.NotNil with a check that AlertStatuses equals the expected map for that
group (use require.Equal or similar against your expected per-alert status map
for this test), then set ag.AlertStatuses = nil so the later struct comparison
still works; ensure you reference the same keys/values that Groups() should
expose.
dispatch/metric.go (1)

36-79: registerLimitMetrics is currently ignored.

aggrGroupLimitReached is always registered with promauto.With(r).NewCounter, so callers passing false still expose alertmanager_dispatcher_aggregation_group_limit_reached_total. Either drop the flag or make the false branch create an unregistered counter.

🛠️ One local way to honor the flag
-		aggrGroupLimitReached: promauto.With(r).NewCounter(
-			prometheus.CounterOpts{
-				Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total",
-				Help: "Number of times when dispatcher failed to create new aggregation group due to limit.",
-			},
-		),
+		aggrGroupLimitReached: func() prometheus.Counter {
+			opts := prometheus.CounterOpts{
+				Name: "alertmanager_dispatcher_aggregation_group_limit_reached_total",
+				Help: "Number of times when dispatcher failed to create new aggregation group due to limit.",
+			}
+			if !registerLimitMetrics || r == nil {
+				return prometheus.NewCounter(opts)
+			}
+			return promauto.With(r).NewCounter(opts)
+		}(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/metric.go` around lines 36 - 79, The registerLimitMetrics boolean is
ignored in NewDispatcherMetrics causing aggrGroupLimitReached to always be
registered; update NewDispatcherMetrics so that when registerLimitMetrics is
true you create and register the counter with promauto.With(r).NewCounter (as
now), and when false you create an unregistered counter using
prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation)
and do not register it with r; refer to the NewDispatcherMetrics function and
the aggrGroupLimitReached symbol to implement the conditional
creation/registration so callers passing false no longer expose
alertmanager_dispatcher_aggregation_group_limit_reached_total.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: Normalize nil metrics when constructing the dispatcher:
if the incoming metrics or metrics.alertsCollector is nil, replace them with a
no-op implementation so subsequent unguarded calls on d.metrics (e.g., in
methods referencing d.metrics at the sites where you currently call
d.metrics.Alert.../d.metrics.*) are safe. Implement a simple no-op type that
satisfies the DispatcherMetrics interface (or the concrete shape returned by
NewDispatcherMetrics) with an alertsCollector containing a dispatcher.Store
method that is a no-op, then in the dispatcher constructor (where you currently
call metrics.alertsCollector.dispatcher.Store(disp)) set metrics =
NewNoopDispatcherMetrics() when metrics == nil or metrics.alertsCollector == nil
before storing the dispatcher.

---

Duplicate comments:
In `@dispatch/metric.go`:
- Around line 124-149: The dedupe uses first-write-wins on seen which makes
results order-dependent; change the assignment logic inside the loop that
populates seen (where seen is map[model.Fingerprint]alert.AlertState and the
code calls ag.marker.Status(fp).State) to apply a deterministic precedence
instead of skipping when a key exists: compute the newState :=
ag.marker.Status(fp).State and update seen[fp] only if newState has higher
precedence than the current seen[fp] (e.g., prefer alert.AlertStateActive over
alert.AlertStateSuppressed over others), or implement a small helper priority
function to compare states before assigning.

---

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Line 291: The local variable named marker shadows the imported marker package;
rename the local variable returned by marker.NewGroupMarker() to groupMarker
(e.g., groupMarker := marker.NewGroupMarker()) and update all intra-file
references that use that local variable (formerly marker) to groupMarker so the
package identifier marker remains unshadowed—ensure you update every use site
that currently refers to the local variable (the subsequent references in this
file).

In `@dispatch/dispatch_test.go`:
- Around line 433-438: Test currently only checks AlertStatuses is non-nil then
clears it; instead assert the actual contents before nil-ing so regressions in
per-alert state are caught. In the loop over alertGroups (variable alertGroups,
field AlertStatuses) replace the single require.NotNil with a check that
AlertStatuses equals the expected map for that group (use require.Equal or
similar against your expected per-alert status map for this test), then set
ag.AlertStatuses = nil so the later struct comparison still works; ensure you
reference the same keys/values that Groups() should expose.

In `@dispatch/metric.go`:
- Around line 36-79: The registerLimitMetrics boolean is ignored in
NewDispatcherMetrics causing aggrGroupLimitReached to always be registered;
update NewDispatcherMetrics so that when registerLimitMetrics is true you create
and register the counter with promauto.With(r).NewCounter (as now), and when
false you create an unregistered counter using
prometheus.NewCounter(prometheus.CounterOpts{...}) (or a no-op implementation)
and do not register it with r; refer to the NewDispatcherMetrics function and
the aggrGroupLimitReached symbol to implement the conditional
creation/registration so callers passing false no longer expose
alertmanager_dispatcher_aggregation_group_limit_reached_total.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d731d28-13a3-4d8a-929c-6c07a9cf7c9e

📥 Commits

Reviewing files that changed from the base of the PR and between 0db0801 and 35b9fb6.

📒 Files selected for processing (30)
  • api/api.go
  • api/v2/api.go
  • api/v2/api_test.go
  • api/v2/compat.go
  • cmd/alertmanager/main.go
  • dispatch/dispatch.go
  • dispatch/dispatch_bench_test.go
  • dispatch/dispatch_test.go
  • dispatch/metric.go
  • featurecontrol/featurecontrol.go
  • inhibit/inhibit.go
  • inhibit/inhibit_bench_test.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/group.go
  • marker/group_test.go
  • marker/marker.go
  • marker/status.go
  • marker/util.go
  • notify/mute.go
  • notify/mute_test.go
  • notify/notify.go
  • provider/mem/mem.go
  • provider/mem/mem_test.go
  • silence/silence.go
  • silence/silence_bench_test.go
  • silence/silence_test.go
  • types/types.go
  • types/types_test.go
💤 Files with no reviewable changes (4)
  • api/api.go
  • types/types_test.go
  • provider/mem/mem.go
  • types/types.go
✅ Files skipped from review due to trivial changes (5)
  • marker/status.go
  • marker/group_test.go
  • marker/marker.go
  • notify/notify.go
  • marker/group.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • api/v2/api_test.go
  • marker/alert.go
  • inhibit/inhibit_bench_test.go
  • silence/silence_test.go
  • featurecontrol/featurecontrol.go
  • provider/mem/mem_test.go
  • dispatch/dispatch_bench_test.go
  • silence/silence.go

Comment on lines +131 to +133
if metrics != nil && metrics.alertsCollector != nil {
metrics.alertsCollector.dispatcher.Store(disp)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize nil metrics before storing them on the dispatcher.

NewDispatcherMetrics can return nil, and callers can also pass nil directly. Leaving d.metrics nil here makes the unguarded metric calls at Line 157, Line 264, Line 278, Line 451, and Line 485 panic in metrics-disabled setups. Please install a no-op metrics implementation or guard those call sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 131 - 133, Normalize nil metrics when
constructing the dispatcher: if the incoming metrics or metrics.alertsCollector
is nil, replace them with a no-op implementation so subsequent unguarded calls
on d.metrics (e.g., in methods referencing d.metrics at the sites where you
currently call d.metrics.Alert.../d.metrics.*) are safe. Implement a simple
no-op type that satisfies the DispatcherMetrics interface (or the concrete shape
returned by NewDispatcherMetrics) with an alertsCollector containing a
dispatcher.Store method that is a no-op, then in the dispatcher constructor
(where you currently call metrics.alertsCollector.dispatcher.Store(disp)) set
metrics = NewNoopDispatcherMetrics() when metrics == nil or
metrics.alertsCollector == nil before storing the dispatcher.

Copy link
Copy Markdown
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay before a review. Overall this looks good to me, just a few minor comments.

Nothing major enough to block this change if someone else approves before I get a chance to re-review!

siavashs added 5 commits April 3, 2026 16:08
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs requested a review from Spaceman1701 April 3, 2026 14:38
@siavashs
Copy link
Copy Markdown
Contributor Author

siavashs commented Apr 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dispatch/dispatch.go (1)

131-133: ⚠️ Potential issue | 🔴 Critical

Unguarded d.metrics access will panic when metrics are disabled.

The nil-check here only guards the alertsCollector.dispatcher.Store call. However, d.metrics is used without nil checks at Lines 157, 264, 278, 451, and 485. When metrics is nil, those calls will cause a nil pointer dereference.

Either guard all metric call sites or normalize nil metrics to a no-op implementation in the constructor.

🛡️ Proposed fix: normalize nil metrics to no-op
 	disp.state.Store(DispatcherStateUnknown)
 	disp.loaded = make(chan struct{})
 	disp.ctx, disp.cancel = context.WithCancel(context.Background())
 
+	if metrics == nil {
+		metrics = NewNoopDispatcherMetrics()
+	}
+
 	if metrics != nil && metrics.alertsCollector != nil {
 		metrics.alertsCollector.dispatcher.Store(disp)
 	}

This requires implementing a NewNoopDispatcherMetrics() function that returns a *DispatcherMetrics with no-op gauge/counter/histogram implementations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch.go` around lines 131 - 133, The code currently only guards
metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially
nil elsewhere (d.metrics used at lines referenced), causing panics; fix by
normalizing nil metrics in the dispatcher constructor: implement
NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose
gauges/counters/histograms and alertsCollector are no-op implementations, and
replace any assignment that may leave d.metrics nil with a call to
NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the
constructor always sets d.metrics to NewNoopDispatcherMetrics() when the
incoming metrics parameter is nil).
🧹 Nitpick comments (2)
inhibit/inhibit_test.go (1)

45-50: Assert AlertStateActive on the non-muted path.

The false branch currently accepts both “explicitly cleared” and “never written”. If inhibit.Inhibitor.Mutes stops calling SetInhibited(fp, nil), this helper still passes because InhibitedBy stays empty on an unprocessed marker.

✅ Small test hardening
 	if wantMuted {
 		require.Equal(t, alert.AlertStateSuppressed, status.State, msgAndArgs...)
 		require.NotEmpty(t, status.InhibitedBy, msgAndArgs...)
 	} else {
+		require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...)
 		require.Empty(t, status.InhibitedBy, msgAndArgs...)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inhibit/inhibit_test.go` around lines 45 - 50, The non-muted branch of the
test only checks that status.InhibitedBy is empty but doesn't assert the state;
update the else branch after calling m.Status(fp) to also require that
status.State equals alert.AlertStateActive so the test fails if an unprocessed
marker leaves the state unset (e.g., if inhibit.Inhibitor.Mutes stops calling
SetInhibited). Keep the existing InhibitedBy emptiness check and add the
AlertStateActive assertion referencing status and alert.AlertStateActive.
api/v2/api.go (1)

483-507: Short-circuit label filters before recomputing status.

api.setAlertStatus reruns the silencer/inhibitor pipeline for every alert here, even when the label matchers already exclude it. On large /alerts and /alerts/groups scans with selective filters, that does unnecessary work in the hot path.

⚡ Suggested reorder
 	if !a.EndsAt.IsZero() && a.EndsAt.Before(now) {
 		return false
 	}
+	if !alertMatchesFilterLabels(&a.Alert, matchers) {
+		return false
+	}
 
 	// Set alert's current status based on its label set.
 	// The inhibitor and silencer write to m via the context.
 	ctx = marker.WithContext(ctx, m)
 	api.setAlertStatus(ctx, a.Labels)
@@
 	if !inhibited && len(status.InhibitedBy) != 0 {
 		return false
 	}
 
-	return alertMatchesFilterLabels(&a.Alert, matchers)
+	return true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v2/api.go` around lines 483 - 507, The code is calling api.setAlertStatus
(which reruns the silencer/inhibitor pipeline) before applying label filters,
causing unnecessary work; change the order so you call
alertMatchesFilterLabels(&a.Alert, matchers) first and return false when the
alert doesn't match, and only then call marker.WithContext(ctx, m) and
api.setAlertStatus(ctx, a.Labels) and query m.Status(a.Fingerprint()) for the
alerts that passed the label filter; ensure the existing short-circuit checks
that use status (m.Status, status.SilencedBy, status.InhibitedBy, status.State)
remain after setAlertStatus so they only run when needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@marker/context.go`:
- Around line 35-39: Update the comment on FromContext to accurately describe
its behavior: state that it returns the AlertMarker and a boolean indicating
presence, and that when absent it returns nil and false (not a no-op marker).
Reference the FromContext function and keyAlertMarker so readers know where to
look and ensure callers check the returned ok before using the marker.

---

Duplicate comments:
In `@dispatch/dispatch.go`:
- Around line 131-133: The code currently only guards
metrics.alertsCollector.dispatcher.Store(disp) but leaves d.metrics potentially
nil elsewhere (d.metrics used at lines referenced), causing panics; fix by
normalizing nil metrics in the dispatcher constructor: implement
NewNoopDispatcherMetrics() that returns a *DispatcherMetrics whose
gauges/counters/histograms and alertsCollector are no-op implementations, and
replace any assignment that may leave d.metrics nil with a call to
NewNoopDispatcherMetrics() so d.metrics is never nil (alternatively ensure the
constructor always sets d.metrics to NewNoopDispatcherMetrics() when the
incoming metrics parameter is nil).

---

Nitpick comments:
In `@api/v2/api.go`:
- Around line 483-507: The code is calling api.setAlertStatus (which reruns the
silencer/inhibitor pipeline) before applying label filters, causing unnecessary
work; change the order so you call alertMatchesFilterLabels(&a.Alert, matchers)
first and return false when the alert doesn't match, and only then call
marker.WithContext(ctx, m) and api.setAlertStatus(ctx, a.Labels) and query
m.Status(a.Fingerprint()) for the alerts that passed the label filter; ensure
the existing short-circuit checks that use status (m.Status, status.SilencedBy,
status.InhibitedBy, status.State) remain after setAlertStatus so they only run
when needed.

In `@inhibit/inhibit_test.go`:
- Around line 45-50: The non-muted branch of the test only checks that
status.InhibitedBy is empty but doesn't assert the state; update the else branch
after calling m.Status(fp) to also require that status.State equals
alert.AlertStateActive so the test fails if an unprocessed marker leaves the
state unset (e.g., if inhibit.Inhibitor.Mutes stops calling SetInhibited). Keep
the existing InhibitedBy emptiness check and add the AlertStateActive assertion
referencing status and alert.AlertStateActive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 172d86ce-7958-458a-bd24-de7fac631af1

📥 Commits

Reviewing files that changed from the base of the PR and between 35b9fb6 and 231742a.

📒 Files selected for processing (13)
  • alert/state.go
  • api/v2/api.go
  • dispatch/dispatch.go
  • dispatch/metric.go
  • inhibit/inhibit.go
  • inhibit/inhibit_test.go
  • marker/alert.go
  • marker/alert_test.go
  • marker/context.go
  • notify/mute_test.go
  • notify/notify.go
  • silence/silence.go
  • silence/silence_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • silence/silence_test.go
  • notify/mute_test.go
  • silence/silence.go
  • notify/notify.go

Comment on lines +35 to +39
// FromContext extracts the AlertMarker from the context if present,
// otherwise returns a no-op marker.
func FromContext(ctx context.Context) (AlertMarker, bool) {
m, ok := ctx.Value(keyAlertMarker).(AlertMarker)
return m, ok
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the FromContext contract comment.

Lines 35-36 say the absent case returns a no-op marker, but the implementation returns nil, false, and the new tests assert that behavior. Leaving the comment as-is makes it easy for a future caller to skip the ok check and dereference a nil marker.

📝 Suggested doc fix
-// FromContext extracts the AlertMarker from the context if present,
-// otherwise returns a no-op marker.
+// FromContext extracts the AlertMarker from the context if present.
+// It returns (nil, false) when no marker has been attached.
 func FromContext(ctx context.Context) (AlertMarker, bool) {
 	m, ok := ctx.Value(keyAlertMarker).(AlertMarker)
 	return m, ok
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// FromContext extracts the AlertMarker from the context if present,
// otherwise returns a no-op marker.
func FromContext(ctx context.Context) (AlertMarker, bool) {
m, ok := ctx.Value(keyAlertMarker).(AlertMarker)
return m, ok
// FromContext extracts the AlertMarker from the context if present.
// It returns (nil, false) when no marker has been attached.
func FromContext(ctx context.Context) (AlertMarker, bool) {
m, ok := ctx.Value(keyAlertMarker).(AlertMarker)
return m, ok
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@marker/context.go` around lines 35 - 39, Update the comment on FromContext to
accurately describe its behavior: state that it returns the AlertMarker and a
boolean indicating presence, and that when absent it returns nil and false (not
a no-op marker). Reference the FromContext function and keyAlertMarker so
readers know where to look and ensure callers check the returned ok before using
the marker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrated from Global Marker to per Aggregation Group markers

5 participants