feat(dispatch): add start delay by siavashs · Pull Request #4704 · prometheus/alertmanager

siavashs · 2025-11-06T14:00:26Z

This change adds a new cmd flag --dispatch.start-delay which corresponds to the --rules.alert.resend-delay flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.

By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.

This should help avoid race conditions in inhibitions after a (re)start.

Other improvements:

remove hasFlushed flag from aggrGroup
remove mutex locking from aggrGroup

config/config.go

dispatch/dispatch.go

Spaceman1701

Would you mind splitting this into two PRs? One that adds the --alerts.resend-delay and one that adds the wait_on_startup config to the route?

I'm asking because I think the --alerts.resend-delay is something we should definitely merge, but I'm a little concerned about wait_on_startup.

From the description in the PR, it seems like these are both aimed at solving the same problem - the inhibitor and the dispatcher race on alertmanager restart because alertmanager has to wait for prometheus to resend alerts. resend-delay seems to address this directly, while wait_on_startup seems more like a hack - there's no guarantee that group_wait is the right duration to wait after a restart. Additionally, group_wait is intended to express user's logic, not handle the protocol between alertmanager and prometheus. I wouldn't want to give users competing concerns around what value to use group_wait.

Is there any other use case you envision for wait_on_startup that I might be missing?

dispatch/dispatch.go

Spaceman1701 · 2025-11-06T15:50:08Z

dispatch/dispatch.go

 	// alert is already over.
 	ag.mtx.Lock()
 	defer ag.mtx.Unlock()
-	if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {


Somewhat unrelated to this change, but I noticed it when reviewing the new code - I think there's a very minor logic bug here - if an alert's StartsAt is in the past, but not at least ag.opts.GroupWait in the past, I think we should check if the next flush is before or after it would be scheduled purely from the new alert. If it's after, we should reset the timer to that duration. I don't thin we're keeping track of the next flush time outside of the timer, so that'd need to change too 🤔

E.g.

wantedFlush := time.Since(alert.StartsAt.Add(ag.opts.GroupWait)) if wantedFlush < time.Duration(0) { wantedFlush = time.Duration(0) } actualFlush := ag.durationToNextFlush() if wantedFlush < actualFlush { timer.Reset(wantedFlush) }

I don't think we should change the behavior in this PR though. Perhaps as a follow up.

Good catch, we can add it here or as a follow up.

dispatch/dispatch.go

juliusv · 2025-11-07T08:05:53Z

there's no guarantee that group_wait is the right duration to wait after a restart.

That's what I was thinking as well: some people may even have a group_wait: 1d for low-prio grouped alerts. Then you would never get any alerts if you restarted Alertmanager once a day, right?

siavashs · 2025-11-07T13:03:30Z

I'm dropping the WaitOnStartup as we never used it internally and based on the comments it can be tricky if user uses a long group_wait value.

siavashs · 2025-11-07T13:53:16Z

We are now failing this test which is vague and I remember debugging before but not documenting:

alertmanager/test/with_api_v2/acceptance/send_test.go

Lines 422 to 466 in aa879e1

    
           func TestReload(t *testing.T) { 
        
           	t.Parallel() 
        
           	// This integration test ensures that the first alert isn't notified twice 
        
           	// and repeat_interval applies after the AlertManager process has been 
        
           	// reloaded. 
        
           	conf := ` 
        
           route: 
        
             receiver: "default" 
        
             group_by: [] 
        
             group_wait:      1s 
        
             group_interval:  6s 
        
             repeat_interval: 10m 
        
           receivers: 
        
           - name: "default" 
        
             webhook_configs: 
        
             - url: 'http://%s' 
        
           ` 
        
           	at := NewAcceptanceTest(t, &AcceptanceOpts{ 
        
           		Tolerance: 150 * time.Millisecond, 
        
           	}) 
        
           	co := at.Collector("webhook") 
        
           	wh := NewWebhook(t, co) 
        
           	amc := at.AlertmanagerCluster(fmt.Sprintf(conf, wh.Address()), 1) 
        
           	amc.Push(At(1), Alert("alertname", "test1")) 
        
           	at.Do(At(3), amc.Reload) 
        
           	amc.Push(At(4), Alert("alertname", "test2")) 
        
           	co.Want(Between(2, 2.5), Alert("alertname", "test1").Active(1)) 
        
           	// Timers are reset on reload regardless, so we count the 6 second group 
        
           	// interval from 3 onwards. 
        
           	co.Want(Between(9, 9.5), 
        
           		Alert("alertname", "test1").Active(1), 
        
           		Alert("alertname", "test2").Active(4), 
        
           	) 
        
           	at.Run() 
        
           	t.Log(co.Check()) 
        
           }

cmd/alertmanager/main.go

dispatch/dispatch.go

dispatch/route.go

dispatch/dispatch.go

ultrotter · 2025-11-07T13:56:29Z

We are now failing this test which is vague and I remember debugging before but not documenting:

alertmanager/test/with_api_v2/acceptance/send_test.go

Lines 422 to 466 in aa879e1

func TestReload(t *testing.T) {

t.Parallel()

// This integration test ensures that the first alert isn't notified twice

// and repeat_interval applies after the AlertManager process has been

// reloaded.

conf := `

route:

receiver: "default"

group_by: []

group_wait: 1s

group_interval: 6s

repeat_interval: 10m

receivers:

- name: "default"

webhook_configs:

- url: 'http://%s'

`

at := NewAcceptanceTest(t, &AcceptanceOpts{

Tolerance: 150 * time.Millisecond,

})

co := at.Collector("webhook")

wh := NewWebhook(t, co)

amc := at.AlertmanagerCluster(fmt.Sprintf(conf, wh.Address()), 1)

amc.Push(At(1), Alert("alertname", "test1"))

at.Do(At(3), amc.Reload)

amc.Push(At(4), Alert("alertname", "test2"))

co.Want(Between(2, 2.5), Alert("alertname", "test1").Active(1))

// Timers are reset on reload regardless, so we count the 6 second group

// interval from 3 onwards.

co.Want(Between(9, 9.5),

Alert("alertname", "test1").Active(1),

Alert("alertname", "test2").Active(4),

)

at.Run()

t.Log(co.Check())

}

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

siavashs · 2025-11-07T14:50:52Z

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

So the problem is not duplicate notification but earlier than expected notification:

        interval [2,2.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
          [ ✓ ]
        interval [9,9.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:04.000Z <nil> <nil> { map[alertname:test2]}}[-9.223372036854776e+09:]
          [ ✗ ]

        received:
        @ 2.00549375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        @ 4.009307375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:51.706+01:00 <nil> <nil> { map[alertname:test2]}}[4.003107:]

siavashs · 2025-11-10T14:54:12Z

hasFlashed is dropped now since Dispatcher controls that with status checks after creating an AG.
This also means that AGs are now lock-free.

dispatch/dispatch.go

siavashs · 2025-11-13T09:24:52Z

It seems I broke acceptance tests again, I'll check and fix them.

TheMeier · 2025-11-13T09:57:52Z

cmd/alertmanager/main.go

 		maxSilenceSizeBytes         = kingpin.Flag("silences.max-silence-size-bytes", "Maximum silence size in bytes. If negative or zero, no limit is set.").Default("0").Int()
 		alertGCInterval             = kingpin.Flag("alerts.gc-interval", "Interval between alert GC.").Default("30m").Duration()
 		dispatchMaintenanceInterval = kingpin.Flag("dispatch.maintenance-interval", "Interval between maintenance of aggregation groups in the dispatcher.").Default("30s").Duration()
+		DispatchStartDelay          = kingpin.Flag("dispatch.start-delay", "Minimum amount of time to wait before dispatching alerts. This option should be synced with value of --rules.alert.resend-delay on Prometheus.").Default("0s").Duration()


Default in prometheus is 1m so should the default in AM also be 1m?

I initially set this to 1m but it changes the default behaviour and lots of acceptance tests fail. We can avoid breaking the existing tests by setting it to 0 for all tests.
Adjusting timings is not possible since we add +1m to each test.

But I thought same thing could happen to users unexpectedly if they don't pay attention to the changelog and this new cmd flag.

I'm open to setting the default to 1m to sync it with Prometheus defaults.

I see, whatever you choose, you will choose wrong :D

I think we should set this to 1m to match the Prometheus defaults. But we can do that in a followup PR.

ultrotter

LGTM, with a few minor comments! Thanks!

cmd/alertmanager/main.go

config/coordinator.go

ultrotter · 2025-11-13T12:32:57Z

dispatch/dispatch.go

-
-	mtx        sync.RWMutex
-	hasFlushed bool
+	running atomic.Bool


I know we use go.uber.org/atomic elsewhere, but does this buy us anything over sync/atomic Bool? https://pkg.go.dev/sync/atomic#pkg-types Should we use this and start a switch?

There is currently a lint check which enforces this.
I guess we inherit this from Prometheus.
I need to check if this is still required.
cc @SuperQ

FYI, there's an open effort in Prometheus to switch to the new stdlib atomic types (those only got added in Go 1.19):

replace uber atomic with stdlib atomic types prometheus#14866

refactor: replace uber atomic with stdlib atomic types (Issue #14866) prometheus#17358

Yea, I'm not sure. This was introduced in Prometheus in 2020. prometheus/prometheus#7647

I guess we'll need to find out from the rest of the Prometheus devs if we still need this.

We can remove the check, std lib is what we want as per prometheus/prometheus#14866

Looks like we can migrate to to the stdlib now: prometheus/prometheus#14866

I guess we wait for this to be fix in Prometheus, and Alertmanager will get it as part of the next sync of common ci/build setup.

Yeah, I was going to note that the new stdlib fixes the issue and provides that... I think we can target a migration for post 0.30, since it's more of a cleanup than something that is urgently needed?

Spaceman1701

Overall, LGTM now. The state to string map is the only thing I'd really prefer that we changed before merging.

dispatch/dispatch.go

TheMeier · 2025-11-14T16:26:03Z

Should we add this one to the v0.30.0 project? Look pretty far along to me

siavashs · 2025-11-14T18:57:09Z

Should we add this one to the v0.30.0 project? Look pretty far along to me

I think it can make it for that release, all related comments are resolved.
I'll wait for a final review by @Spaceman1701 and @ultrotter

Spaceman1701

LGTM. Thanks for making those changes!

dispatch/dispatch.go

This change adds a new cmd flag `--dispatch.start-delay` which corresponds to the `--rules.alert.resend-delay` flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager. By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances. This should help avoid race conditions in inhibitions after a (re)start. Other improvements: - remove hasFlushed flag from aggrGroup - remove mutex locking from aggrGroup Signed-off-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

This change adds a new cmd flag `--dispatch.start-delay` which corresponds to the `--rules.alert.resend-delay` flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager. By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances. This should help avoid race conditions in inhibitions after a (re)start. Other improvements: - remove hasFlushed flag from aggrGroup - remove mutex locking from aggrGroup Signed-off-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Siavash Safi <siavash@cloudflare.com> Co-authored-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Holger Waschke <holger.waschke@dvag.com>

siavashs force-pushed the feat/dispatch-wait branch from a6aaf6b to e524fe8 Compare November 6, 2025 14:04

siavashs changed the title ~~feat(dispatch): honor group_wait on first flush & sync with Prometheus' --alerts.resend-delay~~ feat(dispatch): honor group_wait on first flush & sync with Prometheus' --rules.alerts.resend-delay Nov 6, 2025

ultrotter requested changes Nov 6, 2025

View reviewed changes

config/config.go Outdated Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

Spaceman1701 reviewed Nov 6, 2025

View reviewed changes

dispatch/dispatch.go Outdated Show resolved Hide resolved

Spaceman1701 mentioned this pull request Nov 6, 2025

Add new behavior to avoid races on config reload #4705

Merged

siavashs changed the title ~~feat(dispatch): honor group_wait on first flush & sync with Prometheus' --rules.alerts.resend-delay~~ feat(dispatch): sync with Prometheus resend delay Nov 7, 2025

siavashs force-pushed the feat/dispatch-wait branch from e524fe8 to aa879e1 Compare November 7, 2025 13:08

ultrotter reviewed Nov 7, 2025

View reviewed changes

cmd/alertmanager/main.go Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

dispatch/route.go Outdated Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

siavashs force-pushed the feat/dispatch-wait branch from aa879e1 to 0247eba Compare November 10, 2025 14:49

siavashs changed the title ~~feat(dispatch): sync with Prometheus resend delay~~ feat(dispatch): add start delay Nov 10, 2025

siavashs requested review from Spaceman1701 and ultrotter November 10, 2025 14:52

siavashs force-pushed the feat/dispatch-wait branch 2 times, most recently from bcfbfbe to 960ed5e Compare November 10, 2025 18:00

siavashs added the component/notify label Nov 11, 2025

Spaceman1701 requested changes Nov 12, 2025

View reviewed changes

dispatch/dispatch.go Outdated Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

siavashs force-pushed the feat/dispatch-wait branch 3 times, most recently from 253e717 to 885824c Compare November 13, 2025 09:11

siavashs requested a review from Spaceman1701 November 13, 2025 09:12

siavashs force-pushed the feat/dispatch-wait branch from 885824c to 1028c37 Compare November 13, 2025 09:13

siavashs force-pushed the feat/dispatch-wait branch from 1028c37 to 0222298 Compare November 13, 2025 09:49

siavashs force-pushed the feat/dispatch-wait branch from 0222298 to d527f3f Compare November 13, 2025 09:54

TheMeier reviewed Nov 13, 2025

View reviewed changes

ultrotter approved these changes Nov 13, 2025

View reviewed changes

Spaceman1701 reviewed Nov 13, 2025

View reviewed changes

dispatch/dispatch.go Outdated Show resolved Hide resolved

dispatch/dispatch.go Outdated Show resolved Hide resolved

siavashs force-pushed the feat/dispatch-wait branch 3 times, most recently from 00d1c63 to 6f69d98 Compare November 14, 2025 14:31

siavashs requested a review from Spaceman1701 November 14, 2025 14:40

siavashs self-assigned this Nov 14, 2025

Spaceman1701 approved these changes Nov 14, 2025

View reviewed changes

ultrotter approved these changes Nov 15, 2025

View reviewed changes

dispatch/dispatch.go Outdated Show resolved Hide resolved

siavashs force-pushed the feat/dispatch-wait branch from 6f69d98 to 5edf157 Compare November 15, 2025 11:33

siavashs force-pushed the feat/dispatch-wait branch from 5edf157 to a78e2ee Compare November 15, 2025 11:38

SuperQ approved these changes Nov 15, 2025

View reviewed changes

SuperQ merged commit 2e0970e into prometheus:main Nov 15, 2025
7 checks passed

Spaceman1701 mentioned this pull request Nov 21, 2025

Race condition with inhibited rules #2229

Closed

siavashs mentioned this pull request Nov 22, 2025

dispatch: Fix initial alerts not honoring group_wait #3167

Closed

SoloJacobs mentioned this pull request Nov 24, 2025

Release v0.30.0-rc.0 #4770

Merged

siavashs deleted the feat/dispatch-wait branch December 8, 2025 11:38

Conversation

siavashs commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

juliusv commented Nov 7, 2025

Uh oh!

siavashs commented Nov 7, 2025

Uh oh!

siavashs commented Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ultrotter commented Nov 7, 2025

Uh oh!

siavashs commented Nov 7, 2025

Uh oh!

siavashs commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

siavashs commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ultrotter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TheMeier commented Nov 14, 2025

Uh oh!

siavashs commented Nov 14, 2025

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

siavashs commented Nov 6, 2025 •

edited

Loading

siavashs Nov 13, 2025 •

edited

Loading

siavashs Nov 14, 2025 •

edited

Loading