feat(dispatch): add start delay#4704
Conversation
a6aaf6b to
e524fe8
Compare
Spaceman1701
left a comment
There was a problem hiding this comment.
Would you mind splitting this into two PRs? One that adds the --alerts.resend-delay and one that adds the wait_on_startup config to the route?
I'm asking because I think the --alerts.resend-delay is something we should definitely merge, but I'm a little concerned about wait_on_startup.
From the description in the PR, it seems like these are both aimed at solving the same problem - the inhibitor and the dispatcher race on alertmanager restart because alertmanager has to wait for prometheus to resend alerts. resend-delay seems to address this directly, while wait_on_startup seems more like a hack - there's no guarantee that group_wait is the right duration to wait after a restart. Additionally, group_wait is intended to express user's logic, not handle the protocol between alertmanager and prometheus. I wouldn't want to give users competing concerns around what value to use group_wait.
Is there any other use case you envision for wait_on_startup that I might be missing?
| // alert is already over. | ||
| ag.mtx.Lock() | ||
| defer ag.mtx.Unlock() | ||
| if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) { |
There was a problem hiding this comment.
Somewhat unrelated to this change, but I noticed it when reviewing the new code - I think there's a very minor logic bug here - if an alert's StartsAt is in the past, but not at least ag.opts.GroupWait in the past, I think we should check if the next flush is before or after it would be scheduled purely from the new alert. If it's after, we should reset the timer to that duration. I don't thin we're keeping track of the next flush time outside of the timer, so that'd need to change too 🤔
E.g.
wantedFlush := time.Since(alert.StartsAt.Add(ag.opts.GroupWait))
if wantedFlush < time.Duration(0) {
wantedFlush = time.Duration(0)
}
actualFlush := ag.durationToNextFlush()
if wantedFlush < actualFlush {
timer.Reset(wantedFlush)
}I don't think we should change the behavior in this PR though. Perhaps as a follow up.
There was a problem hiding this comment.
Good catch, we can add it here or as a follow up.
That's what I was thinking as well: some people may even have a |
|
I'm dropping the |
e524fe8 to
aa879e1
Compare
|
We are now failing this test which is vague and I remember debugging before but not documenting: alertmanager/test/with_api_v2/acceptance/send_test.go Lines 422 to 466 in aa879e1 |
Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice? |
So the problem is not duplicate notification but earlier than expected notification: |
aa879e1 to
0247eba
Compare
|
|
bcfbfbe to
960ed5e
Compare
253e717 to
885824c
Compare
885824c to
1028c37
Compare
|
It seems I broke acceptance tests again, I'll check and fix them. |
1028c37 to
0222298
Compare
0222298 to
d527f3f
Compare
| maxSilenceSizeBytes = kingpin.Flag("silences.max-silence-size-bytes", "Maximum silence size in bytes. If negative or zero, no limit is set.").Default("0").Int() | ||
| alertGCInterval = kingpin.Flag("alerts.gc-interval", "Interval between alert GC.").Default("30m").Duration() | ||
| dispatchMaintenanceInterval = kingpin.Flag("dispatch.maintenance-interval", "Interval between maintenance of aggregation groups in the dispatcher.").Default("30s").Duration() | ||
| DispatchStartDelay = kingpin.Flag("dispatch.start-delay", "Minimum amount of time to wait before dispatching alerts. This option should be synced with value of --rules.alert.resend-delay on Prometheus.").Default("0s").Duration() |
There was a problem hiding this comment.
Default in prometheus is 1m so should the default in AM also be 1m?
There was a problem hiding this comment.
I initially set this to 1m but it changes the default behaviour and lots of acceptance tests fail. We can avoid breaking the existing tests by setting it to 0 for all tests.
Adjusting timings is not possible since we add +1m to each test.
But I thought same thing could happen to users unexpectedly if they don't pay attention to the changelog and this new cmd flag.
I'm open to setting the default to 1m to sync it with Prometheus defaults.
There was a problem hiding this comment.
I see, whatever you choose, you will choose wrong :D
There was a problem hiding this comment.
I think we should set this to 1m to match the Prometheus defaults. But we can do that in a followup PR.
ultrotter
left a comment
There was a problem hiding this comment.
LGTM, with a few minor comments! Thanks!
|
|
||
| mtx sync.RWMutex | ||
| hasFlushed bool | ||
| running atomic.Bool |
There was a problem hiding this comment.
I know we use go.uber.org/atomic elsewhere, but does this buy us anything over sync/atomic Bool? https://pkg.go.dev/sync/atomic#pkg-types Should we use this and start a switch?
There was a problem hiding this comment.
There is currently a lint check which enforces this.
I guess we inherit this from Prometheus.
I need to check if this is still required.
cc @SuperQ
There was a problem hiding this comment.
FYI, there's an open effort in Prometheus to switch to the new stdlib atomic types (those only got added in Go 1.19):
There was a problem hiding this comment.
Yea, I'm not sure. This was introduced in Prometheus in 2020. prometheus/prometheus#7647
I guess we'll need to find out from the rest of the Prometheus devs if we still need this.
There was a problem hiding this comment.
We can remove the check, std lib is what we want as per prometheus/prometheus#14866
There was a problem hiding this comment.
Looks like we can migrate to to the stdlib now: prometheus/prometheus#14866
There was a problem hiding this comment.
I guess we wait for this to be fix in Prometheus, and Alertmanager will get it as part of the next sync of common ci/build setup.
There was a problem hiding this comment.
Yeah, I was going to note that the new stdlib fixes the issue and provides that... I think we can target a migration for post 0.30, since it's more of a cleanup than something that is urgently needed?
Spaceman1701
left a comment
There was a problem hiding this comment.
Overall, LGTM now. The state to string map is the only thing I'd really prefer that we changed before merging.
00d1c63 to
6f69d98
Compare
|
Should we add this one to the v0.30.0 project? Look pretty far along to me |
I think it can make it for that release, all related comments are resolved. |
Spaceman1701
left a comment
There was a problem hiding this comment.
LGTM. Thanks for making those changes!
6f69d98 to
5edf157
Compare
This change adds a new cmd flag `--dispatch.start-delay` which corresponds to the `--rules.alert.resend-delay` flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager. By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances. This should help avoid race conditions in inhibitions after a (re)start. Other improvements: - remove hasFlushed flag from aggrGroup - remove mutex locking from aggrGroup Signed-off-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
5edf157 to
a78e2ee
Compare
This change adds a new cmd flag `--dispatch.start-delay` which corresponds to the `--rules.alert.resend-delay` flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager. By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances. This should help avoid race conditions in inhibitions after a (re)start. Other improvements: - remove hasFlushed flag from aggrGroup - remove mutex locking from aggrGroup Signed-off-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Siavash Safi <siavash@cloudflare.com> Co-authored-by: Alexander Rickardsson <alxric@aiven.io> Signed-off-by: Holger Waschke <holger.waschke@dvag.com>
This change adds a new cmd flag
--dispatch.start-delaywhich corresponds to the--rules.alert.resend-delayflag in Prometheus.This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.
By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.
This should help avoid race conditions in inhibitions after a (re)start.
Other improvements: