Fix config reloads not respecting group interval#3074
Fix config reloads not respecting group interval#3074andrejbranch wants to merge 3 commits intoprometheus:mainfrom
Conversation
dc9abf9 to
c2fc518
Compare
Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>
c2fc518 to
f4d61d8
Compare
notify/notify.go
Outdated
| } | ||
|
|
||
| if !entry.IsFiringSubset(firing) { | ||
| groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval)) |
There was a problem hiding this comment.
Any chance we want to use keyNow from the context here?
This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications. |
Agreed, but would it be better to make it work like that by default with the ability to configure? |
373cc9e to
320cb60
Compare
Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>
320cb60 to
c61f146
Compare
|
In #3283 I try to solve another problem a similar way. There are some differences though:
|
|
I have modified this quite a bit in Shopify's alertmanager fork, i do not use 0.99 constant anymore and I also had to solve alert flapping with hard coded check if its a recent entry. The flappy behavior was amplified in how we are running alertmanager because we change https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go#L452 to for what its worth here is the needsUpdate in our forked version |
Interesting, in your code you use time.Now() which is a wall-clock but in this PR you use the |
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
|
Hi! 👋 I found this issue from another PR.
Config reload will re-create all existing aggregation groups. This causes an immediate flush because of this code here: // Immediately trigger a flush if the wait duration for this
// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
ag.next.Reset(0)
}I think we should wait Group wait seconds. I'm not sure if we need to persist and wait the remainder of the previous Group interval?
Like Simon said, I think this is intended behavior. If you have a group of size 1 (i.e. grouping is disabled) that is flapping between firing and resolved, won't this change make Alertmanager send a notification each time it flaps, rather than smooth it out? |
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.
One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.
I considered the following ways to fix it:
1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.
I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.
needsUpdate is based on:
prometheus#3074 (comment)
Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
We have been running this modification at Shopify for the last few weeks and it has solved two issues for us.
Related issues that this addresses:
This change does the following:
TODO:
This change adds to even further the confusion of group interval, and group_wait. Three separate configs seems more appropriate: group_wait (how long we wait to start the evaluation background job for an alert group), group_interval (how frequent an alert group is evaluated for dispatch), group_repeat_interval (how long to wait before sending a notification about new alerts that are added to a group of alerts. This would allow for more fine tuning alert dispatching.