Fix config reloads not respecting group interval by andrejbranch · Pull Request #3074 · prometheus/alertmanager

andrejbranch · 2022-09-14T17:07:18Z

We have been running this modification at Shopify for the last few weeks and it has solved two issues for us.

Config reload causes notifications of new alerts within a group (doesn't respect group_interval)
When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

Related issues that this addresses:

This change does the following:

Changes an alerts group ticker interval to group wait (theres a TODO here), this results in notification evaluation to happen every 30s (default group wait) for each alert group.
Moves group interval logic to dedupe stage utilizing the persistent notification entries. If the last notification entry for an alert group was a firing event and it was sent within the group interval window, we prevent notification.

TODO:
This change adds to even further the confusion of group interval, and group_wait. Three separate configs seems more appropriate: group_wait (how long we wait to start the evaluation background job for an alert group), group_interval (how frequent an alert group is evaluated for dispatch), group_repeat_interval (how long to wait before sending a notification about new alerts that are added to a group of alerts. This would allow for more fine tuning alert dispatching.

Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>

…interval-fix

zecke · 2022-09-15T13:37:29Z

notify/notify.go

 	}

-	if !entry.IsFiringSubset(firing) {
+	groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))


Any chance we want to use keyNow from the context here?

simonpasquier · 2022-09-16T14:23:04Z

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications.

andrejbranch · 2022-09-16T14:49:48Z

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications.

Agreed, but would it be better to make it work like that by default with the ability to configure?

Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>

yuri-tceretian · 2023-04-14T15:46:54Z

In #3283 I try to solve another problem a similar way. There are some differences though:

my PR drops the positive result of needsUpdate if the tick expires, no matter if it resolve or firing.
I do not use group_interval and a magic constant 0.99. Why did you choose 0.99 and not 1.0 or 0.95?

andrejbranch · 2023-04-14T15:55:35Z

I have modified this quite a bit in Shopify's alertmanager fork, i do not use 0.99 constant anymore and I also had to solve alert flapping with hard coded check if its a recent entry. The flappy behavior was amplified in how we are running alertmanager because we change https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go#L452 to ag.next.Reset(ag.opts.GroupWait) so that we can send resolved alerts before group interval is up

for what its worth here is the needsUpdate in our forked version

func (n *DedupStage) needsUpdate(entry *nflogpb.Entry, firing, resolved map[uint64]struct{}, repeat, groupInterval time.Duration) bool {
	// If we haven't notified about the alert group before, notify right away
	// unless we only have resolved alerts.
	if entry == nil {
		return len(firing) > 0
	}

	if recentEntry := entry.Timestamp.After(time.Now().Add(-30 * time.Second)); recentEntry {
		return false
	}

	groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))

	if !entry.IsFiringSubset(firing) && !groupIntervalMuted {
		return true
	}

	// Notify about all alerts being resolved.
	// This is done irrespective of the send_resolved flag to make sure that
	// the firing alerts are cleared from the notification log.
	if len(firing) == 0 {
		// If the current alert group and last notification contain no firing
		// alert, it means that some alerts have been fired and resolved during the
		// last interval. In this case, there is no need to notify the receiver
		// since it doesn't know about them.
		return len(entry.FiringAlerts) > 0
	}

	if n.rs.SendResolved() && !entry.IsResolvedSubset(resolved) && !groupIntervalMuted {
		return true
	}

	// Nothing changed, only notify if the repeat interval has passed.
	return entry.Timestamp.Before(n.now().Add(-repeat))
}

yuri-tceretian · 2023-04-14T16:08:00Z

for what its worth here is the needsUpdate in our forked version

Interesting, in your code you use time.Now() which is a wall-clock

groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))

but in this PR you use the Now(ctx) which is the time when flushing happened. Probably, for your use-case\PR this does not really make a sensible difference becasue time.Now() and Now(ctx) are pretty close but in the case of HA (my PR fixes) the diff between those two timestamps could be times of peer_timeout.

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

grobinson-grafana · 2024-05-07T19:47:12Z

Hi! 👋 I found this issue from another PR.

Config reload causes notifications of new alerts within a group (doesn't respect group_interval)

Config reload will re-create all existing aggregation groups. This causes an immediate flush because of this code here:

// Immediately trigger a flush if the wait duration for this
// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
	ag.next.Reset(0)
}

I think we should wait Group wait seconds. I'm not sure if we need to persist and wait the remainder of the previous Group interval?

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete.

Like Simon said, I think this is intended behavior. If you have a group of size 1 (i.e. grouping is disabled) that is flapping between firing and resolved, won't this change make Alertmanager send a notification each time it flaps, rather than smooth it out?

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

andrejbranch force-pushed the group-interval-fix branch 2 times, most recently from dc9abf9 to c2fc518 Compare September 14, 2022 17:11

andrejbranch added 2 commits September 14, 2022 11:11

fix config reloads not respecting group interval

2c2b907

Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>

Merge branch 'main' of github.com:prometheus/alertmanager into group-…

f4d61d8

…interval-fix

andrejbranch force-pushed the group-interval-fix branch from c2fc518 to f4d61d8 Compare September 14, 2022 17:11

zecke reviewed Sep 15, 2022

View reviewed changes

andrejbranch force-pushed the group-interval-fix branch from 373cc9e to 320cb60 Compare September 19, 2022 14:38

fixing tests

c61f146

Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>

andrejbranch force-pushed the group-interval-fix branch from 320cb60 to c61f146 Compare September 19, 2022 14:49

zecke mentioned this pull request May 5, 2024

Fixes early notifications after config reload (#2492) #3835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix config reloads not respecting group interval#3074

Fix config reloads not respecting group interval#3074
andrejbranch wants to merge 3 commits intoprometheus:mainfrom
andrejbranch:group-interval-fix

andrejbranch commented Sep 14, 2022 •

edited

Loading

Uh oh!

zecke Sep 15, 2022

Uh oh!

simonpasquier commented Sep 16, 2022

Uh oh!

andrejbranch commented Sep 16, 2022

Uh oh!

yuri-tceretian commented Apr 14, 2023 •

edited

Loading

Uh oh!

andrejbranch commented Apr 14, 2023 •

edited

Loading

Uh oh!

yuri-tceretian commented Apr 14, 2023

Uh oh!

grobinson-grafana commented May 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

andrejbranch commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zecke Sep 15, 2022

Choose a reason for hiding this comment

Uh oh!

simonpasquier commented Sep 16, 2022

Uh oh!

andrejbranch commented Sep 16, 2022

Uh oh!

yuri-tceretian commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrejbranch commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuri-tceretian commented Apr 14, 2023

Uh oh!

grobinson-grafana commented May 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andrejbranch commented Sep 14, 2022 •

edited

Loading

yuri-tceretian commented Apr 14, 2023 •

edited

Loading

andrejbranch commented Apr 14, 2023 •

edited

Loading