Fix flapping notifications in HA mode#3283
Fix flapping notifications in HA mode#3283yuri-tceretian wants to merge 1 commit intoprometheus:mainfrom yuri-tceretian:fix-obsolete-tick-dedup
Conversation
|
I do not think failing CI is related to my change. I inspected logs and did not find a message that I introduced. |
|
I ran tests locally ( |
Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
|
I think I might have figured out the problem! My version of the main branch was ancient. I synced my fork and rebased PR to the head of the main. CI is still failing but now it's another test (seems to be flaky). |
|
I have restarted the CI and filed #3287. |
| } | ||
|
|
||
| if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) { | ||
| if entry != nil { |
There was a problem hiding this comment.
why not moving this verification to needsUpdate()?
There was a problem hiding this comment.
Initially I wanted to do that there. I had an idea to emit a log message only when needsUpdate is positive (to better indicate the race). To achieve that in needsUpdate I would have to rewrite the method, which I thought would bring less clarity. I can refactor that method if you think that is ok for understanding the change.
|
I don't think this is the correct change we should make. If we're confident this only happens under certain conditions, then I'd say we need to protect against that at a configuration level rather than dropping evaluations. I don't fully understand, or I'm confident on what kind of side-effects this produces. |
|
Closing as this does not seem to have any potential to be merged. |
In some situations in high-availability mode, a cluster of Alertmanagers in a normal state (e.g. notifications are sent immediately, no delays in state propagation, etc) can send multiple (flapping) notifications for the same alert because of the unfortunate composition of parameters
peer_timeoutandgroup_interval.How to reproduce a bug:
peer_timeout=60sgroup_wait=30s,group_interval=70s, andrepeat_interval=2d(to exclude the possibility of repeat notifications) and receiver that sends a webhookDiagram
alertmanager-bug.zip contains everything that is needed to reproduce the bug. To reproduce it, unpack the archive and run
docker compose up. When cluster is created, run script./send.sh testthat will send an alert to all 3 instances, wait for 50s and then send the same alert but with EndsAt=now.This happens because instance that spends on stage
WaitStagefor more thangroup_intervalscan encounter with a state "from future".In the diagram above,
Alertmanager 3that processes ticknow=30with 1 firing alert during the DedupStage compares the current aggregation group with state produced byAlertmanager 1while it was processing ticknow=100where alert was resolved.This PR proposes a fix for this behavior. It introduces an additional check if predicate DedupStage.needsUpdate returns true and notification log entry exists.
It compares the current tick time (the timestamp of the aggregation group tick) with the notification log timestamp, and if former is in the past, which means that there was Log event after the flushing happened, emits an info log and returns empty slice of alerts, which means that the current pipeline should be stopped.