Remove immediate flush on reload/restart#3419
Closed
grobinson-grafana wants to merge 6 commits intoprometheus:mainfrom
Closed
Remove immediate flush on reload/restart#3419grobinson-grafana wants to merge 6 commits intoprometheus:mainfrom
grobinson-grafana wants to merge 6 commits intoprometheus:mainfrom
Conversation
bd8ae85 to
9c0547b
Compare
f037397 to
1f113be
Compare
This commit changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues: 1. Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart 2. Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (prometheus#3407) A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
1f113be to
5d2478b
Compare
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Collaborator
Author
|
I've just read a comment on another issue that suggests this might be causing issues for users:
|
|
We've been running in production with a patched version of AlertManager that includes this PR since a week. Is there any chance we can get some momentum behind merging this? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
This pull request changes Alertmanager so it no longer flushes aggregation groups on configuration reload or restart of Alertmanager as this behavior causes a number of issues:
Alertmanager will send notifications for inhibited alerts if the inhibited alert is sent to Alertmanager before the inhibiting alert following a restart (https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html)
Reloading Alertmanager via /-/reload can cause incomplete flushes of aggregation groups (Reloading the config leads to incorrect notifications being sent due to a race condition #3407)
Reloading or restarting an Alertmanager while sending a notification can cause a race between the reloaded/restarted Alertmanager and the next peer in the Alertmanager cluster. This can, in some cases, cause firing and resolved notifications to be sent out of order. For example, resolved, firing, resolved.
A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Frequent reloads in combination with a large group_wait could even prevent alerts from being flushed at all.