Skip to content

feat(notifier): independent alertmanager sendloops#16355

Merged
SuperQ merged 3 commits intoprometheus:mainfrom
siavashs:notifier-queue-per-am
Jan 20, 2026
Merged

feat(notifier): independent alertmanager sendloops#16355
SuperQ merged 3 commits intoprometheus:mainfrom
siavashs:notifier-queue-per-am

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Mar 31, 2025

Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:

  • prometheus_notifications_dropped_total
  • prometheus_notifications_queue_capacity
  • prometheus_notifications_queue_length

Which issue(s) does the PR fix:

Closes #7676

This PR also includes the test from #14099

Does this PR introduce a user-facing change?

[ENHANCEMENT] independent alertmanager sendloops
[CHANGE] add `alertmanager` dimention to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`

Signed-off-by: Siavash Safi siavash@cloudflare.com
Signed-off-by: machine424 ayoubmrini424@gmail.com

@siavashs siavashs force-pushed the notifier-queue-per-am branch 3 times, most recently from 299dfc8 to f1ea687 Compare March 31, 2025 07:39
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@siavashs siavashs force-pushed the notifier-queue-per-am branch from f1ea687 to 2969971 Compare April 1, 2025 00:30
@krajorama krajorama self-requested a review April 1, 2025 07:12
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.

In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

@siavashs
Copy link
Contributor Author

siavashs commented Apr 1, 2025

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.

In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.

The refactoring commit is kept separate, so we can easily do another PR for that one first.
I can submit that later today.

@krajorama
Copy link
Member

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.

The refactoring commit is kept separate, so we can easily do another PR for that one first. I can submit that later today.

Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance!

@siavashs
Copy link
Contributor Author

siavashs commented Apr 3, 2025

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.
The refactoring commit is kept separate, so we can easily do another PR for that one first. I can submit that later today.

Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance!

Created #16398

@siavashs siavashs force-pushed the notifier-queue-per-am branch 6 times, most recently from e6e57c0 to 1605a50 Compare April 8, 2025 19:42
@siavashs siavashs requested a review from krajorama April 8, 2025 19:42
@krajorama
Copy link
Member

hi @siavashs , will you have time to rebase this PR and continue? thanks

@siavashs siavashs force-pushed the notifier-queue-per-am branch from 1605a50 to 410c4f3 Compare April 22, 2025 07:55
@siavashs siavashs marked this pull request as ready for review April 28, 2025 08:31
@siavashs siavashs force-pushed the notifier-queue-per-am branch 3 times, most recently from 70395be to 5438ece Compare May 1, 2025 15:21
@siavashs siavashs requested a review from krajorama May 1, 2025 15:37
@siavashs siavashs force-pushed the notifier-queue-per-am branch from 5438ece to 62c7f17 Compare May 2, 2025 09:59
@siavashs
Copy link
Contributor Author

siavashs commented May 2, 2025

@krajorama this is ready for another review.

@siavashs siavashs force-pushed the notifier-queue-per-am branch 2 times, most recently from 3e7680c to 12352ae Compare November 28, 2025 11:06
krajorama and others added 2 commits November 28, 2025 12:16
Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the notifier-queue-per-am branch 2 times, most recently from 0417be9 to 78b490d Compare November 28, 2025 11:28
@siavashs
Copy link
Contributor Author

Github is having a bad day, random stuff are failing.

@siavashs siavashs force-pushed the notifier-queue-per-am branch 2 times, most recently from ffe57b9 to 38f95f2 Compare November 28, 2025 14:02
@siavashs
Copy link
Contributor Author

All checks are green again 🎉

@siavashs
Copy link
Contributor Author

Thanks for the last review @machine424
Let me know how to proceed with the last few items remaining.

@siavashs siavashs force-pushed the notifier-queue-per-am branch from 38f95f2 to ea05297 Compare December 2, 2025 12:49
@siavashs siavashs requested a review from machine424 December 2, 2025 12:51
Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once #16355 (comment) is addressed.

let's give this a try, thanks again for working on it!

@siavashs siavashs force-pushed the notifier-queue-per-am branch from ea05297 to 63c4b7c Compare January 12, 2026 12:35
@siavashs
Copy link
Contributor Author

Thank you @machine424
I resolved that comment.

@siavashs
Copy link
Contributor Author

Hmm, I made a small change and now random unrelated CI checks are failing 🤔

@siavashs siavashs force-pushed the notifier-queue-per-am branch from 63c4b7c to 5bcb4dd Compare January 12, 2026 14:43
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Contributor Author

All green again, let's merge this thing 😄

@SuperQ SuperQ merged commit a89c665 into prometheus:main Jan 20, 2026
32 checks passed
@machine424
Copy link
Member

I was on PTO and just returned today, thanks for taking care of this.

@siavashs siavashs deleted the notifier-queue-per-am branch January 20, 2026 10:01
@machine424
Copy link
Member

seems TestHandlerSendAll is flaky onboth linux https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149084 and windows https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149078

maybe it only flakes on the CI, we can add some debug logs and make it run there

@siavashs
Copy link
Contributor Author

seems TestHandlerSendAll is flaky onboth linux https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149084 and windows https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149078

maybe it only flakes on the CI, we can add some debug logs and make it run there

I could not reproduce this locally. Can we bump the CI machine to a bigger one?

@machine424
Copy link
Member

let's try to fix the test/or the code for the current CI first.
You can open a PR with some fmt.Println() around the check to see what's going on and try to reproduce the failure.

(I can look into this if you want when I have some time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notification queue fills with single down AM instance

7 participants