feat(notifier): independent alertmanager sendloops#16355
feat(notifier): independent alertmanager sendloops#16355SuperQ merged 3 commits intoprometheus:mainfrom
Conversation
299dfc8 to
f1ea687
Compare
f1ea687 to
2969971
Compare
krajorama
left a comment
There was a problem hiding this comment.
Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.
Hi, thanks for your feedback. The refactoring commit is kept separate, so we can easily do another PR for that one first. |
Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance! |
Created #16398 |
e6e57c0 to
1605a50
Compare
|
hi @siavashs , will you have time to rebase this PR and continue? thanks |
1605a50 to
410c4f3
Compare
70395be to
5438ece
Compare
5438ece to
62c7f17
Compare
|
@krajorama this is ready for another review. |
3e7680c to
12352ae
Compare
Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
0417be9 to
78b490d
Compare
|
Github is having a bad day, random stuff are failing. |
ffe57b9 to
38f95f2
Compare
|
All checks are green again 🎉 |
|
Thanks for the last review @machine424 |
38f95f2 to
ea05297
Compare
machine424
left a comment
There was a problem hiding this comment.
once #16355 (comment) is addressed.
let's give this a try, thanks again for working on it!
ea05297 to
63c4b7c
Compare
|
Thank you @machine424 |
|
Hmm, I made a small change and now random unrelated CI checks are failing 🤔 |
63c4b7c to
5bcb4dd
Compare
Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. The following metrics now include an extra dimention for alertmanager label: - prometheus_notifications_dropped_total - prometheus_notifications_queue_capacity - prometheus_notifications_queue_length This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
5bcb4dd to
206e76a
Compare
|
All green again, let's merge this thing 😄 |
|
I was on PTO and just returned today, thanks for taking care of this. |
|
seems maybe it only flakes on the CI, we can add some debug logs and make it run there |
I could not reproduce this locally. Can we bump the CI machine to a bigger one? |
|
let's try to fix the test/or the code for the current CI first. (I can look into this if you want when I have some time) |
Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload.
The following metrics now include an extra dimention for alertmanager label:
prometheus_notifications_dropped_totalprometheus_notifications_queue_capacityprometheus_notifications_queue_lengthWhich issue(s) does the PR fix:
Closes #7676
This PR also includes the test from #14099
Does this PR introduce a user-facing change?
Signed-off-by: Siavash Safi siavash@cloudflare.com
Signed-off-by: machine424 ayoubmrini424@gmail.com