feat(provider): implement per-alert limits#4819
Conversation
349d097 to
1b4bf64
Compare
Spaceman1701
left a comment
There was a problem hiding this comment.
I really like this change, and we'll definitely use it.
I am curious how it'll perform under extremely heavy load, but I think that's hard to test synthetically.
eaa22e8 to
ec37156
Compare
ec37156 to
b5552c5
Compare
b5552c5 to
6b1744a
Compare
provider/mem/mem.go
Outdated
| Name: "alertmanager_alerts_limited_total", | ||
| Help: "Total number of alerts that were dropped due to per alert name limit", | ||
| }, | ||
| []string{"alertname"}, |
There was a problem hiding this comment.
I worry that this will be a cardinality risk. Do we really need to have the label given the alertname is now logged for every alert of this conditino?
There was a problem hiding this comment.
Well with some luck there would be only very few misbehaving alerts in this condition? This is more a safeguard and you want to know which alert is to blame, rather than something that should regularly happen for big number of alerts, isn't it?
There was a problem hiding this comment.
Well with some luck
"Hope is not a strategy". 😁
There was a problem hiding this comment.
Added alerts-limited-metric feature flag.
There was a problem hiding this comment.
That feature flag is a bit confusing.
How about enable-alerts-limited-alertname-label?
There was a problem hiding this comment.
I made the feature name and implementation similar with what we have for receiver in notify package.
We have a metric by default without any labels, enabling the feature alert-names-in-metrics will add the alertname dimention:
# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit
# TYPE alertmanager_alerts_limited_total counter
alertmanager_alerts_limited_total 1
vs.
# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit
# TYPE alertmanager_alerts_limited_total counter
alertmanager_alerts_limited_total{alertname="foo"} 1
There was a problem hiding this comment.
Thanks! Last nit, can you add the feature flag to the docs?
There was a problem hiding this comment.
Added the docs, thanks for the suggestion!
bcfb080 to
c6a140b
Compare
SuperQ
left a comment
There was a problem hiding this comment.
After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.
limit/bucket.go
Outdated
| if latest.expired(time.Now()) { | ||
| // Remove all items from the heap and index. | ||
| b.items = b.items[:0] | ||
| clear(b.index) |
There was a problem hiding this comment.
I don't think we need either of these lines, because if the bucket itself has been deleted from the map and is unreachable, the gc will not count the pointers in it as active, and whatever they point to will be cleared anyway, so we don't need to clear the content of them... Should we check? Or remove trhe two lines?
There was a problem hiding this comment.
I agree this isn't necessary - the bucket isn't valid after delete is called
There was a problem hiding this comment.
I removed the bucket deletion logic from store, so we will keep empty buckets to reuse them if necessary.
There was a problem hiding this comment.
That would still cause the store to potentially grow indefinitely. How about we delete the buckets, but increase the len of the condition for example to if latest.expired(time.Now()+1h) instead if it expired now, if it expired more than 1h ago or so, then we delete the bucket altogether because we can definitely afford a reallocation for something firing so infrequently?
There was a problem hiding this comment.
Also if you just clear the index but leave it there it still uses all the space anyway? GC doesn't run so frequently, so maybe just deleting it is fine, it will only happen to alerts that are not so infrequent and then we can afford the allocation...
There was a problem hiding this comment.
That would still cause the store to potentially grow indefinitely.
Max store bucket usage would be equal to cardinality of alertname for dropped alerts.
We only store fingerprints which are uint64s and time.Time in the buckets so footprint is quite small.
Since alert limits are not enabled by default, we don't need to prematurely optimise this IMHO, we can get feedback from users enabling the feature and maybe then do further optimisations.
I'm still open to adding more logic for GC.
There was a problem hiding this comment.
I don't think is a premature optimization - as it's written, there's a memory leak. Without restarting, the limiter will allocate but never free buckets. I really don't think it should do that.
What is the motivation for trying to reuse buckets?
There was a problem hiding this comment.
Basically avoiding allocations.
But we can instead avoid doing GC within the bucket and just drop it.
There was a problem hiding this comment.
I updated the code again, we basically do this:
for alertName, bucket := range a.limits {
if bucket.IsStale() {
delete(a.limits, alertName)
}
}
I think it'd be useful to have this so we can know which alerts are being dropped. Maybe it could be behind a flag? Since prometheus keeps the |
The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc. |
By default no limits are applied and therefore there are no metrics for dropped alerts, but I can put the metric behind a feature flag. |
c6a140b to
a8a94e0
Compare
|
Added |
Yep, we do a very similar thing at HRT, but our limiter is a standalone service that ingests alerts first. |
Same at Cloudflare. |
a8a94e0 to
dcb47ac
Compare
dcb47ac to
2362d53
Compare
|
@ultrotter any more comments? Otherwise I think this is ready to merge. |
2362d53 to
74d2b38
Compare
Add a new limit package with generic bucket implementation. This can be used for example to limit the number of alerts in memory. Benchmarks: ```go goos: darwin goarch: arm64 pkg: github.com/prometheus/alertmanager/limit cpu: Apple M3 Pro BenchmarkBucketUpsert/EmptyBucket-12 8816954 122.4 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/AddToFullBucketWithExpiredItems-12 9861010 123.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/AddToFullBucketWithActiveItems-12 8343778 143.6 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/UpdateExistingAlert-12 10107787 118.9 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/MixedWorkload-12 9436174 126.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_10-12 10255278 115.4 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_50-12 10166518 117.1 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_100-12 10457394 115.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_500-12 9644079 115.2 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_1000-12 10426184 116.6 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertConcurrent-12 5796210 216.3 ns/op 406 B/op 5 allocs/op PASS ok github.com/prometheus/alertmanager/limit 15.497s ``` Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Use the new limit module to add optional per alert-name limits. The metrics for limited alerts can be enabled using `alerts-limited-metric` feature flag. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
74d2b38 to
d2d35b4
Compare
|
Let's merge it, I think we may want to still improve the GC a little. But I
believe it can be done as a separate commit, at this point. We'll chat
about it later today but I think in the meantime we can progress!
Thanks,
Guido
…On Sun, 1 Feb 2026, 09:18 Christoph Maser, ***@***.***> wrote:
***@***.**** approved this pull request.
—
Reply to this email directly, view it on GitHub
<#4819 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD3UDV7E4SK4WVOFC56YVG34JWZFJAVCNFSM6AAAAACPKI5J3GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTOMZVGIYTENJTGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Ok, going to merge this so we can get it into 0.31. |
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
* [ENHANCEMENT] docs(opsgenie): Fix description of `api_url` field. #4908 * [ENHANCEMENT] docs(slack): Document missing app configs. #4871 * [ENHANCEMENT] docs: Fix `max-silence-size-bytes`. #4805 * [ENHANCEMENT] docs: Update expr for `AlertmanagerClusterFailedToSendAlerts` to exclude value 0. #4872 * [ENHANCEMENT] docs: Use matchers for inhibit rules examples. #4131 * [ENHANCEMENT] docs: add notification integrations. #4901 * [ENHANCEMENT] docs: update `slack_config` attachments documentation links. #4802 * [ENHANCEMENT] docs: update description of filter query params in openapi doc. #4810 * [ENHANCEMENT] provider: Reduce lock contention. #4809 * [FEATURE] slack: Add support for top-level text field in slack notification. #4867 * [FEATURE] smtp: Add support for authsecret from file. #3087 * [FEATURE] smtp: Customize the ssl/tls port support (#4757). #4818 * [FEATURE] smtp: Enhance email notifier configuration validation. #4826 * [FEATURE] telegram: Add `chat_id_file` configuration parameter. #4909 * [FEATURE] telegram: Support global bot token. #4823 * [FEATURE] webhook: Support templating in url fields. #4798 * [FEATURE] wechat: Add config directive to pass api secret via file. #4734 * [FEATURE] provider: Implement per alert limits. #4819 * [BUGFIX] Allow empty `group_by` to override parent route. #4825 * [BUGFIX] Set `spellcheck=false` attribute on silence filter input. #4811 * [BUGFIX] jira: Fix for handling api v3 with ADF. #4756 * [BUGFIX] jira: Prevent hostname corruption in cloud api url replacement. #4892 --------- Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com> Signed-off-by: Ben Kochie <superq@gmail.com> Co-authored-by: Ben Kochie <superq@gmail.com>


Add new limit package with bucket
Add a new limit package with generic bucket implementation.
This can be used for example to limit the number of alerts in memory.
Benchmarks:
Implement per-alert limits
Use the new limit module to add optional per alert-name limits.
The metrics for limited alerts can be enabled using
alerts-limited-metricfeature flag.Signed-off-by: Siavash Safi siavash@cloudflare.com