feat(provider): implement per-alert limits by siavashs · Pull Request #4819 · prometheus/alertmanager

siavashs · 2025-12-17T14:44:20Z

Add new limit package with bucket

Add a new limit package with generic bucket implementation.
This can be used for example to limit the number of alerts in memory.

Benchmarks:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/limit
cpu: Apple M3 Pro
BenchmarkBucketUpsert/EmptyBucket-12  	 8816954	       122.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithExpiredItems-12         	 9861010	       123.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithActiveItems-12          	 8343778	       143.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/UpdateExistingItem-12                     	10107787	       118.9 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/MixedWorkload-12                           	 9436174	       126.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_10-12                    	10255278	       115.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_50-12                    	10166518	       117.1 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_100-12                   	10457394	       115.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_500-12                   	 9644079	       115.2 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_1000-12                  	10426184	       116.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertConcurrent-12                               	 5796210	       216.3 ns/op	     406 B/op	       5 allocs/op
PASS
ok  	github.com/prometheus/alertmanager/limit	15.497s

Implement per-alert limits

Use the new limit module to add optional per alert-name limits.
The metrics for limited alerts can be enabled using alerts-limited-metric feature flag.

Signed-off-by: Siavash Safi siavash@cloudflare.com

Spaceman1701

I really like this change, and we'll definitely use it.

I am curious how it'll perform under extremely heavy load, but I think that's hard to test synthetically.

limit/bucket.go

docs/alertmanager.md

cmd/alertmanager/main.go

store/store.go

docs/alertmanager.md

provider/mem/mem.go

SuperQ · 2026-01-23T12:49:51Z

provider/mem/mem.go

+			Name: "alertmanager_alerts_limited_total",
+			Help: "Total number of alerts that were dropped due to per alert name limit",
+		},
+		[]string{"alertname"},


I worry that this will be a cardinality risk. Do we really need to have the label given the alertname is now logged for every alert of this conditino?

Well with some luck there would be only very few misbehaving alerts in this condition? This is more a safeguard and you want to know which alert is to blame, rather than something that should regularly happen for big number of alerts, isn't it?

Exactly, here is the cardinality of a similar alert at Cloudflare over past year:

Note that depending on the configuration, setting a limit too low would drop many alerts and therefore cardinality can increase.

Well with some luck

"Hope is not a strategy". 😁

Added alerts-limited-metric feature flag.

That feature flag is a bit confusing.

How about enable-alerts-limited-alertname-label?

I made the feature name and implementation similar with what we have for receiver in notify package.
We have a metric by default without any labels, enabling the feature alert-names-in-metrics will add the alertname dimention:

# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit # TYPE alertmanager_alerts_limited_total counter alertmanager_alerts_limited_total 1

vs.

# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit # TYPE alertmanager_alerts_limited_total counter alertmanager_alerts_limited_total{alertname="foo"} 1

Thanks! Last nit, can you add the feature flag to the docs?

Added the docs, thanks for the suggestion!

limit/bucket.go

SuperQ

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

ultrotter · 2026-01-28T12:39:35Z

limit/bucket.go

+	if latest.expired(time.Now()) {
+		// Remove all items from the heap and index.
+		b.items = b.items[:0]
+		clear(b.index)


I don't think we need either of these lines, because if the bucket itself has been deleted from the map and is unreachable, the gc will not count the pointers in it as active, and whatever they point to will be cleared anyway, so we don't need to clear the content of them... Should we check? Or remove trhe two lines?

I agree this isn't necessary - the bucket isn't valid after delete is called

I removed the bucket deletion logic from store, so we will keep empty buckets to reuse them if necessary.

That would still cause the store to potentially grow indefinitely. How about we delete the buckets, but increase the len of the condition for example to if latest.expired(time.Now()+1h) instead if it expired now, if it expired more than 1h ago or so, then we delete the bucket altogether because we can definitely afford a reallocation for something firing so infrequently?

Also if you just clear the index but leave it there it still uses all the space anyway? GC doesn't run so frequently, so maybe just deleting it is fine, it will only happen to alerts that are not so infrequent and then we can afford the allocation...

That would still cause the store to potentially grow indefinitely.

Max store bucket usage would be equal to cardinality of alertname for dropped alerts.
We only store fingerprints which are uint64s and time.Time in the buckets so footprint is quite small.
Since alert limits are not enabled by default, we don't need to prematurely optimise this IMHO, we can get feedback from users enabling the feature and maybe then do further optimisations.

I'm still open to adding more logic for GC.

I don't think is a premature optimization - as it's written, there's a memory leak. Without restarting, the limiter will allocate but never free buckets. I really don't think it should do that.

What is the motivation for trying to reuse buckets?

Basically avoiding allocations.
But we can instead avoid doing GC within the bucket and just drop it.

I updated the code again, we basically do this:

for alertName, bucket := range a.limits { if bucket.IsStale() { delete(a.limits, alertName) } }

Spaceman1701 · 2026-01-28T12:42:23Z

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

I think it'd be useful to have this so we can know which alerts are being dropped. Maybe it could be behind a flag?

Since prometheus keeps the ALERTS metric, there's already at least one timeseries per alert in the TSDB.

siavashs · 2026-01-28T12:47:11Z

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

siavashs · 2026-01-28T12:48:01Z

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

I think it'd be useful to have this so we can know which alerts are being dropped. Maybe it could be behind a flag?

Since prometheus keeps the ALERTS metric, there's already at least one timeseries per alert in the TSDB.

By default no limits are applied and therefore there are no metrics for dropped alerts, but I can put the metric behind a feature flag.

siavashs · 2026-01-28T13:00:29Z

Added alerts-limited-metric feature flag.

Spaceman1701 · 2026-01-28T14:18:01Z

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

Yep, we do a very similar thing at HRT, but our limiter is a standalone service that ingests alerts first.

siavashs · 2026-01-28T14:24:16Z

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

Yep, we do a very similar thing at HRT, but our limiter is a standalone service that ingests alerts first.

Same at Cloudflare.

SuperQ · 2026-01-30T20:17:26Z

@ultrotter any more comments? Otherwise I think this is ready to merge.

Add a new limit package with generic bucket implementation. This can be used for example to limit the number of alerts in memory. Benchmarks: ```go goos: darwin goarch: arm64 pkg: github.com/prometheus/alertmanager/limit cpu: Apple M3 Pro BenchmarkBucketUpsert/EmptyBucket-12 8816954 122.4 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/AddToFullBucketWithExpiredItems-12 9861010 123.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/AddToFullBucketWithActiveItems-12 8343778 143.6 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/UpdateExistingAlert-12 10107787 118.9 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsert/MixedWorkload-12 9436174 126.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_10-12 10255278 115.4 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_50-12 10166518 117.1 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_100-12 10457394 115.0 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_500-12 9644079 115.2 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertScaling/BucketSize_1000-12 10426184 116.6 ns/op 56 B/op 2 allocs/op BenchmarkBucketUpsertConcurrent-12 5796210 216.3 ns/op 406 B/op 5 allocs/op PASS ok github.com/prometheus/alertmanager/limit 15.497s ``` Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Use the new limit module to add optional per alert-name limits. The metrics for limited alerts can be enabled using `alerts-limited-metric` feature flag. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

ultrotter · 2026-02-01T08:52:48Z

Let's merge it, I think we may want to still improve the GC a little. But I believe it can be done as a separate commit, at this point. We'll chat about it later today but I think in the meantime we can progress! Thanks, Guido

…

On Sun, 1 Feb 2026, 09:18 Christoph Maser, ***@***.***> wrote: ***@***.**** approved this pull request. — Reply to this email directly, view it on GitHub <#4819 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD3UDV7E4SK4WVOFC56YVG34JWZFJAVCNFSM6AAAAACPKI5J3GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTOMZVGIYTENJTGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SuperQ · 2026-02-01T12:21:15Z

Ok, going to merge this so we can get it into 0.31.

Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>

* [ENHANCEMENT] docs(opsgenie): Fix description of `api_url` field. #4908 * [ENHANCEMENT] docs(slack): Document missing app configs. #4871 * [ENHANCEMENT] docs: Fix `max-silence-size-bytes`. #4805 * [ENHANCEMENT] docs: Update expr for `AlertmanagerClusterFailedToSendAlerts` to exclude value 0. #4872 * [ENHANCEMENT] docs: Use matchers for inhibit rules examples. #4131 * [ENHANCEMENT] docs: add notification integrations. #4901 * [ENHANCEMENT] docs: update `slack_config` attachments documentation links. #4802 * [ENHANCEMENT] docs: update description of filter query params in openapi doc. #4810 * [ENHANCEMENT] provider: Reduce lock contention. #4809 * [FEATURE] slack: Add support for top-level text field in slack notification. #4867 * [FEATURE] smtp: Add support for authsecret from file. #3087 * [FEATURE] smtp: Customize the ssl/tls port support (#4757). #4818 * [FEATURE] smtp: Enhance email notifier configuration validation. #4826 * [FEATURE] telegram: Add `chat_id_file` configuration parameter. #4909 * [FEATURE] telegram: Support global bot token. #4823 * [FEATURE] webhook: Support templating in url fields. #4798 * [FEATURE] wechat: Add config directive to pass api secret via file. #4734 * [FEATURE] provider: Implement per alert limits. #4819 * [BUGFIX] Allow empty `group_by` to override parent route. #4825 * [BUGFIX] Set `spellcheck=false` attribute on silence filter input. #4811 * [BUGFIX] jira: Fix for handling api v3 with ADF. #4756 * [BUGFIX] jira: Prevent hostname corruption in cloud api url replacement. #4892 --------- Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com> Signed-off-by: Ben Kochie <superq@gmail.com> Co-authored-by: Ben Kochie <superq@gmail.com>

siavashs requested review from Spaceman1701 and ultrotter December 17, 2025 14:44

siavashs force-pushed the feat/alert-limits branch 6 times, most recently from 349d097 to 1b4bf64 Compare December 18, 2025 11:14

siavashs requested a review from SuperQ December 18, 2025 12:06

Spaceman1701 approved these changes Jan 13, 2026

View reviewed changes

limit/bucket.go Outdated Show resolved Hide resolved

limit/bucket.go Show resolved Hide resolved

siavashs force-pushed the feat/alert-limits branch 3 times, most recently from eaa22e8 to ec37156 Compare January 14, 2026 11:04

SuperQ requested changes Jan 20, 2026

View reviewed changes

docs/alertmanager.md Outdated Show resolved Hide resolved

docs/alertmanager.md Outdated Show resolved Hide resolved

cmd/alertmanager/main.go Outdated Show resolved Hide resolved

ultrotter reviewed Jan 21, 2026

View reviewed changes

store/store.go Show resolved Hide resolved

docs/alertmanager.md Outdated Show resolved Hide resolved

docs/alertmanager.md Outdated Show resolved Hide resolved

siavashs force-pushed the feat/alert-limits branch from ec37156 to b5552c5 Compare January 23, 2026 11:44

siavashs requested review from SuperQ and ultrotter January 23, 2026 11:48

siavashs force-pushed the feat/alert-limits branch from b5552c5 to 6b1744a Compare January 23, 2026 11:52

SuperQ requested a review from Spaceman1701 January 23, 2026 12:45

SuperQ approved these changes Jan 23, 2026

View reviewed changes

SuperQ reviewed Jan 23, 2026

View reviewed changes

provider/mem/mem.go Outdated Show resolved Hide resolved

SuperQ reviewed Jan 23, 2026

View reviewed changes

Spaceman1701 reviewed Jan 27, 2026

View reviewed changes

limit/bucket.go Outdated Show resolved Hide resolved

siavashs force-pushed the feat/alert-limits branch 2 times, most recently from bcfb080 to c6a140b Compare January 28, 2026 11:41

siavashs requested review from Spaceman1701 and SuperQ January 28, 2026 11:44

SuperQ requested changes Jan 28, 2026

View reviewed changes

ultrotter reviewed Jan 28, 2026

View reviewed changes

siavashs force-pushed the feat/alert-limits branch from c6a140b to a8a94e0 Compare January 28, 2026 12:59

siavashs requested review from SuperQ and ultrotter January 28, 2026 13:01

siavashs force-pushed the feat/alert-limits branch from a8a94e0 to dcb47ac Compare January 30, 2026 15:38

SuperQ approved these changes Jan 30, 2026

View reviewed changes

siavashs force-pushed the feat/alert-limits branch from dcb47ac to 2362d53 Compare January 30, 2026 16:14

SuperQ approved these changes Jan 30, 2026

View reviewed changes

siavashs force-pushed the feat/alert-limits branch from 2362d53 to 74d2b38 Compare January 31, 2026 14:54

siavashs added 2 commits January 31, 2026 15:59

feat(provider): implement per-alert limits

d2d35b4

Use the new limit module to add optional per alert-name limits. The metrics for limited alerts can be enabled using `alerts-limited-metric` feature flag. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs force-pushed the feat/alert-limits branch from 74d2b38 to d2d35b4 Compare January 31, 2026 14:59

TheMeier approved these changes Feb 1, 2026

View reviewed changes

SuperQ merged commit c90d870 into prometheus:main Feb 1, 2026
7 checks passed

SoloJacobs added a commit to SoloJacobs/alertmanager that referenced this pull request Feb 1, 2026

Add missing entry prometheus#4819

a8394a3

Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>

SoloJacobs added a commit to SoloJacobs/alertmanager that referenced this pull request Feb 1, 2026

Add missing entry prometheus#4819

1620ea0

Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>

siavashs deleted the feat/alert-limits branch February 2, 2026 00:11

Conversation

siavashs commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add new limit package with bucket

Implement per-alert limits

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ultrotter Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 commented Jan 28, 2026

Uh oh!

siavashs commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Jan 28, 2026

Uh oh!

siavashs commented Jan 28, 2026

Uh oh!

Spaceman1701 commented Jan 28, 2026

Uh oh!

siavashs commented Jan 28, 2026

Uh oh!

SuperQ commented Jan 30, 2026

Uh oh!

ultrotter commented Feb 1, 2026 via email

siavashs commented Dec 17, 2025 •

edited

Loading

siavashs Jan 28, 2026 •

edited

Loading

siavashs Jan 30, 2026 •

edited

Loading

ultrotter Jan 28, 2026 •

edited

Loading

siavashs commented Jan 28, 2026 •

edited

Loading