Skip to content

feat(provider): implement per-alert limits#4819

Merged
SuperQ merged 2 commits intoprometheus:mainfrom
siavashs:feat/alert-limits
Feb 1, 2026
Merged

feat(provider): implement per-alert limits#4819
SuperQ merged 2 commits intoprometheus:mainfrom
siavashs:feat/alert-limits

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Dec 17, 2025

Add new limit package with bucket

Add a new limit package with generic bucket implementation.
This can be used for example to limit the number of alerts in memory.

Benchmarks:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/limit
cpu: Apple M3 Pro
BenchmarkBucketUpsert/EmptyBucket-12  	 8816954	       122.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithExpiredItems-12         	 9861010	       123.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithActiveItems-12          	 8343778	       143.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/UpdateExistingItem-12                     	10107787	       118.9 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/MixedWorkload-12                           	 9436174	       126.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_10-12                    	10255278	       115.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_50-12                    	10166518	       117.1 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_100-12                   	10457394	       115.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_500-12                   	 9644079	       115.2 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_1000-12                  	10426184	       116.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertConcurrent-12                               	 5796210	       216.3 ns/op	     406 B/op	       5 allocs/op
PASS
ok  	github.com/prometheus/alertmanager/limit	15.497s

Implement per-alert limits

Use the new limit module to add optional per alert-name limits.
The metrics for limited alerts can be enabled using alerts-limited-metric feature flag.


Signed-off-by: Siavash Safi siavash@cloudflare.com

@siavashs siavashs force-pushed the feat/alert-limits branch 6 times, most recently from 349d097 to 1b4bf64 Compare December 18, 2025 11:14
@siavashs siavashs requested a review from SuperQ December 18, 2025 12:06
Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this change, and we'll definitely use it.

I am curious how it'll perform under extremely heavy load, but I think that's hard to test synthetically.

@siavashs siavashs force-pushed the feat/alert-limits branch 3 times, most recently from eaa22e8 to ec37156 Compare January 14, 2026 11:04
Name: "alertmanager_alerts_limited_total",
Help: "Total number of alerts that were dropped due to per alert name limit",
},
[]string{"alertname"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this will be a cardinality risk. Do we really need to have the label given the alertname is now logged for every alert of this conditino?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well with some luck there would be only very few misbehaving alerts in this condition? This is more a safeguard and you want to know which alert is to blame, rather than something that should regularly happen for big number of alerts, isn't it?

Copy link
Contributor Author

@siavashs siavashs Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, here is the cardinality of a similar alert at Cloudflare over past year:
image
image

Note that depending on the configuration, setting a limit too low would drop many alerts and therefore cardinality can increase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well with some luck

"Hope is not a strategy". 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added alerts-limited-metric feature flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feature flag is a bit confusing.

How about enable-alerts-limited-alertname-label?

Copy link
Contributor Author

@siavashs siavashs Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the feature name and implementation similar with what we have for receiver in notify package.
We have a metric by default without any labels, enabling the feature alert-names-in-metrics will add the alertname dimention:

# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit
# TYPE alertmanager_alerts_limited_total counter
alertmanager_alerts_limited_total 1

vs.

# HELP alertmanager_alerts_limited_total Total number of alerts that were dropped due to per alert name limit
# TYPE alertmanager_alerts_limited_total counter
alertmanager_alerts_limited_total{alertname="foo"} 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Last nit, can you add the feature flag to the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the docs, thanks for the suggestion!

@siavashs siavashs force-pushed the feat/alert-limits branch 2 times, most recently from bcfb080 to c6a140b Compare January 28, 2026 11:41
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

limit/bucket.go Outdated
if latest.expired(time.Now()) {
// Remove all items from the heap and index.
b.items = b.items[:0]
clear(b.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need either of these lines, because if the bucket itself has been deleted from the map and is unreachable, the gc will not count the pointers in it as active, and whatever they point to will be cleared anyway, so we don't need to clear the content of them... Should we check? Or remove trhe two lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this isn't necessary - the bucket isn't valid after delete is called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the bucket deletion logic from store, so we will keep empty buckets to reuse them if necessary.

Copy link
Contributor

@ultrotter ultrotter Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would still cause the store to potentially grow indefinitely. How about we delete the buckets, but increase the len of the condition for example to if latest.expired(time.Now()+1h) instead if it expired now, if it expired more than 1h ago or so, then we delete the bucket altogether because we can definitely afford a reallocation for something firing so infrequently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you just clear the index but leave it there it still uses all the space anyway? GC doesn't run so frequently, so maybe just deleting it is fine, it will only happen to alerts that are not so infrequent and then we can afford the allocation...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would still cause the store to potentially grow indefinitely.

Max store bucket usage would be equal to cardinality of alertname for dropped alerts.
We only store fingerprints which are uint64s and time.Time in the buckets so footprint is quite small.
Since alert limits are not enabled by default, we don't need to prematurely optimise this IMHO, we can get feedback from users enabling the feature and maybe then do further optimisations.

I'm still open to adding more logic for GC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is a premature optimization - as it's written, there's a memory leak. Without restarting, the limiter will allocate but never free buckets. I really don't think it should do that.

What is the motivation for trying to reuse buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically avoiding allocations.
But we can instead avoid doing GC within the bucket and just drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code again, we basically do this:

	for alertName, bucket := range a.limits {
		if bucket.IsStale() {
			delete(a.limits, alertName)
		}
	}

@Spaceman1701
Copy link
Contributor

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

I think it'd be useful to have this so we can know which alerts are being dropped. Maybe it could be behind a flag?

Since prometheus keeps the ALERTS metric, there's already at least one timeseries per alert in the TSDB.

@siavashs
Copy link
Contributor Author

siavashs commented Jan 28, 2026

After thinking about this, I'm going to say that we should not have alertname as a metric label.

Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.

It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

@siavashs
Copy link
Contributor Author

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

I think it'd be useful to have this so we can know which alerts are being dropped. Maybe it could be behind a flag?

Since prometheus keeps the ALERTS metric, there's already at least one timeseries per alert in the TSDB.

By default no limits are applied and therefore there are no metrics for dropped alerts, but I can put the metric behind a feature flag.

@siavashs
Copy link
Contributor Author

Added alerts-limited-metric feature flag.

@siavashs siavashs requested review from SuperQ and ultrotter January 28, 2026 13:01
@Spaceman1701
Copy link
Contributor

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

Yep, we do a very similar thing at HRT, but our limiter is a standalone service that ingests alerts first.

@siavashs
Copy link
Contributor Author

After thinking about this, I'm going to say that we should not have alertname as a metric label.
Please remove this, it's just too risky for end users. We have logs for debugging real prod issues.
It also solves the zero set issues with only one hit.

The use-case we have at Cloudflare is to notify about these dropped alerts caused by specific alert flooding the pipeline so the owner of the alert can fix the alert expression or improve it to avoid so many instances of the same alert, etc.

Yep, we do a very similar thing at HRT, but our limiter is a standalone service that ingests alerts first.

Same at Cloudflare.

@SuperQ
Copy link
Member

SuperQ commented Jan 30, 2026

@ultrotter any more comments? Otherwise I think this is ready to merge.

Add a new limit package with generic bucket implementation.
This can be used for example to limit the number of alerts in memory.

Benchmarks:
```go
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/limit
cpu: Apple M3 Pro
BenchmarkBucketUpsert/EmptyBucket-12  	 8816954	       122.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithExpiredItems-12         	 9861010	       123.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/AddToFullBucketWithActiveItems-12          	 8343778	       143.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/UpdateExistingAlert-12                     	10107787	       118.9 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsert/MixedWorkload-12                           	 9436174	       126.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_10-12                    	10255278	       115.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_50-12                    	10166518	       117.1 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_100-12                   	10457394	       115.0 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_500-12                   	 9644079	       115.2 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertScaling/BucketSize_1000-12                  	10426184	       116.6 ns/op	      56 B/op	       2 allocs/op
BenchmarkBucketUpsertConcurrent-12                               	 5796210	       216.3 ns/op	     406 B/op	       5 allocs/op
PASS
ok  	github.com/prometheus/alertmanager/limit	15.497s
```

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Use the new limit module to add optional per alert-name limits.
The metrics for limited alerts can be enabled using
`alerts-limited-metric` feature flag.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@ultrotter
Copy link
Contributor

ultrotter commented Feb 1, 2026 via email

@SuperQ
Copy link
Member

SuperQ commented Feb 1, 2026

Ok, going to merge this so we can get it into 0.31.

@SuperQ SuperQ merged commit c90d870 into prometheus:main Feb 1, 2026
7 checks passed
SoloJacobs added a commit to SoloJacobs/alertmanager that referenced this pull request Feb 1, 2026
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
SoloJacobs added a commit to SoloJacobs/alertmanager that referenced this pull request Feb 1, 2026
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
@siavashs siavashs deleted the feat/alert-limits branch February 2, 2026 00:11
SuperQ added a commit that referenced this pull request Feb 2, 2026
* [ENHANCEMENT] docs(opsgenie): Fix description of `api_url` field. #4908
* [ENHANCEMENT] docs(slack): Document missing app configs. #4871
* [ENHANCEMENT] docs: Fix `max-silence-size-bytes`. #4805
* [ENHANCEMENT] docs: Update expr for `AlertmanagerClusterFailedToSendAlerts` to exclude value 0. #4872
* [ENHANCEMENT] docs: Use matchers for inhibit rules examples. #4131
* [ENHANCEMENT] docs: add notification integrations. #4901
* [ENHANCEMENT] docs: update `slack_config` attachments documentation links. #4802
* [ENHANCEMENT] docs: update description of filter query params in openapi doc. #4810
* [ENHANCEMENT] provider: Reduce lock contention. #4809
* [FEATURE] slack: Add support for top-level text field in slack notification. #4867
* [FEATURE] smtp: Add support for authsecret from file. #3087
* [FEATURE] smtp: Customize the ssl/tls port support (#4757). #4818
* [FEATURE] smtp: Enhance email notifier configuration validation. #4826
* [FEATURE] telegram: Add `chat_id_file` configuration parameter. #4909
* [FEATURE] telegram: Support global bot token. #4823
* [FEATURE] webhook: Support templating in url fields. #4798
* [FEATURE] wechat: Add config directive to pass api secret via file. #4734
* [FEATURE] provider: Implement per alert limits. #4819
* [BUGFIX] Allow empty `group_by` to override parent route. #4825
* [BUGFIX] Set `spellcheck=false` attribute on silence filter input. #4811
* [BUGFIX] jira: Fix for handling api v3 with ADF. #4756
* [BUGFIX] jira: Prevent hostname corruption in cloud api url replacement. #4892
---------

Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Co-authored-by: Ben Kochie <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants