Optimize the new inhibitor implementation for ~2.5x performance improvement by Spaceman1701 · Pull Request #4668 · prometheus/alertmanager

Spaceman1701 · 2025-10-30T14:49:38Z

#4607 massively improved inhibitor performance, but had some regressions compared to #4134. This PR is the result of an optimization pass where I tried to get back some of the performance.

These changes were mostly guided by profiling and come in two categories:

Remove repetitive calls to With on prometheus metric vectors. These require allocating a map, hashing each key, and then a somewhat expensive call to With itself. I was able to move these into the one-time construction of the inhibitor metrics.
Reduce calls to time.Now(). First, use the time taken at the beginning of Mutes as now for all inhibit rule evaluations. This reduces calls to time.Now() and makes inhibitor behavior a little more consistent (since alerts can't resolve mid-execution anymore). Second, calculate time.Now() once for the two metrics that track inhibitor performance when an inhibition is found. Each call to time.Since calls time.Now.

Overall, this leads to reasonable performance gains:

`main` vs. this PR:

                                                     │ main-with-index.txt │ main-with-index-fix-metrics-and-time.txt │
                                                     │       sec/op        │      sec/op        vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                3.726µ ±  35%        1.806µ ± 10%  -51.52% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4              3.359µ ±  16%        1.894µ ± 25%  -43.63% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4             3.714µ ±  31%        1.922µ ± 13%  -48.24% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4            3.696µ ±  22%        2.071µ ± 20%  -43.95% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4           8.774µ ± 154%        2.856µ ± 26%  -67.45% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4              3.954µ ±  20%        1.877µ ± 43%  -52.54% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4             3.607µ ±   7%        2.134µ ± 18%  -40.84% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4            3.716µ ±  13%        1.912µ ± 36%  -48.55% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4           3.418µ ±  13%        2.047µ ± 16%  -40.11% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4         3.669µ ±  25%        1.860µ ±  6%  -49.30% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4              16.239µ ±   6%        4.400µ ± 10%  -72.91% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4             147.14µ ±  18%        27.75µ ±  6%  -81.14% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4            1449.5µ ±   7%        278.1µ ±  7%  -80.81% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4           16.880m ±  43%        2.888m ± 10%  -82.89% (p=0.002 n=6)
geomean                                                      15.76µ               6.152µ        -60.98%

Optimization (1) vs (1) and (2)

                                                     │ main-with-index-fix-metrics.txt │ main-with-index-fix-metrics-and-time.txt │
                                                     │             sec/op              │      sec/op        vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                             2.736µ ± 40%        1.806µ ± 10%  -34.00% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                           2.429µ ± 23%        1.894µ ± 25%  -22.05% (p=0.015 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                          2.693µ ± 16%        1.922µ ± 13%  -28.63% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                         2.702µ ± 43%        2.071µ ± 20%  -23.32% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                        4.248µ ± 26%        2.856µ ± 26%  -32.79% (p=0.026 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                           2.742µ ± 12%        1.877µ ± 43%  -31.55% (p=0.004 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                          2.528µ ±  5%        2.134µ ± 18%  -15.59% (p=0.009 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                         2.091µ ± 29%        1.912µ ± 36%        ~ (p=0.180 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                        2.206µ ± 15%        2.047µ ± 16%        ~ (p=0.589 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                      1.985µ ±  8%        1.860µ ±  6%        ~ (p=0.065 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                            4.982µ ± 16%        4.400µ ± 10%  -11.69% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                           34.80µ ±  5%        27.75µ ±  6%  -20.25% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                          335.6µ ±  5%        278.1µ ±  7%  -17.13% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                         3.701m ±  8%        2.888m ± 10%  -21.95% (p=0.002 n=6)
geomean                                                                   7.747µ              6.152µ        -20.60%

This PR vs #4134

                                                                 │ main-with-index-fix-metrics-and-time.txt │    icache-with-first-result.txt     │
                                                                 │                  sec/op                  │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                                                  1.806µ ± 10%   1.381µ ± 19%  -23.53% (p=0.009 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                                                1.894µ ± 25%   1.448µ ± 17%  -23.53% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                                               1.922µ ± 13%   1.431µ ± 10%  -25.57% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                                              2.071µ ± 20%   1.498µ ± 25%  -27.69% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                                             2.856µ ± 26%   1.484µ ± 56%  -48.05% (p=0.004 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                                                1.877µ ± 43%   1.498µ ± 13%  -20.20% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                                               2.134µ ± 18%   1.552µ ± 35%  -27.27% (p=0.009 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                                              1.912µ ± 36%   1.468µ ±  8%  -23.23% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                                             2.047µ ± 16%   1.506µ ± 28%  -26.43% (p=0.015 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                                           1.860µ ±  6%   1.329µ ± 22%  -28.55% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                                                 4.400µ ± 10%   1.664µ ±  6%  -62.19% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                                               27.751µ ±  6%   4.798µ ± 17%  -82.71% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                                              278.10µ ±  7%   31.36µ ± 17%  -88.72% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                                             2888.3µ ± 10%   302.8µ ± 33%  -89.52% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                                                 1.793µ ±  8%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4                                                1.902µ ± 11%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4                                               1.781µ ± 18%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4                                              1.881µ ± 42%
geomean                                                                                        6.152µ         2.635µ        -52.52%

In summary: This PR is a fair bit faster than what's on main, but it's still ~3x slower than #4134 in the bechmark suite. This is mostly caused by a much higher per-inhibition rule cost. I don't think we can significantly reduce that without removing some of the new metrics (which probably isn't worth it), but there might be something more I can find.

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

Spaceman1701 · 2025-10-30T14:53:01Z

When you get a minute this might be interesting to you @siavashs and @SuperQ

siavashs

Minor issue, otherwise LGTM 👍

inhibit/inhibit.go

Co-authored-by: Siavash Safi <git@hosted.run> Signed-off-by: Ethan Hunter <fc.spaceman@gmail.com>

SuperQ

LGTM

Spaceman1701 added 2 commits October 30, 2025 10:50

remove metric selection from inhibitor hot-path

355620c

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

add optimizations to improve inhibitor performance

f487c04

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

Spaceman1701 force-pushed the feature/inhibitor-optimization branch from bdfb74b to f487c04 Compare October 30, 2025 14:51

siavashs requested changes Oct 30, 2025

View reviewed changes

inhibit/inhibit.go Outdated Show resolved Hide resolved

siavashs reviewed Oct 30, 2025

View reviewed changes

inhibit/inhibit.go Show resolved Hide resolved

fix duration calculation

b7e776d

Co-authored-by: Siavash Safi <git@hosted.run> Signed-off-by: Ethan Hunter <fc.spaceman@gmail.com>

siavashs approved these changes Oct 30, 2025

View reviewed changes

SuperQ approved these changes Nov 1, 2025

View reviewed changes

SuperQ merged commit 352d49c into prometheus:main Nov 1, 2025
7 checks passed

Spaceman1701 mentioned this pull request Nov 3, 2025

Signficantly improve inhibitor performance via new cache datastructure #4134

Closed

SoloJacobs mentioned this pull request Nov 24, 2025

Release v0.30.0-rc.0 #4770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the new inhibitor implementation for ~2.5x performance improvement #4668

Optimize the new inhibitor implementation for ~2.5x performance improvement #4668
SuperQ merged 3 commits intoprometheus:mainfrom
Spaceman1701:feature/inhibitor-optimization

Spaceman1701 commented Oct 30, 2025

Uh oh!

Spaceman1701 commented Oct 30, 2025

Uh oh!

siavashs left a comment

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Spaceman1701 commented Oct 30, 2025

main vs. this PR:

Optimization (1) vs (1) and (2)

This PR vs #4134

Uh oh!

Spaceman1701 commented Oct 30, 2025

Uh oh!

siavashs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`main` vs. this PR: