Skip to content

slack: retry on 429 and honor Retry-After header#5048

Open
biwwy0 wants to merge 1 commit intoprometheus:mainfrom
biwwy0:slack-retry-after-429
Open

slack: retry on 429 and honor Retry-After header#5048
biwwy0 wants to merge 1 commit intoprometheus:mainfrom
biwwy0:slack-retry-after-429

Conversation

@biwwy0
Copy link
Copy Markdown

@biwwy0 biwwy0 commented Feb 26, 2026

Make HTTP 429 (Too Many Requests) retryable for the Slack notifier by adding http.StatusTooManyRequests to the retrier's RetryCodes, matching the pattern already used by PagerDuty, OpsGenie, Jira, and incident.io.

Additionally, parse the Retry-After response header on 429 responses and sleep for the indicated duration before returning. This addresses the concern raised when a similar change was reverted in #2128: simply retrying 429 without honoring Retry-After causes rapid-fire retries that risk getting apps permanently disabled by Slack. The sleep respects context cancellation.

Fixes #2111
Ref #2205

Pull Request Checklist

Please check all the applicable boxes.

  • Please list all open issue(s) discussed with maintainers related to this change
    • Fixes #
  • Is this a new Receiver integration?
  • Is this a bugfix?
    • I have added tests that can reproduce the bug which pass with this bugfix applied
  • Is this a new feature?
    • I have added tests that test the new feature's functionality
  • Does this change affect performance?
    • I have provided benchmarks comparison that shows performance is improved or is not degraded
      • You can use benchstat to compare benchmarks
    • I have added new benchmarks if required or requested by maintainers
  • Is this a breaking change?
    • My changes do not break the existing cluster messages
    • My changes do not break the existing api
  • I have added/updated the required documentation
  • I have signed-off my commits
  • I will follow best practices for contributing to this project

Which user-facing changes does this PR introduce?

[ENHANCEMENT] slack: retry on 429 honoring Retry-After header

Make HTTP 429 (Too Many Requests) retryable for the Slack notifier by
adding http.StatusTooManyRequests to the retrier's RetryCodes, matching
the pattern already used by PagerDuty, OpsGenie, Jira, and incident.io.

Additionally, parse the Retry-After response header on 429 responses and
sleep for the indicated duration before returning. This addresses the
concern raised when a similar change was reverted in prometheus#2128: simply
retrying 429 without honoring Retry-After causes rapid-fire retries that
risk getting apps permanently disabled by Slack. The sleep respects
context cancellation.

Fixes prometheus#2111
Ref prometheus#2205

Signed-off-by: Anton Ivanov <antonivanov@bitgo.com>
@TheMeier TheMeier self-assigned this Mar 6, 2026
@TheMeier
Copy link
Copy Markdown
Contributor

TheMeier commented Mar 8, 2026

Hi @biwwy0 thank you for your contribution,

I do think this is good however I think we should take the time and fix this for all notifiers. There is multiple notifiers currently that just set http.StatusTooManyRequests as retryable and cause an immediate retry disregarding the Retry-After header.
On top of that one also needs to think about what happens if Retry-After contains a "long" time, If the time spent for the previous request(s) + the retry delay is greater than group_interval the notification will get cancelled. (A new one is triggered at that point)

@TheMeier
Copy link
Copy Markdown
Contributor

After checking again there is one issue here. After waiting the Retry-After the incremental backup from the notify stage gets added on top. I will try to investigate further how we could do proper 429-handling in the retry-stage globally.

TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 11, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 11, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 11, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
@biwwy0
Copy link
Copy Markdown
Author

biwwy0 commented Mar 13, 2026

@TheMeier Please let me know if you want me to change something here? I looked into moving Retry-After support into shared retrier, but it seems that every integration has very different return codes/headers and it won't do any good. So decided to think just for one we use mostly.

@TheMeier
Copy link
Copy Markdown
Contributor

I can't follow the reasoning. I think it should be possible to unify the handling of 429 accross all notifiers, regardless of special handling of other return codes in some notifieres. On top the reason for #2128 also applies for all notifieres that currently just set http.StatusTooManyRequests as retrieable, so one could consider it a bug (or at least bad-behaviour).

TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 14, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 14, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 14, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 14, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 16, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 16, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
TheMeier added a commit to TheMeier/alertmanager that referenced this pull request Mar 18, 2026
add generic handling of 429 in retier. This will allow to remove any
handling of 429 in the individual notifiers and instead rely on the
retier to handle it (TBD in follow-up PRs).

Fixes prometheus#2205
Refs: prometheus#5048

Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slack Integration does not respect rate limiting

2 participants