feat(notify): generic handling of 429 in retrier#5088
feat(notify): generic handling of 429 in retrier#5088TheMeier wants to merge 1 commit intoprometheus:mainfrom
Conversation
90af1ab to
b7912fc
Compare
4190a78 to
8f9c08d
Compare
|
There seem to be a few bugs captured by this analysis https://app.devin.ai/review/prometheus/alertmanager/pull/5088 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRetrier.Check now accepts a full *http.Response and extracts Retry-After (seconds or HTTP-date). ErrorWithReason gained a RetryAfter field and TooManyRequestsReason. RetryStage uses an attempt timer, honors Retry-After, and recalculates backoff. Many notifiers and tests were updated to pass synthetic *http.Response objects. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant RetryStage as RetryStage
participant Notifier as Notifier
participant Service as Service
participant Retrier as Retrier
participant Timer as Timer
Client->>RetryStage: send notification request
activate RetryStage
RetryStage->>Notifier: attempt send
activate Notifier
Notifier->>Service: HTTP POST
activate Service
Service-->>Notifier: HTTP response (e.g. 429, Retry-After: 30)
deactivate Service
Notifier->>Retrier: Check(resp)
activate Retrier
Retrier->>Retrier: parseRetryAfter(resp.Header)
Retrier-->>Notifier: shouldRetry=true, ErrorWithReason{TooManyRequests, RetryAfter:30s}
deactivate Retrier
Notifier-->>RetryStage: (shouldRetry, errWithReason)
deactivate Notifier
alt Retry-After present
RetryStage->>Timer: set timer for RetryAfter (30s)
activate Timer
Timer-->>RetryStage: timer fires
deactivate Timer
else No Retry-After
RetryStage->>Timer: set timer using exponential backoff
end
RetryStage->>Client: final result (success or final error)
deactivate RetryStage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
notify/incidentio/incidentio_test.go (1)
170-191: Add aRetry-After429 scenario to lock in the new behavior.The updated 429 case asserts message text, but it still doesn’t validate header-driven retry timing. A dedicated test with
Retry-Afterwould prevent regressions in the new generic rate-limit handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/incidentio/incidentio_test.go` around lines 170 - 191, Add a dedicated 429-with-Retry-After test to TestIncidentIORetryScenarios: extend the testCases to include a case named like "rate limit with Retry-After" that uses statusCode http.StatusTooManyRequests, provides a Retry-After header (e.g. "Retry-After":"5"), sets expectRetry to true, and sets expectErrorMsgContains to include both the 429 text and the Retry-After value so the test harness/assertions validate header-driven retry timing; make sure the test runner uses the response headers when constructing the mock response so the new case exercises the header-based retry logic.notify/test/test.go (1)
197-209: PopulateStatusandHeaderon synthetic responses.The retrier now consumes the full
*http.Response. LeavingHeadernil means a follow-upresp.Header.Set("Retry-After", ...)test will panic, and leavingStatusempty makes status-text assertions diverge from real responses.Proposed change
func HTTPResponseForStatusCode(statusCode int) *http.Response { + status := fmt.Sprintf("%d %s", statusCode, http.StatusText(statusCode)) if statusCode == http.StatusBadRequest { body := `{"status":"invalid event"}` return &http.Response{ + Status: status, StatusCode: statusCode, + Header: make(http.Header), Body: io.NopCloser(bytes.NewBufferString(body)), } } return &http.Response{ + Status: status, StatusCode: statusCode, + Header: make(http.Header), Body: io.NopCloser(bytes.NewBufferString(fmt.Sprintf("response with status %d", statusCode))), } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/test/test.go` around lines 197 - 209, The synthetic response generator HTTPResponseForStatusCode should populate the Response.Status field and ensure Response.Header is non-nil so callers can safely read/set headers; update HTTPResponseForStatusCode to set resp.Header = make(http.Header) and set resp.Status = fmt.Sprintf("%d %s", statusCode, http.StatusText(statusCode)) (and preserve the Body and StatusCode as before) so tests that call resp.Header.Set("Retry-After", ...) or assert status text behave like real http.Responses.notify/util_test.go (1)
179-186: Assert the 429 metadata, not just the message.The new behavior in
notify/util.gois the structured*ErrorWithReasonplus optionalRetryAfter, but Lines 226-232 only checkretryandEqualError.Retrier.Checkcould regress to a plainerroror drop header parsing and this test would still pass. Please add 429 cases withRetry-Afterheaders (delta-seconds and HTTP-date) and assert them witherrors.As.Also applies to: 226-232
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/util_test.go` around lines 179 - 186, The test currently only asserts a 429 error string—update the test cases around the Retrier/response checks to include 429 responses that contain Retry-After headers (both delta-seconds and HTTP-date forms) and assert the typed metadata instead of just the message: after calling Retrier.Check use errors.As to cast the returned error to *ErrorWithReason (or the concrete error type defined in notify/util.go), verify the Reason/Status indicates 429/Too Many Requests, and assert the parsed RetryAfter field matches the expected time/duration for both header formats; apply the same change to the other 429 assertion block mentioned (lines ~226-232) so the test fails if header parsing or the typed error is dropped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@notify/notify.go`:
- Around line 915-924: The current logic reassigns nextAttemptC = tick.C after
using retryAfterTimer, which can immediately receive a stale blocked tick from
the original ticker and skip the intended backoff; fix by stopping and
recreating the backoff ticker (or otherwise resetting the backoff state) before
reassigning nextAttemptC so a fresh ticker channel is used (i.e., call
tick.Stop(), create a new ticker instance and set nextAttemptC = newTick.C), and
apply the same change to the other occurrence referenced (around the other block
at 965-980); ensure retryAfterTimer is stopped and nil'd where appropriate to
avoid races.
In `@notify/pagerduty/pagerduty.go`:
- Around line 296-299: The wrapper call discards RetryAfter from the error
returned by n.retrier.Check; change the wrapping to preserve RetryAfter by
detecting when err is an *notify.ErrorWithReason and updating/returning it
instead of creating a new one. Concretely, in the block handling
n.retrier.Check(resp) (where you currently call
notify.NewErrorWithReason(notify.GetFailureReasonFromStatusCode(resp.StatusCode),
err)), perform a type assertion like ew, ok := err.(*notify.ErrorWithReason); if
ok set ew.Reason = notify.GetFailureReasonFromStatusCode(resp.StatusCode) (keep
ew.RetryAfter as-is) and return retry, ew; otherwise fall back to
notify.NewErrorWithReason(...) so RetryStage.exec sees the original RetryAfter
value.
In `@notify/sns/sns.go`:
- Around line 84-88: The synthetic http.Response created before calling
n.retrier.Check is losing upstream headers (so parseRetryAfter reads nil and
drops Retry-After); when constructing the synthetic response (the resp variable
built from respErr.Response and apiErr.ErrorMessage()) copy or clone the
original respErr.Response.Header into resp.Header (e.g., resp.Header =
respErr.Response.Header.Clone() or copy values) so the retrier.Check /
parseRetryAfter sees the original Retry-After and other headers; apply the same
fix to both places where a synthetic http.Response is constructed (the resp
variable near respErr.Response.StatusCode and the similar block at lines
~115-119).
In `@notify/util.go`:
- Around line 238-258: parseRetryAfter currently returns 0 for both missing
headers and explicit "Retry-After: 0" (or already-elapsed dates), which loses
header presence; change parseRetryAfter to return (time.Duration, bool) where
the bool indicates header present, and update callers (notably the 429 handling
that sets e.RetryAfter) to use the presence flag to set e.RetryAfter even when
duration==0; reference parseRetryAfter and the 429 handler that assigns
e.RetryAfter (and ErrorWithReason/e.RetryAfter) and ensure the semantics are:
bool==false => treat as header absent, bool==true with duration==0 => explicit
immediate retry (do not fall back to exponential backoff).
In `@notify/webhook/webhook_test.go`:
- Line 77: The test is constructing an HTTP response with
test.HTTPResponseForStatusCode(tc.status), which ignores the table-provided
tc.body so the error-detail parsing path isn't exercised; change the test to
call the helper that lets you provide both status and body (or construct an
http.Response with tc.status and tc.body) and pass that into
notifier.retrier.Check so the table-driven tc.body values are actually used when
asserting error details for functions like notifier.retrier.Check in
webhook_test.go.
In `@notify/webhook/webhook.go`:
- Around line 149-152: Retrier.Check(resp) may return an *notify.ErrorWithReason
that includes RetryAfter for 429s, but webhook.go currently wraps it with
notify.NewErrorWithReason(GetFailureReasonFromStatusCode(resp.StatusCode), err)
which resets RetryAfter; fix by detecting when the returned err is an
*notify.ErrorWithReason with Reason == notify.TooManyRequestsReason and
RetryAfter > 0 and either (a) skip the second wrap and return the original err
as-is, or (b) after creating the new error via notify.NewErrorWithReason, copy
err.(*notify.ErrorWithReason).RetryAfter into the new error's RetryAfter field
so the Retry-After metadata is preserved (update the code around the call to
Retrier.Check / notify.NewErrorWithReason accordingly).
---
Nitpick comments:
In `@notify/incidentio/incidentio_test.go`:
- Around line 170-191: Add a dedicated 429-with-Retry-After test to
TestIncidentIORetryScenarios: extend the testCases to include a case named like
"rate limit with Retry-After" that uses statusCode http.StatusTooManyRequests,
provides a Retry-After header (e.g. "Retry-After":"5"), sets expectRetry to
true, and sets expectErrorMsgContains to include both the 429 text and the
Retry-After value so the test harness/assertions validate header-driven retry
timing; make sure the test runner uses the response headers when constructing
the mock response so the new case exercises the header-based retry logic.
In `@notify/test/test.go`:
- Around line 197-209: The synthetic response generator
HTTPResponseForStatusCode should populate the Response.Status field and ensure
Response.Header is non-nil so callers can safely read/set headers; update
HTTPResponseForStatusCode to set resp.Header = make(http.Header) and set
resp.Status = fmt.Sprintf("%d %s", statusCode, http.StatusText(statusCode)) (and
preserve the Body and StatusCode as before) so tests that call
resp.Header.Set("Retry-After", ...) or assert status text behave like real
http.Responses.
In `@notify/util_test.go`:
- Around line 179-186: The test currently only asserts a 429 error string—update
the test cases around the Retrier/response checks to include 429 responses that
contain Retry-After headers (both delta-seconds and HTTP-date forms) and assert
the typed metadata instead of just the message: after calling Retrier.Check use
errors.As to cast the returned error to *ErrorWithReason (or the concrete error
type defined in notify/util.go), verify the Reason/Status indicates 429/Too Many
Requests, and assert the parsed RetryAfter field matches the expected
time/duration for both header formats; apply the same change to the other 429
assertion block mentioned (lines ~226-232) so the test fails if header parsing
or the typed error is dropped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7c0e658e-a0f9-4f38-a5a6-d3fa11e7f07e
📒 Files selected for processing (36)
notify/discord/discord.gonotify/discord/discord_test.gonotify/incidentio/incidentio.gonotify/incidentio/incidentio_test.gonotify/jira/jira.gonotify/jira/jira_test.gonotify/mattermost/mattermost.gonotify/mattermost/mattermost_test.gonotify/msteams/msteams.gonotify/msteams/msteams_test.gonotify/msteamsv2/msteamsv2.gonotify/msteamsv2/msteamsv2_test.gonotify/notify.gonotify/notify_test.gonotify/opsgenie/opsgenie.gonotify/opsgenie/opsgenie_test.gonotify/pagerduty/pagerduty.gonotify/pagerduty/pagerduty_test.gonotify/pushover/pushover.gonotify/pushover/pushover_test.gonotify/rocketchat/rocketchat.gonotify/rocketchat/rocketchat_test.gonotify/slack/slack.gonotify/slack/slack_test.gonotify/sns/sns.gonotify/telegram/telegram_test.gonotify/test/test.gonotify/util.gonotify/util_test.gonotify/victorops/victorops.gonotify/victorops/victorops_test.gonotify/webex/webex.gonotify/webex/webex_test.gonotify/webhook/webhook.gonotify/webhook/webhook_test.gonotify/wechat/wechat.go
💤 Files with no reviewable changes (1)
- notify/wechat/wechat.go
292eada to
ddca42a
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
notify/wechat/wechat.go (1)
109-128:⚠️ Potential issue | 🟠 MajorApply the retrier to the
gettokenrequest as well.The new generic 429 handling only covers
message/send. Here,corpsecret/corpidtemplating errors are not checked before the request, and thegettokenresponse is decoded unconditionally. A 429/5xx during token refresh will therefore surface asinvalid APISecret for CorpIDor a JSON decode error instead of a retryableErrorWithReason.🛠️ Proposed fix
parameters.Add("corpsecret", tmpl(apiSecret)) parameters.Add("corpid", tmpl(string(n.conf.CorpID))) + if err != nil { + return false, fmt.Errorf("templating error: %w", err) + } u := n.conf.APIURL.Copy() u.Path += "gettoken" u.RawQuery = parameters.Encode() @@ resp, err := notify.Get(ctx, n.client, u.String()) if err != nil { return true, notify.RedactURL(err) } defer notify.Drain(resp) + + shouldRetry, errWithReason := n.retrier.Check(resp) + if errWithReason != nil { + return shouldRetry, errWithReason + } var wechatToken token if err := json.NewDecoder(resp.Body).Decode(&wechatToken); err != nil { return false, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/wechat/wechat.go` around lines 109 - 128, The gettoken HTTP call currently uses notify.Get directly and decodes the response unconditionally, so 429/5xx or templating errors become misleading decode/invalid APISecret errors; change the gettoken request to use the same retrier logic used for message/send (wrap the notify.Get call with the retrier retry/Do pattern), validate tmpl(...) results before adding corpsecret/corpid to parameters, and when retries exhaust return an ErrorWithReason consistent with other retryable failures; preserve notify.Drain and json decode of token only after a successful retried response and map HTTP 429/5xx into the retrier’s retryable error path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@notify/util_test.go`:
- Around line 258-261: The table-driven test builds an HTTP-date Retry-After
value at table construction time which can be stale by the time the subtest
runs; update the test so the HTTP-date value for the case named "retry-after
http-date with retry-codes including 429" is generated inside the subtest (e.g.,
inside the loop before calling Retrier.ParseRetryAfter) rather than as the
precomputed retryAfter string in the table; locate the test using the table
entry and the Retrier type to make this change.
---
Outside diff comments:
In `@notify/wechat/wechat.go`:
- Around line 109-128: The gettoken HTTP call currently uses notify.Get directly
and decodes the response unconditionally, so 429/5xx or templating errors become
misleading decode/invalid APISecret errors; change the gettoken request to use
the same retrier logic used for message/send (wrap the notify.Get call with the
retrier retry/Do pattern), validate tmpl(...) results before adding
corpsecret/corpid to parameters, and when retries exhaust return an
ErrorWithReason consistent with other retryable failures; preserve notify.Drain
and json decode of token only after a successful retried response and map HTTP
429/5xx into the retrier’s retryable error path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e239cf5c-7cb9-4569-b935-5697331bdede
📒 Files selected for processing (36)
notify/discord/discord.gonotify/discord/discord_test.gonotify/incidentio/incidentio.gonotify/incidentio/incidentio_test.gonotify/jira/jira.gonotify/jira/jira_test.gonotify/mattermost/mattermost.gonotify/mattermost/mattermost_test.gonotify/msteams/msteams.gonotify/msteams/msteams_test.gonotify/msteamsv2/msteamsv2.gonotify/msteamsv2/msteamsv2_test.gonotify/notify.gonotify/notify_test.gonotify/opsgenie/opsgenie.gonotify/opsgenie/opsgenie_test.gonotify/pagerduty/pagerduty.gonotify/pagerduty/pagerduty_test.gonotify/pushover/pushover.gonotify/pushover/pushover_test.gonotify/rocketchat/rocketchat.gonotify/rocketchat/rocketchat_test.gonotify/slack/slack.gonotify/slack/slack_test.gonotify/sns/sns.gonotify/telegram/telegram_test.gonotify/test/test.gonotify/util.gonotify/util_test.gonotify/victorops/victorops.gonotify/victorops/victorops_test.gonotify/webex/webex.gonotify/webex/webex_test.gonotify/webhook/webhook.gonotify/webhook/webhook_test.gonotify/wechat/wechat.go
🚧 Files skipped from review as they are similar to previous changes (13)
- notify/jira/jira_test.go
- notify/pagerduty/pagerduty_test.go
- notify/pushover/pushover_test.go
- notify/incidentio/incidentio_test.go
- notify/webex/webex.go
- notify/slack/slack_test.go
- notify/sns/sns.go
- notify/msteams/msteams.go
- notify/jira/jira.go
- notify/test/test.go
- notify/discord/discord_test.go
- notify/incidentio/incidentio.go
- notify/mattermost/mattermost_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
notify/webex/webex.go (1)
104-114:⚠️ Potential issue | 🟠 MajorDrain the response before returning.
This path never closes
resp.Body.notify.Retrier.Checkreturns immediately on 2xx, so successful Webex sends leave the body open and gradually hurt connection reuse.♻️ Suggested fix
resp, err := notify.PostJSON(ctx, n.client, n.conf.APIURL.String(), &payload) if err != nil { return true, notify.RedactURL(err) } +defer notify.Drain(resp) shouldRetry, err := n.retrier.Check(resp) if err != nil { return shouldRetry, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/webex/webex.go` around lines 104 - 114, The HTTP response body from notify.PostJSON (resp.Body) is never drained/closed which prevents connection reuse; update the handler in webex.go to always consume and close resp.Body (e.g., io.Copy(io.Discard, resp.Body) then resp.Body.Close()) whenever resp != nil — do this before calling n.retrier.Check or before any early return so resp.Body is closed on both success and error paths (referencing resp, resp.Body, notify.PostJSON, and n.retrier.Check).
♻️ Duplicate comments (1)
notify/util_test.go (1)
258-260:⚠️ Potential issue | 🟡 MinorThis HTTP-date case is still time-flaky.
time.Now()is evaluated when the table is built, not when the subtest runs. Becausehttp.TimeFormatdrops sub-second precision, slower CI can reach the assertion with a near-zero or already-expiredRetry-Aftervalue. Generate the HTTP-date insidet.Run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/util_test.go` around lines 258 - 260, The table-driven test seeds a retryAfter string using time.Now() at table construction, which is flaky; change the test to generate the HTTP-date inside the subtest invoked by t.Run instead of in the table entry. For the case where retrier is Retrier{RetryCodes: []int{http.StatusTooManyRequests}} and the table field retryAfter currently uses time.Now().Add(...).UTC().Format(http.TimeFormat), leave the table entry as a placeholder (empty string or a marker) and in the t.Run body compute retryAfter := time.Now().Add(2*time.Second).UTC().Format(http.TimeFormat) and use that value for the request/response used by the test, updating assertions accordingly; reference the Retrier struct and the t.Run subtest that processes table entries to locate where to inject the generation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@notify/jira/jira.go`:
- Around line 404-407: The reassignment of resp.Body to
io.NopCloser(bytes.NewReader(responseBody)) after defer notify.Drain(resp) leaks
the original network body; instead of replacing resp.Body, create a new
temporary reader for retrier.Check to use (or pass responseBody directly) and
leave resp.Body untouched so notify.Drain can close/drain the original
connection. Locate the call to n.retrier.Check(resp) in function handling the
HTTP response and change it to use the in-memory responseBody (or a new
http.Response with its Body set to a fresh NopCloser created exclusively for the
retrier) without reassigning resp.Body; ensure notify.Drain(resp) remains able
to close the original Body and preserve shouldRetry, errWithReason behavior from
retrier.Check.
---
Outside diff comments:
In `@notify/webex/webex.go`:
- Around line 104-114: The HTTP response body from notify.PostJSON (resp.Body)
is never drained/closed which prevents connection reuse; update the handler in
webex.go to always consume and close resp.Body (e.g., io.Copy(io.Discard,
resp.Body) then resp.Body.Close()) whenever resp != nil — do this before calling
n.retrier.Check or before any early return so resp.Body is closed on both
success and error paths (referencing resp, resp.Body, notify.PostJSON, and
n.retrier.Check).
---
Duplicate comments:
In `@notify/util_test.go`:
- Around line 258-260: The table-driven test seeds a retryAfter string using
time.Now() at table construction, which is flaky; change the test to generate
the HTTP-date inside the subtest invoked by t.Run instead of in the table entry.
For the case where retrier is Retrier{RetryCodes:
[]int{http.StatusTooManyRequests}} and the table field retryAfter currently uses
time.Now().Add(...).UTC().Format(http.TimeFormat), leave the table entry as a
placeholder (empty string or a marker) and in the t.Run body compute retryAfter
:= time.Now().Add(2*time.Second).UTC().Format(http.TimeFormat) and use that
value for the request/response used by the test, updating assertions
accordingly; reference the Retrier struct and the t.Run subtest that processes
table entries to locate where to inject the generation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c4dbb653-6fa3-46d9-9408-022569156a89
📒 Files selected for processing (36)
notify/discord/discord.gonotify/discord/discord_test.gonotify/incidentio/incidentio.gonotify/incidentio/incidentio_test.gonotify/jira/jira.gonotify/jira/jira_test.gonotify/mattermost/mattermost.gonotify/mattermost/mattermost_test.gonotify/msteams/msteams.gonotify/msteams/msteams_test.gonotify/msteamsv2/msteamsv2.gonotify/msteamsv2/msteamsv2_test.gonotify/notify.gonotify/notify_test.gonotify/opsgenie/opsgenie.gonotify/opsgenie/opsgenie_test.gonotify/pagerduty/pagerduty.gonotify/pagerduty/pagerduty_test.gonotify/pushover/pushover.gonotify/pushover/pushover_test.gonotify/rocketchat/rocketchat.gonotify/rocketchat/rocketchat_test.gonotify/slack/slack.gonotify/slack/slack_test.gonotify/sns/sns.gonotify/telegram/telegram_test.gonotify/test/test.gonotify/util.gonotify/util_test.gonotify/victorops/victorops.gonotify/victorops/victorops_test.gonotify/webex/webex.gonotify/webex/webex_test.gonotify/webhook/webhook.gonotify/webhook/webhook_test.gonotify/wechat/wechat.go
🚧 Files skipped from review as they are similar to previous changes (16)
- notify/incidentio/incidentio.go
- notify/sns/sns.go
- notify/webhook/webhook_test.go
- notify/webhook/webhook.go
- notify/msteams/msteams_test.go
- notify/msteams/msteams.go
- notify/pagerduty/pagerduty_test.go
- notify/slack/slack.go
- notify/slack/slack_test.go
- notify/incidentio/incidentio_test.go
- notify/jira/jira_test.go
- notify/victorops/victorops.go
- notify/pagerduty/pagerduty.go
- notify/pushover/pushover_test.go
- notify/test/test.go
- notify/discord/discord_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
notify/wechat/wechat.go (1)
175-202:⚠️ Potential issue | 🔴 CriticalResponse body consumed before WeChat-specific parsing.
n.retrier.Check(resp)at line 175 consumesresp.Body(viareadAllorCustomDetailsFunc). The subsequentio.ReadAll(resp.Body)at line 180 will return an empty result because the body stream has already been exhausted.This breaks the WeChat-specific response parsing (lines 186-202) that checks
weResp.Codefor success (0) or expired token (42001).🐛 Proposed fix: Read body before retrier.Check or use CustomDetailsFunc
One approach is to read the body first and pass it to a custom details function:
+func wechatErrDetails(code int, body io.Reader) string { + if body == nil { + return "" + } + b, err := io.ReadAll(body) + if err != nil { + return "" + } + return string(b) +} + // In New(): - return &Notifier{conf: c, tmpl: t, logger: l, client: client, retrier: ¬ify.Retrier{}}, nil + return &Notifier{conf: c, tmpl: t, logger: l, client: client, retrier: ¬ify.Retrier{CustomDetailsFunc: wechatErrDetails}}, nilThen capture the body content from the error message for WeChat-specific parsing, or restructure to read body once and pass a copy/buffer to the retrier.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/wechat/wechat.go` around lines 175 - 202, The response body is consumed by n.retrier.Check(resp) before io.ReadAll(resp.Body) runs, so we need to read and preserve the body once and use it both for the retrier and for WeChat parsing: read resp.Body into a byte slice first, restore resp.Body (e.g., rewrap with a new ReadCloser) before calling n.retrier.Check(resp) or use the retrier's CustomDetailsFunc to accept the already-read bytes; then unmarshal the saved bytes into weChatResponse and handle weResp.Code (0/42001) and n.accessToken exactly as before. Ensure you reference n.retrier.Check, resp.Body, io.ReadAll, weChatResponse, weResp.Code, weResp.Error, and n.accessToken when making the change.notify/discord/discord.go (1)
175-181:⚠️ Potential issue | 🟠 MajorMissing
notify.Drain(resp)causes HTTP connection leak.The response body is never drained or closed after calling
retrier.Check(resp). Unlike other notifiers (e.g.,rocketchat.goline 227,pushover.goline 175), there's nodefer notify.Drain(resp)after obtaining the response. This prevents HTTP connection reuse and leaks resources.🐛 Proposed fix
resp, err := notify.PostJSON(ctx, n.client, url, &payload) if err != nil { return true, notify.RedactURL(err) } + defer notify.Drain(resp) shouldRetry, errWithReason := n.retrier.Check(resp) return shouldRetry, errWithReason🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/discord/discord.go` around lines 175 - 181, The HTTP response body from notify.PostJSON is never drained/closed, leaking connections; after the successful call that sets resp (i.e., right after resp, err := notify.PostJSON(... ) and before calling n.retrier.Check(resp)), add a call to notify.Drain(resp) (use defer notify.Drain(resp) to ensure it always runs) so the response is drained/closed before returning from the function where resp is used by n.retrier.Check; reference resp, notify.Drain and n.retrier.Check in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@notify/test/test.go`:
- Around line 197-214: The StatusBadRequest branch in HTTPResponseForStatusCode
returns an http.Response without initializing Header, which can cause nil
dereferences when callers inspect headers; update the BadRequest return in
HTTPResponseForStatusCode to include Header: make(http.Header) (matching the
non-400 branch) so the response always has a non-nil Header map.
---
Outside diff comments:
In `@notify/discord/discord.go`:
- Around line 175-181: The HTTP response body from notify.PostJSON is never
drained/closed, leaking connections; after the successful call that sets resp
(i.e., right after resp, err := notify.PostJSON(... ) and before calling
n.retrier.Check(resp)), add a call to notify.Drain(resp) (use defer
notify.Drain(resp) to ensure it always runs) so the response is drained/closed
before returning from the function where resp is used by n.retrier.Check;
reference resp, notify.Drain and n.retrier.Check in your change.
In `@notify/wechat/wechat.go`:
- Around line 175-202: The response body is consumed by n.retrier.Check(resp)
before io.ReadAll(resp.Body) runs, so we need to read and preserve the body once
and use it both for the retrier and for WeChat parsing: read resp.Body into a
byte slice first, restore resp.Body (e.g., rewrap with a new ReadCloser) before
calling n.retrier.Check(resp) or use the retrier's CustomDetailsFunc to accept
the already-read bytes; then unmarshal the saved bytes into weChatResponse and
handle weResp.Code (0/42001) and n.accessToken exactly as before. Ensure you
reference n.retrier.Check, resp.Body, io.ReadAll, weChatResponse, weResp.Code,
weResp.Error, and n.accessToken when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5423a33c-039e-4c1a-9087-e65b76be1c82
📒 Files selected for processing (36)
notify/discord/discord.gonotify/discord/discord_test.gonotify/incidentio/incidentio.gonotify/incidentio/incidentio_test.gonotify/jira/jira.gonotify/jira/jira_test.gonotify/mattermost/mattermost.gonotify/mattermost/mattermost_test.gonotify/msteams/msteams.gonotify/msteams/msteams_test.gonotify/msteamsv2/msteamsv2.gonotify/msteamsv2/msteamsv2_test.gonotify/notify.gonotify/notify_test.gonotify/opsgenie/opsgenie.gonotify/opsgenie/opsgenie_test.gonotify/pagerduty/pagerduty.gonotify/pagerduty/pagerduty_test.gonotify/pushover/pushover.gonotify/pushover/pushover_test.gonotify/rocketchat/rocketchat.gonotify/rocketchat/rocketchat_test.gonotify/slack/slack.gonotify/slack/slack_test.gonotify/sns/sns.gonotify/telegram/telegram_test.gonotify/test/test.gonotify/util.gonotify/util_test.gonotify/victorops/victorops.gonotify/victorops/victorops_test.gonotify/webex/webex.gonotify/webex/webex_test.gonotify/webhook/webhook.gonotify/webhook/webhook_test.gonotify/wechat/wechat.go
🚧 Files skipped from review as they are similar to previous changes (12)
- notify/incidentio/incidentio.go
- notify/mattermost/mattermost_test.go
- notify/pagerduty/pagerduty_test.go
- notify/victorops/victorops_test.go
- notify/webhook/webhook_test.go
- notify/opsgenie/opsgenie_test.go
- notify/mattermost/mattermost.go
- notify/sns/sns.go
- notify/webex/webex.go
- notify/msteamsv2/msteamsv2_test.go
- notify/webex/webex_test.go
- notify/webhook/webhook.go
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/with_api_v2/acceptance/send_test.go (1)
559-581:⚠️ Potential issue | 🟠 MajorAssert that the timeout path actually retried.
Line 577 now accepts delivery at
t=2, which is before the500mstimeout and retry backoff have elapsed. That makes this test pass even if the first attempt is the only one observed. Please add an explicitattempts.Load() >= 2assertion afterat.Run()so the test still proves the retry behavior.Suggested test hardening
at.Run() + + if got := attempts.Load(); got < 2 { + t.Fatalf("expected webhook retry after timeout, got %d attempt(s)", got) + } t.Log(co.Check()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/with_api_v2/acceptance/send_test.go` around lines 559 - 581, The test currently can pass without verifying a retry occurred; after calling at.Run(), add an explicit assertion that attempts.Load() >= 2 to ensure the timeout path retried (i.e., the webhook handler represented by wh.Func was invoked at least twice). Place this assertion immediately after at.Run() (and before t.Log(co.Check())) so the test fails if only the first, timed-out attempt was observed.notify/incidentio/incidentio_test.go (1)
186-190:⚠️ Potential issue | 🟡 MinorDon't lock in the loss of Incident.io's 429 body details.
Lines 284-287 later in this file still verify that
errDetailscan turn a 429 body into a more actionable message, but this assertion now only expects the generic status text. That would normalize away the provider response details just when operators need them most. Please preserve the parsed body message when wrapping the generic retry reason.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/incidentio/incidentio_test.go` around lines 186 - 190, The test currently expects only the generic status text for a 429 response but must preserve Incident.io's parsed body details; update the test case (the table entry with name "rate limit response") so expectErrorMsgContains asserts that the error includes the parsed body message produced by errDetails in addition to the status text (for example, ensure the expected substring contains "Too many requests" or the JSON "rate limit exceeded" message as the implementation formats it) so the wrapped retry error retains provider-specific details when statusCode is http.StatusTooManyRequests and expectRetry is true.
🧹 Nitpick comments (2)
notify/test/test.go (1)
197-216: Consider a header-aware response helper for 429 coverage.This shared helper only builds bare responses, so the notifier retry tests that use it can exercise the fallback 429 path but not the new
Retry-Afterbehavior. A small variant that accepts headers would make the seconds and HTTP-date cases easy to cover without open-coding responses in each package.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/test/test.go` around lines 197 - 216, The helper HTTPResponseForStatusCode should be made header-aware so tests can exercise Retry-After behavior; add a new function (e.g., HTTPResponseWithHeaders(statusCode int, headers http.Header) or extend HTTPResponseForStatusCode to accept an http.Header parameter) that sets the Response.Header to the provided headers (falling back to make(http.Header) when nil) and otherwise preserves the existing body and status behavior (including the BadRequest JSON case). Update notifier retry tests to call the new header-aware helper with Retry-After set to a seconds value and an HTTP-date value to cover both parsing branches.notify/util_test.go (1)
268-270: Decouple test branching from descriptive literals.Line 268 and Line 287 currently branch on
tc.name/"7"values. This is brittle for future test-case edits. Prefer explicit boolean fields in the table (e.g.,useHTTPDate,expectExactRetryAfter).♻️ Suggested refactor
for _, tc := range []struct { name string retrier Retrier retryAfter string expected time.Duration + useHTTPDate bool + expectExactRetryAfter bool }{ { name: "retry-after seconds", retrier: Retrier{}, retryAfter: "7", expected: 7 * time.Second, + expectExactRetryAfter: true, }, { name: "retry-after seconds with retry-codes including 429", retrier: Retrier{RetryCodes: []int{http.StatusTooManyRequests}}, retryAfter: "7", expected: 7 * time.Second, + expectExactRetryAfter: true, }, { name: "retry-after http-date with retry-codes including 429", retrier: Retrier{RetryCodes: []int{http.StatusTooManyRequests}}, retryAfter: "", expected: 2 * time.Second, + useHTTPDate: true, }, } { t.Run(tc.name, func(t *testing.T) { retryAfter := tc.retryAfter - if tc.name == "retry-after http-date with retry-codes including 429" { + if tc.useHTTPDate { retryAfter = time.Now().Add(tc.expected).UTC().Format(http.TimeFormat) } ... - if tc.retryAfter == "7" { + if tc.expectExactRetryAfter { require.Equal(t, tc.expected, errWithReason.RetryAfter) return }Also applies to: 287-290
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notify/util_test.go` around lines 268 - 270, Replace brittle branching that checks tc.name and string literals with explicit boolean fields on the test-case struct (e.g., add useHTTPDate and expectExactRetryAfter to the test case type), set those flags for the specific cases that currently rely on tc.name == "retry-after http-date with retry-codes including 429" or the `"7"` case, and update the conditional branches that currently read tc.name or compare to "7" to use tc.useHTTPDate and tc.expectExactRetryAfter instead (adjust test case initializers accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@notify/util.go`:
- Around line 263-266: The Check method on type Retrier currently dereferences
resp without guarding for nil; add an early nil check in Retrier.Check that
returns a non-nil error (and appropriate bool) when resp == nil instead of
accessing resp.StatusCode. Update the start of Retrier.Check to detect nil,
return (false, error) with a clear message (e.g. "nil http response") and then
proceed to the existing 2xx check that uses resp.StatusCode.
In `@notify/wechat/wechat.go`:
- Line 82: The token refresh logic in gettoken (and the second token-refresh
block at the later lines) must use the existing notify.Retrier instead of
decoding responses directly; modify the token fetch calls to invoke n.retrier
(notify.Retrier) to perform the HTTP request via n.client, retry on 429/5xx
honoring Retry-After, and only decode the response body after verifying
resp.StatusCode is 2xx (otherwise return the retrier error); locate the
token-fetch code in gettoken (and the similar block at lines 179-187) and
replace the direct HTTP/JSON decode path with the retrier-run path so transient
429/5xx are retried rather than turned into JSON/invalid-secret errors.
---
Outside diff comments:
In `@notify/incidentio/incidentio_test.go`:
- Around line 186-190: The test currently expects only the generic status text
for a 429 response but must preserve Incident.io's parsed body details; update
the test case (the table entry with name "rate limit response") so
expectErrorMsgContains asserts that the error includes the parsed body message
produced by errDetails in addition to the status text (for example, ensure the
expected substring contains "Too many requests" or the JSON "rate limit
exceeded" message as the implementation formats it) so the wrapped retry error
retains provider-specific details when statusCode is http.StatusTooManyRequests
and expectRetry is true.
In `@test/with_api_v2/acceptance/send_test.go`:
- Around line 559-581: The test currently can pass without verifying a retry
occurred; after calling at.Run(), add an explicit assertion that attempts.Load()
>= 2 to ensure the timeout path retried (i.e., the webhook handler represented
by wh.Func was invoked at least twice). Place this assertion immediately after
at.Run() (and before t.Log(co.Check())) so the test fails if only the first,
timed-out attempt was observed.
---
Nitpick comments:
In `@notify/test/test.go`:
- Around line 197-216: The helper HTTPResponseForStatusCode should be made
header-aware so tests can exercise Retry-After behavior; add a new function
(e.g., HTTPResponseWithHeaders(statusCode int, headers http.Header) or extend
HTTPResponseForStatusCode to accept an http.Header parameter) that sets the
Response.Header to the provided headers (falling back to make(http.Header) when
nil) and otherwise preserves the existing body and status behavior (including
the BadRequest JSON case). Update notifier retry tests to call the new
header-aware helper with Retry-After set to a seconds value and an HTTP-date
value to cover both parsing branches.
In `@notify/util_test.go`:
- Around line 268-270: Replace brittle branching that checks tc.name and string
literals with explicit boolean fields on the test-case struct (e.g., add
useHTTPDate and expectExactRetryAfter to the test case type), set those flags
for the specific cases that currently rely on tc.name == "retry-after http-date
with retry-codes including 429" or the `"7"` case, and update the conditional
branches that currently read tc.name or compare to "7" to use tc.useHTTPDate and
tc.expectExactRetryAfter instead (adjust test case initializers accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cdfb9c8b-194d-4906-94e7-cdadd4dbdb8b
📒 Files selected for processing (37)
notify/discord/discord.gonotify/discord/discord_test.gonotify/incidentio/incidentio.gonotify/incidentio/incidentio_test.gonotify/jira/jira.gonotify/jira/jira_test.gonotify/mattermost/mattermost.gonotify/mattermost/mattermost_test.gonotify/msteams/msteams.gonotify/msteams/msteams_test.gonotify/msteamsv2/msteamsv2.gonotify/msteamsv2/msteamsv2_test.gonotify/notify.gonotify/notify_test.gonotify/opsgenie/opsgenie.gonotify/opsgenie/opsgenie_test.gonotify/pagerduty/pagerduty.gonotify/pagerduty/pagerduty_test.gonotify/pushover/pushover.gonotify/pushover/pushover_test.gonotify/rocketchat/rocketchat.gonotify/rocketchat/rocketchat_test.gonotify/slack/slack.gonotify/slack/slack_test.gonotify/sns/sns.gonotify/telegram/telegram_test.gonotify/test/test.gonotify/util.gonotify/util_test.gonotify/victorops/victorops.gonotify/victorops/victorops_test.gonotify/webex/webex.gonotify/webex/webex_test.gonotify/webhook/webhook.gonotify/webhook/webhook_test.gonotify/wechat/wechat.gotest/with_api_v2/acceptance/send_test.go
🚧 Files skipped from review as they are similar to previous changes (11)
- notify/pushover/pushover_test.go
- notify/pagerduty/pagerduty.go
- notify/webhook/webhook_test.go
- notify/webex/webex_test.go
- notify/opsgenie/opsgenie.go
- notify/jira/jira_test.go
- notify/msteamsv2/msteamsv2_test.go
- notify/mattermost/mattermost_test.go
- notify/telegram/telegram_test.go
- notify/victorops/victorops.go
- notify/rocketchat/rocketchat_test.go
add generic handling of 429 in retier. This will allow to remove any handling of 429 in the individual notifiers and instead rely on the retier to handle it (TBD in follow-up PRs). Fixes prometheus#2205 Refs: prometheus#5048 Signed-off-by: Christoph Maser <christoph.maser+github@gmail.com>
Thanks @siavashs I did some major rework and I think it is in a state that is as least not completely broken. However I am still not sure if this way of trasporting information via custom error types is a good idea here. Also it is a quite significant behaviour change for nearly all notfiers. |
This is an attempt to handle 429 HTTP response codes (Too Many Request aka rate limited) in a generic fasion in the notifiy retrier. It (ab)uses custom
ErrorWithReasonerror types to encode the 429 status code and transport the optional "Retry-After" HTTP reposnse header value. SettingErrorWithReason.RetryAfterto 0 serve as a default to fall-back to the previously used exponential backoff timerThis approach was chosen in order to not change the ridgid Notifier/Multistage interface/type
NOTE: it will require follow-up changes to clean up notifier-specific handling of 429 status codes
NOTE: existing notifier-specific handling of 429 status codes often just does simply rely on the exponential backoff timer
NOTE: if group_interval elapses it will cancel the notifier if it is still waiting for either retryAfterTimer or exponential backoff timer
NOTE: wechat excluded since it does not use the std retrier
Fixes #2205
Refs: #5048
Pull Request Checklist
Please check all the applicable boxes.
benchstatto compare benchmarksWhich user-facing changes does this PR introduce?
Summary by CodeRabbit
New Features
Bug Fixes
Tests