[cli/api] Add automatic retry with backoff for 429 rate limit responses #585

thomasdesr · 2025-12-10T04:04:22Z

Summary

This PR adds (& enables by default) automatic retry for HTTP 429 (Too Many Requests) the bk api commands
Added --verbose flag to observe retry behavior when using the CLI

Note: I didn't want to make this the default anywhere other than the CLI because I don't know what the consequences would be of making this the default 🙃 If you think its safe, happy to make it the default everywhere 🤣

Implementation Details

Built following the docs here: https://buildkite.com/docs/apis/rest-api/limits
When the client experiences a 429 it won't retry before RateLimit-Reset timer has expired
If after that, it still hits another 429, it starts to exponentially backoff on repeated rate limits
internal/http gained a few additional options to configure the behavior:
- WithMaxRetries to control how many attempts & WithMaxRetryDelay() to set an upper bound on retry intervals
- WithOnRetry callback (currently used by the --verbose flag for printing warnings about backoff being hit)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/cmd/api/api.go

scadu · 2025-12-10T15:11:43Z

Hey @thomasdesr 👋
Thanks for the contribution!
While automatic retry for rate limit handling is usually great idea, I think in context of CLI, we might want to emit an error message stating that rate limit has been reached, and stop the current execution.
RateLimit-Reset could be included in the message to communicate clearly when the limit is reset.

Would you be able to update the PR to make CLI behave that way?

@scadu

@scadu comment

thomasdesr · 2025-12-10T20:20:32Z

So that's actually exactly how it behaves today :D When you hit the rate limit the CLI exits non-zero and returns an error that contains the notice that you've hit a back off.

The reason I wanted to add support (either default or opt-in) for respecting and retrying the rate limit within the CLI was that when scripting multiple calls, getting rate limit errors in the middle of a pipeline was quite difficult to handle correctly.

E.g. I was writing something like this to batch download logs for a build:

bk build view --pipeline "$ORG/$PIPELINE" "$build_id" \
  | jq -r '. as $b
      | $b.jobs[]
      | select(.type=="script")
      | [$b.pipeline.slug, $b.number, .id]
      | @tsv' \
  | parallel -j8 --colsep '\t' -u \
      'bk api --verbose "/pipelines/{1}/builds/{2}/jobs/{3}/log" | jq -r .content > {1}/{2}-{3}.log'

I can use something like parallel's built-in retries, but then I'm not actually going to respect the rate limit backoff instruction. To do that I'd need to wrap the bk cli and extract the backoff time from the error message? At that point I might as well use the SDK directly xD

Would love advice if I'm going about this wrong :D

scadu · 2025-12-12T10:24:05Z

@thomasdesr, after revisiting it, I agree that handling 429s for api commands would be welcome addition. It looks like we currently only handle it for build|job GETs through go-buildkite.

If you can continue working on this pull request, I'd recommend using https://github.com/buildkite/roko to handle retries – it's a library already used by our agent, agent-stack-k8s and others.

Again, thanks a ton for your contribution! 🙇

The HTTP client now automatically retries when the Buildkite API returns 429 responses. On the first retry, it waits exactly the duration specified in the RateLimit-Reset header. On subsequent retries, it multiplies the delay exponentially (2x, 4x, etc.) to reduce contention when multiple clients are competing for quota. Configurable via WithMaxRetries (default 3) and WithMaxRetryDelay (default 60s).

Adds WithOnRetry(func(attempt int, delay time.Duration)) option to the HTTP client. The callback is invoked before each retry sleep, allowing callers to log retry attempts, collect metrics, or implement custom notification logic.

When --verbose is enabled, the api command logs rate limit retries to stderr with the delay duration and expected resume time.

Remove default retry settings from the HTTP client to avoid changing behavior for other commands until we discuss with the team. The bk api command explicitly opts in to retry behavior.

Previously, min(delay, 0) would clamp delays to 0 when WithMaxRetryDelay wasn't called. Now only clamp when a positive max is configured.

Use Buildkite's roko library for retry handling instead of the custom retry loop. This aligns with patterns used in other Buildkite projects (agent, agent-stack-k8s).

thomasdesr · 2025-12-27T23:29:49Z

Hey @scadu! Thanks for the second look! I've rebased onto main and resolved conflicts from the Kong migration.

Two thoughts:

I re-added the --verbose flag into the ApiCmd Kong struct, but kept it command-local for now rather than root-level, lmk if you want me to move it back to global! I saw it wasn't setup yet at the global level
I'm also not super familiar with roko so if you could take a close look at how I'm using it, I'd appreciate it :D Would love confirmation if roko.Constant(0) + SetNextInterval() is the right pattern for server-directed delays, or if there's a more idiomatic approach 🙏

Sorry for the turnaround delay! Ptal :D

scadu · 2026-01-02T11:23:40Z

@thomasdesr happy New Year!
Thank you so much for the update – it looks great!

The current implementation with roko.Constant(0) + SetNextInterval() should be sufficient, although I see you are adding exponential backoff down the line.
In this case we can consider using roko.Exponential, e.g.: roko.WithStrategy(roko.Exponential(2, 0)), // Wait (2 ^ attemptCount) + 0 seconds between attempts.

thomasdesr · 2026-01-06T02:11:46Z

In this case we can consider using roko.Exponential, e.g.: roko.WithStrategy(roko.Exponential(2, 0)), // Wait (2 ^ attemptCount) + 0 seconds between attempts.

I don't think this actually works due to how SetNextInterval works, it'll override the next attempt's sleep right? And since we call it every time, it will never actually hit the strategy's "sleep amount".

Please let me know if I'm misunderstanding either its behavior or your suggest 🙏

scadu · 2026-01-06T13:33:13Z

@thomasdesr I believe it should work, but the current approach with SetNextInternal is also fine.

Side note: looks like after merging main into this branch, you might need to run go get github.com/buildkite/roko and go mod tidy to clear the golangci-lint.

thomasdesr · 2026-01-07T23:20:03Z

Done!

scadu

Thanks @thomasdesr!

thomasdesr requested a review from a team as a code owner December 10, 2025 04:04

chatgpt-codex-connector bot reviewed Dec 10, 2025

View reviewed changes

pkg/cmd/api/api.go Outdated Show resolved Hide resolved

JoeColeman95 enabled auto-merge (squash) December 10, 2025 14:57

JoeColeman95 previously approved these changes Dec 10, 2025

View reviewed changes

JoeColeman95 disabled auto-merge December 10, 2025 15:00

thomasdesr force-pushed the thomas/add-automatic-429-retry branch from 603731b to 7484bcf Compare December 27, 2025 17:31

thomasdesr added 6 commits December 27, 2025 15:14

Add OnRetry callback for retry observability

9ea780e

Adds WithOnRetry(func(attempt int, delay time.Duration)) option to the HTTP client. The callback is invoked before each retry sleep, allowing callers to log retry attempts, collect metrics, or implement custom notification logic.

Add --verbose flag to bk api for rate limit retry logging

b1d7e05

When --verbose is enabled, the api command logs rate limit retries to stderr with the delay duration and expected resume time.

Disable retry by default in HTTP client, opt-in for bk api only

a2f823f

Remove default retry settings from the HTTP client to avoid changing behavior for other commands until we discuss with the team. The bk api command explicitly opts in to retry behavior.

Fix maxRetryDelay clamping when no max is configured

46a6db5

Previously, min(delay, 0) would clamp delays to 0 when WithMaxRetryDelay wasn't called. Now only clamp when a positive max is configured.

Replace hand-rolled retry logic with buildkite/roko

585ce7c

Use Buildkite's roko library for retry handling instead of the custom retry loop. This aligns with patterns used in other Buildkite projects (agent, agent-stack-k8s).

thomasdesr force-pushed the thomas/add-automatic-429-retry branch from 7484bcf to 585ce7c Compare December 27, 2025 23:24

Merge branch 'main' into thomas/add-automatic-429-retry

954459c

Merge branch 'main' into thomas/add-automatic-429-retry

2260440

scadu and others added 2 commits January 7, 2026 16:47

Merge branch 'main' into thomas/add-automatic-429-retry

fc43dfa

Missing roko after merge

66b1320

Merge branch 'main' into thomas/add-automatic-429-retry

43e5a60

scadu approved these changes Jan 8, 2026

View reviewed changes

scadu merged commit b3665fe into buildkite:main Jan 8, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cli/api] Add automatic retry with backoff for 429 rate limit responses #585

[cli/api] Add automatic retry with backoff for 429 rate limit responses #585

Uh oh!

thomasdesr commented Dec 10, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

scadu commented Dec 10, 2025

Uh oh!

thomasdesr commented Dec 10, 2025

Uh oh!

scadu commented Dec 12, 2025

Uh oh!

thomasdesr commented Dec 27, 2025

Uh oh!

scadu commented Jan 2, 2026

Uh oh!

thomasdesr commented Jan 6, 2026

Uh oh!

scadu commented Jan 6, 2026

Uh oh!

thomasdesr commented Jan 7, 2026

Uh oh!

scadu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[cli/api] Add automatic retry with backoff for 429 rate limit responses #585

[cli/api] Add automatic retry with backoff for 429 rate limit responses #585

Uh oh!

Conversation

thomasdesr commented Dec 10, 2025

Summary

Implementation Details

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

scadu commented Dec 10, 2025

Uh oh!

thomasdesr commented Dec 10, 2025

Uh oh!

scadu commented Dec 12, 2025

Uh oh!

thomasdesr commented Dec 27, 2025

Uh oh!

scadu commented Jan 2, 2026

Uh oh!

thomasdesr commented Jan 6, 2026

Uh oh!

scadu commented Jan 6, 2026

Uh oh!

thomasdesr commented Jan 7, 2026

Uh oh!

scadu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants