Skip to content

Conversation

@thomasdesr
Copy link

Summary

  • This PR adds (& enables by default) automatic retry for HTTP 429 (Too Many Requests) the bk api commands
  • Added --verbose flag to observe retry behavior when using the CLI

Note: I didn't want to make this the default anywhere other than the CLI because I don't know what the consequences would be of making this the default 🙃 If you think its safe, happy to make it the default everywhere 🤣

Implementation Details

  • Built following the docs here: https://buildkite.com/docs/apis/rest-api/limits
  • When the client experiences a 429 it won't retry before RateLimit-Reset timer has expired
  • If after that, it still hits another 429, it starts to exponentially backoff on repeated rate limits
  • internal/http gained a few additional options to configure the behavior:
    • WithMaxRetries to control how many attempts & WithMaxRetryDelay() to set an upper bound on retry intervals
    • WithOnRetry callback (currently used by the --verbose flag for printing warnings about backoff being hit)

@thomasdesr thomasdesr requested a review from a team as a code owner December 10, 2025 04:04
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@JoeColeman95 JoeColeman95 enabled auto-merge (squash) December 10, 2025 14:57
JoeColeman95
JoeColeman95 previously approved these changes Dec 10, 2025
@scadu
Copy link
Contributor

scadu commented Dec 10, 2025

Hey @thomasdesr 👋
Thanks for the contribution!
While automatic retry for rate limit handling is usually great idea, I think in context of CLI, we might want to emit an error message stating that rate limit has been reached, and stop the current execution.
RateLimit-Reset could be included in the message to communicate clearly when the limit is reset.

Would you be able to update the PR to make CLI behave that way?

@thomasdesr
Copy link
Author

So that's actually exactly how it behaves today :D When you hit the rate limit the CLI exits non-zero and returns an error that contains the notice that you've hit a back off.

The reason I wanted to add support (either default or opt-in) for respecting and retrying the rate limit within the CLI was that when scripting multiple calls, getting rate limit errors in the middle of a pipeline was quite difficult to handle correctly.

E.g. I was writing something like this to batch download logs for a build:

bk build view --pipeline "$ORG/$PIPELINE" "$build_id" \
  | jq -r '. as $b
      | $b.jobs[]
      | select(.type=="script")
      | [$b.pipeline.slug, $b.number, .id]
      | @tsv' \
  | parallel -j8 --colsep '\t' -u \
      'bk api --verbose "/pipelines/{1}/builds/{2}/jobs/{3}/log" | jq -r .content > {1}/{2}-{3}.log'

I can use something like parallel's built-in retries, but then I'm not actually going to respect the rate limit backoff instruction. To do that I'd need to wrap the bk cli and extract the backoff time from the error message? At that point I might as well use the SDK directly xD

Would love advice if I'm going about this wrong :D

@scadu
Copy link
Contributor

scadu commented Dec 12, 2025

@thomasdesr, after revisiting it, I agree that handling 429s for api commands would be welcome addition. It looks like we currently only handle it for build|job GETs through go-buildkite.

If you can continue working on this pull request, I'd recommend using https://github.com/buildkite/roko to handle retries – it's a library already used by our agent, agent-stack-k8s and others.

Again, thanks a ton for your contribution! 🙇

@thomasdesr thomasdesr force-pushed the thomas/add-automatic-429-retry branch from 603731b to 7484bcf Compare December 27, 2025 17:31
The HTTP client now automatically retries when the Buildkite API returns
429 responses. On the first retry, it waits exactly the duration specified
in the RateLimit-Reset header. On subsequent retries, it multiplies the
delay exponentially (2x, 4x, etc.) to reduce contention when multiple
clients are competing for quota.

Configurable via WithMaxRetries (default 3) and WithMaxRetryDelay (default 60s).
Adds WithOnRetry(func(attempt int, delay time.Duration)) option to the
HTTP client. The callback is invoked before each retry sleep, allowing
callers to log retry attempts, collect metrics, or implement custom
notification logic.
When --verbose is enabled, the api command logs rate limit retries to
stderr with the delay duration and expected resume time.
Remove default retry settings from the HTTP client to avoid changing
behavior for other commands until we discuss with the team. The bk api
command explicitly opts in to retry behavior.
Previously, min(delay, 0) would clamp delays to 0 when WithMaxRetryDelay
wasn't called. Now only clamp when a positive max is configured.
Use Buildkite's roko library for retry handling instead of the custom
retry loop. This aligns with patterns used in other Buildkite projects
(agent, agent-stack-k8s).
@thomasdesr thomasdesr force-pushed the thomas/add-automatic-429-retry branch from 7484bcf to 585ce7c Compare December 27, 2025 23:24
@thomasdesr
Copy link
Author

Hey @scadu! Thanks for the second look! I've rebased onto main and resolved conflicts from the Kong migration.

Two thoughts:

  • I re-added the --verbose flag into the ApiCmd Kong struct, but kept it command-local for now rather than root-level, lmk if you want me to move it back to global! I saw it wasn't setup yet at the global level
  • I'm also not super familiar with roko so if you could take a close look at how I'm using it, I'd appreciate it :D Would love confirmation if roko.Constant(0) + SetNextInterval() is the right pattern for server-directed delays, or if there's a more idiomatic approach 🙏

Sorry for the turnaround delay! Ptal :D

@scadu
Copy link
Contributor

scadu commented Jan 2, 2026

@thomasdesr happy New Year!
Thank you so much for the update – it looks great!

The current implementation with roko.Constant(0) + SetNextInterval() should be sufficient, although I see you are adding exponential backoff down the line.
In this case we can consider using roko.Exponential, e.g.: roko.WithStrategy(roko.Exponential(2, 0)), // Wait (2 ^ attemptCount) + 0 seconds between attempts.

@thomasdesr
Copy link
Author

In this case we can consider using roko.Exponential, e.g.: roko.WithStrategy(roko.Exponential(2, 0)), // Wait (2 ^ attemptCount) + 0 seconds between attempts.

I don't think this actually works due to how SetNextInterval works, it'll override the next attempt's sleep right? And since we call it every time, it will never actually hit the strategy's "sleep amount".

Please let me know if I'm misunderstanding either its behavior or your suggest 🙏

@scadu
Copy link
Contributor

scadu commented Jan 6, 2026

@thomasdesr I believe it should work, but the current approach with SetNextInternal is also fine.

Side note: looks like after merging main into this branch, you might need to run go get github.com/buildkite/roko and go mod tidy to clear the golangci-lint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants