Skip to content

feat: configurable RetryConfig for transient-failure retries#20

Merged
jackparnell merged 1 commit intomainfrom
feature/retry-config
Apr 9, 2026
Merged

feat: configurable RetryConfig for transient-failure retries#20
jackparnell merged 1 commit intomainfrom
feature/retry-config

Conversation

@ColonistOne
Copy link
Copy Markdown
Collaborator

Summary

Adds a frozen RetryConfig dataclass that callers pass via:

ColonyClient(api_key, retry=RetryConfig(...))
AsyncColonyClient(api_key, retry=RetryConfig(...))
Field Default Notes
max_retries 2 Number of retries after the initial attempt. 0 disables retries.
base_delay 1.0 Base backoff in seconds. Nth retry waits base_delay * 2**(N-1).
max_delay 10.0 Cap on the per-retry delay (seconds).
retry_on frozenset({429, 502, 503, 504}) HTTP statuses that trigger a retry.

The server's Retry-After header always overrides the computed delay.

Why

Downstream packages (langchain-colony, crewai-colony) currently re-implement retry logic on top of the SDK with their own RetryConfig clones. That logic belongs in one place — here. With this change they can delete their wrappers and just pass retry= through.

⚠️ Behavior change: 5xx gateway errors retried by default

Previously the SDK only retried 429s. It now also retries 502 Bad Gateway, 503 Service Unavailable, and 504 Gateway Timeout — these almost always represent transient infra issues that clear on retry.

500 Internal Server Error is intentionally not retried by default. It usually signals a bug in the request, not a transient failure, so retrying just amplifies the problem.

To restore the old 1.4.x behaviour:

ColonyClient("col_...", retry=RetryConfig(retry_on=frozenset({429})))

Internals

  • _should_retry(status, attempt, config) and _compute_retry_delay(attempt, config, retry_after_header) helpers shared by sync + async _raw_request paths.
  • _raw_request signature gains a separate _token_refreshed flag so the 401-refresh path doesn't consume the configurable retry budget. (Otherwise a 401 followed by a 429 storm would only get max_retries - 1 retries, which is surprising.)
  • ColonyClient / AsyncColonyClient gain a retry attribute — the RetryConfig instance, defaulting to RetryConfig() when None passed.

Examples

from colony_sdk import ColonyClient, RetryConfig

# Disable retries entirely — fail fast
client = ColonyClient("col_...", retry=RetryConfig(max_retries=0))

# Aggressive for a flaky network
client = ColonyClient(
    "col_...",
    retry=RetryConfig(max_retries=5, base_delay=0.5, max_delay=30.0),
)

# Also retry 500 in addition to defaults
client = ColonyClient(
    "col_...",
    retry=RetryConfig(retry_on=frozenset({429, 500, 502, 503, 504})),
)

Test plan

  • 14 new sync tests (TestRetryConfig) covering: defaults, frozen-ness, custom config wiring, max_retries=0, custom max_retries, default 503 retry, default 500 no-retry, custom retry_on, exponential backoff math, max_delay capping, Retry-After override, mixed 429/503 retry-then-success, token refresh not consuming retry budget
  • 7 new async tests (TestAsyncRetryConfig) mirroring the same scenarios
  • All existing tests still pass — fully backward compatible API
  • Package coverage stays at 100% (463/463 statements)
  • ruff check / ruff format --check / mypy src/ all clean
  • CI green on Python 3.10 / 3.11 / 3.12 / 3.13

Follow-up (next PRs, not this one)

After release, crewai-colony and langchain-colony can delete their custom RetryConfig dataclasses and pass the SDK's RetryConfig straight through to the underlying client.

Adds a frozen RetryConfig dataclass that callers pass via:

  ColonyClient(api_key, retry=RetryConfig(...))
  AsyncColonyClient(api_key, retry=RetryConfig(...))

Fields:
  max_retries  — number of retries after the initial attempt (default 2)
  base_delay   — base backoff in seconds (default 1.0)
  max_delay    — cap on per-retry delay (default 10.0)
  retry_on     — frozenset of statuses that trigger retry
                 (default {429, 502, 503, 504})

The Nth retry waits min(base_delay * 2**(N-1), max_delay), unless the
server provides a Retry-After header which always overrides.

Why: downstream packages (langchain-colony, crewai-colony) currently
re-implement retry logic on top of the SDK with their own RetryConfig
clones. That logic belongs in one place — here. With this change they
can delete their wrappers and just pass `retry=` through.

Behavior change: 5xx gateway errors (502/503/504) are now retried by
default. They almost always represent transient infra issues that
clear on retry. 500 is intentionally NOT retried by default — it
usually signals a bug in the request, not a transient failure, so
retrying just amplifies the problem. Opt back into the old behaviour
with `RetryConfig(retry_on=frozenset({429}))`.

Internals:
- _should_retry(status, attempt, config) and
  _compute_retry_delay(attempt, config, retry_after_header) helpers
  shared by sync + async _raw_request paths
- _raw_request signature gains a separate `_token_refreshed` flag so
  the 401-refresh path doesn't consume the configurable retry budget
- ColonyClient/AsyncColonyClient gain a `retry` attribute (the
  RetryConfig instance, defaults to RetryConfig() if None passed)

Tests: 14 new sync + 7 new async tests covering defaults, max_retries=0,
custom retry_on, exponential backoff, max_delay capping, Retry-After
override, mixed 429/503 retry, and the token-refresh isolation. Coverage
stays at 100% (463/463 statements).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@jackparnell jackparnell merged commit f57cb6b into main Apr 9, 2026
7 checks passed
ColonistOne added a commit that referenced this pull request Apr 9, 2026
Two changes that ship together so v1.5.0 can be the first release cut
via the new automation:

1. Release workflow at .github/workflows/release.yml — triggered on
   `v*` tag push. Stages:
     - test:           runs ruff, mypy, pytest before anything else
     - build:          builds wheel + sdist, refuses to proceed if
                       the tag version doesn't match pyproject.toml
     - publish:        uploads to PyPI via OIDC trusted publishing
                       (no API token stored anywhere — short-lived
                       token minted by PyPI from the GitHub Actions
                       OIDC identity at publish time)
     - github-release: extracts the matching CHANGELOG section and
                       creates a GitHub Release with the wheel + sdist
                       attached

2. Version bump 1.4.0 → 1.5.0 in pyproject.toml and __init__.py.

3. CHANGELOG: consolidated the 1.5.0 section into a clean, ordered
   summary covering everything that's landed since 1.4.0:
     - AsyncColonyClient (PR #18)
     - Typed error hierarchy (PR #19)
     - RetryConfig + 5xx default retry (PR #20)
     - py.typed + verify_webhook + Dependabot (PR #21)
     - Pagination iterators (PR #23)
     - Coverage + Codecov (PR #17)
     - This release automation

Coverage at 100% (514/514 statements). 215 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ColonistOne added a commit that referenced this pull request Apr 9, 2026
Two changes that ship together so v1.5.0 can be the first release cut
via the new automation:

1. Release workflow at .github/workflows/release.yml — triggered on
   `v*` tag push. Stages:
     - test:           runs ruff, mypy, pytest before anything else
     - build:          builds wheel + sdist, refuses to proceed if
                       the tag version doesn't match pyproject.toml
     - publish:        uploads to PyPI via OIDC trusted publishing
                       (no API token stored anywhere — short-lived
                       token minted by PyPI from the GitHub Actions
                       OIDC identity at publish time)
     - github-release: extracts the matching CHANGELOG section and
                       creates a GitHub Release with the wheel + sdist
                       attached

2. Version bump 1.4.0 → 1.5.0 in pyproject.toml and __init__.py.

3. CHANGELOG: consolidated the 1.5.0 section into a clean, ordered
   summary covering everything that's landed since 1.4.0:
     - AsyncColonyClient (PR #18)
     - Typed error hierarchy (PR #19)
     - RetryConfig + 5xx default retry (PR #20)
     - py.typed + verify_webhook + Dependabot (PR #21)
     - Pagination iterators (PR #23)
     - Coverage + Codecov (PR #17)
     - This release automation

Coverage at 100% (514/514 statements). 215 tests passing.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants