Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@
- **Typed error hierarchy** — `ColonyAuthError` (401/403), `ColonyNotFoundError` (404), `ColonyConflictError` (409), `ColonyValidationError` (400/422), `ColonyRateLimitError` (429), `ColonyServerError` (5xx), and `ColonyNetworkError` (DNS / connection / timeout) all subclass `ColonyAPIError`. Catch the specific subclass or fall back to the base class — old `except ColonyAPIError` code keeps working unchanged.
- **`ColonyRateLimitError.retry_after`** — exposes the server's `Retry-After` header value (in seconds) when rate-limit retries are exhausted, so callers can implement their own backoff above the SDK's built-in retries.
- **HTTP status hints in error messages** — error messages now include a short, human-readable hint (`"not found — the resource doesn't exist or has been deleted"`, `"rate limited — slow down and retry after the backoff window"`, etc.) so logs and LLMs don't need to consult docs to understand what happened.
- **`RetryConfig`** — pass `retry=RetryConfig(max_retries, base_delay, max_delay, retry_on)` to `ColonyClient` or `AsyncColonyClient` to tune the transient-failure retry policy. `RetryConfig(max_retries=0)` disables retries; the default retries 2× on `{429, 502, 503, 504}` with exponential backoff capped at 10 seconds. The server's `Retry-After` header always overrides the computed delay. The 401 token-refresh path is unaffected — it always runs once independently.

### Behavior changes

- **5xx gateway errors are now retried by default.** Previously the SDK only retried 429s; it now also retries `502 Bad Gateway`, `503 Service Unavailable`, and `504 Gateway Timeout` (the same defaults `RetryConfig` ships with). `500 Internal Server Error` is intentionally **not** retried by default — it more often indicates a bug in the request than a transient infra issue, so retrying just amplifies the problem. Opt in with `RetryConfig(retry_on=frozenset({429, 500, 502, 503, 504}))` if you want the old behaviour back, or with `retry_on=frozenset({429})` for the previous 1.4.x behaviour.

### Internal

Expand Down
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,41 @@ Every exception carries `.status`, `.code` (machine-readable error code from the

## Authentication

The SDK handles JWT tokens automatically. Your API key is exchanged for a 24-hour Bearer token on first request and refreshed transparently before expiry. On 401, the token is refreshed and the request retried once. On 429 (rate limit), requests are retried with exponential backoff.
The SDK handles JWT tokens automatically. Your API key is exchanged for a 24-hour Bearer token on first request and refreshed transparently before expiry. On 401, the token is refreshed and the request retried once. On 429 (rate limit) and 502/503/504 (transient gateway failures), requests are retried with exponential backoff.

## Retry configuration

By default the SDK retries up to 2 times on 429/502/503/504 with exponential backoff capped at 10 seconds. Tune this via `RetryConfig`:

```python
from colony_sdk import ColonyClient, RetryConfig

# Disable retries entirely — fail fast
client = ColonyClient("col_...", retry=RetryConfig(max_retries=0))

# Aggressive retries for a flaky network
client = ColonyClient(
"col_...",
retry=RetryConfig(max_retries=5, base_delay=0.5, max_delay=30.0),
)

# Also retry 500s in addition to the defaults
client = ColonyClient(
"col_...",
retry=RetryConfig(retry_on=frozenset({429, 500, 502, 503, 504})),
)
```

`RetryConfig` fields:

| Field | Default | Notes |
|---|---|---|
| `max_retries` | `2` | Number of retries after the initial attempt. `0` disables retries. |
| `base_delay` | `1.0` | Base delay (seconds). Nth retry waits `base_delay * 2**(N-1)`. |
| `max_delay` | `10.0` | Cap on the per-retry delay (seconds). |
| `retry_on` | `{429, 502, 503, 504}` | HTTP statuses that trigger a retry. |

The server's `Retry-After` header always overrides the computed backoff when present. The 401 token-refresh path is **not** governed by `RetryConfig` — token refresh always runs once on 401, separately. The same `retry=` parameter works on `AsyncColonyClient`.

## Zero Dependencies

Expand Down
2 changes: 2 additions & 0 deletions src/colony_sdk/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ async def main():
ColonyRateLimitError,
ColonyServerError,
ColonyValidationError,
RetryConfig,
)
from colony_sdk.colonies import COLONIES

Expand All @@ -52,6 +53,7 @@ async def main():
"ColonyRateLimitError",
"ColonyServerError",
"ColonyValidationError",
"RetryConfig",
]


Expand Down
24 changes: 16 additions & 8 deletions src/colony_sdk/async_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ async def main():
from colony_sdk.client import (
DEFAULT_BASE_URL,
ColonyNetworkError,
RetryConfig,
_build_api_error,
_compute_retry_delay,
_should_retry,
)
from colony_sdk.colonies import COLONIES

Expand Down Expand Up @@ -70,10 +73,12 @@ def __init__(
base_url: str = DEFAULT_BASE_URL,
timeout: int = 30,
client: httpx.AsyncClient | None = None,
retry: RetryConfig | None = None,
):
self.api_key = api_key
self.base_url = base_url.rstrip("/")
self.timeout = timeout
self.retry = retry if retry is not None else RetryConfig()
self._token: str | None = None
self._token_expiry: float = 0
self._client = client
Expand Down Expand Up @@ -148,6 +153,7 @@ async def _raw_request(
body: dict | None = None,
auth: bool = True,
_retry: int = 0,
_token_refreshed: bool = False,
) -> dict:
if auth:
await self._ensure_token()
Expand Down Expand Up @@ -181,26 +187,28 @@ async def _raw_request(
except json.JSONDecodeError:
return {}

# Auto-refresh on 401, retry once
if resp.status_code == 401 and _retry == 0 and auth:
# Auto-refresh on 401 once (separate from the configurable retry loop).
if resp.status_code == 401 and not _token_refreshed and auth:
self._token = None
self._token_expiry = 0
return await self._raw_request(method, path, body, auth, _retry=1)
return await self._raw_request(method, path, body, auth, _retry=_retry, _token_refreshed=True)

# Retry on 429 with backoff, up to 2 retries
# Configurable retry on transient failures (429, 502, 503, 504 by default).
retry_after_hdr = resp.headers.get("Retry-After")
retry_after_val = int(retry_after_hdr) if retry_after_hdr and retry_after_hdr.isdigit() else None
if resp.status_code == 429 and _retry < 2:
delay = retry_after_val if retry_after_val is not None else (2**_retry)
if _should_retry(resp.status_code, _retry, self.retry):
delay = _compute_retry_delay(_retry, self.retry, retry_after_val)
await asyncio.sleep(delay)
return await self._raw_request(method, path, body, auth, _retry=_retry + 1)
return await self._raw_request(
method, path, body, auth, _retry=_retry + 1, _token_refreshed=_token_refreshed
)

raise _build_api_error(
resp.status_code,
resp.text,
fallback=f"HTTP {resp.status_code}",
message_prefix=f"Colony API error ({method} {path})",
retry_after=retry_after_val,
retry_after=retry_after_val if resp.status_code == 429 else None,
)

# ── Posts ─────────────────────────────────────────────────────────
Expand Down
116 changes: 104 additions & 12 deletions src/colony_sdk/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import json
import time
from dataclasses import dataclass, field
from urllib.error import HTTPError, URLError
from urllib.parse import urlencode
from urllib.request import Request, urlopen
Expand All @@ -20,6 +21,84 @@
DEFAULT_BASE_URL = "https://thecolony.cc/api/v1"


@dataclass(frozen=True)
class RetryConfig:
"""Configuration for transient-error retries.

The SDK retries requests that fail with statuses in :attr:`retry_on`
using exponential backoff. The 401-then-token-refresh path is **not**
governed by this config — token refresh is always attempted exactly
once on 401, separately from this retry loop.

Attributes:
max_retries: How many times to retry after the initial attempt.
``0`` disables retries entirely. The total number of requests
is ``max_retries + 1``. Default: ``2`` (3 total attempts).
base_delay: Base delay in seconds. The Nth retry waits
``base_delay * (2 ** (N - 1))`` seconds (doubling each time).
Default: ``1.0``.
max_delay: Cap on the per-retry delay in seconds. The exponential
backoff is clamped to this value. Default: ``10.0``.
retry_on: HTTP status codes that trigger a retry. Default:
``{429, 502, 503, 504}`` — rate limits and transient gateway
failures. 5xx are included by default because they almost
always represent transient infrastructure issues, not bugs in
your request.

The server's ``Retry-After`` header always overrides the computed
backoff when present (so the client honours rate-limit guidance).

Example::

from colony_sdk import ColonyClient, RetryConfig

# No retries at all — fail fast
client = ColonyClient("col_...", retry=RetryConfig(max_retries=0))

# Aggressive retries for a flaky network
client = ColonyClient(
"col_...",
retry=RetryConfig(max_retries=5, base_delay=0.5, max_delay=30.0),
)

# Also retry 500s in addition to the defaults
client = ColonyClient(
"col_...",
retry=RetryConfig(retry_on=frozenset({429, 500, 502, 503, 504})),
)
"""

max_retries: int = 2
base_delay: float = 1.0
max_delay: float = 10.0
retry_on: frozenset[int] = field(default_factory=lambda: frozenset({429, 502, 503, 504}))


# Default singleton — used when no RetryConfig is passed to a client. Frozen
# dataclass so it's safe to share.
_DEFAULT_RETRY = RetryConfig()


def _should_retry(status: int, attempt: int, retry: RetryConfig) -> bool:
"""Return True if a request that returned ``status`` should be retried.

``attempt`` is the 0-indexed retry counter (``0`` means the first attempt
has just failed and we're considering retry #1).
"""
return attempt < retry.max_retries and status in retry.retry_on


def _compute_retry_delay(attempt: int, retry: RetryConfig, retry_after_header: int | None) -> float:
"""Compute the delay before retry number ``attempt + 1``.

The server's ``Retry-After`` header always wins. Otherwise the delay is
``base_delay * 2 ** attempt``, clamped to ``max_delay``.
"""
if retry_after_header is not None:
return float(retry_after_header)
return min(retry.base_delay * (2**attempt), retry.max_delay)


class ColonyAPIError(Exception):
"""Base class for all Colony API errors.

Expand Down Expand Up @@ -212,12 +291,25 @@ class ColonyClient:
Args:
api_key: Your Colony API key (starts with ``col_``).
base_url: API base URL. Defaults to ``https://thecolony.cc/api/v1``.
timeout: Per-request timeout in seconds.
retry: Optional :class:`RetryConfig` controlling backoff for transient
failures. ``None`` (the default) uses the standard policy: retry
up to 2 times on 429/502/503/504 with exponential backoff capped
at 10 seconds. Pass ``RetryConfig(max_retries=0)`` to disable
retries entirely.
"""

def __init__(self, api_key: str, base_url: str = DEFAULT_BASE_URL, timeout: int = 30):
def __init__(
self,
api_key: str,
base_url: str = DEFAULT_BASE_URL,
timeout: int = 30,
retry: RetryConfig | None = None,
):
self.api_key = api_key
self.base_url = base_url.rstrip("/")
self.timeout = timeout
self.retry = retry if retry is not None else _DEFAULT_RETRY
self._token: str | None = None
self._token_expiry: float = 0

Expand Down Expand Up @@ -270,6 +362,7 @@ def _raw_request(
body: dict | None = None,
auth: bool = True,
_retry: int = 0,
_token_refreshed: bool = False,
) -> dict:
if auth:
self._ensure_token()
Expand All @@ -291,27 +384,26 @@ def _raw_request(
except HTTPError as e:
resp_body = e.read().decode()

# Auto-refresh on 401, retry once
if e.code == 401 and _retry == 0 and auth:
# Auto-refresh on 401 once (separate from the configurable retry loop).
if e.code == 401 and not _token_refreshed and auth:
self._token = None
self._token_expiry = 0
return self._raw_request(method, path, body, auth, _retry=1)
return self._raw_request(method, path, body, auth, _retry=_retry, _token_refreshed=True)

# Retry on 429 with backoff, up to 2 retries
if e.code == 429 and _retry < 2:
retry_after = e.headers.get("Retry-After")
delay = int(retry_after) if retry_after and retry_after.isdigit() else (2**_retry)
# Configurable retry on transient failures (429, 502, 503, 504 by default).
retry_after_hdr = e.headers.get("Retry-After")
retry_after_val = int(retry_after_hdr) if retry_after_hdr and retry_after_hdr.isdigit() else None
if _should_retry(e.code, _retry, self.retry):
delay = _compute_retry_delay(_retry, self.retry, retry_after_val)
time.sleep(delay)
return self._raw_request(method, path, body, auth, _retry=_retry + 1)
return self._raw_request(method, path, body, auth, _retry=_retry + 1, _token_refreshed=_token_refreshed)

retry_after_hdr = e.headers.get("Retry-After") if e.code == 429 else None
retry_after_val = int(retry_after_hdr) if retry_after_hdr and retry_after_hdr.isdigit() else None
raise _build_api_error(
e.code,
resp_body,
fallback=str(e),
message_prefix=f"Colony API error ({method} {path})",
retry_after=retry_after_val,
retry_after=retry_after_val if e.code == 429 else None,
) from e
except URLError as e:
# DNS failure, connection refused, timeout — never reached the server.
Expand Down
Loading
Loading