-
Notifications
You must be signed in to change notification settings - Fork 0
Reliability Features
Detailed explanation of ReliAPI's reliability features and how they work for both HTTP and LLM targets.
ReliAPI automatically retries failed requests based on error class:
-
429 (Rate Limit): Retries with exponential backoff, respecting
Retry-Afterheader - 5xx (Server Error): Retries server errors (transient failures)
- Network Errors: Retries timeouts and connection errors
- Key Pool Fallback: On 429/5xx errors, automatically retries with different key from pool (up to 3 key switches)
retry_matrix:
"429":
attempts: 3
backoff: "exp-jitter"
base_s: 1.0
max_s: 60.0
"5xx":
attempts: 2
backoff: "exp-jitter"
base_s: 1.0
"net":
attempts: 2
backoff: "exp-jitter"
base_s: 1.0-
exp-jitter: Exponential backoff with jitter (recommended) -
linear: Linear backoff
- HTTP Targets: Retries apply to all HTTP methods
- LLM Targets: Retries apply to LLM API calls
- Non-Retryable: 4xx errors (except 429) are not retried
Circuit breaker prevents cascading failures by opening circuit after threshold failures:
- Closed: Normal operation, requests pass through
- Open: Circuit opens after N consecutive failures, requests fail fast
- Half-Open: After cooldown, allows test requests
- Closed: If test succeeds, circuit closes
circuit:
error_threshold: 5 # Open after 5 failures
cooldown_s: 60 # Stay open for 60 seconds- Per-Target: Each target has its own circuit breaker
- HTTP Targets: Opens on HTTP errors (5xx, timeouts)
- LLM Targets: Opens on LLM API errors
- Fast Fail: When open, requests fail immediately without upstream call
ReliAPI caches responses to reduce upstream calls:
- HTTP: GET/HEAD requests cached by default
- LLM: POST requests cached if enabled
- TTL-Based: Responses cached for configured TTL
- Redis-Backed: Uses Redis for storage
cache:
ttl_s: 300 # Cache for 5 minutes
enabled: trueCache keys include:
- Method (GET, POST, etc.)
- URL/path
- Query parameters (sorted)
- Significant headers (Accept, Content-Type)
- Body hash (for POST requests)
- HTTP Targets: GET/HEAD cached automatically
-
LLM Targets: POST cached if
cache.enabled: true - Cache Hit: Returns cached response instantly
- Cache Miss: Makes upstream request and caches result
Idempotency ensures duplicate requests return same result:
-
Request Registration: Request with
Idempotency-Keyis registered - Conflict Detection: If key exists, check if request body matches
- Coalescing: Concurrent requests with same key execute once
- Result Caching: Results cached for configured TTL
Use Idempotency-Key header or idempotency_key field:
curl -X POST http://localhost:8000/proxy/llm \
-H "Idempotency-Key: chat-123" \
-d '{"target": "openai", "messages": [...]}'- HTTP Targets: Works for POST/PUT/PATCH requests
- LLM Targets: Works for all LLM requests
- Coalescing: Concurrent requests with same key execute once
- Conflict: Different request bodies with same key return error
- TTL: Results cached for same TTL as cache config
{
"meta": {
"idempotent_hit": true, # True if result from idempotency cache
"cache_hit": false
}
}Budget caps prevent unexpected LLM costs:
- Cost Estimation: Pre-call cost estimation based on model, messages, max_tokens
- Hard Cap Check: Rejects requests exceeding hard cap
-
Soft Cap Check: Throttles by reducing
max_tokensif soft cap exceeded - Cost Tracking: Records actual cost in metrics
llm:
soft_cost_cap_usd: 0.01 # Throttle if exceeded
hard_cost_cap_usd: 0.05 # Reject if exceeded- Hard Cap: Rejects request if estimated cost > hard cap
-
Soft Cap: Reduces
max_tokensif estimated cost > soft cap - Cost Estimation: Uses approximate pricing tables
-
Cost Tracking: Records actual cost in
meta.cost_usd
{
"meta": {
"cost_estimate_usd": 0.012,
"cost_usd": 0.011,
"cost_policy_applied": "soft_cap_throttled",
"max_tokens_reduced": true,
"original_max_tokens": 2000
}
}All errors are normalized to unified format:
{
"success": false,
"error": {
"type": "upstream_error",
"code": "TIMEOUT",
"message": "Request timed out",
"retryable": true,
"target": "openai",
"status_code": 504
},
"meta": {
"target": "openai",
"retries": 2,
"duration_ms": 20000
}
}-
client_error: Client errors (4xx, invalid request) -
upstream_error: Upstream errors (5xx, timeout) -
budget_error: Budget errors (cost cap exceeded) -
internal_error: Internal errors (configuration, adapter)
- No Raw Stacktraces: Errors never expose internal stacktraces
- Retryable Flag: Indicates if error is retryable
- Consistent Format: All errors follow same structure
Fallback chains provide automatic failover:
- Primary Target: Try primary target first
- Failure Detection: If primary fails, try fallback targets
- Sequential Fallback: Try fallbacks in order
- Success: Return first successful response
targets:
openai:
base_url: "https://api.openai.com/v1"
fallback_targets: ["anthropic", "mistral"]- HTTP Targets: Fallback to backup HTTP APIs
- LLM Targets: Fallback to backup LLM providers
- Sequential: Tries fallbacks in order
-
Metadata: Includes
fallback_usedandfallback_targetin meta
Provider Key Pool Manager manages multiple API keys per provider with health tracking:
- Key Selection: Selects best key based on load score (current_qps / qps_limit + error penalty)
- Health Tracking: Tracks error scores, consecutive errors, and key status
- Status Transitions: Keys transition: active → degraded (5 errors) → exhausted (10 errors)
- Automatic Recovery: Degraded keys recover to active when error score decreases
- Fallback: On 429/5xx errors, automatically retries with different key from pool
provider_key_pools:
openai:
keys:
- id: "openai-main-1"
api_key: "env:OPENAI_KEY_1"
qps_limit: 3
- id: "openai-main-2"
api_key: "env:OPENAI_KEY_2"
qps_limit: 3-
Backward Compatible: Falls back to
targets.authif no key pool configured - Health-Based Selection: Always selects healthiest key with lowest load
- Automatic Penalties: 429 errors add 0.1 to error score, 5xx add 0.05
- Metrics: Exports metrics per provider_key_id (requests, errors, QPS, status)
Rate Scheduler uses token bucket algorithm to smooth bursts and enforce rate limits:
- Token Buckets: Separate buckets for provider key, tenant, and client profile
- Rate Limiting: Enforces QPS limits before upstream requests
- Burst Protection: Configurable burst size for traffic smoothing
- Normalized 429: Returns stable 429 errors from ReliAPI (not upstream)
Rate limits are configured via:
-
Provider Key Pool:
qps_limitper key -
Client Profiles:
max_qps_per_tenant,max_qps_per_provider_key -
Tenant Config:
rate_limit_rpm(legacy, in-memory)
- Per-Key Limits: Each provider key has its own token bucket
- Per-Tenant Limits: Each tenant has its own token bucket
- Per-Profile Limits: Each client profile can override limits
- Priority: Provider key → Tenant → Client profile (all checked)
-
Normalized Errors: Returns 429 with
retry_after_s,provider_key_status,hint
Client Profile Manager provides different rate limits and behavior for different client types:
-
Profile Detection: Priority:
X-Clientheader →tenant.profile→default - Limit Application: Applies profile limits to rate scheduler
- Configurable: Different limits per client type (e.g., Cursor IDE vs API clients)
client_profiles:
cursor_default:
max_parallel_requests: 4
max_qps_per_tenant: 3
max_qps_per_provider_key: 2
burst_size: 2
default_timeout_s: 60Set X-Client header in requests:
curl -X POST http://localhost:8000/proxy/llm \
-H "X-Client: cursor" \
-d '{"target": "openai", "messages": [...]}'Or configure per tenant:
tenants:
cursor_user:
api_key: "sk-..."
profile: "cursor_default"-
Header Priority:
X-Clientheader has highest priority -
Tenant Fallback: Uses
tenant.profileif header absent -
Default Profile: Falls back to
defaultprofile if none specified - Limit Override: Profile limits override provider key limits (minimum wins)
All reliability features work uniformly for HTTP and LLM targets:
- Retries: Automatic retries with exponential backoff, Retry-After support, and key pool fallback
- Circuit Breaker: Per-target failure detection
- Cache: TTL cache for GET/HEAD and LLM responses
- Idempotency: Request coalescing for duplicate requests
- Budget Caps: Cost control for LLM requests (LLM only)
- Error Normalization: Unified error format
- Fallback Chains: Automatic failover to backup targets
- Provider Key Pool: Multi-key support with health tracking and automatic rotation
- Rate Smoothing: Token bucket algorithm for per-key/tenant/profile limits
- Client Profiles: Different rate limits and behavior for different client types
- Configuration — Configuration guide
- Comparison — Comparison with other tools