Reliability Features

Detailed explanation of ReliAPI's reliability features and how they work for both HTTP and LLM targets.

Retries

How It Works

ReliAPI automatically retries failed requests based on error class:

429 (Rate Limit): Retries with exponential backoff, respecting Retry-After header
5xx (Server Error): Retries server errors (transient failures)
Network Errors: Retries timeouts and connection errors
Key Pool Fallback: On 429/5xx errors, automatically retries with different key from pool (up to 3 key switches)

Configuration

retry_matrix:
  "429":
    attempts: 3
    backoff: "exp-jitter"
    base_s: 1.0
    max_s: 60.0
  "5xx":
    attempts: 2
    backoff: "exp-jitter"
    base_s: 1.0
  "net":
    attempts: 2
    backoff: "exp-jitter"
    base_s: 1.0

Backoff Strategies

exp-jitter: Exponential backoff with jitter (recommended)
linear: Linear backoff

Behavior

HTTP Targets: Retries apply to all HTTP methods
LLM Targets: Retries apply to LLM API calls
Non-Retryable: 4xx errors (except 429) are not retried

Circuit Breaker

How It Works

Circuit breaker prevents cascading failures by opening circuit after threshold failures:

Closed: Normal operation, requests pass through
Open: Circuit opens after N consecutive failures, requests fail fast
Half-Open: After cooldown, allows test requests
Closed: If test succeeds, circuit closes

Configuration

circuit:
  error_threshold: 5      # Open after 5 failures
  cooldown_s: 60          # Stay open for 60 seconds

Behavior

Per-Target: Each target has its own circuit breaker
HTTP Targets: Opens on HTTP errors (5xx, timeouts)
LLM Targets: Opens on LLM API errors
Fast Fail: When open, requests fail immediately without upstream call

Cache

How It Works

ReliAPI caches responses to reduce upstream calls:

HTTP: GET/HEAD requests cached by default
LLM: POST requests cached if enabled
TTL-Based: Responses cached for configured TTL
Redis-Backed: Uses Redis for storage

Configuration

cache:
  ttl_s: 300              # Cache for 5 minutes
  enabled: true

Cache Keys

Cache keys include:

Method (GET, POST, etc.)
URL/path
Query parameters (sorted)
Significant headers (Accept, Content-Type)
Body hash (for POST requests)

Behavior

HTTP Targets: GET/HEAD cached automatically
LLM Targets: POST cached if cache.enabled: true
Cache Hit: Returns cached response instantly
Cache Miss: Makes upstream request and caches result

Idempotency

How It Works

Idempotency ensures duplicate requests return same result:

Request Registration: Request with Idempotency-Key is registered
Conflict Detection: If key exists, check if request body matches
Coalescing: Concurrent requests with same key execute once
Result Caching: Results cached for configured TTL

Usage

Use Idempotency-Key header or idempotency_key field:

curl -X POST http://localhost:8000/proxy/llm \
  -H "Idempotency-Key: chat-123" \
  -d '{"target": "openai", "messages": [...]}'

Behavior

HTTP Targets: Works for POST/PUT/PATCH requests
LLM Targets: Works for all LLM requests
Coalescing: Concurrent requests with same key execute once
Conflict: Different request bodies with same key return error
TTL: Results cached for same TTL as cache config

Response Meta

{
  "meta": {
    "idempotent_hit": true,    # True if result from idempotency cache
    "cache_hit": false
  }
}

Budget Caps (LLM Only)

How It Works

Budget caps prevent unexpected LLM costs:

Cost Estimation: Pre-call cost estimation based on model, messages, max_tokens
Hard Cap Check: Rejects requests exceeding hard cap
Soft Cap Check: Throttles by reducing max_tokens if soft cap exceeded
Cost Tracking: Records actual cost in metrics

Configuration

llm:
  soft_cost_cap_usd: 0.01    # Throttle if exceeded
  hard_cost_cap_usd: 0.05    # Reject if exceeded

Behavior

Hard Cap: Rejects request if estimated cost > hard cap
Soft Cap: Reduces max_tokens if estimated cost > soft cap
Cost Estimation: Uses approximate pricing tables
Cost Tracking: Records actual cost in meta.cost_usd

Response Meta

{
  "meta": {
    "cost_estimate_usd": 0.012,
    "cost_usd": 0.011,
    "cost_policy_applied": "soft_cap_throttled",
    "max_tokens_reduced": true,
    "original_max_tokens": 2000
  }
}

Error Normalization

How It Works

All errors are normalized to unified format:

{
  "success": false,
  "error": {
    "type": "upstream_error",
    "code": "TIMEOUT",
    "message": "Request timed out",
    "retryable": true,
    "target": "openai",
    "status_code": 504
  },
  "meta": {
    "target": "openai",
    "retries": 2,
    "duration_ms": 20000
  }
}

Error Types

client_error: Client errors (4xx, invalid request)
upstream_error: Upstream errors (5xx, timeout)
budget_error: Budget errors (cost cap exceeded)
internal_error: Internal errors (configuration, adapter)

Behavior

No Raw Stacktraces: Errors never expose internal stacktraces
Retryable Flag: Indicates if error is retryable
Consistent Format: All errors follow same structure

Fallback Chains

How It Works

Fallback chains provide automatic failover:

Primary Target: Try primary target first
Failure Detection: If primary fails, try fallback targets
Sequential Fallback: Try fallbacks in order
Success: Return first successful response

Configuration

targets:
  openai:
    base_url: "https://api.openai.com/v1"
    fallback_targets: ["anthropic", "mistral"]

Behavior

HTTP Targets: Fallback to backup HTTP APIs
LLM Targets: Fallback to backup LLM providers
Sequential: Tries fallbacks in order
Metadata: Includes fallback_used and fallback_target in meta

Provider Key Pool

How It Works

Provider Key Pool Manager manages multiple API keys per provider with health tracking:

Key Selection: Selects best key based on load score (current_qps / qps_limit + error penalty)
Health Tracking: Tracks error scores, consecutive errors, and key status
Status Transitions: Keys transition: active → degraded (5 errors) → exhausted (10 errors)
Automatic Recovery: Degraded keys recover to active when error score decreases
Fallback: On 429/5xx errors, automatically retries with different key from pool

Configuration

provider_key_pools:
  openai:
    keys:
      - id: "openai-main-1"
        api_key: "env:OPENAI_KEY_1"
        qps_limit: 3
      - id: "openai-main-2"
        api_key: "env:OPENAI_KEY_2"
        qps_limit: 3

Behavior

Backward Compatible: Falls back to targets.auth if no key pool configured
Health-Based Selection: Always selects healthiest key with lowest load
Automatic Penalties: 429 errors add 0.1 to error score, 5xx add 0.05
Metrics: Exports metrics per provider_key_id (requests, errors, QPS, status)

Rate Smoothing

How It Works

Rate Scheduler uses token bucket algorithm to smooth bursts and enforce rate limits:

Token Buckets: Separate buckets for provider key, tenant, and client profile
Rate Limiting: Enforces QPS limits before upstream requests
Burst Protection: Configurable burst size for traffic smoothing
Normalized 429: Returns stable 429 errors from ReliAPI (not upstream)

Configuration

Rate limits are configured via:

Provider Key Pool: qps_limit per key
Client Profiles: max_qps_per_tenant, max_qps_per_provider_key
Tenant Config: rate_limit_rpm (legacy, in-memory)

Behavior

Per-Key Limits: Each provider key has its own token bucket
Per-Tenant Limits: Each tenant has its own token bucket
Per-Profile Limits: Each client profile can override limits
Priority: Provider key → Tenant → Client profile (all checked)
Normalized Errors: Returns 429 with retry_after_s, provider_key_status, hint

Client Profiles

How It Works

Client Profile Manager provides different rate limits and behavior for different client types:

Profile Detection: Priority: X-Client header → tenant.profile → default
Limit Application: Applies profile limits to rate scheduler
Configurable: Different limits per client type (e.g., Cursor IDE vs API clients)

Configuration

client_profiles:
  cursor_default:
    max_parallel_requests: 4
    max_qps_per_tenant: 3
    max_qps_per_provider_key: 2
    burst_size: 2
    default_timeout_s: 60

Usage

Set X-Client header in requests:

curl -X POST http://localhost:8000/proxy/llm \
  -H "X-Client: cursor" \
  -d '{"target": "openai", "messages": [...]}'

Or configure per tenant:

tenants:
  cursor_user:
    api_key: "sk-..."
    profile: "cursor_default"

Behavior

Header Priority: X-Client header has highest priority
Tenant Fallback: Uses tenant.profile if header absent
Default Profile: Falls back to default profile if none specified
Limit Override: Profile limits override provider key limits (minimum wins)

Summary

All reliability features work uniformly for HTTP and LLM targets:

Retries: Automatic retries with exponential backoff, Retry-After support, and key pool fallback
Circuit Breaker: Per-target failure detection
Cache: TTL cache for GET/HEAD and LLM responses
Idempotency: Request coalescing for duplicate requests
Budget Caps: Cost control for LLM requests (LLM only)
Error Normalization: Unified error format
Fallback Chains: Automatic failover to backup targets
Provider Key Pool: Multi-key support with health tracking and automatic rotation
Rate Smoothing: Token bucket algorithm for per-key/tenant/profile limits
Client Profiles: Different rate limits and behavior for different client types

Next Steps

Configuration — Configuration guide
Comparison — Comparison with other tools

Reliability Features

Reliability Features

Retries

How It Works

Configuration

Backoff Strategies

Behavior

Circuit Breaker

How It Works

Configuration

Behavior

Cache

How It Works

Configuration

Cache Keys

Behavior

Idempotency

How It Works

Usage

Behavior

Response Meta

Budget Caps (LLM Only)

How It Works

Configuration

Behavior

Response Meta

Error Normalization

How It Works

Error Types

Behavior

Fallback Chains

How It Works

Configuration

Behavior

Provider Key Pool

How It Works

Configuration

Behavior

Rate Smoothing

How It Works

Configuration

Behavior

Client Profiles

How It Works

Configuration

Usage

Behavior

Summary

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally