Skip to content

Reliability Features

Nick edited this page Nov 26, 2025 · 2 revisions

Reliability Features

Detailed explanation of ReliAPI's reliability features and how they work for both HTTP and LLM targets.


Retries

How It Works

ReliAPI automatically retries failed requests based on error class:

  • 429 (Rate Limit): Retries with exponential backoff, respecting Retry-After header
  • 5xx (Server Error): Retries server errors (transient failures)
  • Network Errors: Retries timeouts and connection errors
  • Key Pool Fallback: On 429/5xx errors, automatically retries with different key from pool (up to 3 key switches)

Configuration

retry_matrix:
  "429":
    attempts: 3
    backoff: "exp-jitter"
    base_s: 1.0
    max_s: 60.0
  "5xx":
    attempts: 2
    backoff: "exp-jitter"
    base_s: 1.0
  "net":
    attempts: 2
    backoff: "exp-jitter"
    base_s: 1.0

Backoff Strategies

  • exp-jitter: Exponential backoff with jitter (recommended)
  • linear: Linear backoff

Behavior

  • HTTP Targets: Retries apply to all HTTP methods
  • LLM Targets: Retries apply to LLM API calls
  • Non-Retryable: 4xx errors (except 429) are not retried

Circuit Breaker

How It Works

Circuit breaker prevents cascading failures by opening circuit after threshold failures:

  1. Closed: Normal operation, requests pass through
  2. Open: Circuit opens after N consecutive failures, requests fail fast
  3. Half-Open: After cooldown, allows test requests
  4. Closed: If test succeeds, circuit closes

Configuration

circuit:
  error_threshold: 5      # Open after 5 failures
  cooldown_s: 60          # Stay open for 60 seconds

Behavior

  • Per-Target: Each target has its own circuit breaker
  • HTTP Targets: Opens on HTTP errors (5xx, timeouts)
  • LLM Targets: Opens on LLM API errors
  • Fast Fail: When open, requests fail immediately without upstream call

Cache

How It Works

ReliAPI caches responses to reduce upstream calls:

  • HTTP: GET/HEAD requests cached by default
  • LLM: POST requests cached if enabled
  • TTL-Based: Responses cached for configured TTL
  • Redis-Backed: Uses Redis for storage

Configuration

cache:
  ttl_s: 300              # Cache for 5 minutes
  enabled: true

Cache Keys

Cache keys include:

  • Method (GET, POST, etc.)
  • URL/path
  • Query parameters (sorted)
  • Significant headers (Accept, Content-Type)
  • Body hash (for POST requests)

Behavior

  • HTTP Targets: GET/HEAD cached automatically
  • LLM Targets: POST cached if cache.enabled: true
  • Cache Hit: Returns cached response instantly
  • Cache Miss: Makes upstream request and caches result

Idempotency

How It Works

Idempotency ensures duplicate requests return same result:

  1. Request Registration: Request with Idempotency-Key is registered
  2. Conflict Detection: If key exists, check if request body matches
  3. Coalescing: Concurrent requests with same key execute once
  4. Result Caching: Results cached for configured TTL

Usage

Use Idempotency-Key header or idempotency_key field:

curl -X POST http://localhost:8000/proxy/llm \
  -H "Idempotency-Key: chat-123" \
  -d '{"target": "openai", "messages": [...]}'

Behavior

  • HTTP Targets: Works for POST/PUT/PATCH requests
  • LLM Targets: Works for all LLM requests
  • Coalescing: Concurrent requests with same key execute once
  • Conflict: Different request bodies with same key return error
  • TTL: Results cached for same TTL as cache config

Response Meta

{
  "meta": {
    "idempotent_hit": true,    # True if result from idempotency cache
    "cache_hit": false
  }
}

Budget Caps (LLM Only)

How It Works

Budget caps prevent unexpected LLM costs:

  1. Cost Estimation: Pre-call cost estimation based on model, messages, max_tokens
  2. Hard Cap Check: Rejects requests exceeding hard cap
  3. Soft Cap Check: Throttles by reducing max_tokens if soft cap exceeded
  4. Cost Tracking: Records actual cost in metrics

Configuration

llm:
  soft_cost_cap_usd: 0.01    # Throttle if exceeded
  hard_cost_cap_usd: 0.05    # Reject if exceeded

Behavior

  • Hard Cap: Rejects request if estimated cost > hard cap
  • Soft Cap: Reduces max_tokens if estimated cost > soft cap
  • Cost Estimation: Uses approximate pricing tables
  • Cost Tracking: Records actual cost in meta.cost_usd

Response Meta

{
  "meta": {
    "cost_estimate_usd": 0.012,
    "cost_usd": 0.011,
    "cost_policy_applied": "soft_cap_throttled",
    "max_tokens_reduced": true,
    "original_max_tokens": 2000
  }
}

Error Normalization

How It Works

All errors are normalized to unified format:

{
  "success": false,
  "error": {
    "type": "upstream_error",
    "code": "TIMEOUT",
    "message": "Request timed out",
    "retryable": true,
    "target": "openai",
    "status_code": 504
  },
  "meta": {
    "target": "openai",
    "retries": 2,
    "duration_ms": 20000
  }
}

Error Types

  • client_error: Client errors (4xx, invalid request)
  • upstream_error: Upstream errors (5xx, timeout)
  • budget_error: Budget errors (cost cap exceeded)
  • internal_error: Internal errors (configuration, adapter)

Behavior

  • No Raw Stacktraces: Errors never expose internal stacktraces
  • Retryable Flag: Indicates if error is retryable
  • Consistent Format: All errors follow same structure

Fallback Chains

How It Works

Fallback chains provide automatic failover:

  1. Primary Target: Try primary target first
  2. Failure Detection: If primary fails, try fallback targets
  3. Sequential Fallback: Try fallbacks in order
  4. Success: Return first successful response

Configuration

targets:
  openai:
    base_url: "https://api.openai.com/v1"
    fallback_targets: ["anthropic", "mistral"]

Behavior

  • HTTP Targets: Fallback to backup HTTP APIs
  • LLM Targets: Fallback to backup LLM providers
  • Sequential: Tries fallbacks in order
  • Metadata: Includes fallback_used and fallback_target in meta

Provider Key Pool

How It Works

Provider Key Pool Manager manages multiple API keys per provider with health tracking:

  1. Key Selection: Selects best key based on load score (current_qps / qps_limit + error penalty)
  2. Health Tracking: Tracks error scores, consecutive errors, and key status
  3. Status Transitions: Keys transition: active → degraded (5 errors) → exhausted (10 errors)
  4. Automatic Recovery: Degraded keys recover to active when error score decreases
  5. Fallback: On 429/5xx errors, automatically retries with different key from pool

Configuration

provider_key_pools:
  openai:
    keys:
      - id: "openai-main-1"
        api_key: "env:OPENAI_KEY_1"
        qps_limit: 3
      - id: "openai-main-2"
        api_key: "env:OPENAI_KEY_2"
        qps_limit: 3

Behavior

  • Backward Compatible: Falls back to targets.auth if no key pool configured
  • Health-Based Selection: Always selects healthiest key with lowest load
  • Automatic Penalties: 429 errors add 0.1 to error score, 5xx add 0.05
  • Metrics: Exports metrics per provider_key_id (requests, errors, QPS, status)

Rate Smoothing

How It Works

Rate Scheduler uses token bucket algorithm to smooth bursts and enforce rate limits:

  1. Token Buckets: Separate buckets for provider key, tenant, and client profile
  2. Rate Limiting: Enforces QPS limits before upstream requests
  3. Burst Protection: Configurable burst size for traffic smoothing
  4. Normalized 429: Returns stable 429 errors from ReliAPI (not upstream)

Configuration

Rate limits are configured via:

  • Provider Key Pool: qps_limit per key
  • Client Profiles: max_qps_per_tenant, max_qps_per_provider_key
  • Tenant Config: rate_limit_rpm (legacy, in-memory)

Behavior

  • Per-Key Limits: Each provider key has its own token bucket
  • Per-Tenant Limits: Each tenant has its own token bucket
  • Per-Profile Limits: Each client profile can override limits
  • Priority: Provider key → Tenant → Client profile (all checked)
  • Normalized Errors: Returns 429 with retry_after_s, provider_key_status, hint

Client Profiles

How It Works

Client Profile Manager provides different rate limits and behavior for different client types:

  1. Profile Detection: Priority: X-Client header → tenant.profiledefault
  2. Limit Application: Applies profile limits to rate scheduler
  3. Configurable: Different limits per client type (e.g., Cursor IDE vs API clients)

Configuration

client_profiles:
  cursor_default:
    max_parallel_requests: 4
    max_qps_per_tenant: 3
    max_qps_per_provider_key: 2
    burst_size: 2
    default_timeout_s: 60

Usage

Set X-Client header in requests:

curl -X POST http://localhost:8000/proxy/llm \
  -H "X-Client: cursor" \
  -d '{"target": "openai", "messages": [...]}'

Or configure per tenant:

tenants:
  cursor_user:
    api_key: "sk-..."
    profile: "cursor_default"

Behavior

  • Header Priority: X-Client header has highest priority
  • Tenant Fallback: Uses tenant.profile if header absent
  • Default Profile: Falls back to default profile if none specified
  • Limit Override: Profile limits override provider key limits (minimum wins)

Summary

All reliability features work uniformly for HTTP and LLM targets:

  • Retries: Automatic retries with exponential backoff, Retry-After support, and key pool fallback
  • Circuit Breaker: Per-target failure detection
  • Cache: TTL cache for GET/HEAD and LLM responses
  • Idempotency: Request coalescing for duplicate requests
  • Budget Caps: Cost control for LLM requests (LLM only)
  • Error Normalization: Unified error format
  • Fallback Chains: Automatic failover to backup targets
  • Provider Key Pool: Multi-key support with health tracking and automatic rotation
  • Rate Smoothing: Token bucket algorithm for per-key/tenant/profile limits
  • Client Profiles: Different rate limits and behavior for different client types

Next Steps