Skip to content

Add cache-aware routing and provider prompt-cache support #52

@stackbilt-admin

Description

@stackbilt-admin

Goal

Add first-class cache-aware routing to the inference router so we can take advantage of provider-side prompt/prefix caching while keeping the public API provider-agnostic.

Why

Providers increasingly expose prompt/prefix caching that can materially reduce latency and/or cost for long repeated prefixes, agent loops, RAG, coding workflows, stable tool definitions, and stable system prompts.

The router should understand cache hints, normalize cache usage, and eventually account for cached-token pricing in routing decisions.

Providers to cover

  • OpenAI: automatic prompt caching, prompt_cache_key, prompt_cache_retention, cached token usage.
  • Anthropic: explicit cache_control breakpoints, 5m/1h ephemeral TTLs, cache read/write token usage.
  • Groq: automatic prompt caching on supported models, cached token usage.
  • Cerebras: automatic prompt caching on supported models, 128-token blocks, guaranteed 5m TTL with up to 1h max, usage.prompt_tokens_details.cached_tokens.
  • Cloudflare Workers AI: prefix caching on supported models, x-session-affinity, cached token usage.
  • Cloudflare AI Gateway: existing cf-aig-cache-key / cf-aig-cache-ttl support should be reconciled with the new abstraction.

Proposed shape

Add request-level cache hints, for example:

cache?: {
  strategy?: 'off' | 'provider-prefix' | 'response' | 'both';
  key?: string;
  ttl?: '5m' | '1h' | '24h' | number;
  sessionId?: string;
  cacheablePrefix?: 'auto' | 'system' | 'tools' | 'messages';
}

Normalize response usage:

usage: {
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  cost: number;
  cachedInputTokens?: number;
  cacheReadInputTokens?: number;
  cacheCreationInputTokens?: number;
  cacheWriteInputTokens?: number;
}

Implementation notes

  • Add cache capability metadata to the model catalog.
  • Translate generic cache hints inside each provider adapter.
  • Parse provider-specific cached token fields into normalized usage.
  • Update cost tracking to account for cache read/write pricing where providers discount cached tokens.
  • Keep response/output caching separate and disabled by default.
  • Add tests per provider for request translation and usage parsing.

Non-goals

  • Do not promise identical cache semantics across providers.
  • Do not cache arbitrary chat responses by default.
  • Do not persist sensitive prompt content unless an explicit response-cache store is configured.

Acceptance criteria

  • LLMRequest supports provider-agnostic cache hints without breaking existing callers.
  • TokenUsage exposes normalized cached-token fields.
  • OpenAI, Anthropic, Groq, Cerebras, and Cloudflare adapters either translate cache hints or explicitly no-op with documented behavior.
  • Cost tracking and ledger paths preserve existing behavior and include cached-token metadata where available.
  • Unit tests cover cache request translation and usage parsing for each provider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions