Goal
Add first-class cache-aware routing to the inference router so we can take advantage of provider-side prompt/prefix caching while keeping the public API provider-agnostic.
Why
Providers increasingly expose prompt/prefix caching that can materially reduce latency and/or cost for long repeated prefixes, agent loops, RAG, coding workflows, stable tool definitions, and stable system prompts.
The router should understand cache hints, normalize cache usage, and eventually account for cached-token pricing in routing decisions.
Providers to cover
- OpenAI: automatic prompt caching,
prompt_cache_key, prompt_cache_retention, cached token usage.
- Anthropic: explicit
cache_control breakpoints, 5m/1h ephemeral TTLs, cache read/write token usage.
- Groq: automatic prompt caching on supported models, cached token usage.
- Cerebras: automatic prompt caching on supported models, 128-token blocks, guaranteed 5m TTL with up to 1h max,
usage.prompt_tokens_details.cached_tokens.
- Cloudflare Workers AI: prefix caching on supported models,
x-session-affinity, cached token usage.
- Cloudflare AI Gateway: existing
cf-aig-cache-key / cf-aig-cache-ttl support should be reconciled with the new abstraction.
Proposed shape
Add request-level cache hints, for example:
cache?: {
strategy?: 'off' | 'provider-prefix' | 'response' | 'both';
key?: string;
ttl?: '5m' | '1h' | '24h' | number;
sessionId?: string;
cacheablePrefix?: 'auto' | 'system' | 'tools' | 'messages';
}
Normalize response usage:
usage: {
inputTokens: number;
outputTokens: number;
totalTokens: number;
cost: number;
cachedInputTokens?: number;
cacheReadInputTokens?: number;
cacheCreationInputTokens?: number;
cacheWriteInputTokens?: number;
}
Implementation notes
- Add cache capability metadata to the model catalog.
- Translate generic cache hints inside each provider adapter.
- Parse provider-specific cached token fields into normalized usage.
- Update cost tracking to account for cache read/write pricing where providers discount cached tokens.
- Keep response/output caching separate and disabled by default.
- Add tests per provider for request translation and usage parsing.
Non-goals
- Do not promise identical cache semantics across providers.
- Do not cache arbitrary chat responses by default.
- Do not persist sensitive prompt content unless an explicit response-cache store is configured.
Acceptance criteria
LLMRequest supports provider-agnostic cache hints without breaking existing callers.
TokenUsage exposes normalized cached-token fields.
- OpenAI, Anthropic, Groq, Cerebras, and Cloudflare adapters either translate cache hints or explicitly no-op with documented behavior.
- Cost tracking and ledger paths preserve existing behavior and include cached-token metadata where available.
- Unit tests cover cache request translation and usage parsing for each provider.
Goal
Add first-class cache-aware routing to the inference router so we can take advantage of provider-side prompt/prefix caching while keeping the public API provider-agnostic.
Why
Providers increasingly expose prompt/prefix caching that can materially reduce latency and/or cost for long repeated prefixes, agent loops, RAG, coding workflows, stable tool definitions, and stable system prompts.
The router should understand cache hints, normalize cache usage, and eventually account for cached-token pricing in routing decisions.
Providers to cover
prompt_cache_key,prompt_cache_retention, cached token usage.cache_controlbreakpoints, 5m/1h ephemeral TTLs, cache read/write token usage.usage.prompt_tokens_details.cached_tokens.x-session-affinity, cached token usage.cf-aig-cache-key/cf-aig-cache-ttlsupport should be reconciled with the new abstraction.Proposed shape
Add request-level cache hints, for example:
Normalize response usage:
Implementation notes
Non-goals
Acceptance criteria
LLMRequestsupports provider-agnostic cache hints without breaking existing callers.TokenUsageexposes normalized cached-token fields.