Add cache-aware routing and provider prompt-cache support

## Goal

Add first-class cache-aware routing to the inference router so we can take advantage of provider-side prompt/prefix caching while keeping the public API provider-agnostic.

## Why

Providers increasingly expose prompt/prefix caching that can materially reduce latency and/or cost for long repeated prefixes, agent loops, RAG, coding workflows, stable tool definitions, and stable system prompts.

The router should understand cache hints, normalize cache usage, and eventually account for cached-token pricing in routing decisions.

## Providers to cover

- OpenAI: automatic prompt caching, `prompt_cache_key`, `prompt_cache_retention`, cached token usage.
- Anthropic: explicit `cache_control` breakpoints, 5m/1h ephemeral TTLs, cache read/write token usage.
- Groq: automatic prompt caching on supported models, cached token usage.
- Cerebras: automatic prompt caching on supported models, 128-token blocks, guaranteed 5m TTL with up to 1h max, `usage.prompt_tokens_details.cached_tokens`.
- Cloudflare Workers AI: prefix caching on supported models, `x-session-affinity`, cached token usage.
- Cloudflare AI Gateway: existing `cf-aig-cache-key` / `cf-aig-cache-ttl` support should be reconciled with the new abstraction.

## Proposed shape

Add request-level cache hints, for example:

```ts
cache?: {
  strategy?: 'off' | 'provider-prefix' | 'response' | 'both';
  key?: string;
  ttl?: '5m' | '1h' | '24h' | number;
  sessionId?: string;
  cacheablePrefix?: 'auto' | 'system' | 'tools' | 'messages';
}
```

Normalize response usage:

```ts
usage: {
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  cost: number;
  cachedInputTokens?: number;
  cacheReadInputTokens?: number;
  cacheCreationInputTokens?: number;
  cacheWriteInputTokens?: number;
}
```

## Implementation notes

- Add cache capability metadata to the model catalog.
- Translate generic cache hints inside each provider adapter.
- Parse provider-specific cached token fields into normalized usage.
- Update cost tracking to account for cache read/write pricing where providers discount cached tokens.
- Keep response/output caching separate and disabled by default.
- Add tests per provider for request translation and usage parsing.

## Non-goals

- Do not promise identical cache semantics across providers.
- Do not cache arbitrary chat responses by default.
- Do not persist sensitive prompt content unless an explicit response-cache store is configured.

## Acceptance criteria

- `LLMRequest` supports provider-agnostic cache hints without breaking existing callers.
- `TokenUsage` exposes normalized cached-token fields.
- OpenAI, Anthropic, Groq, Cerebras, and Cloudflare adapters either translate cache hints or explicitly no-op with documented behavior.
- Cost tracking and ledger paths preserve existing behavior and include cached-token metadata where available.
- Unit tests cover cache request translation and usage parsing for each provider.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache-aware routing and provider prompt-cache support #52

Goal

Why

Providers to cover

Proposed shape

Implementation notes

Non-goals

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add cache-aware routing and provider prompt-cache support #52

Description

Goal

Why

Providers to cover

Proposed shape

Implementation notes

Non-goals

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions