Skip to content

cllama feed injection breaks Anthropic prompt cache #122

@mostlydev

Description

@mostlydev

Problem

InjectAnthropic and InjectOpenAI in cllama/internal/feeds/inject.go prepend dynamic content (feeds, timestamp) before the static system prompt. Anthropic's prompt cache is prefix-matched with a 5-minute TTL — since the prefix changes every request, the entire system prompt (~120K tokens for Allen) is re-cached from scratch every time.

Observed on tiverton-house over 7 days:

Agent Model Requests Cache Create Tokens Cache Reuse Weekly Cost
Allen Sonnet 4.6 624 13.5M 18% $52.84
Tiverton Sonnet 4.6 467 5.4M 8% $21.30
Dundas Haiku 4.5 1,243 16.7M 29% $17.42
Total 2,334 $91.56

Cache creation is ~98% of the bill. Median request gap is 15 min (well beyond the 5-min TTL), so almost every request cold-starts the cache. But even when requests burst within 5 min, the volatile prefix (timestamp + feeds) still invalidates it.

Root Cause

inject.go line 87 (OpenAI) and line 118 (Anthropic):

// OpenAI path — prepends feed before existing system message
first["content"] = feedBlock + "\n\n" + existing

// Anthropic path — prepends feed before existing system prompt  
payload["system"] = feedBlock + "\n\n" + s

handler.go calls inject twice per request:

feeds.InjectAnthropic(payload, feedBlock)                              // feeds prepended
feeds.InjectAnthropic(payload, currentTimeLine(agentCtx, time.Now()))  // time prepended again

Final system prompt order: time (changes every minute) → feeds (change every request) → static system prompt (~120K)

Fix

Append dynamic content instead of prepending. The static system prompt should come first so Anthropic can cache the prefix:

static system prompt (~120K, stable) → feeds (dynamic) → time (dynamic)

This means:

  1. InjectAnthropic and InjectOpenAI → append feed block after existing system content
  2. Same for the content-blocks path (line 122-130)
  3. Handler call order stays the same (feeds then time) — both append

Expected Savings

With static prefix cached, cache_read (0.1x cost) replaces cache_create (1.25x cost) for the bulk of the system prompt on every request where the TTL hasn't expired. Estimated ~$60/week savings on the current tiverton-house pod.

Notes

  • OpenAI path should also append for consistency, though OpenAI caching semantics differ
  • The "no existing system" paths (setting payload["system"] = feedBlock directly) are fine — no reordering needed there
  • Tests in inject_test.go assert prepend order — they'll need updating

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcllama

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions