Problem
InjectAnthropic and InjectOpenAI in cllama/internal/feeds/inject.go prepend dynamic content (feeds, timestamp) before the static system prompt. Anthropic's prompt cache is prefix-matched with a 5-minute TTL — since the prefix changes every request, the entire system prompt (~120K tokens for Allen) is re-cached from scratch every time.
Observed on tiverton-house over 7 days:
| Agent |
Model |
Requests |
Cache Create Tokens |
Cache Reuse |
Weekly Cost |
| Allen |
Sonnet 4.6 |
624 |
13.5M |
18% |
$52.84 |
| Tiverton |
Sonnet 4.6 |
467 |
5.4M |
8% |
$21.30 |
| Dundas |
Haiku 4.5 |
1,243 |
16.7M |
29% |
$17.42 |
| Total |
|
2,334 |
|
|
$91.56 |
Cache creation is ~98% of the bill. Median request gap is 15 min (well beyond the 5-min TTL), so almost every request cold-starts the cache. But even when requests burst within 5 min, the volatile prefix (timestamp + feeds) still invalidates it.
Root Cause
inject.go line 87 (OpenAI) and line 118 (Anthropic):
// OpenAI path — prepends feed before existing system message
first["content"] = feedBlock + "\n\n" + existing
// Anthropic path — prepends feed before existing system prompt
payload["system"] = feedBlock + "\n\n" + s
handler.go calls inject twice per request:
feeds.InjectAnthropic(payload, feedBlock) // feeds prepended
feeds.InjectAnthropic(payload, currentTimeLine(agentCtx, time.Now())) // time prepended again
Final system prompt order: time (changes every minute) → feeds (change every request) → static system prompt (~120K)
Fix
Append dynamic content instead of prepending. The static system prompt should come first so Anthropic can cache the prefix:
static system prompt (~120K, stable) → feeds (dynamic) → time (dynamic)
This means:
InjectAnthropic and InjectOpenAI → append feed block after existing system content
- Same for the content-blocks path (line 122-130)
- Handler call order stays the same (feeds then time) — both append
Expected Savings
With static prefix cached, cache_read (0.1x cost) replaces cache_create (1.25x cost) for the bulk of the system prompt on every request where the TTL hasn't expired. Estimated ~$60/week savings on the current tiverton-house pod.
Notes
- OpenAI path should also append for consistency, though OpenAI caching semantics differ
- The "no existing system" paths (setting
payload["system"] = feedBlock directly) are fine — no reordering needed there
- Tests in
inject_test.go assert prepend order — they'll need updating
Problem
InjectAnthropicandInjectOpenAIincllama/internal/feeds/inject.goprepend dynamic content (feeds, timestamp) before the static system prompt. Anthropic's prompt cache is prefix-matched with a 5-minute TTL — since the prefix changes every request, the entire system prompt (~120K tokens for Allen) is re-cached from scratch every time.Observed on tiverton-house over 7 days:
Cache creation is ~98% of the bill. Median request gap is 15 min (well beyond the 5-min TTL), so almost every request cold-starts the cache. But even when requests burst within 5 min, the volatile prefix (timestamp + feeds) still invalidates it.
Root Cause
inject.goline 87 (OpenAI) and line 118 (Anthropic):handler.gocalls inject twice per request:Final system prompt order:
time (changes every minute) → feeds (change every request) → static system prompt (~120K)Fix
Append dynamic content instead of prepending. The static system prompt should come first so Anthropic can cache the prefix:
This means:
InjectAnthropicandInjectOpenAI→ append feed block after existing system contentExpected Savings
With static prefix cached, cache_read (0.1x cost) replaces cache_create (1.25x cost) for the bulk of the system prompt on every request where the TTL hasn't expired. Estimated ~$60/week savings on the current tiverton-house pod.
Notes
payload["system"] = feedBlockdirectly) are fine — no reordering needed thereinject_test.goassert prepend order — they'll need updating