Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions canon/constraints/decision-rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Decision rules describe how decisions are made when multiple valid options exist
- Say "I Don't Know" Early
- Prefer One-Shot Builds
- Hard-Code Protocols, Not Domain Tables
- Measure Total Cost Before Optimizing

---

Expand All @@ -46,6 +47,7 @@ Decision rules describe how decisions are made when multiple valid options exist
- MUST NOT consider work complete unless it is verified with evidence
- MUST prefer one-shot builds over steering multi-turn misses; fix inputs and restart clean
- MUST name tradeoffs as part of design, not as postmortem
- MUST NOT accept "it will be faster" as justification for caching or optimization without Total Cost of Ownership evidence

---

Expand All @@ -68,6 +70,7 @@ Decision rules describe how decisions are made when multiple valid options exist
- **Steering a Miss**: "Just one more tweak" turning into extended multi-turn patching
- **Hidden Tradeoffs**: Decisions feeling arbitrary in hindsight; future changes requiring archaeology
- **Confidence Without Verification**: Bugs discovered by users instead of builders
- **Local Maxima Optimization (The Cache Trap)**: Optimizing a single metric while ignoring TCO; "it's faster" without measuring debugging hours, staleness incidents, or trust erosion

---

Expand Down Expand Up @@ -318,6 +321,26 @@ I do hard-code protocol contracts that define interoperability:

---

## 15. Measure Total Cost Before Optimizing

I do not accept "it will be faster" as justification without Total Cost of Ownership.

**How I apply this**
• I require measurement of the cache-less or unoptimized path before accepting optimization
• I count debugging hours, maintenance burden, staleness risk, cognitive overhead, and trust erosion as costs
• I treat "pre-optimization" without TCO evidence as a claim without payment (Axiom 2)
• I recognize that local maxima (faster requests) purchased at the cost of system-wide integrity is not optimization — it is debt

**Signals this rule was violated**
• "Have you tried clearing the cache?" appears in debugging conversations
• An optimization is introduced on day one without benchmarking the unoptimized path
• The team spends more time managing the optimization than it would have spent without it
• Nobody can say what the system's actual state is without first flushing something

**See also:** `odd/constraint/anti-cache-lying.md` — the canonical constraint on caching derived content

---

## 💡 Closing Note

These rules describe how I tend to decide, not how decisions must always be made.
Expand Down
101 changes: 101 additions & 0 deletions docs/incidents/oddkit-stale-cache-2026-02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
uri: klappy://docs/incidents/oddkit-stale-cache-2026-02
title: "Incident: OddKit Stale Cache (February 2026)"
audience: docs
exposure: nav
tier: 2
voice: neutral
stability: stable
tags: ["incident", "oddkit", "caching", "dogfooding", "axiom-violation"]
derives_from: "odd/constraint/anti-cache-lying.md"
epoch: E0005
date: 2026-02-12
---

# Incident: OddKit Stale Cache (February 2026)

## Summary

OddKit — the epistemic guide for ODD — served stale canon documents for days without detection. Its cache flush mechanism (`invalidate_cache`) only cleared `.zip` files, leaving other stale derived content in place. The tool built to enforce "Reality Is Sovereign" was itself substituting a past observation for current truth.

---

## What Happened

OddKit caches baseline canon documents fetched from GitHub to reduce latency on repeated calls (orient, search, get, etc.). This cache used a staleness window — content was fetched once and served from cache until either the TTL expired or `invalidate_cache` was manually invoked.

During a period of active canon development, documents in the baseline repo were updated. OddKit continued serving the old versions. No error was raised. No signal indicated staleness. Agents and users received outdated canon content and made decisions based on it.

When the issue was eventually discovered and `invalidate_cache` was called, it only cleared `.zip` files from storage. Other cached artifacts — parsed documents, search indexes, derived content — remained stale. The flush mechanism was itself incomplete.

---

## Duration

Days. The exact duration is unknown because the cache produced no staleness signal.

---

## Detection

The staleness was not detected by any automated system. It was discovered through human observation when canon content did not match expected updates.

This is consistent with the core failure mode of derived-content caching: the cache eliminates the very signal that would prompt investigation.

---

## Axiom Violations

**Axiom 1: Reality Is Sovereign**
The cache served a model of reality (past state) instead of reality itself (current state). Every response from OddKit during the stale period was an assertion about canon that was not grounded in the current state of the canon.

**Axiom 3: Integrity Is Non-Negotiable Efficiency**
The cache existed to save latency — a local optimization. The cost was days of incorrect canon being served. The "efficiency" of caching was purchased with system-wide integrity loss.

**Axiom 4: You Cannot Verify What You Did Not Observe**
The cache eliminated the observation path. Because cached content was served without contacting the source, there was no opportunity to observe that the source had changed. The system could not verify what it did not look at.

---

## Root Cause

Pre-optimization. The caching strategy was introduced to reduce GitHub API calls and improve response latency. No Total Cost of Ownership analysis was performed. The cost of the optimization — stale state, incomplete flush, debugging opacity, trust erosion — exceeded the benefit by orders of magnitude.

This is the named anti-pattern **Local Maxima Optimization (The Cache Trap)**: optimizing for a single metric (latency) while ignoring the full cost of the decision.

---

## Irony

The tool whose entire purpose is to enforce epistemic discipline — to ensure agents observe before asserting, verify before claiming, and prove before confirming — was itself asserting without observation, claiming without verification, and confirming without proof.

The Creed says: *"What I have not seen, I do not know."*
The cache said: *"What I saw yesterday is close enough."*

---

## Resolution

This incident led to the creation of:

1. **Constraint: Anti-Cache Lying** (`odd/constraint/anti-cache-lying.md`) — a permanent constraint prohibiting TTL-based caching of derived or mutable content
2. **Decision Rule #15: Measure Total Cost Before Optimizing** — added to the Decision Rules as a heuristic against pre-optimization without TCO evidence
3. **Implementation Instruction: Content-Addressed Storage** (`docs/oddkit/IMPL-content-addressed-caching.md`) — the plan to replace OddKit's caching with SHA-keyed immutable storage

---

## Lessons

1. A cache that can lie will lie. The question is not *if* but *when* and *for how long*.
2. A flush mechanism that exists for correctness is an admission that the cache is a liability.
3. A flush mechanism that only clears some artifacts is a lie about the lie-clearing mechanism.
4. The absence of an error signal is not evidence of correctness — it is evidence that the system cannot detect its own failures.
5. Dogfooding works. This incident was discovered because the canon author was actively using OddKit and noticed the gap. If the staleness had been in a less-observed part of the system, it could have persisted indefinitely.

---

## Canonical Tie-In

This incident exists because:

> *"Nobody noticed for days."*
126 changes: 126 additions & 0 deletions docs/oddkit/IMPL-content-addressed-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
uri: klappy://docs/oddkit/impl-content-addressed-caching
title: "Implementation: Replace TTL Caching with Content-Addressed Storage"
audience: docs
exposure: internal
tier: 2
voice: neutral
stability: evolving
tags: ["oddkit", "implementation", "caching", "content-addressed", "anti-cache-lying"]
derives_from: "odd/constraint/anti-cache-lying.md"
---

# Implementation Instruction Set: Content-Addressed Storage for OddKit

## Replace TTL-based baseline caching with SHA-keyed immutable storage

---

## Intent

Eliminate the possibility of serving stale canon or baseline content.

OddKit's current caching strategy caches baseline documents with a staleness window and provides an `invalidate_cache` action as a manual correctness tool. This violates the Anti-Cache Lying constraint and was proven harmful by the stale-cache incident (days of stale content served without detection, incomplete flush that only cleared `.zip` files).

The goal is to make it impossible for OddKit to lie about the state of the canon.

---

## Background: The Incident

OddKit cached its baseline canon documents. For days, it served stale content. Nobody noticed because the cache made everything look like it was working. When `invalidate_cache` was invoked, it only cleared `.zip` files — other stale derived content continued to be served.

The tool whose purpose is epistemic integrity was itself violating Axiom 1 (Reality Is Sovereign).

---

## Required Changes

### 1. Replace TTL-based cache with commit-SHA-keyed storage

**Current behavior:** Fetch baseline content, cache it, serve from cache until TTL expires or `invalidate_cache` is called.

**Target behavior:**
- On first request in a session, fetch the current commit SHA for the baseline branch (one lightweight GitHub API call or HTTP HEAD with ETag)
- Use that SHA as the storage namespace key
- If content for this exact SHA exists in storage, serve it — this is a truthful assertion
- If the SHA has changed or no content exists, fetch fresh from GitHub and store keyed to the new SHA
- No TTL. No staleness window. No manual flush for correctness.

### 2. Redefine `invalidate_cache` as orphan cleanup

**Current behavior:** `invalidate_cache` attempts to realign cached content with reality by clearing stored files (incompletely — only `.zip` files).

**Target behavior:**
- `invalidate_cache` becomes `cleanup_storage` or equivalent
- Its purpose is garbage collection of orphaned SHA-keyed storage (old commit SHAs that are no longer current)
- It MUST NOT be required for correctness — the system MUST serve correct content regardless of whether cleanup has been run
- It is a storage hygiene operation, not a truth-recovery operation

### 3. Ensure ALL cached/stored artifacts are SHA-keyed

**Current gap:** The previous `invalidate_cache` only cleared `.zip` files. Other derived artifacts (search indexes, parsed document caches, etc.) were not cleared.

**Target behavior:**
- Every stored artifact MUST be keyed to the commit SHA it was derived from
- When the SHA changes, ALL derived artifacts are either re-derived or the old SHA namespace is ignored entirely
- No partial flush. No "we cleared the zips but not the indexes."

### 4. Remove the need for a correctness-oriented flush action from the MCP tool interface

**Current behavior:** `invalidate_cache` is exposed as an MCP tool action, implying it is sometimes necessary for correct operation.

**Target behavior:**
- If a storage cleanup action exists, it is clearly labeled as hygiene, not correctness
- The system description and documentation make clear that cleanup is never required for accurate results
- Agents should never need to call a flush action to get truthful responses

---

## Acceptance Criteria

- [ ] OddKit serves content keyed to the current commit SHA of the baseline branch
- [ ] Changing a canon document in the baseline repo results in OddKit serving the updated content on the next request — without any manual intervention
- [ ] No TTL-based expiration exists in the caching/storage layer
- [ ] The `invalidate_cache` action is either removed or renamed and redefined as storage cleanup with no correctness implications
- [ ] ALL stored artifacts (not just `.zip`) are keyed to their source commit SHA
- [ ] Documentation reflects the new strategy and explains why TTL caching was removed

---

## Explicit Non-Goals

- ❌ Not optimizing for minimal GitHub API calls at the cost of truthfulness
- ❌ Not introducing a "smart" TTL that "probably" stays fresh long enough
- ❌ Not adding cache warming, pre-fetching, or speculative caching of derived content
- ❌ Not trading correctness for latency under any circumstance

---

## Depends On

- **Constraint: Anti-Cache Lying** (`odd/constraint/anti-cache-lying.md`) — the governing constraint
- **Foundational Axioms** (`canon/values/axioms.md`) — the axiomatic basis
- **Decision Rule #15** — Measure Total Cost Before Optimizing

---

## Decision Record: Why This Change

| Factor | TTL Cache (Current) | Content-Addressed (Target) |
|--------|---------------------|---------------------------|
| Correctness guarantee | None — staleness window | Complete — SHA is proof |
| Flush required for truth | Yes | Never |
| Partial flush risk | Yes (proven: .zip only) | Impossible — namespace is atomic |
| Silent staleness | Yes (proven: days) | Impossible — SHA mismatch = fresh fetch |
| Debugging clarity | "Did you clear the cache?" | "What SHA are we on?" |
| TCO | Unbounded (debugging, incidents, trust erosion) | Bounded (one SHA check per session) |

---

## Next

After this is implemented and validated:
1. Update OddKit documentation to reflect the new storage model
2. Remove or rename `invalidate_cache` in the MCP tool interface
3. Add the stale-cache incident as a test case — simulate a baseline change and verify OddKit serves fresh content without intervention
Loading