diff --git a/canon/constraints/decision-rules.md b/canon/constraints/decision-rules.md index fdcbf02f..3a932b94 100644 --- a/canon/constraints/decision-rules.md +++ b/canon/constraints/decision-rules.md @@ -35,6 +35,7 @@ Decision rules describe how decisions are made when multiple valid options exist - Say "I Don't Know" Early - Prefer One-Shot Builds - Hard-Code Protocols, Not Domain Tables +- Measure Total Cost Before Optimizing --- @@ -46,6 +47,7 @@ Decision rules describe how decisions are made when multiple valid options exist - MUST NOT consider work complete unless it is verified with evidence - MUST prefer one-shot builds over steering multi-turn misses; fix inputs and restart clean - MUST name tradeoffs as part of design, not as postmortem +- MUST NOT accept "it will be faster" as justification for caching or optimization without Total Cost of Ownership evidence --- @@ -68,6 +70,7 @@ Decision rules describe how decisions are made when multiple valid options exist - **Steering a Miss**: "Just one more tweak" turning into extended multi-turn patching - **Hidden Tradeoffs**: Decisions feeling arbitrary in hindsight; future changes requiring archaeology - **Confidence Without Verification**: Bugs discovered by users instead of builders +- **Local Maxima Optimization (The Cache Trap)**: Optimizing a single metric while ignoring TCO; "it's faster" without measuring debugging hours, staleness incidents, or trust erosion --- @@ -318,6 +321,26 @@ I do hard-code protocol contracts that define interoperability: --- +## 15. Measure Total Cost Before Optimizing + +I do not accept "it will be faster" as justification without Total Cost of Ownership. + +**How I apply this** +• I require measurement of the cache-less or unoptimized path before accepting optimization +• I count debugging hours, maintenance burden, staleness risk, cognitive overhead, and trust erosion as costs +• I treat "pre-optimization" without TCO evidence as a claim without payment (Axiom 2) +• I recognize that local maxima (faster requests) purchased at the cost of system-wide integrity is not optimization — it is debt + +**Signals this rule was violated** +• "Have you tried clearing the cache?" appears in debugging conversations +• An optimization is introduced on day one without benchmarking the unoptimized path +• The team spends more time managing the optimization than it would have spent without it +• Nobody can say what the system's actual state is without first flushing something + +**See also:** `odd/constraint/anti-cache-lying.md` — the canonical constraint on caching derived content + +--- + ## 💡 Closing Note These rules describe how I tend to decide, not how decisions must always be made. diff --git a/docs/incidents/oddkit-stale-cache-2026-02.md b/docs/incidents/oddkit-stale-cache-2026-02.md new file mode 100644 index 00000000..cffd7e6f --- /dev/null +++ b/docs/incidents/oddkit-stale-cache-2026-02.md @@ -0,0 +1,101 @@ +--- +uri: klappy://docs/incidents/oddkit-stale-cache-2026-02 +title: "Incident: OddKit Stale Cache (February 2026)" +audience: docs +exposure: nav +tier: 2 +voice: neutral +stability: stable +tags: ["incident", "oddkit", "caching", "dogfooding", "axiom-violation"] +derives_from: "odd/constraint/anti-cache-lying.md" +epoch: E0005 +date: 2026-02-12 +--- + +# Incident: OddKit Stale Cache (February 2026) + +## Summary + +OddKit — the epistemic guide for ODD — served stale canon documents for days without detection. Its cache flush mechanism (`invalidate_cache`) only cleared `.zip` files, leaving other stale derived content in place. The tool built to enforce "Reality Is Sovereign" was itself substituting a past observation for current truth. + +--- + +## What Happened + +OddKit caches baseline canon documents fetched from GitHub to reduce latency on repeated calls (orient, search, get, etc.). This cache used a staleness window — content was fetched once and served from cache until either the TTL expired or `invalidate_cache` was manually invoked. + +During a period of active canon development, documents in the baseline repo were updated. OddKit continued serving the old versions. No error was raised. No signal indicated staleness. Agents and users received outdated canon content and made decisions based on it. + +When the issue was eventually discovered and `invalidate_cache` was called, it only cleared `.zip` files from storage. Other cached artifacts — parsed documents, search indexes, derived content — remained stale. The flush mechanism was itself incomplete. + +--- + +## Duration + +Days. The exact duration is unknown because the cache produced no staleness signal. + +--- + +## Detection + +The staleness was not detected by any automated system. It was discovered through human observation when canon content did not match expected updates. + +This is consistent with the core failure mode of derived-content caching: the cache eliminates the very signal that would prompt investigation. + +--- + +## Axiom Violations + +**Axiom 1: Reality Is Sovereign** +The cache served a model of reality (past state) instead of reality itself (current state). Every response from OddKit during the stale period was an assertion about canon that was not grounded in the current state of the canon. + +**Axiom 3: Integrity Is Non-Negotiable Efficiency** +The cache existed to save latency — a local optimization. The cost was days of incorrect canon being served. The "efficiency" of caching was purchased with system-wide integrity loss. + +**Axiom 4: You Cannot Verify What You Did Not Observe** +The cache eliminated the observation path. Because cached content was served without contacting the source, there was no opportunity to observe that the source had changed. The system could not verify what it did not look at. + +--- + +## Root Cause + +Pre-optimization. The caching strategy was introduced to reduce GitHub API calls and improve response latency. No Total Cost of Ownership analysis was performed. The cost of the optimization — stale state, incomplete flush, debugging opacity, trust erosion — exceeded the benefit by orders of magnitude. + +This is the named anti-pattern **Local Maxima Optimization (The Cache Trap)**: optimizing for a single metric (latency) while ignoring the full cost of the decision. + +--- + +## Irony + +The tool whose entire purpose is to enforce epistemic discipline — to ensure agents observe before asserting, verify before claiming, and prove before confirming — was itself asserting without observation, claiming without verification, and confirming without proof. + +The Creed says: *"What I have not seen, I do not know."* +The cache said: *"What I saw yesterday is close enough."* + +--- + +## Resolution + +This incident led to the creation of: + +1. **Constraint: Anti-Cache Lying** (`odd/constraint/anti-cache-lying.md`) — a permanent constraint prohibiting TTL-based caching of derived or mutable content +2. **Decision Rule #15: Measure Total Cost Before Optimizing** — added to the Decision Rules as a heuristic against pre-optimization without TCO evidence +3. **Implementation Instruction: Content-Addressed Storage** (`docs/oddkit/IMPL-content-addressed-caching.md`) — the plan to replace OddKit's caching with SHA-keyed immutable storage + +--- + +## Lessons + +1. A cache that can lie will lie. The question is not *if* but *when* and *for how long*. +2. A flush mechanism that exists for correctness is an admission that the cache is a liability. +3. A flush mechanism that only clears some artifacts is a lie about the lie-clearing mechanism. +4. The absence of an error signal is not evidence of correctness — it is evidence that the system cannot detect its own failures. +5. Dogfooding works. This incident was discovered because the canon author was actively using OddKit and noticed the gap. If the staleness had been in a less-observed part of the system, it could have persisted indefinitely. + +--- + +## Canonical Tie-In + +This incident exists because: + +> *"Nobody noticed for days."* diff --git a/docs/oddkit/IMPL-content-addressed-caching.md b/docs/oddkit/IMPL-content-addressed-caching.md new file mode 100644 index 00000000..5f2c5676 --- /dev/null +++ b/docs/oddkit/IMPL-content-addressed-caching.md @@ -0,0 +1,126 @@ +--- +uri: klappy://docs/oddkit/impl-content-addressed-caching +title: "Implementation: Replace TTL Caching with Content-Addressed Storage" +audience: docs +exposure: internal +tier: 2 +voice: neutral +stability: evolving +tags: ["oddkit", "implementation", "caching", "content-addressed", "anti-cache-lying"] +derives_from: "odd/constraint/anti-cache-lying.md" +--- + +# Implementation Instruction Set: Content-Addressed Storage for OddKit + +## Replace TTL-based baseline caching with SHA-keyed immutable storage + +--- + +## Intent + +Eliminate the possibility of serving stale canon or baseline content. + +OddKit's current caching strategy caches baseline documents with a staleness window and provides an `invalidate_cache` action as a manual correctness tool. This violates the Anti-Cache Lying constraint and was proven harmful by the stale-cache incident (days of stale content served without detection, incomplete flush that only cleared `.zip` files). + +The goal is to make it impossible for OddKit to lie about the state of the canon. + +--- + +## Background: The Incident + +OddKit cached its baseline canon documents. For days, it served stale content. Nobody noticed because the cache made everything look like it was working. When `invalidate_cache` was invoked, it only cleared `.zip` files — other stale derived content continued to be served. + +The tool whose purpose is epistemic integrity was itself violating Axiom 1 (Reality Is Sovereign). + +--- + +## Required Changes + +### 1. Replace TTL-based cache with commit-SHA-keyed storage + +**Current behavior:** Fetch baseline content, cache it, serve from cache until TTL expires or `invalidate_cache` is called. + +**Target behavior:** +- On first request in a session, fetch the current commit SHA for the baseline branch (one lightweight GitHub API call or HTTP HEAD with ETag) +- Use that SHA as the storage namespace key +- If content for this exact SHA exists in storage, serve it — this is a truthful assertion +- If the SHA has changed or no content exists, fetch fresh from GitHub and store keyed to the new SHA +- No TTL. No staleness window. No manual flush for correctness. + +### 2. Redefine `invalidate_cache` as orphan cleanup + +**Current behavior:** `invalidate_cache` attempts to realign cached content with reality by clearing stored files (incompletely — only `.zip` files). + +**Target behavior:** +- `invalidate_cache` becomes `cleanup_storage` or equivalent +- Its purpose is garbage collection of orphaned SHA-keyed storage (old commit SHAs that are no longer current) +- It MUST NOT be required for correctness — the system MUST serve correct content regardless of whether cleanup has been run +- It is a storage hygiene operation, not a truth-recovery operation + +### 3. Ensure ALL cached/stored artifacts are SHA-keyed + +**Current gap:** The previous `invalidate_cache` only cleared `.zip` files. Other derived artifacts (search indexes, parsed document caches, etc.) were not cleared. + +**Target behavior:** +- Every stored artifact MUST be keyed to the commit SHA it was derived from +- When the SHA changes, ALL derived artifacts are either re-derived or the old SHA namespace is ignored entirely +- No partial flush. No "we cleared the zips but not the indexes." + +### 4. Remove the need for a correctness-oriented flush action from the MCP tool interface + +**Current behavior:** `invalidate_cache` is exposed as an MCP tool action, implying it is sometimes necessary for correct operation. + +**Target behavior:** +- If a storage cleanup action exists, it is clearly labeled as hygiene, not correctness +- The system description and documentation make clear that cleanup is never required for accurate results +- Agents should never need to call a flush action to get truthful responses + +--- + +## Acceptance Criteria + +- [ ] OddKit serves content keyed to the current commit SHA of the baseline branch +- [ ] Changing a canon document in the baseline repo results in OddKit serving the updated content on the next request — without any manual intervention +- [ ] No TTL-based expiration exists in the caching/storage layer +- [ ] The `invalidate_cache` action is either removed or renamed and redefined as storage cleanup with no correctness implications +- [ ] ALL stored artifacts (not just `.zip`) are keyed to their source commit SHA +- [ ] Documentation reflects the new strategy and explains why TTL caching was removed + +--- + +## Explicit Non-Goals + +- ❌ Not optimizing for minimal GitHub API calls at the cost of truthfulness +- ❌ Not introducing a "smart" TTL that "probably" stays fresh long enough +- ❌ Not adding cache warming, pre-fetching, or speculative caching of derived content +- ❌ Not trading correctness for latency under any circumstance + +--- + +## Depends On + +- **Constraint: Anti-Cache Lying** (`odd/constraint/anti-cache-lying.md`) — the governing constraint +- **Foundational Axioms** (`canon/values/axioms.md`) — the axiomatic basis +- **Decision Rule #15** — Measure Total Cost Before Optimizing + +--- + +## Decision Record: Why This Change + +| Factor | TTL Cache (Current) | Content-Addressed (Target) | +|--------|---------------------|---------------------------| +| Correctness guarantee | None — staleness window | Complete — SHA is proof | +| Flush required for truth | Yes | Never | +| Partial flush risk | Yes (proven: .zip only) | Impossible — namespace is atomic | +| Silent staleness | Yes (proven: days) | Impossible — SHA mismatch = fresh fetch | +| Debugging clarity | "Did you clear the cache?" | "What SHA are we on?" | +| TCO | Unbounded (debugging, incidents, trust erosion) | Bounded (one SHA check per session) | + +--- + +## Next + +After this is implemented and validated: +1. Update OddKit documentation to reflect the new storage model +2. Remove or rename `invalidate_cache` in the MCP tool interface +3. Add the stale-cache incident as a test case — simulate a baseline change and verify OddKit serves fresh content without intervention diff --git a/odd/constraint/anti-cache-lying.md b/odd/constraint/anti-cache-lying.md new file mode 100644 index 00000000..b874cc2a --- /dev/null +++ b/odd/constraint/anti-cache-lying.md @@ -0,0 +1,149 @@ +--- +uri: klappy://odd/constraints/anti-cache-lying +title: "Constraint: Anti-Cache Lying" +audience: odd +exposure: nav +tier: 1 +voice: neutral +stability: stable +tags: ["constraints", "caching", "truth", "governance", "agents", "tco"] +derives_from: "canon/values/axioms.md" +epoch: E0005 +date: 2026-02-12 +--- + +# Constraint: Anti-Cache Lying + +## Problem + +When derived or mutable content is cached, the system serves a past observation as though it were current truth. + +This does not require malicious intent. + +It emerges through: +- TTL-based staleness windows ("it's probably still valid") +- incomplete flush mechanisms (clearing some artifacts but not others) +- pre-optimization without TCO measurement +- silent serving of stale state with no signal to the consumer +- debugging masks that hide the gap between cache and reality + +The result is confidence without contact with reality. + +--- + +## Core Principle + +**A cache of derived content is a polite lie. If you need a flush strategy, you have already admitted the cache can lie.** + +--- + +## Non-Negotiable Rules + +1. Derived or mutable content MUST NOT be cached with TTL-based expiration. + If the content can change independently of its cache key, the cache can lie. + +2. The only acceptable caching is content-addressed storage. + The cache key MUST be the identity of the content — a commit SHA, a content hash, an immutable tag. No TTL. No staleness window. No flush. + +3. A "cache invalidation" mechanism MUST NOT exist as a correctness tool. + Cache flush should only exist as storage cleanup of orphaned immutable artifacts. If flush is required to realign the system with truth, the caching strategy has already failed. + +4. "It will be faster" is a claim, and a claim is a debt. + Per Axiom 2: if you have not measured Total Cost of Ownership — including debugging hours, stale-state incidents, flush mechanism maintenance, cognitive overhead, and trust erosion — you have not paid the debt. You have hidden it. + +5. Speed MUST come from architecture, not from pretending yesterday's answer is today's answer. + If the fetch path is too slow, fix the fetch path. Do not mask the problem with a cache. + +--- + +## Total Cost of Ownership — What "Faster" Actually Costs + +Engineers optimize for a single metric (request latency) without measuring the full cost of the caching decision. TCO includes: + +- Hours debugging stale-state bugs that only exist because of the cache +- Building, testing, and maintaining flush/invalidation mechanisms +- Incidents caused by silent staleness — serving wrong answers with no error signal +- Cognitive overhead of "is this cached?" on every debug session +- Trust erosion when you cannot tell if what you are seeing is real +- Leadership time spent litigating the same caching argument on every project +- The incomplete flush — clearing some artifacts but not others, creating a lie about the lie-clearing mechanism + +None of these costs appear in latency benchmarks. All of them compound. + +--- + +## Required Warnings + +The following phrases indicate potential cache lying: + +- "Have you tried clearing the cache?" +- "It works after a fresh deploy." +- "The data was stale — we just need to flush." +- "The cache TTL is only an hour." +- "It's probably still valid." +- "Nobody noticed for days." + +These are not reassurances. They are signals to investigate whether the caching strategy violates reality. + +--- + +## The Content-Addressed Alternative + +Content-addressed caching is not lying because the key IS the identity of the content: + +1. Fetch the current commit SHA for the source (one lightweight call) +2. Use that SHA as the cache namespace +3. If the SHA matches what you have stored, serve from storage — this is a truthful assertion, not an assumption +4. If the SHA has changed, fetch fresh and store under the new SHA +5. Old SHA-keyed storage can be garbage collected — this is the only valid "flush" + +No TTL. No staleness window. No flush button. The cache key is the proof. + +--- + +## Agent Instruction + +Agents must not: +- introduce TTL-based caching of derived or mutable content +- recommend caching as a performance optimization without measuring TCO +- treat "cache invalidation" as a correctness mechanism +- assume cached derived content is current without verifying the source identity + +Agents must: +- prefer content-addressed storage over TTL-based caching +- measure or estimate TCO before recommending any caching strategy +- surface caching as an epistemic risk, not just a performance feature +- treat "have you tried clearing the cache?" as a system design failure signal + +If an agent cannot demonstrate that a cached value is guaranteed to match the current source truth, the cache is a liability. + +--- + +## Anti-Pattern: Local Maxima Optimization (The Cache Trap) + +Pre-optimizing for request latency while ignoring maintenance cost and Total Cost of Ownership is a named anti-pattern in ODD. + +Per Axiom 3 (Integrity Is Non-Negotiable Efficiency): the fastest system is the one where every response is already true. A local maximum on latency purchased at the cost of system-wide integrity is not optimization — it is debt with interest. + +--- + +## Case Study: The OddKit Stale-Cache Incident + +OddKit — the tool whose purpose is to enforce ODD epistemic discipline — cached its baseline canon documents with a staleness window. For days, it served stale canon content without detection. When the `invalidate_cache` action was invoked, it only cleared `.zip` files, leaving other stale derived content in place. + +The tool built to prevent assertion without verification was itself asserting without verification. + +This violated: +- **Axiom 1 (Reality Is Sovereign)** — the cache served a model of reality, not reality itself +- **Axiom 4 (You Cannot Verify What You Did Not Observe)** — the cache eliminated the signal that would have prompted observation +- **Axiom 3 (Integrity Is Non-Negotiable Efficiency)** — the "speed" of caching was purchased with days of silent lies + +Nobody noticed. That is the point. + +--- + +## Canonical Tie-In + +This constraint exists because: + +> *"It works after you clear the cache."*