klappy · klappy · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
diff --git a/docs/oddkit/evidence/challenge-governance-code-refactor.md b/docs/oddkit/evidence/challenge-governance-code-refactor.md
@@ -0,0 +1,155 @@
+# Gauntlet Evidence — Challenge Governance Code Refactor
+
+**Branch:** `feat/e0008-challenge-governance-driven`
+**Date:** 2026-04-17
+**Scope:** Governance-driven refactor of `oddkit_challenge` in `workers/src/orchestrate.ts` plus minor extension of `workers/src/bm25.ts`
+**Deliverable type:** Worker code change (TypeScript) — the runtime that consumes the canon governance articles landed in PR #99
+**Predecessor PRs:** #96 (governance-driven encode pattern, the structural mirror), #99 (klappy.dev governance articles, the canon this code reads)
+
+---
+
+## Definition of Done — Evidence
+
+### 1. Change Description
+
+Refactored `runChallengeAction` in `workers/src/orchestrate.ts` from hardcoded claim-type detection and question generation to governance-driven extraction. The structural mirror of PR #96 (encode). **Mid-implementation pivot:** replaced regex-OR detection with BM25 + stemming after the gauntlet surfaced a morphological brittleness (`"coin"` doesn't match trigger word `"coining"`). The architectural swap removed an entire class of bug and validated a reusable pattern for future governance-driven tools.
+
+**New types added (`orchestrate.ts`):**
+
+- `ChallengeTypeDef` — slug, name, blockquote, trigger words, `detectionText` (triggers + blockquote, fed to BM25 indexer), questions with tiers, prerequisite overlays, reframings, fallback flag
+- `BasePrerequisite` — prerequisite name, check description, gap message
+- `NormativeVocabulary` — case-sensitive regex (RFC 2119), case-insensitive regex (architectural phrases), directive type map (this one keeps regex since it's directive-vocabulary matching against retrieved canon quotes, not claim-type detection)
+- `StakesModeConfig` / `StakesCalibration` — mode → (question tiers, prerequisite strictness, reframing surfacing)
+
+**New discovery/fetch functions added (`orchestrate.ts`):**
+
+- `discoverChallengeTypes(fetcher, canonUrl)` — finds articles tagged `challenge-type`, parses each, builds a per-canonUrl BM25 index over detection text. Per-canonUrl cache for types AND index.
+- `fetchBasePrerequisites(fetcher, canonUrl)` — fetches `odd/challenge/base-prerequisites.md`, extracts the prerequisite overlays table. Per-canonUrl cache.
+- `fetchNormativeVocabulary(fetcher, canonUrl)` — fetches `odd/challenge/normative-vocabulary.md`, extracts both vocabulary tables, compiles case-sensitive and case-insensitive regexes. Falls back to minimal RFC 2119 set if the article is missing. Per-canonUrl cache.
+- `fetchStakesCalibration(fetcher, canonUrl)` — fetches `odd/challenge/stakes-calibration.md`, extracts the calibration table. Per-canonUrl cache.
+
+**`runChallengeAction` refactored to:**
+
+- Load all four governance sources in parallel
+- Honor voice-dump suppression invariant — return empty challenge output when mode's tier list is empty
+- Detect matching types via BM25 over per-type detection text (score > 0 = match)
+- Resolve fallback type when no type scores > 0
+- Aggregate questions, prerequisite overlays (base + type), and reframings across matched types with deduplication
+- Apply stakes calibration filter based on mode (question tiers, prerequisite strictness, reframing surfacing)
+- Detect tensions in retrieved canon quotes via governance-driven vocabulary regex (replacing hardcoded `MUST`/`MUST NOT` checks)
+- Surface matched type names and definitions in the response (teaching the model what governs the behavior)
+- Mark `block_until_addressed` when calibration says so
+
+**`evaluatePrerequisiteCheck` helper added:** interprets natural-language `check` strings from prerequisite overlay tables. Extracts quoted keywords and tests presence in input. Special-cases URL, numeric, proper-noun, and citation patterns.
+
+**`runCleanupStorage` extended:** clears all five new caches (types, type-index, base prerequisites, normative vocabulary, stakes calibration). Mirror of the PR #96 fix for cache staleness on governance edits.
+
+**Dead code removed:** `detectClaimType` in `workers/src/orchestrate.ts` (only used by the old hardcoded `runChallengeAction`). Legacy version in `src/tasks/challenge.js` retained for backward-compat on the non-worker CLI path.
+
+**`workers/src/bm25.ts` extension (backward-compatible):**
+
+- `tokenize(text, stopWords?)` — new optional parameter. Defaults to the existing `STOP_WORDS` set (unchanged behavior for existing callers).
+- `buildBM25Index(documents, stopWords?)` — same. Records the stop word set on the returned index so `searchBM25` tokenizes queries consistently with doc vocabularies.
+- `BM25Index` interface gained an optional `stopWords?: Set<string>` field.
+- Motivation: the default `STOP_WORDS` filters out modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the load-bearing detection signal for strong-claim, proposal, and assumption challenge types. Challenge-type detection needs a custom stop-word set that preserves modals.
+
+### 2. Verification Performed
+
+- `npm run typecheck` (workers/) — clean both before and after the BM25 pivot, and after the dead-code removal
+- `bash tests/smoke.sh` (root) — 6 PASS, exercising the legacy CLI path. Confirms backward compat preserved (the worker path I refactored is separate from the CLI path).
+- `node workers/test/governance-parser.test.mjs` — new parser-fidelity test, 94 assertions against live governance articles fetched from klappy.dev raw. **94 pass, 0 fail.** Includes explicit regression tests for stemming (`coin`/`coining`, `proposed`/`propose`, `principles`/`principle`) and multi-match semantics via BM25.
+- `oddkit_preflight` — surfaced constraints (ai-voice-cliches, author-identity-language, definition-of-done, supersession, prompt-over-code)
+- `oddkit_get` on `canon/methods/supersession.md` — confirmed this refactor is "replace" on the supersession spectrum (provenance preserved via PR description, commit message, ledger entry, retained legacy file)
+- AI voice clichés audit on new code/comments via `git diff | grep` for negation parallelism, formulaic transitions, puffing — clean, zero hits
+- `oddkit_challenge` on the commit decision — generic prereqs answered honestly in the PR description
+- `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason documented in PR #99 — flagged in PR as future refactor candidate
+
+### 3. Observed Behavior
+
+Parser-fidelity test output (94/94 passed):
+
+```
+─── Test 1: Challenge type parsing ───  (7 types × 8 assertions = 56 passing)
+─── Test 2: Fallback resolution ───  (2 passing — observation has fallback: true, others don't)
+─── Test 3: BM25 detection with stemming ───  (7 passing — each type matches its first trigger word)
+─── Test 3b: Stemming defeats the original coin/coining bug ───  (5 passing — stemming equivalence + 4 real-world inputs)
+─── Test 4: Multi-match semantics (BM25) ───  (3 passing)
+─── Test 4b: Empty input + irrelevant input do not over-match ───  (1 passing)
+─── Test 5: Base prerequisites ───  (4 passing)
+─── Test 6: Normative vocabulary ───  (4 passing)
+─── Test 7: Stakes calibration ───  (5 passing — including the voice-dump suppression invariant)
+
+94 passed, 0 failed
+```
+
+### 4. Evidence Produced
+
+This file. Plus the diffs:
+
+- `workers/src/orchestrate.ts`: ~560 insertions, ~70 deletions
+- `workers/src/bm25.ts`: small additive change (stopWords parameter threaded through tokenize/buildBM25Index/searchBM25, no behavior change for existing callers)
+- `workers/test/governance-parser.test.mjs`: new (~200 lines)
+- `docs/oddkit/evidence/challenge-governance-code-refactor.md`: this note
+
+Visual proof: **N/A — server-side code change.** No UI, no interaction surface, no visible state. The `oddkit_challenge` MCP tool's response shape changes (adds `mode`, `matched_types`, `type_definitions`, `block_until_addressed` fields; removes `claim_type`) but this is consumed programmatically, not rendered.
+
+### 5. Self-Audit Completed
+
+- **Intended outcome:** the worker path of `oddkit_challenge` becomes governance-driven via extraction from canon, mirroring PR #96. Behavior changes when the canon governance articles change — no code redeploy required. Detection is morphologically resilient via BM25 + stemming.
+- **Constraints applied:** Definition of Done (this file), Writing Canon (n/a — code, not document, but evidence note follows the structure), AI voice clichés (audited clean on new comments), supersession ("replace" with provenance preserved), prompt-over-code (the principle this implements), Vodka Architecture (server stays thin — extraction and IR, no domain opinion baked in).
+- **Decision rules followed:** mirror PR #96's cache pattern (per-canonUrl keying, try-catch-graceful-degradation per article); preserve legacy CLI path; voice-dump suppression as a load-bearing invariant; multi-match by design; honor `fallback: true` frontmatter for type fallback resolution; keep `bm25.ts` changes backward-compatible.
+- **Tradeoffs:** four governance fetches per challenge call (mitigated by per-canonUrl module-level cache, so cold start is the only slow path); BM25 index built per cache invalidation (cheap — 5–10 tiny docs); BM25 score magnitudes aren't intuitive constants (anyone tuning thresholds later will need to reason in relative terms); the Porter-style stemmer handles common English morphology but not irregular forms.
+- **Remaining risks:**
+  - Parser regex assumes specific table column order. If a future governance article reorders columns, parsing degrades silently. The parser-fidelity test catches this for currently-shipped articles but won't catch it for hypothetical future structure changes.
+  - `evaluatePrerequisiteCheck` uses heuristics over natural-language check descriptions. Some prerequisite checks may evaluate incorrectly — watch for false-negative gap messages in production logs.
+  - `oddkit_gate` still returns NOT_READY due to its own hardcoded prereqs — same architectural pattern as challenge pre-refactor. Future refactor candidate. Documented in PR.
+  - `oddkit_encode` still uses regex-OR detection with the same morphological brittleness this PR fixes for challenge. Follow-up PR required to bring encode to parity; the pivot here provides the blueprint.
+  - klappy.dev meta governance article (`odd/challenge-types/how-to-write-challenge-types.md`) describes the runtime as "compiles into a case-insensitive word-boundary regex" — that's now stale. Small coordinated klappy.dev PR required to update the language.
+
+---
+
+## Bugs the Gauntlet Caught (this refactor sequence)
+
+1. **PR #99 — 10 of 11 articles missing required `## Summary` sections.** Writing Canon tier 4 violation. Same failure mode as the Feb 2026 Progressive Disclosure Failure incident.
+2. **PR #99 — broken `derives_from` path** in `stakes-calibration.md` (`canon/epistemic-modes.md` → `canon/definitions/epistemic-modes.md`).
+3. **This PR — voice-dump suppression invariant would have shipped broken.** The calibration cell content is `"none (suppress all challenge)"` not bare `"none"`. Initial parser checked `=== "none"` with strict equality, would have produced a single-element array, voice-dump mode would have surfaced all challenge questions in production. Fixed by checking `tiersRaw === "none" || tiersRaw.startsWith("none ") || tiersRaw.startsWith("none(")`.
+4. **This PR (BM25 pivot) — morphological brittleness revealed.** The test `pattern-coinage fires on 'coin the term'` failed under regex because the article has `coining` as a trigger but not `coin`. This signal triggered the full pivot from regex-OR to BM25 + stemming.
+5. **This PR (BM25 pivot) — default `STOP_WORDS` would have silently broken strong-claim and proposal detection.** The default filter drops modal verbs (`must`, `should`, `shall`, `may`, `not`) — exactly the load-bearing trigger words for these two types. Caught because the parser-fidelity test asserted each type matches its first trigger word and two types failed. Fixed by extending `bm25.ts` with an optional `stopWords: Set<string>` parameter and defining a `CHALLENGE_STOP_WORDS` set in `orchestrate.ts` that preserves modals.
+
+**The discipline is load-bearing, not ceremony.** Five real bugs caught across two PRs. Two of the five would have caused silent production failures of invariants specifically named in the governance.
+
+---
+
+## Bugbot Review + Combine — A Sixth Layer of Catch
+
+After the regex version of this PR was first pushed, the Cursor bugbot reviewed it and flagged five issues, four of which were addressed via three follow-up commits on the remote branch (`31f8134`, `e9ef2f9`, `84932f0`). Those commits existed in parallel to the local BM25 pivot work and were not visible until the remote was fetched. Discovering them late led to a fork situation that was resolved by:
+
+1. Resetting local to the remote tip (preserving the three follow-up fixes plus the ledger entry on `a88abf7`)
+2. Cherry-picking the BM25 commit on top of the resolved tip
+3. Hand-porting the four polish fixes onto the BM25 base — they touch unrelated code regions so the port was mechanical
+4. Adding a fixup commit (`e82164b`) on top of the BM25 commit that captures the ports
+
+**Bugbot's five review items, all closed by this PR:**
+
+| Severity | Issue | Resolution |
+|----------|-------|------------|
+| High | Mode column not lowercased breaks voice-dump suppression | Lowercased at parse time AND at lookup |
+| Medium | Regex alternation order breaks multi-word directive matching (`MUST` before `MUST NOT`) | Sort vocab by length descending |
+| Medium | `first_1` reframings surfaces multiple instead of one | Slice from aggregated list, not per-type |
+| Medium | SUPPRESSED response missing `governance` field | Detection runs before suppression check; both responses share the `governance` shape |
+| Low | Dead code: `detectClaimType` has zero callers | Removed by the BM25 commit before bugbot reviewed |
+
+**Lesson recorded:** when encountering a divergent remote on an existing PR branch, fetch and read PR review comments first. Bugbot leaves structured comments that explain the divergent commits — checking saves the user from explaining what already exists.
+
+---
+
+## Version Tracking
+
+- Branch: `feat/e0008-challenge-governance-driven`
+- Post-merge: ledger entry capturing E0008 challenge code-refactor milestone
+- Related PRs:
+  - **Predecessor (structural mirror):** klappy/oddkit#96 (governance-driven encode refactor)
+  - **Depends on:** klappy/klappy.dev#99 (governance articles in canon — the inputs this code reads)
+  - **Immediate follow-up:** encode parity PR — bring `oddkit_encode` to BM25 + stemming using the pattern proven here
+  - **Small follow-up:** klappy.dev PR updating `how-to-write-challenge-types.md` — swap "compiles into a case-insensitive word-boundary regex" for the BM25 description
+  - **Future candidate:** governance-driven gate refactor (gate has the same hardcoded-logic gap as challenge pre-refactor; surfaced again during this gauntlet run)
diff --git a/odd/ledger/journal/2026-04-17-pr100-combined.md b/odd/ledger/journal/2026-04-17-pr100-combined.md
@@ -0,0 +1,48 @@
+# Session Journal — PR #100 BM25 Pivot + Bugbot Combine
+
+**Date:** 2026-04-17
+**PR:** klappy/oddkit#100 — feat(challenge): governance-driven runChallengeAction (E0008)
+**Branch:** `feat/e0008-challenge-governance-driven`
+**Final commit on branch:** `fd14a60`
+
+## DOLCHE
+
+### Decisions
+
+- **D1: Pivot from regex-OR to BM25 + stemming for challenge-type detection.** Triggered by gauntlet observation that `coin` doesn't match trigger word `coining`. Klappy proposed BM25 as the right tool; agreed, executed the swap.
+- **D2: Use a custom `CHALLENGE_STOP_WORDS` set, not the default `STOP_WORDS`.** Default filters modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the signal for strong-claim and proposal types. Without this fix, those two type detections would have silently broken in production.
+- **D3: Detection runs BEFORE voice-dump suppression check.** Lets SUPPRESSED response include the `governance` field so the model sees what types matched even when questions are suppressed. Closes bugbot Medium item; also better UX.
+- **D4: `governance` is the canonical response key for matched-type definitions.** CHALLENGED and SUPPRESSED both use it; SUPPRESSED no longer returns `undefined` for shape parity.
+- **D5: Combine fork rather than discard either side.** Remote had 5 commits (regex base + 3 polish + ledger entry), local had BM25 pivot. Both contained real work. Reset to remote tip, cherry-picked BM25 on top, hand-ported the 4 polish fixes.
+
+### Observations
+
+- **O1: PR review comments explain divergent commits.** Bugbot left 5 structured review comments on PR #100. Reading them first would have surfaced what those 3 unfamiliar commits did within seconds. Lesson recorded.
+- **O2: Cherry-picking after staged conflicts didn't fold subsequent edits.** Hand-edits made AFTER `git add` but BEFORE `git cherry-pick --continue` sat in the working tree post-commit. Required a fixup commit. Workflow note for future merges.
+- **O3: BM25 default `STOP_WORDS` is tuned for prose, not directive language.** The general-purpose IR assumption that modals are filler doesn't hold in the challenge taxonomy. The opt-in `stopWords: Set<string>` extension is now available to other use cases that face the same issue.
+- **O4: BM25 already had phrase boost machinery.** `PHRASE_BOOST_EXACT` and `PHRASE_BOOST_PARTIAL` give multi-word triggers free score amplification with no additional code.
+
+### Learnings
+
+- **L1: Read PR review comments first when fork is detected.** Bugbot/cursor leaves structured comments that explain divergent commits. Standard practice from now on.
+- **L2: General-purpose IR stop-word lists are domain-hostile in directive matching.** Modal verbs are content, not filler, in any context where claims and proposals are the signal.
+- **L3: Stemming + BM25 is the right shape for canon-defined category matching.** Stems handle morphology, IDF weights distinctive terms, score > 0 preserves multi-match and fallback semantics. Pattern is now reusable for encode parity and future gate refactor.
+- **L4: The gauntlet caught one bug; bugbot caught four; the combine surfaced the fifth (dead code).** Different review tools catch different classes. The gauntlet is for "would this satisfy our governance"; bugbot is for "would this satisfy basic correctness." Both have value.
+
+### Constraints
+
+- **C1: Voice-dump mode MUST suppress questions, prereqs, and reframings — but MAY surface governance.** The invariant is about not pressure-testing during raw thought capture, not about hiding what types matched.
+- **C2: bm25.ts extensions must preserve backward compatibility.** Default parameter values mean existing callers (oddkit_search, future encode pivot) are unaffected. New behavior is opt-in only.
+
+### Handoffs
+
+- **H1: Encode parity PR.** Same regex-OR brittleness in `runEncodeAction`. Pattern proven here, port near-mechanical. Highest-priority follow-up.
+- **H2: klappy.dev meta governance update.** `how-to-write-challenge-types.md` references "compiles into a case-insensitive word-boundary regex" — now stale. Two-line PR.
+- **H3: Gate refactor candidate.** `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason challenge had pre-refactor. Same shape, same fix pattern.
+- **H4: Score-based confidence in `matched_types`.** Currently `[slug, slug, ...]`; trivial upgrade to `[{slug, score}, ...]` if any consumer wants relative confidence. Not blocking.
+
+### Encodes
+
+- This journal at `odd/ledger/journal/2026-04-17-pr100-combined.md`
+- Evidence note at `docs/oddkit/evidence/challenge-governance-code-refactor.md` (updated with bugbot+combine section)
+- PR #100 comment summarizing the combine: https://github.com/klappy/oddkit/pull/100#issuecomment-4266114513