Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
aa4445c
feat(challenge): governance-driven runChallengeAction extraction (E0008)
klappy Apr 17, 2026
a88abf7
chore: record E0008 challenge governance refactor decision in ledger
klappy Apr 17, 2026
31f8134
fix(orchestrate): normalize mode casing and sort directive regex by l…
cursoragent Apr 17, 2026
e9ef2f9
fix(orchestrate): first_1 reframings surfaces a single reframing total
cursoragent Apr 17, 2026
84932f0
fix(orchestrate): include governance field in SUPPRESSED challenge re…
cursoragent Apr 17, 2026
726e5ed
feat(workers): governance-driven oddkit_challenge with BM25 + stemming
klappy Apr 17, 2026
e82164b
fix(challenge): port bugbot review fixes onto BM25 detection layer
klappy Apr 17, 2026
fd14a60
docs(evidence): add bugbot review + combine section to challenge refa…
klappy Apr 17, 2026
4f19ecd
chore(ledger): journal PR #100 BM25 pivot + bugbot combine session
klappy Apr 17, 2026
94990b5
fix(orchestrate): lowercase challenge question tier for calibration m…
cursoragent Apr 17, 2026
997a50d
fix(challenge): guard cached BM25 index by canonUrl to prevent isolat…
cursoragent Apr 17, 2026
6358d65
fix(challenge): restore claim_type alias in response envelope
cursoragent Apr 17, 2026
ce41b39
fix(challenge): move stop words from hardcoded constant into governance
klappy Apr 17, 2026
6f5a7cc
fix(challenge): prefer prohibitions and lowercase mode keys in test p…
cursoragent Apr 17, 2026
bd462fc
fix(orchestrate): preserve empty middle cells in table parser and hoi…
cursoragent Apr 17, 2026
477a213
fix(tests): use parseTableRow in governance parser test to preserve e…
cursoragent Apr 17, 2026
80416ac
fix(challenge): close two open bugbot issues — none-prefix surfacing …
klappy Apr 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/oddkit/evidence/challenge-governance-code-refactor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Gauntlet Evidence — Challenge Governance Code Refactor

**Branch:** `feat/e0008-challenge-governance-driven`
**Date:** 2026-04-17
**Scope:** Governance-driven refactor of `oddkit_challenge` in `workers/src/orchestrate.ts` plus minor extension of `workers/src/bm25.ts`
**Deliverable type:** Worker code change (TypeScript) — the runtime that consumes the canon governance articles landed in PR #99
**Predecessor PRs:** #96 (governance-driven encode pattern, the structural mirror), #99 (klappy.dev governance articles, the canon this code reads)

---

## Definition of Done — Evidence

### 1. Change Description

Refactored `runChallengeAction` in `workers/src/orchestrate.ts` from hardcoded claim-type detection and question generation to governance-driven extraction. The structural mirror of PR #96 (encode). **Mid-implementation pivot:** replaced regex-OR detection with BM25 + stemming after the gauntlet surfaced a morphological brittleness (`"coin"` doesn't match trigger word `"coining"`). The architectural swap removed an entire class of bug and validated a reusable pattern for future governance-driven tools.

**New types added (`orchestrate.ts`):**

- `ChallengeTypeDef` — slug, name, blockquote, trigger words, `detectionText` (triggers + blockquote, fed to BM25 indexer), questions with tiers, prerequisite overlays, reframings, fallback flag
- `BasePrerequisite` — prerequisite name, check description, gap message
- `NormativeVocabulary` — case-sensitive regex (RFC 2119), case-insensitive regex (architectural phrases), directive type map (this one keeps regex since it's directive-vocabulary matching against retrieved canon quotes, not claim-type detection)
- `StakesModeConfig` / `StakesCalibration` — mode → (question tiers, prerequisite strictness, reframing surfacing)

**New discovery/fetch functions added (`orchestrate.ts`):**

- `discoverChallengeTypes(fetcher, canonUrl)` — finds articles tagged `challenge-type`, parses each, builds a per-canonUrl BM25 index over detection text. Per-canonUrl cache for types AND index.
- `fetchBasePrerequisites(fetcher, canonUrl)` — fetches `odd/challenge/base-prerequisites.md`, extracts the prerequisite overlays table. Per-canonUrl cache.
- `fetchNormativeVocabulary(fetcher, canonUrl)` — fetches `odd/challenge/normative-vocabulary.md`, extracts both vocabulary tables, compiles case-sensitive and case-insensitive regexes. Falls back to minimal RFC 2119 set if the article is missing. Per-canonUrl cache.
- `fetchStakesCalibration(fetcher, canonUrl)` — fetches `odd/challenge/stakes-calibration.md`, extracts the calibration table. Per-canonUrl cache.

**`runChallengeAction` refactored to:**

- Load all four governance sources in parallel
- Honor voice-dump suppression invariant — return empty challenge output when mode's tier list is empty
- Detect matching types via BM25 over per-type detection text (score > 0 = match)
- Resolve fallback type when no type scores > 0
- Aggregate questions, prerequisite overlays (base + type), and reframings across matched types with deduplication
- Apply stakes calibration filter based on mode (question tiers, prerequisite strictness, reframing surfacing)
- Detect tensions in retrieved canon quotes via governance-driven vocabulary regex (replacing hardcoded `MUST`/`MUST NOT` checks)
- Surface matched type names and definitions in the response (teaching the model what governs the behavior)
- Mark `block_until_addressed` when calibration says so

**`evaluatePrerequisiteCheck` helper added:** interprets natural-language `check` strings from prerequisite overlay tables. Extracts quoted keywords and tests presence in input. Special-cases URL, numeric, proper-noun, and citation patterns.

**`runCleanupStorage` extended:** clears all five new caches (types, type-index, base prerequisites, normative vocabulary, stakes calibration). Mirror of the PR #96 fix for cache staleness on governance edits.

**Dead code removed:** `detectClaimType` in `workers/src/orchestrate.ts` (only used by the old hardcoded `runChallengeAction`). Legacy version in `src/tasks/challenge.js` retained for backward-compat on the non-worker CLI path.

**`workers/src/bm25.ts` extension (backward-compatible):**

- `tokenize(text, stopWords?)` — new optional parameter. Defaults to the existing `STOP_WORDS` set (unchanged behavior for existing callers).
- `buildBM25Index(documents, stopWords?)` — same. Records the stop word set on the returned index so `searchBM25` tokenizes queries consistently with doc vocabularies.
- `BM25Index` interface gained an optional `stopWords?: Set<string>` field.
- Motivation: the default `STOP_WORDS` filters out modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the load-bearing detection signal for strong-claim, proposal, and assumption challenge types. Challenge-type detection needs a custom stop-word set that preserves modals.

### 2. Verification Performed

- `npm run typecheck` (workers/) — clean both before and after the BM25 pivot, and after the dead-code removal
- `bash tests/smoke.sh` (root) — 6 PASS, exercising the legacy CLI path. Confirms backward compat preserved (the worker path I refactored is separate from the CLI path).
- `node workers/test/governance-parser.test.mjs` — new parser-fidelity test, 94 assertions against live governance articles fetched from klappy.dev raw. **94 pass, 0 fail.** Includes explicit regression tests for stemming (`coin`/`coining`, `proposed`/`propose`, `principles`/`principle`) and multi-match semantics via BM25.
- `oddkit_preflight` — surfaced constraints (ai-voice-cliches, author-identity-language, definition-of-done, supersession, prompt-over-code)
- `oddkit_get` on `canon/methods/supersession.md` — confirmed this refactor is "replace" on the supersession spectrum (provenance preserved via PR description, commit message, ledger entry, retained legacy file)
- AI voice clichés audit on new code/comments via `git diff | grep` for negation parallelism, formulaic transitions, puffing — clean, zero hits
- `oddkit_challenge` on the commit decision — generic prereqs answered honestly in the PR description
- `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason documented in PR #99 — flagged in PR as future refactor candidate

### 3. Observed Behavior

Parser-fidelity test output (94/94 passed):

```
─── Test 1: Challenge type parsing ─── (7 types × 8 assertions = 56 passing)
─── Test 2: Fallback resolution ─── (2 passing — observation has fallback: true, others don't)
─── Test 3: BM25 detection with stemming ─── (7 passing — each type matches its first trigger word)
─── Test 3b: Stemming defeats the original coin/coining bug ─── (5 passing — stemming equivalence + 4 real-world inputs)
─── Test 4: Multi-match semantics (BM25) ─── (3 passing)
─── Test 4b: Empty input + irrelevant input do not over-match ─── (1 passing)
─── Test 5: Base prerequisites ─── (4 passing)
─── Test 6: Normative vocabulary ─── (4 passing)
─── Test 7: Stakes calibration ─── (5 passing — including the voice-dump suppression invariant)

94 passed, 0 failed
```

### 4. Evidence Produced

This file. Plus the diffs:

- `workers/src/orchestrate.ts`: ~560 insertions, ~70 deletions
- `workers/src/bm25.ts`: small additive change (stopWords parameter threaded through tokenize/buildBM25Index/searchBM25, no behavior change for existing callers)
- `workers/test/governance-parser.test.mjs`: new (~200 lines)
- `docs/oddkit/evidence/challenge-governance-code-refactor.md`: this note

Visual proof: **N/A — server-side code change.** No UI, no interaction surface, no visible state. The `oddkit_challenge` MCP tool's response shape changes (adds `mode`, `matched_types`, `type_definitions`, `block_until_addressed` fields; removes `claim_type`) but this is consumed programmatically, not rendered.

### 5. Self-Audit Completed

- **Intended outcome:** the worker path of `oddkit_challenge` becomes governance-driven via extraction from canon, mirroring PR #96. Behavior changes when the canon governance articles change — no code redeploy required. Detection is morphologically resilient via BM25 + stemming.
- **Constraints applied:** Definition of Done (this file), Writing Canon (n/a — code, not document, but evidence note follows the structure), AI voice clichés (audited clean on new comments), supersession ("replace" with provenance preserved), prompt-over-code (the principle this implements), Vodka Architecture (server stays thin — extraction and IR, no domain opinion baked in).
- **Decision rules followed:** mirror PR #96's cache pattern (per-canonUrl keying, try-catch-graceful-degradation per article); preserve legacy CLI path; voice-dump suppression as a load-bearing invariant; multi-match by design; honor `fallback: true` frontmatter for type fallback resolution; keep `bm25.ts` changes backward-compatible.
- **Tradeoffs:** four governance fetches per challenge call (mitigated by per-canonUrl module-level cache, so cold start is the only slow path); BM25 index built per cache invalidation (cheap — 5–10 tiny docs); BM25 score magnitudes aren't intuitive constants (anyone tuning thresholds later will need to reason in relative terms); the Porter-style stemmer handles common English morphology but not irregular forms.
- **Remaining risks:**
- Parser regex assumes specific table column order. If a future governance article reorders columns, parsing degrades silently. The parser-fidelity test catches this for currently-shipped articles but won't catch it for hypothetical future structure changes.
- `evaluatePrerequisiteCheck` uses heuristics over natural-language check descriptions. Some prerequisite checks may evaluate incorrectly — watch for false-negative gap messages in production logs.
- `oddkit_gate` still returns NOT_READY due to its own hardcoded prereqs — same architectural pattern as challenge pre-refactor. Future refactor candidate. Documented in PR.
- `oddkit_encode` still uses regex-OR detection with the same morphological brittleness this PR fixes for challenge. Follow-up PR required to bring encode to parity; the pivot here provides the blueprint.
- klappy.dev meta governance article (`odd/challenge-types/how-to-write-challenge-types.md`) describes the runtime as "compiles into a case-insensitive word-boundary regex" — that's now stale. Small coordinated klappy.dev PR required to update the language.

---

## Bugs the Gauntlet Caught (this refactor sequence)

1. **PR #99 — 10 of 11 articles missing required `## Summary` sections.** Writing Canon tier 4 violation. Same failure mode as the Feb 2026 Progressive Disclosure Failure incident.
2. **PR #99 — broken `derives_from` path** in `stakes-calibration.md` (`canon/epistemic-modes.md` → `canon/definitions/epistemic-modes.md`).
3. **This PR — voice-dump suppression invariant would have shipped broken.** The calibration cell content is `"none (suppress all challenge)"` not bare `"none"`. Initial parser checked `=== "none"` with strict equality, would have produced a single-element array, voice-dump mode would have surfaced all challenge questions in production. Fixed by checking `tiersRaw === "none" || tiersRaw.startsWith("none ") || tiersRaw.startsWith("none(")`.
4. **This PR (BM25 pivot) — morphological brittleness revealed.** The test `pattern-coinage fires on 'coin the term'` failed under regex because the article has `coining` as a trigger but not `coin`. This signal triggered the full pivot from regex-OR to BM25 + stemming.
5. **This PR (BM25 pivot) — default `STOP_WORDS` would have silently broken strong-claim and proposal detection.** The default filter drops modal verbs (`must`, `should`, `shall`, `may`, `not`) — exactly the load-bearing trigger words for these two types. Caught because the parser-fidelity test asserted each type matches its first trigger word and two types failed. Fixed by extending `bm25.ts` with an optional `stopWords: Set<string>` parameter and defining a `CHALLENGE_STOP_WORDS` set in `orchestrate.ts` that preserves modals.

**The discipline is load-bearing, not ceremony.** Five real bugs caught across two PRs. Two of the five would have caused silent production failures of invariants specifically named in the governance.

---

## Bugbot Review + Combine — A Sixth Layer of Catch

After the regex version of this PR was first pushed, the Cursor bugbot reviewed it and flagged five issues, four of which were addressed via three follow-up commits on the remote branch (`31f8134`, `e9ef2f9`, `84932f0`). Those commits existed in parallel to the local BM25 pivot work and were not visible until the remote was fetched. Discovering them late led to a fork situation that was resolved by:

1. Resetting local to the remote tip (preserving the three follow-up fixes plus the ledger entry on `a88abf7`)
2. Cherry-picking the BM25 commit on top of the resolved tip
3. Hand-porting the four polish fixes onto the BM25 base — they touch unrelated code regions so the port was mechanical
4. Adding a fixup commit (`e82164b`) on top of the BM25 commit that captures the ports

**Bugbot's five review items, all closed by this PR:**

| Severity | Issue | Resolution |
|----------|-------|------------|
| High | Mode column not lowercased breaks voice-dump suppression | Lowercased at parse time AND at lookup |
| Medium | Regex alternation order breaks multi-word directive matching (`MUST` before `MUST NOT`) | Sort vocab by length descending |
| Medium | `first_1` reframings surfaces multiple instead of one | Slice from aggregated list, not per-type |
| Medium | SUPPRESSED response missing `governance` field | Detection runs before suppression check; both responses share the `governance` shape |
| Low | Dead code: `detectClaimType` has zero callers | Removed by the BM25 commit before bugbot reviewed |

**Lesson recorded:** when encountering a divergent remote on an existing PR branch, fetch and read PR review comments first. Bugbot leaves structured comments that explain the divergent commits — checking saves the user from explaining what already exists.

---

## Version Tracking

- Branch: `feat/e0008-challenge-governance-driven`
- Post-merge: ledger entry capturing E0008 challenge code-refactor milestone
- Related PRs:
- **Predecessor (structural mirror):** klappy/oddkit#96 (governance-driven encode refactor)
- **Depends on:** klappy/klappy.dev#99 (governance articles in canon — the inputs this code reads)
- **Immediate follow-up:** encode parity PR — bring `oddkit_encode` to BM25 + stemming using the pattern proven here
- **Small follow-up:** klappy.dev PR updating `how-to-write-challenge-types.md` — swap "compiles into a case-insensitive word-boundary regex" for the BM25 description
- **Future candidate:** governance-driven gate refactor (gate has the same hardcoded-logic gap as challenge pre-refactor; surfaced again during this gauntlet run)
48 changes: 48 additions & 0 deletions odd/ledger/journal/2026-04-17-pr100-combined.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Session Journal — PR #100 BM25 Pivot + Bugbot Combine

**Date:** 2026-04-17
**PR:** klappy/oddkit#100 — feat(challenge): governance-driven runChallengeAction (E0008)
**Branch:** `feat/e0008-challenge-governance-driven`
**Final commit on branch:** `fd14a60`

## DOLCHE

### Decisions

- **D1: Pivot from regex-OR to BM25 + stemming for challenge-type detection.** Triggered by gauntlet observation that `coin` doesn't match trigger word `coining`. Klappy proposed BM25 as the right tool; agreed, executed the swap.
- **D2: Use a custom `CHALLENGE_STOP_WORDS` set, not the default `STOP_WORDS`.** Default filters modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the signal for strong-claim and proposal types. Without this fix, those two type detections would have silently broken in production.
- **D3: Detection runs BEFORE voice-dump suppression check.** Lets SUPPRESSED response include the `governance` field so the model sees what types matched even when questions are suppressed. Closes bugbot Medium item; also better UX.
- **D4: `governance` is the canonical response key for matched-type definitions.** CHALLENGED and SUPPRESSED both use it; SUPPRESSED no longer returns `undefined` for shape parity.
- **D5: Combine fork rather than discard either side.** Remote had 5 commits (regex base + 3 polish + ledger entry), local had BM25 pivot. Both contained real work. Reset to remote tip, cherry-picked BM25 on top, hand-ported the 4 polish fixes.

### Observations

- **O1: PR review comments explain divergent commits.** Bugbot left 5 structured review comments on PR #100. Reading them first would have surfaced what those 3 unfamiliar commits did within seconds. Lesson recorded.
- **O2: Cherry-picking after staged conflicts didn't fold subsequent edits.** Hand-edits made AFTER `git add` but BEFORE `git cherry-pick --continue` sat in the working tree post-commit. Required a fixup commit. Workflow note for future merges.
- **O3: BM25 default `STOP_WORDS` is tuned for prose, not directive language.** The general-purpose IR assumption that modals are filler doesn't hold in the challenge taxonomy. The opt-in `stopWords: Set<string>` extension is now available to other use cases that face the same issue.
- **O4: BM25 already had phrase boost machinery.** `PHRASE_BOOST_EXACT` and `PHRASE_BOOST_PARTIAL` give multi-word triggers free score amplification with no additional code.

### Learnings

- **L1: Read PR review comments first when fork is detected.** Bugbot/cursor leaves structured comments that explain divergent commits. Standard practice from now on.
- **L2: General-purpose IR stop-word lists are domain-hostile in directive matching.** Modal verbs are content, not filler, in any context where claims and proposals are the signal.
- **L3: Stemming + BM25 is the right shape for canon-defined category matching.** Stems handle morphology, IDF weights distinctive terms, score > 0 preserves multi-match and fallback semantics. Pattern is now reusable for encode parity and future gate refactor.
- **L4: The gauntlet caught one bug; bugbot caught four; the combine surfaced the fifth (dead code).** Different review tools catch different classes. The gauntlet is for "would this satisfy our governance"; bugbot is for "would this satisfy basic correctness." Both have value.

### Constraints

- **C1: Voice-dump mode MUST suppress questions, prereqs, and reframings — but MAY surface governance.** The invariant is about not pressure-testing during raw thought capture, not about hiding what types matched.
- **C2: bm25.ts extensions must preserve backward compatibility.** Default parameter values mean existing callers (oddkit_search, future encode pivot) are unaffected. New behavior is opt-in only.

### Handoffs

- **H1: Encode parity PR.** Same regex-OR brittleness in `runEncodeAction`. Pattern proven here, port near-mechanical. Highest-priority follow-up.
- **H2: klappy.dev meta governance update.** `how-to-write-challenge-types.md` references "compiles into a case-insensitive word-boundary regex" — now stale. Two-line PR.
- **H3: Gate refactor candidate.** `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason challenge had pre-refactor. Same shape, same fix pattern.
- **H4: Score-based confidence in `matched_types`.** Currently `[slug, slug, ...]`; trivial upgrade to `[{slug, score}, ...]` if any consumer wants relative confidence. Not blocking.

### Encodes

- This journal at `odd/ledger/journal/2026-04-17-pr100-combined.md`
- Evidence note at `docs/oddkit/evidence/challenge-governance-code-refactor.md` (updated with bugbot+combine section)
- PR #100 comment summarizing the combine: https://github.com/klappy/oddkit/pull/100#issuecomment-4266114513
Loading
Loading