Promote PR #100: governance-driven challenge with BM25 + stemming to prod by klappy · Pull Request #101 · klappy/oddkit

klappy · 2026-04-17T15:18:19Z

Promote E0008 challenge governance refactor (PR #100) to production

Promotes 18 commits from main to prod, all originating from PR #100 (governance-driven oddkit_challenge refactor with BM25 + stemming).

Scope verification

Diff confirmed clean — only challenge-related code, evidence note, ledger entry, parser test, and the small backward-compatible bm25.ts extension. runOrientAction, runGateAction, runEncodeAction bodies are byte-identical between main and prod. The two near-orient diff hunks are line-number shifts only (dead detectClaimType removed above orient; new pickStrongestDirective added after it).

What ships

Governance-driven oddkit_challenge — claim type detection, questions, prerequisites, tension vocabulary, and stakes calibration all read from canon at runtime, mirroring the encode pattern from PR feat: governance-driven encode architecture #96
BM25 + stemming detection — replaces brittle regex-OR matching; coining and coin now map to the same stem
bm25.ts extension — backward-compatible stopWords: Set<string> parameter on tokenize, buildBM25Index, and searchBM25. Default behavior unchanged for all existing callers including oddkit_search
Detection Noise governance — stop words for challenge-type detection now live in odd/challenge/normative-vocabulary.md (klappy.dev#100, already merged), not as a hardcoded constant in worker source
Voice-dump suppression invariant — load-bearing, parser tolerates "none (parenthetical)" cell content
Shared parseTableRow helper — preserves empty interior cells across all six governance table parsers
Response shape parity — both CHALLENGED and SUPPRESSED responses include the governance field
Backward compat — claim_type alias retained in response envelope

Verification on `main`

TypeScript typecheck: clean
Parser-fidelity test (workers/test/governance-parser.test.mjs): 97/97 against live klappy.dev main
Smoke tests: 6/6 (legacy CLI path unaffected)
Cloudflare Workers preview deployed automatically on each commit; production deploy follows this merge

Bugs caught and fixed during PR #100

15+ across three review surfaces: oddkit gauntlet (3 governance/architectural), bugbot (12+ code-correctness across multiple defect classes), human review (1 Vodka Architecture violation). All resolved before merge to main.

Follow-ups (not blocking this promotion) — same anti-pattern in three remaining tools

The challenge refactor proved out a reusable pattern: governance-driven extraction with per-canonUrl caches, BM25 + stemming for detection, response-shape parity, parseTableRow safety. Three other tools still carry the pre-refactor shape and should be ported next:

oddkit_encode parity — runEncodeAction still uses regex-OR matching for encoding-type detection; same morphological brittleness as challenge pre-refactor. Pattern proven, port will be near-mechanical
oddkit_gate refactor — runGateAction has hardcoded exploration→planning and planning→execution prereq lists; same hardcoded-logic gap as challenge pre-refactor (NOT_READY false negatives demonstrated twice during PR feat(challenge): governance-driven runChallengeAction (E0008) #100 work)
oddkit_orient refactor — runOrientAction has three hardcoded class instances of governance-in-source: per-mode question lists (lines 1489–1508), assumption-detection regex on modal verbs (line 1482), and the "Proactive posture" governance prose baked as a string literal (line 1528). All three should move to canon articles parallel to odd/challenge/. The proactive-posture string is especially load-bearing — evolving the posture currently requires a worker redeploy

Refs

main PR: feat(challenge): governance-driven runChallengeAction (E0008) #100 (merged)
Companion canon PR: feat(challenge): add Detection Noise vocabulary to normative-vocabulary klappy.dev#100 (merged)
Predecessor: feat: governance-driven encode architecture #96 (governance-driven encode — the structural mirror)

Mirrors the PR #96 encode pattern. Extracts challenge behavior from live governance articles (landed in klappy.dev canon via PR #99) rather than hardcoded source logic. New functions in workers/src/orchestrate.ts: - discoverChallengeTypes — per-canonUrl cached type discovery - fetchBasePrerequisites — universal prerequisite checks - fetchNormativeVocabulary — RFC 2119 + architectural load-bearing terms - fetchStakesCalibration — mode-to-depth filter - extractPrereqTable / extractKeywordsFromCheck — shared helpers Refactored: - runChallengeAction — replaces hardcoded detectClaimType / generateChallenges / findTensions / findMissingPrerequisites with governance extraction. Supports multi-match. Filters output by stakes calibration based on mode parameter. - runCleanupStorage — clears all four new caches on invalidation Invariant: voice-dump mode suppresses all challenge output regardless of matched types. Load-bearing per stakes-calibration governance — some modes exist for raw capture and pressure-testing at that stage damages the mode. Graceful degradation: missing governance articles fall back to minimal built-in behavior with warnings, rather than failing. Co-authored-by: Claude <noreply@anthropic.com>

…ength

…sponse

Refactor runChallengeAction in workers/src/orchestrate.ts to extract challenge-type behavior from canon governance articles at runtime rather than hardcoding claim-type detection, questions, prerequisites, and tension rules in source. Structural mirror of PR #96 (encode). Detection upgraded mid-implementation from regex-OR to BM25 + stemming after the gauntlet revealed that regex-based matching was morphologically brittle ("coin" doesn't match trigger "coining"). The pivot removed an entire class of bug and seeded a reusable pattern for future governance-driven tools. Changes in workers/src/orchestrate.ts: - New: ChallengeTypeDef, BasePrerequisite, NormativeVocabulary, StakesModeConfig, StakesCalibration - New: discoverChallengeTypes (builds per-canonUrl BM25 index over detection text), fetchBasePrerequisites, fetchNormativeVocabulary, fetchStakesCalibration — each with per-canonUrl cache and graceful degradation on missing articles - New: evaluatePrerequisiteCheck — interprets natural-language check strings from prerequisite overlay tables - Refactored runChallengeAction: multi-match via BM25 score > 0, base + overlay prerequisite aggregation, stakes calibration filtering, voice-dump suppression invariant, governance-driven tension detection - Extended runCleanupStorage with five new cache clears (types, type-index, base prerequisites, vocabulary, calibration) - Removed dead detectClaimType (legacy src/tasks/challenge.js retains its copy for CLI backward-compat) - Added CHALLENGE_STOP_WORDS set preserving modal verbs as signal Changes in workers/src/bm25.ts (backward-compatible extension): - tokenize(), buildBM25Index() accept optional stopWords: Set<string> - BM25Index gains optional stopWords field so searchBM25 tokenizes queries consistently with the index - Default behavior unchanged — existing callers unaffected - Motivation: default STOP_WORDS filters modals (must, should, shall, may, not) which are signal for challenge-type detection New tests: workers/test/governance-parser.test.mjs — 94 assertions against live governance articles fetched from klappy.dev raw. Covers type parsing, fallback resolution, BM25 detection, stemming regression cases (coin/coining, propose/proposed, principle/principles), multi- match, and the voice-dump suppression invariant. 94/94 pass. Bugs the gauntlet caught on this PR: 1. Voice-dump suppression invariant would have shipped broken — the calibration cell reads "none (suppress all challenge)" not bare "none". Strict-equality parser would have produced a single-element array, voice-dump mode would have surfaced all challenges in prod. 2. Morphological brittleness in regex detection (coin vs coining) — triggered the pivot to BM25 + stemming. 3. Default BM25 STOP_WORDS silently breaks strong-claim and proposal detection by filtering modal verbs. Fixed via custom stop word set. Verification: - npm run typecheck: clean - tests/smoke.sh: 6/6 pass (legacy CLI path — backward compat preserved) - workers/test/governance-parser.test.mjs: 94/94 pass - AI voice clichés audit on new comments: clean - oddkit_preflight, challenge, gate, validate: all run; gate NOT_READY due to same hardcoded-logic gap as challenge pre-refactor (flagged as follow-up) Response shape change: adds mode, matched_types, type_definitions, block_until_addressed; removes claim_type. Consumed programmatically, not rendered. Follow-ups flagged: - Encode parity PR — same regex-OR brittleness in runEncodeAction; pattern proven here, port will be near-mechanical - klappy.dev meta governance PR — "compiles into a case-insensitive word-boundary regex" is now stale language - Gate refactor candidate — same hardcoded-logic shape as challenge pre-refactor Refs: - Depends on: klappy/klappy.dev#99 (governance articles this code reads) - Structural mirror: #96 (governance-driven encode) - Evidence: docs/oddkit/evidence/challenge-governance-code-refactor.md

Re-applies the four review fixes from sibling commits (31f8134, e9ef2f9, 84932f0) and the dead-code removal that the bugbot review also flagged, on top of the BM25 + stemming detection swap. - Vocabulary regex sorted by length descending so 'MUST NOT' matches before 'MUST' (closes bugbot 'Regex alternation order') - Stakes calibration mode column lowercased at parse time AND mode normalized to lowercase at lookup time (closes bugbot 'Mode column not lowercased breaks voice-dump suppression') - first_1 reframings policy now surfaces a single reframing total across all matched types, not one per type (closes bugbot 'first_1 reframings surfaces multiple instead of one') - Detection runs BEFORE voice-dump suppression check, and SUPPRESSED response includes the governance field for shape parity with CHALLENGED (closes bugbot 'SUPPRESSED response missing governance') - Renames type_definitions to governance in CHALLENGED response so both statuses return the same shape under the same key - Dead detectClaimType already removed by the BM25 commit (closes bugbot 'Dead code: detectClaimType has zero callers') Verification: - npm run typecheck: clean - workers/test/governance-parser.test.mjs: 94/94 pass - tests/smoke.sh: 6/6 pass

…ctor evidence Captures the fork-resolution and bugbot-review-driven fixes as a sixth layer of catch alongside the gauntlet bugs. Records the lesson: read PR review comments before treating divergent remote as unknown work.

…atch

…e cross-contamination

Caught in PR #100 review by Klappy: the CHALLENGE_STOP_WORDS Set added mid-PR to fix a BM25 over-match was itself a Vodka Architecture violation in a refactor explicitly about removing such violations. The constant carried a domain opinion ('modals are signal, articles are filler in challenge detection') that belonged in canon, not in worker source. Anti-pattern fixed: - Drop the hardcoded CHALLENGE_STOP_WORDS Set from workers/src/orchestrate.ts - Drop the duplicate hardcoded copy from workers/test/governance-parser.test.mjs - Extend NormativeVocabulary interface with stopWords: Set<string> - Extend fetchNormativeVocabulary to extract '## Detection Noise' code block from odd/challenge/normative-vocabulary.md (lands in klappy.dev#100) - Move BM25 index build out of discoverChallengeTypes into a new lazy builder getOrBuildChallengeTypeIndex(types, vocab, canonUrl) so the index can use governance-sourced stop words rather than a constant - Update parser test to fetch Detection Noise the same way the worker does — no hardcoded duplicate, no drift risk. Test gains 3 new assertions: Detection Noise parses non-empty, excludes modal verbs, includes common filler Net hardcoded-constants delta: this PR removes ~6 classes of hardcoded domain opinion (claim type detection, questions, prereqs, tension regex, reframings, stop words) and adds zero. The remaining minimal RFC 2119 fallback ('MUST', 'MUST NOT', 'SHOULD', 'SHOULD NOT') and 'planning' default mode are server-availability fallbacks for when canon is unreachable, not domain governance. Test currently runs against the feature branch via KLAPPYDEV_RAW env override. After klappy.dev#100 merges, the override comes off and the test reads from main with no further changes. Verification: - npm run typecheck: clean - workers/test/governance-parser.test.mjs (vs feature branch): 97/97 pass - tests/smoke.sh: 6/6 pass - grep CHALLENGE_STOP_WORDS in workers/ and src/: zero matches Refs: - Caught in: this PR review by Klappy - Depends on: klappy/klappy.dev#100 (Detection Noise section) - Lesson: 'is this the right architectural shape' is a category the current gauntlet does not catch — the tools verify governance content, not whether new code is creating new ungoverned content. Possible future tool: a vodka-audit that flags non-trivial Sets/Maps/lists in worker source and asks 'should this be in canon?'

…arser - Use matchAll and prefer prohibition directive type over leftmost requirement match so excerpts like 'You MUST X and MUST NOT Y' surface the prohibition. Regex switched to global flag to support matchAll. - Port fetchStakesCalibration toLowerCase fix to the fidelity test parser so byMode keys stay lowercase even if governance introduces capitalized mode names.

…st pickStrongest Two fixes: 1. Table row parser (6 call sites) was using .split('|').map(trim).filter(c => c.length > 0) which also drops legitimately-empty interior cells, silently collapsing the column count. In fetchStakesCalibration this would silently drop a voice-dump row with an empty tiers cell, breaking the suppression invariant with no error signal. Introduce parseTableRow helper that only strips the leading/trailing empties produced by surrounding pipes, preserving empty middle cells. 2. Hoist pickStrongest (now pickStrongestDirective) out of the per-entry loop in runChallengeAction. It captures no loop-scoped state, so defining it inside the loop needlessly re-allocated the closure on each iteration and misled readers about its scope. Matches the placement of evaluatePrerequisiteCheck.

…mpty cells

…and dead branch Two issues from bugbot's 14:29 review: 1. Reframing 'none' check applies same defensive pattern as the tiersRaw fix in fetchStakesCalibration. The cell may be 'none' or 'none (parenthetical reason)' — strict equality would silently surface all reframings via the 'all' fallback when authors include explanatory text. Same defect class as bug #3 in the evidence note; sweep applied. 2. Remove unreachable questionTiers.length === 0 branch in the question- surfacing condition. The SUPPRESSED early-return at line 1635 already handles that case, so the branch was dead code that misleadingly suggested 'surface all questions for empty tiers' semantics — the actual semantic is full suppression. Verified: typecheck clean, parser test 97/97 against main, smoke 6/6. Defect-class sweep on governance cell strict-equality checks: only two sites (tiersRaw, surfacing), both now defensive.

…riven

cloudflare-workers-and-pages · 2026-04-17T15:18:25Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	oddkit	`6e01a00`	Commit Preview URL Branch Preview URL	Apr 17 2026, 03:17 PM

klappy · 2026-04-17T15:23:19Z

⚠ HOLD — schema bug found by preview test

After opening this promotion PR, I actually tested the production preview at https://main-oddkit.klappy.workers.dev — which I should have done before opening the PR.

The MCP tool's mode parameter Zod schema accepts only 3 modes (exploration, planning, execution), but odd/challenge/stakes-calibration defines 9 modes. The 6 writing-lifecycle modes (voice-dump, drafting, peer-review-ready, canon-tier-2, canon-tier-1, published-essay) are unreachable from the public API.

Net effect: the voice-dump suppression invariant — the load-bearing feature this PR's evidence note specifically calls out — cannot be invoked through the public MCP tool. Schema rejects the call before runtime ever sees it.

PR #102 fixes this with a 10-line schema expansion. Hold this promotion until #102 lands and merges to main. Then this PR fast-forwards and the prod deploy ships a feature whose headline invariant actually works.

What the production preview confirmed works

✅ oddkit_orient — 7s warm
✅ oddkit_challenge with mode: planning — multi-match, BM25 stemming (coin the term → pattern-coinage), governance field present, all working
✅ oddkit_challenge multi-match — must always coining fires strong-claim + pattern-coinage + principle-extraction correctly
❌ oddkit_challenge with mode: voice-dump — schema validation error before runtime

Lesson

Three verification layers passed (typecheck, parser-fidelity 97/97, smoke 6/6) and the deploy succeeded. None exercised the public API contract end-to-end with the load-bearing mode value. Testing the running preview is not optional.

Captures DOLCHE for the session that delivered PR #100 (governance-driven challenge refactor with BM25 + stemming) and the unresolved schema bug that made the voice-dump suppression invariant unreachable from the public API. Critical for next model picking up: - PR #101 (prod promotion) is BLOCKED — schema fix not yet merged - fix/challenge-mode-schema-includes-writing-modes has the fix - After fix lands, manually curl preview with mode=voice-dump before promoting Also records lessons for the next session: defect-class sweep discipline, public API contract verification, parser test flakiness, and three follow-up refactors carrying the same anti-pattern as challenge pre-refactor (encode, gate, orient).

klappy and others added 18 commits April 17, 2026 05:33

chore: record E0008 challenge governance refactor decision in ledger

a88abf7

fix(orchestrate): normalize mode casing and sort directive regex by l…

31f8134

…ength

fix(orchestrate): first_1 reframings surfaces a single reframing total

e9ef2f9

fix(orchestrate): include governance field in SUPPRESSED challenge re…

84932f0

…sponse

chore(ledger): journal PR #100 BM25 pivot + bugbot combine session

4f19ecd

fix(orchestrate): lowercase challenge question tier for calibration m…

94990b5

…atch

fix(challenge): guard cached BM25 index by canonUrl to prevent isolat…

997a50d

…e cross-contamination

fix(challenge): restore claim_type alias in response envelope

6358d65

fix(tests): use parseTableRow in governance parser test to preserve e…

477a213

…mpty cells

Merge pull request #100 from klappy/feat/e0008-challenge-governance-d…

6e01a00

…riven

klappy mentioned this pull request Apr 17, 2026

fix(mcp): expand challenge mode enum to all 9 modes — unblocks voice-dump suppression #102

Merged

klappy merged commit e5d0983 into prod Apr 17, 2026
9 of 11 checks passed

klappy mentioned this pull request Apr 17, 2026

Promote PR #102: challenge mode enum schema fix to prod #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote PR #100: governance-driven challenge with BM25 + stemming to prod#101

Promote PR #100: governance-driven challenge with BM25 + stemming to prod#101
klappy merged 18 commits into
prodfrom
main

klappy commented Apr 17, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 17, 2026

Uh oh!

klappy commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klappy commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Promote E0008 challenge governance refactor (PR #100) to production

Scope verification

What ships

Verification on main

Bugs caught and fixed during PR #100

Follow-ups (not blocking this promotion) — same anti-pattern in three remaining tools

Refs

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 17, 2026

Deploying with Cloudflare Workers

Uh oh!

klappy commented Apr 17, 2026

⚠ HOLD — schema bug found by preview test

What the production preview confirmed works

Lesson

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klappy commented Apr 17, 2026 •

edited

Loading

Verification on `main`