Skip to content

Promote PR #100: governance-driven challenge with BM25 + stemming to prod#101

Merged
klappy merged 18 commits into
prodfrom
main
Apr 17, 2026
Merged

Promote PR #100: governance-driven challenge with BM25 + stemming to prod#101
klappy merged 18 commits into
prodfrom
main

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Apr 17, 2026

Promote E0008 challenge governance refactor (PR #100) to production

Promotes 18 commits from main to prod, all originating from PR #100 (governance-driven oddkit_challenge refactor with BM25 + stemming).

Scope verification

Diff confirmed clean — only challenge-related code, evidence note, ledger entry, parser test, and the small backward-compatible bm25.ts extension. runOrientAction, runGateAction, runEncodeAction bodies are byte-identical between main and prod. The two near-orient diff hunks are line-number shifts only (dead detectClaimType removed above orient; new pickStrongestDirective added after it).

What ships

  • Governance-driven oddkit_challenge — claim type detection, questions, prerequisites, tension vocabulary, and stakes calibration all read from canon at runtime, mirroring the encode pattern from PR feat: governance-driven encode architecture #96
  • BM25 + stemming detection — replaces brittle regex-OR matching; coining and coin now map to the same stem
  • bm25.ts extension — backward-compatible stopWords: Set<string> parameter on tokenize, buildBM25Index, and searchBM25. Default behavior unchanged for all existing callers including oddkit_search
  • Detection Noise governance — stop words for challenge-type detection now live in odd/challenge/normative-vocabulary.md (klappy.dev#100, already merged), not as a hardcoded constant in worker source
  • Voice-dump suppression invariant — load-bearing, parser tolerates "none (parenthetical)" cell content
  • Shared parseTableRow helper — preserves empty interior cells across all six governance table parsers
  • Response shape parity — both CHALLENGED and SUPPRESSED responses include the governance field
  • Backward compatclaim_type alias retained in response envelope

Verification on main

  • TypeScript typecheck: clean
  • Parser-fidelity test (workers/test/governance-parser.test.mjs): 97/97 against live klappy.dev main
  • Smoke tests: 6/6 (legacy CLI path unaffected)
  • Cloudflare Workers preview deployed automatically on each commit; production deploy follows this merge

Bugs caught and fixed during PR #100

15+ across three review surfaces: oddkit gauntlet (3 governance/architectural), bugbot (12+ code-correctness across multiple defect classes), human review (1 Vodka Architecture violation). All resolved before merge to main.

Follow-ups (not blocking this promotion) — same anti-pattern in three remaining tools

The challenge refactor proved out a reusable pattern: governance-driven extraction with per-canonUrl caches, BM25 + stemming for detection, response-shape parity, parseTableRow safety. Three other tools still carry the pre-refactor shape and should be ported next:

  • oddkit_encode parityrunEncodeAction still uses regex-OR matching for encoding-type detection; same morphological brittleness as challenge pre-refactor. Pattern proven, port will be near-mechanical
  • oddkit_gate refactorrunGateAction has hardcoded exploration→planning and planning→execution prereq lists; same hardcoded-logic gap as challenge pre-refactor (NOT_READY false negatives demonstrated twice during PR feat(challenge): governance-driven runChallengeAction (E0008) #100 work)
  • oddkit_orient refactorrunOrientAction has three hardcoded class instances of governance-in-source: per-mode question lists (lines 1489–1508), assumption-detection regex on modal verbs (line 1482), and the "Proactive posture" governance prose baked as a string literal (line 1528). All three should move to canon articles parallel to odd/challenge/. The proactive-posture string is especially load-bearing — evolving the posture currently requires a worker redeploy

Refs

klappy and others added 18 commits April 17, 2026 05:33
Mirrors the PR #96 encode pattern. Extracts challenge behavior from
live governance articles (landed in klappy.dev canon via PR #99)
rather than hardcoded source logic.

New functions in workers/src/orchestrate.ts:
- discoverChallengeTypes — per-canonUrl cached type discovery
- fetchBasePrerequisites — universal prerequisite checks
- fetchNormativeVocabulary — RFC 2119 + architectural load-bearing terms
- fetchStakesCalibration — mode-to-depth filter
- extractPrereqTable / extractKeywordsFromCheck — shared helpers

Refactored:
- runChallengeAction — replaces hardcoded detectClaimType /
  generateChallenges / findTensions / findMissingPrerequisites
  with governance extraction. Supports multi-match. Filters output
  by stakes calibration based on mode parameter.
- runCleanupStorage — clears all four new caches on invalidation

Invariant: voice-dump mode suppresses all challenge output
regardless of matched types. Load-bearing per stakes-calibration
governance — some modes exist for raw capture and pressure-testing
at that stage damages the mode.

Graceful degradation: missing governance articles fall back to
minimal built-in behavior with warnings, rather than failing.

Co-authored-by: Claude <noreply@anthropic.com>
Refactor runChallengeAction in workers/src/orchestrate.ts to extract
challenge-type behavior from canon governance articles at runtime rather
than hardcoding claim-type detection, questions, prerequisites, and
tension rules in source. Structural mirror of PR #96 (encode).

Detection upgraded mid-implementation from regex-OR to BM25 + stemming
after the gauntlet revealed that regex-based matching was morphologically
brittle ("coin" doesn't match trigger "coining"). The pivot removed an
entire class of bug and seeded a reusable pattern for future
governance-driven tools.

Changes in workers/src/orchestrate.ts:
- New: ChallengeTypeDef, BasePrerequisite, NormativeVocabulary,
  StakesModeConfig, StakesCalibration
- New: discoverChallengeTypes (builds per-canonUrl BM25 index over
  detection text), fetchBasePrerequisites, fetchNormativeVocabulary,
  fetchStakesCalibration — each with per-canonUrl cache and graceful
  degradation on missing articles
- New: evaluatePrerequisiteCheck — interprets natural-language check
  strings from prerequisite overlay tables
- Refactored runChallengeAction: multi-match via BM25 score > 0, base
  + overlay prerequisite aggregation, stakes calibration filtering,
  voice-dump suppression invariant, governance-driven tension detection
- Extended runCleanupStorage with five new cache clears (types,
  type-index, base prerequisites, vocabulary, calibration)
- Removed dead detectClaimType (legacy src/tasks/challenge.js retains
  its copy for CLI backward-compat)
- Added CHALLENGE_STOP_WORDS set preserving modal verbs as signal

Changes in workers/src/bm25.ts (backward-compatible extension):
- tokenize(), buildBM25Index() accept optional stopWords: Set<string>
- BM25Index gains optional stopWords field so searchBM25 tokenizes
  queries consistently with the index
- Default behavior unchanged — existing callers unaffected
- Motivation: default STOP_WORDS filters modals (must, should, shall,
  may, not) which are signal for challenge-type detection

New tests: workers/test/governance-parser.test.mjs — 94 assertions
against live governance articles fetched from klappy.dev raw. Covers
type parsing, fallback resolution, BM25 detection, stemming regression
cases (coin/coining, propose/proposed, principle/principles), multi-
match, and the voice-dump suppression invariant. 94/94 pass.

Bugs the gauntlet caught on this PR:
1. Voice-dump suppression invariant would have shipped broken — the
   calibration cell reads "none (suppress all challenge)" not bare
   "none". Strict-equality parser would have produced a single-element
   array, voice-dump mode would have surfaced all challenges in prod.
2. Morphological brittleness in regex detection (coin vs coining) —
   triggered the pivot to BM25 + stemming.
3. Default BM25 STOP_WORDS silently breaks strong-claim and proposal
   detection by filtering modal verbs. Fixed via custom stop word set.

Verification:
- npm run typecheck: clean
- tests/smoke.sh: 6/6 pass (legacy CLI path — backward compat preserved)
- workers/test/governance-parser.test.mjs: 94/94 pass
- AI voice clichés audit on new comments: clean
- oddkit_preflight, challenge, gate, validate: all run; gate NOT_READY
  due to same hardcoded-logic gap as challenge pre-refactor (flagged as
  follow-up)

Response shape change: adds mode, matched_types, type_definitions,
block_until_addressed; removes claim_type. Consumed programmatically,
not rendered.

Follow-ups flagged:
- Encode parity PR — same regex-OR brittleness in runEncodeAction;
  pattern proven here, port will be near-mechanical
- klappy.dev meta governance PR — "compiles into a case-insensitive
  word-boundary regex" is now stale language
- Gate refactor candidate — same hardcoded-logic shape as challenge pre-refactor

Refs:
- Depends on: klappy/klappy.dev#99 (governance articles this code reads)
- Structural mirror: #96 (governance-driven encode)
- Evidence: docs/oddkit/evidence/challenge-governance-code-refactor.md
Re-applies the four review fixes from sibling commits (31f8134, e9ef2f9,
84932f0) and the dead-code removal that the bugbot review also flagged,
on top of the BM25 + stemming detection swap.

- Vocabulary regex sorted by length descending so 'MUST NOT' matches
  before 'MUST' (closes bugbot 'Regex alternation order')
- Stakes calibration mode column lowercased at parse time AND mode
  normalized to lowercase at lookup time (closes bugbot 'Mode column
  not lowercased breaks voice-dump suppression')
- first_1 reframings policy now surfaces a single reframing total
  across all matched types, not one per type (closes bugbot
  'first_1 reframings surfaces multiple instead of one')
- Detection runs BEFORE voice-dump suppression check, and SUPPRESSED
  response includes the governance field for shape parity with
  CHALLENGED (closes bugbot 'SUPPRESSED response missing governance')
- Renames type_definitions to governance in CHALLENGED response so
  both statuses return the same shape under the same key
- Dead detectClaimType already removed by the BM25 commit (closes
  bugbot 'Dead code: detectClaimType has zero callers')

Verification:
- npm run typecheck: clean
- workers/test/governance-parser.test.mjs: 94/94 pass
- tests/smoke.sh: 6/6 pass
…ctor evidence

Captures the fork-resolution and bugbot-review-driven fixes as a sixth
layer of catch alongside the gauntlet bugs. Records the lesson:
read PR review comments before treating divergent remote as unknown work.
Caught in PR #100 review by Klappy: the CHALLENGE_STOP_WORDS Set added
mid-PR to fix a BM25 over-match was itself a Vodka Architecture violation
in a refactor explicitly about removing such violations. The constant
carried a domain opinion ('modals are signal, articles are filler in
challenge detection') that belonged in canon, not in worker source.

Anti-pattern fixed:
- Drop the hardcoded CHALLENGE_STOP_WORDS Set from workers/src/orchestrate.ts
- Drop the duplicate hardcoded copy from workers/test/governance-parser.test.mjs
- Extend NormativeVocabulary interface with stopWords: Set<string>
- Extend fetchNormativeVocabulary to extract '## Detection Noise' code block
  from odd/challenge/normative-vocabulary.md (lands in klappy.dev#100)
- Move BM25 index build out of discoverChallengeTypes into a new lazy
  builder getOrBuildChallengeTypeIndex(types, vocab, canonUrl) so the
  index can use governance-sourced stop words rather than a constant
- Update parser test to fetch Detection Noise the same way the worker
  does — no hardcoded duplicate, no drift risk. Test gains 3 new
  assertions: Detection Noise parses non-empty, excludes modal verbs,
  includes common filler

Net hardcoded-constants delta: this PR removes ~6 classes of hardcoded
domain opinion (claim type detection, questions, prereqs, tension regex,
reframings, stop words) and adds zero. The remaining minimal RFC 2119
fallback ('MUST', 'MUST NOT', 'SHOULD', 'SHOULD NOT') and 'planning'
default mode are server-availability fallbacks for when canon is
unreachable, not domain governance.

Test currently runs against the feature branch via KLAPPYDEV_RAW env
override. After klappy.dev#100 merges, the override comes off and the
test reads from main with no further changes.

Verification:
- npm run typecheck: clean
- workers/test/governance-parser.test.mjs (vs feature branch): 97/97 pass
- tests/smoke.sh: 6/6 pass
- grep CHALLENGE_STOP_WORDS in workers/ and src/: zero matches

Refs:
- Caught in: this PR review by Klappy
- Depends on: klappy/klappy.dev#100 (Detection Noise section)
- Lesson: 'is this the right architectural shape' is a category the
  current gauntlet does not catch — the tools verify governance content,
  not whether new code is creating new ungoverned content. Possible
  future tool: a vodka-audit that flags non-trivial Sets/Maps/lists in
  worker source and asks 'should this be in canon?'
…arser

- Use matchAll and prefer prohibition directive type over leftmost
  requirement match so excerpts like 'You MUST X and MUST NOT Y'
  surface the prohibition. Regex switched to global flag to support
  matchAll.
- Port fetchStakesCalibration toLowerCase fix to the fidelity test
  parser so byMode keys stay lowercase even if governance introduces
  capitalized mode names.
…st pickStrongest

Two fixes:

1. Table row parser (6 call sites) was using
   .split('|').map(trim).filter(c => c.length > 0) which also drops
   legitimately-empty interior cells, silently collapsing the column
   count. In fetchStakesCalibration this would silently drop a
   voice-dump row with an empty tiers cell, breaking the suppression
   invariant with no error signal. Introduce parseTableRow helper that
   only strips the leading/trailing empties produced by surrounding
   pipes, preserving empty middle cells.

2. Hoist pickStrongest (now pickStrongestDirective) out of the
   per-entry loop in runChallengeAction. It captures no loop-scoped
   state, so defining it inside the loop needlessly re-allocated the
   closure on each iteration and misled readers about its scope.
   Matches the placement of evaluatePrerequisiteCheck.
…and dead branch

Two issues from bugbot's 14:29 review:

1. Reframing 'none' check applies same defensive pattern as the tiersRaw
   fix in fetchStakesCalibration. The cell may be 'none' or
   'none (parenthetical reason)' — strict equality would silently surface
   all reframings via the 'all' fallback when authors include explanatory
   text. Same defect class as bug #3 in the evidence note; sweep applied.

2. Remove unreachable questionTiers.length === 0 branch in the question-
   surfacing condition. The SUPPRESSED early-return at line 1635 already
   handles that case, so the branch was dead code that misleadingly
   suggested 'surface all questions for empty tiers' semantics — the
   actual semantic is full suppression.

Verified: typecheck clean, parser test 97/97 against main, smoke 6/6.
Defect-class sweep on governance cell strict-equality checks: only two
sites (tiersRaw, surfacing), both now defensive.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
oddkit 6e01a00 Commit Preview URL

Branch Preview URL
Apr 17 2026, 03:17 PM

@klappy
Copy link
Copy Markdown
Owner Author

klappy commented Apr 17, 2026

⚠ HOLD — schema bug found by preview test

After opening this promotion PR, I actually tested the production preview at https://main-oddkit.klappy.workers.dev — which I should have done before opening the PR.

The MCP tool's mode parameter Zod schema accepts only 3 modes (exploration, planning, execution), but odd/challenge/stakes-calibration defines 9 modes. The 6 writing-lifecycle modes (voice-dump, drafting, peer-review-ready, canon-tier-2, canon-tier-1, published-essay) are unreachable from the public API.

Net effect: the voice-dump suppression invariant — the load-bearing feature this PR's evidence note specifically calls out — cannot be invoked through the public MCP tool. Schema rejects the call before runtime ever sees it.

PR #102 fixes this with a 10-line schema expansion. Hold this promotion until #102 lands and merges to main. Then this PR fast-forwards and the prod deploy ships a feature whose headline invariant actually works.

What the production preview confirmed works

  • oddkit_orient — 7s warm
  • oddkit_challenge with mode: planning — multi-match, BM25 stemming (coin the termpattern-coinage), governance field present, all working
  • oddkit_challenge multi-match — must always coining fires strong-claim + pattern-coinage + principle-extraction correctly
  • oddkit_challenge with mode: voice-dumpschema validation error before runtime

Lesson

Three verification layers passed (typecheck, parser-fidelity 97/97, smoke 6/6) and the deploy succeeded. None exercised the public API contract end-to-end with the load-bearing mode value. Testing the running preview is not optional.

@klappy klappy merged commit e5d0983 into prod Apr 17, 2026
9 of 11 checks passed
klappy added a commit that referenced this pull request Apr 17, 2026
Captures DOLCHE for the session that delivered PR #100 (governance-driven
challenge refactor with BM25 + stemming) and the unresolved schema bug
that made the voice-dump suppression invariant unreachable from the
public API.

Critical for next model picking up:
- PR #101 (prod promotion) is BLOCKED — schema fix not yet merged
- fix/challenge-mode-schema-includes-writing-modes has the fix
- After fix lands, manually curl preview with mode=voice-dump before promoting

Also records lessons for the next session: defect-class sweep discipline,
public API contract verification, parser test flakiness, and three
follow-up refactors carrying the same anti-pattern as challenge
pre-refactor (encode, gate, orient).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants