diff --git a/docs/oddkit/evidence/challenge-governance-code-refactor.md b/docs/oddkit/evidence/challenge-governance-code-refactor.md new file mode 100644 index 0000000..68068a1 --- /dev/null +++ b/docs/oddkit/evidence/challenge-governance-code-refactor.md @@ -0,0 +1,155 @@ +# Gauntlet Evidence — Challenge Governance Code Refactor + +**Branch:** `feat/e0008-challenge-governance-driven` +**Date:** 2026-04-17 +**Scope:** Governance-driven refactor of `oddkit_challenge` in `workers/src/orchestrate.ts` plus minor extension of `workers/src/bm25.ts` +**Deliverable type:** Worker code change (TypeScript) — the runtime that consumes the canon governance articles landed in PR #99 +**Predecessor PRs:** #96 (governance-driven encode pattern, the structural mirror), #99 (klappy.dev governance articles, the canon this code reads) + +--- + +## Definition of Done — Evidence + +### 1. Change Description + +Refactored `runChallengeAction` in `workers/src/orchestrate.ts` from hardcoded claim-type detection and question generation to governance-driven extraction. The structural mirror of PR #96 (encode). **Mid-implementation pivot:** replaced regex-OR detection with BM25 + stemming after the gauntlet surfaced a morphological brittleness (`"coin"` doesn't match trigger word `"coining"`). The architectural swap removed an entire class of bug and validated a reusable pattern for future governance-driven tools. + +**New types added (`orchestrate.ts`):** + +- `ChallengeTypeDef` — slug, name, blockquote, trigger words, `detectionText` (triggers + blockquote, fed to BM25 indexer), questions with tiers, prerequisite overlays, reframings, fallback flag +- `BasePrerequisite` — prerequisite name, check description, gap message +- `NormativeVocabulary` — case-sensitive regex (RFC 2119), case-insensitive regex (architectural phrases), directive type map (this one keeps regex since it's directive-vocabulary matching against retrieved canon quotes, not claim-type detection) +- `StakesModeConfig` / `StakesCalibration` — mode → (question tiers, prerequisite strictness, reframing surfacing) + +**New discovery/fetch functions added (`orchestrate.ts`):** + +- `discoverChallengeTypes(fetcher, canonUrl)` — finds articles tagged `challenge-type`, parses each, builds a per-canonUrl BM25 index over detection text. Per-canonUrl cache for types AND index. +- `fetchBasePrerequisites(fetcher, canonUrl)` — fetches `odd/challenge/base-prerequisites.md`, extracts the prerequisite overlays table. Per-canonUrl cache. +- `fetchNormativeVocabulary(fetcher, canonUrl)` — fetches `odd/challenge/normative-vocabulary.md`, extracts both vocabulary tables, compiles case-sensitive and case-insensitive regexes. Falls back to minimal RFC 2119 set if the article is missing. Per-canonUrl cache. +- `fetchStakesCalibration(fetcher, canonUrl)` — fetches `odd/challenge/stakes-calibration.md`, extracts the calibration table. Per-canonUrl cache. + +**`runChallengeAction` refactored to:** + +- Load all four governance sources in parallel +- Honor voice-dump suppression invariant — return empty challenge output when mode's tier list is empty +- Detect matching types via BM25 over per-type detection text (score > 0 = match) +- Resolve fallback type when no type scores > 0 +- Aggregate questions, prerequisite overlays (base + type), and reframings across matched types with deduplication +- Apply stakes calibration filter based on mode (question tiers, prerequisite strictness, reframing surfacing) +- Detect tensions in retrieved canon quotes via governance-driven vocabulary regex (replacing hardcoded `MUST`/`MUST NOT` checks) +- Surface matched type names and definitions in the response (teaching the model what governs the behavior) +- Mark `block_until_addressed` when calibration says so + +**`evaluatePrerequisiteCheck` helper added:** interprets natural-language `check` strings from prerequisite overlay tables. Extracts quoted keywords and tests presence in input. Special-cases URL, numeric, proper-noun, and citation patterns. + +**`runCleanupStorage` extended:** clears all five new caches (types, type-index, base prerequisites, normative vocabulary, stakes calibration). Mirror of the PR #96 fix for cache staleness on governance edits. + +**Dead code removed:** `detectClaimType` in `workers/src/orchestrate.ts` (only used by the old hardcoded `runChallengeAction`). Legacy version in `src/tasks/challenge.js` retained for backward-compat on the non-worker CLI path. + +**`workers/src/bm25.ts` extension (backward-compatible):** + +- `tokenize(text, stopWords?)` — new optional parameter. Defaults to the existing `STOP_WORDS` set (unchanged behavior for existing callers). +- `buildBM25Index(documents, stopWords?)` — same. Records the stop word set on the returned index so `searchBM25` tokenizes queries consistently with doc vocabularies. +- `BM25Index` interface gained an optional `stopWords?: Set` field. +- Motivation: the default `STOP_WORDS` filters out modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the load-bearing detection signal for strong-claim, proposal, and assumption challenge types. Challenge-type detection needs a custom stop-word set that preserves modals. + +### 2. Verification Performed + +- `npm run typecheck` (workers/) — clean both before and after the BM25 pivot, and after the dead-code removal +- `bash tests/smoke.sh` (root) — 6 PASS, exercising the legacy CLI path. Confirms backward compat preserved (the worker path I refactored is separate from the CLI path). +- `node workers/test/governance-parser.test.mjs` — new parser-fidelity test, 94 assertions against live governance articles fetched from klappy.dev raw. **94 pass, 0 fail.** Includes explicit regression tests for stemming (`coin`/`coining`, `proposed`/`propose`, `principles`/`principle`) and multi-match semantics via BM25. +- `oddkit_preflight` — surfaced constraints (ai-voice-cliches, author-identity-language, definition-of-done, supersession, prompt-over-code) +- `oddkit_get` on `canon/methods/supersession.md` — confirmed this refactor is "replace" on the supersession spectrum (provenance preserved via PR description, commit message, ledger entry, retained legacy file) +- AI voice clichés audit on new code/comments via `git diff | grep` for negation parallelism, formulaic transitions, puffing — clean, zero hits +- `oddkit_challenge` on the commit decision — generic prereqs answered honestly in the PR description +- `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason documented in PR #99 — flagged in PR as future refactor candidate + +### 3. Observed Behavior + +Parser-fidelity test output (94/94 passed): + +``` +─── Test 1: Challenge type parsing ─── (7 types × 8 assertions = 56 passing) +─── Test 2: Fallback resolution ─── (2 passing — observation has fallback: true, others don't) +─── Test 3: BM25 detection with stemming ─── (7 passing — each type matches its first trigger word) +─── Test 3b: Stemming defeats the original coin/coining bug ─── (5 passing — stemming equivalence + 4 real-world inputs) +─── Test 4: Multi-match semantics (BM25) ─── (3 passing) +─── Test 4b: Empty input + irrelevant input do not over-match ─── (1 passing) +─── Test 5: Base prerequisites ─── (4 passing) +─── Test 6: Normative vocabulary ─── (4 passing) +─── Test 7: Stakes calibration ─── (5 passing — including the voice-dump suppression invariant) + +94 passed, 0 failed +``` + +### 4. Evidence Produced + +This file. Plus the diffs: + +- `workers/src/orchestrate.ts`: ~560 insertions, ~70 deletions +- `workers/src/bm25.ts`: small additive change (stopWords parameter threaded through tokenize/buildBM25Index/searchBM25, no behavior change for existing callers) +- `workers/test/governance-parser.test.mjs`: new (~200 lines) +- `docs/oddkit/evidence/challenge-governance-code-refactor.md`: this note + +Visual proof: **N/A — server-side code change.** No UI, no interaction surface, no visible state. The `oddkit_challenge` MCP tool's response shape changes (adds `mode`, `matched_types`, `type_definitions`, `block_until_addressed` fields; removes `claim_type`) but this is consumed programmatically, not rendered. + +### 5. Self-Audit Completed + +- **Intended outcome:** the worker path of `oddkit_challenge` becomes governance-driven via extraction from canon, mirroring PR #96. Behavior changes when the canon governance articles change — no code redeploy required. Detection is morphologically resilient via BM25 + stemming. +- **Constraints applied:** Definition of Done (this file), Writing Canon (n/a — code, not document, but evidence note follows the structure), AI voice clichés (audited clean on new comments), supersession ("replace" with provenance preserved), prompt-over-code (the principle this implements), Vodka Architecture (server stays thin — extraction and IR, no domain opinion baked in). +- **Decision rules followed:** mirror PR #96's cache pattern (per-canonUrl keying, try-catch-graceful-degradation per article); preserve legacy CLI path; voice-dump suppression as a load-bearing invariant; multi-match by design; honor `fallback: true` frontmatter for type fallback resolution; keep `bm25.ts` changes backward-compatible. +- **Tradeoffs:** four governance fetches per challenge call (mitigated by per-canonUrl module-level cache, so cold start is the only slow path); BM25 index built per cache invalidation (cheap — 5–10 tiny docs); BM25 score magnitudes aren't intuitive constants (anyone tuning thresholds later will need to reason in relative terms); the Porter-style stemmer handles common English morphology but not irregular forms. +- **Remaining risks:** + - Parser regex assumes specific table column order. If a future governance article reorders columns, parsing degrades silently. The parser-fidelity test catches this for currently-shipped articles but won't catch it for hypothetical future structure changes. + - `evaluatePrerequisiteCheck` uses heuristics over natural-language check descriptions. Some prerequisite checks may evaluate incorrectly — watch for false-negative gap messages in production logs. + - `oddkit_gate` still returns NOT_READY due to its own hardcoded prereqs — same architectural pattern as challenge pre-refactor. Future refactor candidate. Documented in PR. + - `oddkit_encode` still uses regex-OR detection with the same morphological brittleness this PR fixes for challenge. Follow-up PR required to bring encode to parity; the pivot here provides the blueprint. + - klappy.dev meta governance article (`odd/challenge-types/how-to-write-challenge-types.md`) describes the runtime as "compiles into a case-insensitive word-boundary regex" — that's now stale. Small coordinated klappy.dev PR required to update the language. + +--- + +## Bugs the Gauntlet Caught (this refactor sequence) + +1. **PR #99 — 10 of 11 articles missing required `## Summary` sections.** Writing Canon tier 4 violation. Same failure mode as the Feb 2026 Progressive Disclosure Failure incident. +2. **PR #99 — broken `derives_from` path** in `stakes-calibration.md` (`canon/epistemic-modes.md` → `canon/definitions/epistemic-modes.md`). +3. **This PR — voice-dump suppression invariant would have shipped broken.** The calibration cell content is `"none (suppress all challenge)"` not bare `"none"`. Initial parser checked `=== "none"` with strict equality, would have produced a single-element array, voice-dump mode would have surfaced all challenge questions in production. Fixed by checking `tiersRaw === "none" || tiersRaw.startsWith("none ") || tiersRaw.startsWith("none(")`. +4. **This PR (BM25 pivot) — morphological brittleness revealed.** The test `pattern-coinage fires on 'coin the term'` failed under regex because the article has `coining` as a trigger but not `coin`. This signal triggered the full pivot from regex-OR to BM25 + stemming. +5. **This PR (BM25 pivot) — default `STOP_WORDS` would have silently broken strong-claim and proposal detection.** The default filter drops modal verbs (`must`, `should`, `shall`, `may`, `not`) — exactly the load-bearing trigger words for these two types. Caught because the parser-fidelity test asserted each type matches its first trigger word and two types failed. Fixed by extending `bm25.ts` with an optional `stopWords: Set` parameter and defining a `CHALLENGE_STOP_WORDS` set in `orchestrate.ts` that preserves modals. + +**The discipline is load-bearing, not ceremony.** Five real bugs caught across two PRs. Two of the five would have caused silent production failures of invariants specifically named in the governance. + +--- + +## Bugbot Review + Combine — A Sixth Layer of Catch + +After the regex version of this PR was first pushed, the Cursor bugbot reviewed it and flagged five issues, four of which were addressed via three follow-up commits on the remote branch (`31f8134`, `e9ef2f9`, `84932f0`). Those commits existed in parallel to the local BM25 pivot work and were not visible until the remote was fetched. Discovering them late led to a fork situation that was resolved by: + +1. Resetting local to the remote tip (preserving the three follow-up fixes plus the ledger entry on `a88abf7`) +2. Cherry-picking the BM25 commit on top of the resolved tip +3. Hand-porting the four polish fixes onto the BM25 base — they touch unrelated code regions so the port was mechanical +4. Adding a fixup commit (`e82164b`) on top of the BM25 commit that captures the ports + +**Bugbot's five review items, all closed by this PR:** + +| Severity | Issue | Resolution | +|----------|-------|------------| +| High | Mode column not lowercased breaks voice-dump suppression | Lowercased at parse time AND at lookup | +| Medium | Regex alternation order breaks multi-word directive matching (`MUST` before `MUST NOT`) | Sort vocab by length descending | +| Medium | `first_1` reframings surfaces multiple instead of one | Slice from aggregated list, not per-type | +| Medium | SUPPRESSED response missing `governance` field | Detection runs before suppression check; both responses share the `governance` shape | +| Low | Dead code: `detectClaimType` has zero callers | Removed by the BM25 commit before bugbot reviewed | + +**Lesson recorded:** when encountering a divergent remote on an existing PR branch, fetch and read PR review comments first. Bugbot leaves structured comments that explain the divergent commits — checking saves the user from explaining what already exists. + +--- + +## Version Tracking + +- Branch: `feat/e0008-challenge-governance-driven` +- Post-merge: ledger entry capturing E0008 challenge code-refactor milestone +- Related PRs: + - **Predecessor (structural mirror):** klappy/oddkit#96 (governance-driven encode refactor) + - **Depends on:** klappy/klappy.dev#99 (governance articles in canon — the inputs this code reads) + - **Immediate follow-up:** encode parity PR — bring `oddkit_encode` to BM25 + stemming using the pattern proven here + - **Small follow-up:** klappy.dev PR updating `how-to-write-challenge-types.md` — swap "compiles into a case-insensitive word-boundary regex" for the BM25 description + - **Future candidate:** governance-driven gate refactor (gate has the same hardcoded-logic gap as challenge pre-refactor; surfaced again during this gauntlet run) diff --git a/odd/ledger/journal/2026-04-17-pr100-combined.md b/odd/ledger/journal/2026-04-17-pr100-combined.md new file mode 100644 index 0000000..2cc35eb --- /dev/null +++ b/odd/ledger/journal/2026-04-17-pr100-combined.md @@ -0,0 +1,48 @@ +# Session Journal — PR #100 BM25 Pivot + Bugbot Combine + +**Date:** 2026-04-17 +**PR:** klappy/oddkit#100 — feat(challenge): governance-driven runChallengeAction (E0008) +**Branch:** `feat/e0008-challenge-governance-driven` +**Final commit on branch:** `fd14a60` + +## DOLCHE + +### Decisions + +- **D1: Pivot from regex-OR to BM25 + stemming for challenge-type detection.** Triggered by gauntlet observation that `coin` doesn't match trigger word `coining`. Klappy proposed BM25 as the right tool; agreed, executed the swap. +- **D2: Use a custom `CHALLENGE_STOP_WORDS` set, not the default `STOP_WORDS`.** Default filters modal verbs (`must`, `should`, `shall`, `may`, `not`) which are the signal for strong-claim and proposal types. Without this fix, those two type detections would have silently broken in production. +- **D3: Detection runs BEFORE voice-dump suppression check.** Lets SUPPRESSED response include the `governance` field so the model sees what types matched even when questions are suppressed. Closes bugbot Medium item; also better UX. +- **D4: `governance` is the canonical response key for matched-type definitions.** CHALLENGED and SUPPRESSED both use it; SUPPRESSED no longer returns `undefined` for shape parity. +- **D5: Combine fork rather than discard either side.** Remote had 5 commits (regex base + 3 polish + ledger entry), local had BM25 pivot. Both contained real work. Reset to remote tip, cherry-picked BM25 on top, hand-ported the 4 polish fixes. + +### Observations + +- **O1: PR review comments explain divergent commits.** Bugbot left 5 structured review comments on PR #100. Reading them first would have surfaced what those 3 unfamiliar commits did within seconds. Lesson recorded. +- **O2: Cherry-picking after staged conflicts didn't fold subsequent edits.** Hand-edits made AFTER `git add` but BEFORE `git cherry-pick --continue` sat in the working tree post-commit. Required a fixup commit. Workflow note for future merges. +- **O3: BM25 default `STOP_WORDS` is tuned for prose, not directive language.** The general-purpose IR assumption that modals are filler doesn't hold in the challenge taxonomy. The opt-in `stopWords: Set` extension is now available to other use cases that face the same issue. +- **O4: BM25 already had phrase boost machinery.** `PHRASE_BOOST_EXACT` and `PHRASE_BOOST_PARTIAL` give multi-word triggers free score amplification with no additional code. + +### Learnings + +- **L1: Read PR review comments first when fork is detected.** Bugbot/cursor leaves structured comments that explain divergent commits. Standard practice from now on. +- **L2: General-purpose IR stop-word lists are domain-hostile in directive matching.** Modal verbs are content, not filler, in any context where claims and proposals are the signal. +- **L3: Stemming + BM25 is the right shape for canon-defined category matching.** Stems handle morphology, IDF weights distinctive terms, score > 0 preserves multi-match and fallback semantics. Pattern is now reusable for encode parity and future gate refactor. +- **L4: The gauntlet caught one bug; bugbot caught four; the combine surfaced the fifth (dead code).** Different review tools catch different classes. The gauntlet is for "would this satisfy our governance"; bugbot is for "would this satisfy basic correctness." Both have value. + +### Constraints + +- **C1: Voice-dump mode MUST suppress questions, prereqs, and reframings — but MAY surface governance.** The invariant is about not pressure-testing during raw thought capture, not about hiding what types matched. +- **C2: bm25.ts extensions must preserve backward compatibility.** Default parameter values mean existing callers (oddkit_search, future encode pivot) are unaffected. New behavior is opt-in only. + +### Handoffs + +- **H1: Encode parity PR.** Same regex-OR brittleness in `runEncodeAction`. Pattern proven here, port near-mechanical. Highest-priority follow-up. +- **H2: klappy.dev meta governance update.** `how-to-write-challenge-types.md` references "compiles into a case-insensitive word-boundary regex" — now stale. Two-line PR. +- **H3: Gate refactor candidate.** `oddkit_gate` returned NOT_READY for the same hardcoded-logic reason challenge had pre-refactor. Same shape, same fix pattern. +- **H4: Score-based confidence in `matched_types`.** Currently `[slug, slug, ...]`; trivial upgrade to `[{slug, score}, ...]` if any consumer wants relative confidence. Not blocking. + +### Encodes + +- This journal at `odd/ledger/journal/2026-04-17-pr100-combined.md` +- Evidence note at `docs/oddkit/evidence/challenge-governance-code-refactor.md` (updated with bugbot+combine section) +- PR #100 comment summarizing the combine: https://github.com/klappy/oddkit/pull/100#issuecomment-4266114513 diff --git a/odd/ledger/learnings.jsonl b/odd/ledger/learnings.jsonl index f7497f9..f00ada0 100644 --- a/odd/ledger/learnings.jsonl +++ b/odd/ledger/learnings.jsonl @@ -37,3 +37,4 @@ {"id":"learn-20260410-0003","timestamp":"2026-04-10T04:33:00Z","summary":"AnalyticsEngineDataset is a global interface in @cloudflare/workers-types — no import statement needed, just use it directly in type annotations","trigger":"friction","impact":"Initially searched for how to import the type. It is declared globally by the workers-types package (configured via tsconfig types array), so it can be used directly in interface declarations without any import.","confidence":1.0,"sources":["workers/tsconfig.json","node_modules/@cloudflare/workers-types/index.d.ts"],"evidence":[{"type":"artifact","ref":"workers/src/zip-baseline-fetcher.ts — ODDKIT_TELEMETRY?: AnalyticsEngineDataset used without import"}],"candidate_targets":[],"proposed_escalation":"none"} {"id":"learn-20260412-0001","timestamp":"2026-04-12T00:52:00Z","summary":"Standalone Worker tools (telemetry, time) bypass orchestrate pipeline — they share oddkit_ MCP prefix but register directly in createServer with their own handler. CLI parity requires adding to TOOLS array (auto-cascades) plus explicit param threading in cli.js and server.js","trigger":"architecture","impact":"New standalone tools need 5 files touched: index.ts (Worker registration), tool-registry.js (TOOLS entry), actions.js (handler), server.js (param threading), cli.js (param threading). The TOOLS auto-derivation handles enum/listing but not param plumbing.","confidence":0.95,"sources":["workers/src/index.ts","src/core/tool-registry.js","src/core/actions.js","src/mcp/server.js","src/cli.js"],"evidence":[{"type":"artifact","ref":"PR #87 — oddkit_time implementation across 5 files"}],"candidate_targets":[],"proposed_escalation":"none"} {"id":"L39","timestamp":"2026-04-13T11:12:00Z","type":"learning","summary":"raw.githubusercontent.com URL parsing must rejoin all path segments after owner/repo to support branch names with slashes — parts[2] truncates multi-segment refs like publish/four-essays-and-skill to just publish","context":"extractBranchRef() and getZipUrl() in zip-baseline-fetcher.ts both used parts[2] which only captured the first segment of a slash-containing branch name, causing 404s on both SHA resolution and ZIP download","resolution":"Changed to parts.slice(2).join(\"/\") in both functions — minimal 2-line fix"} +{"type":"D","summary":"E0008 challenge governance refactor: replaced hardcoded detectClaimType logic in runChallengeAction with four governance-driven fetch functions (discoverChallengeTypes, fetchBasePrerequisites, fetchNormativeVocabulary, fetchStakesCalibration). Voice-dump suppression invariant is load-bearing — questionTiers.length === 0 short-circuits all output. Four new caches cleared in runCleanupStorage. tsc clean. PR #100.","rationale":"Hardcoded challenge logic cannot evolve with governance articles; governance-driven extraction means challenge behavior updates when articles update, no code change required. Mirrors PR #96 encode precedent exactly.","context":"workers/src/orchestrate.ts, branch feat/e0008-challenge-governance-driven, commit aa4445c","date":"2026-04-17"} diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts index f1aea92..66867df 100644 --- a/workers/src/bm25.ts +++ b/workers/src/bm25.ts @@ -30,13 +30,17 @@ export function stem(word: string): string { .replace(/s$/, ""); } -/** Tokenize and stem text, removing stop words */ -export function tokenize(text: string): string[] { +/** Tokenize and stem text. Pass a custom `stopWords` set to override the + * default. Pass an empty Set to disable filtering entirely. Use this for + * domains where the default modal verbs (must, should, shall, may, might, + * can, could, will, would) carry meaningful signal — for example, + * challenge-type detection where modals are themselves the trigger words. */ +export function tokenize(text: string, stopWords: Set = STOP_WORDS): string[] { return text .toLowerCase() .replace(/[^\w\s-]/g, " ") .split(/[\s\-_/]+/) - .filter((t) => t.length > 1 && !STOP_WORDS.has(t)) + .filter((t) => t.length > 1 && !stopWords.has(t)) .map(stem); } @@ -53,18 +57,25 @@ export interface BM25Index { df: Map; avgdl: number; N: number; + /** The stop word set used at index time. searchBM25 reuses it so that + * query tokenization matches doc tokenization exactly. */ + stopWords?: Set; } -/** Build BM25 index from {id, text} pairs */ +/** Build BM25 index from {id, text} pairs. + * Pass `stopWords` to override the default filter (e.g., for domains where + * modal verbs are signal). The same set is stored on the index so that + * searchBM25 tokenizes queries consistently with the indexed docs. */ export function buildBM25Index( documents: Array<{ id: string; text: string }>, + stopWords: Set = STOP_WORDS, ): BM25Index { const docs: BM25Doc[] = []; const df = new Map(); let totalLength = 0; for (const doc of documents) { - const terms = tokenize(doc.text); + const terms = tokenize(doc.text, stopWords); docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text }); totalLength += terms.length; @@ -82,6 +93,7 @@ export function buildBM25Index( df, avgdl: documents.length > 0 ? totalLength / documents.length : 0, N: documents.length, + stopWords, }; } @@ -97,12 +109,16 @@ export function searchBM25( query: string, limit: number = 5, ): Array<{ id: string; score: number }> { - const queryTerms = tokenize(query); + const stopWords = index.stopWords ?? STOP_WORDS; + const queryTerms = tokenize(query, stopWords); if (queryTerms.length === 0) return []; // Pre-compute phrase matching inputs once, outside the per-doc loop. const queryLower = query.toLowerCase(); - const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w)); + const queryWords = queryLower + .replace(/[^\w\s-]/g, " ") + .split(/[\s\-_/]+/) + .filter((w) => w.length > 1 && !stopWords.has(w)); const scores: Array<{ id: string; score: number }> = []; diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts index a1a43ea..401529a 100644 --- a/workers/src/orchestrate.ts +++ b/workers/src/orchestrate.ts @@ -75,6 +75,58 @@ interface ParsedArtifact { let cachedEncodingTypes: EncodingTypeDef[] | null = null; let cachedEncodingTypesCanonUrl: string | undefined = undefined; +// Governance-driven challenge types (E0008 — mirrors encode pattern from PR #96) +interface ChallengeTypeDef { + slug: string; + name: string; + blockquote: string; + triggerWords: string[]; + detectionText: string; // triggerWords + blockquote, fed to BM25 indexer + questions: Array<{ question: string; tier: string }>; + prerequisiteOverlays: Array<{ prerequisite: string; check: string; gapMessage: string }>; + reframings: string[]; + fallback: boolean; +} + +interface BasePrerequisite { + prerequisite: string; + check: string; + gapMessage: string; +} + +interface NormativeVocabulary { + caseSensitiveRegex: RegExp | null; + caseInsensitiveRegex: RegExp | null; + directiveTypes: Map; + /** Stop words for user-input matching against per-type detection text. + * Sourced from the `## Detection Noise` section of normative-vocabulary.md. + * Empty Set = no filtering (server falls back to BM25 IDF only). Modal + * verbs and negation are deliberately absent from canon's default list + * because they are signal for strong-claim, proposal, and assumption types. */ + stopWords: Set; +} + +interface StakesModeConfig { + questionTiers: string[]; + prerequisiteStrictness: string; + reframingSurfacing: string; +} + +interface StakesCalibration { + byMode: Map; +} + +let cachedChallengeTypes: ChallengeTypeDef[] | null = null; +let cachedChallengeTypesCanonUrl: string | undefined = undefined; +let cachedChallengeTypeIndex: BM25Index | null = null; +let cachedChallengeTypeIndexCanonUrl: string | undefined = undefined; +let cachedBasePrerequisites: BasePrerequisite[] | null = null; +let cachedBasePrerequisitesCanonUrl: string | undefined = undefined; +let cachedNormativeVocabulary: NormativeVocabulary | null = null; +let cachedNormativeVocabularyCanonUrl: string | undefined = undefined; +let cachedStakesCalibration: StakesCalibration | null = null; +let cachedStakesCalibrationCanonUrl: string | undefined = undefined; + export interface UnifiedParams { action: string; input: string; @@ -102,6 +154,27 @@ export interface OrchestrateOptions { canonUrl?: string; } +// ────────────────────────────────────────────────────────────────────────────── +// Markdown table helpers +// ────────────────────────────────────────────────────────────────────────────── + +/** + * Parse a single markdown table row into trimmed cell values, preserving + * legitimately-empty middle cells. Only the leading and trailing empty strings + * produced by splitting a `| a | b |`-style row are stripped — a prior + * `.filter(c => c.length > 0)` approach also dropped empty interior cells, + * which silently collapsed the column count and caused `cols.length >= N` + * guards to misfire (e.g. a voice-dump row with an empty tiers cell). + */ +function parseTableRow(row: string): string[] { + const parts = row.split("|"); + // Strip the leading empty produced by a leading `|`, if present + if (parts.length > 0 && parts[0].trim() === "") parts.shift(); + // Strip the trailing empty produced by a trailing `|`, if present + if (parts.length > 0 && parts[parts.length - 1].trim() === "") parts.pop(); + return parts.map((c) => c.trim()); +} + // ────────────────────────────────────────────────────────────────────────────── // BM25 Index Cache (per-request, lazy) // ────────────────────────────────────────────────────────────────────────────── @@ -239,20 +312,6 @@ function detectMode(input: string): { mode: string; confidence: string } { return { mode: sorted[0][0], confidence }; } -function detectClaimType(input: string): string { - if ( - /\b(must|always|never|guaranteed|impossible|certain|definitely|obviously|clearly)\b/i.test( - input, - ) - ) - return "strong_claim"; - if (/\b(should|plan to|going to|will|propose|suggest|recommend|let's|want to)\b/i.test(input)) - return "proposal"; - if (/\b(assume|assuming|presume|given that|since|because|if we)\b/i.test(input)) - return "assumption"; - return "observation"; -} - function detectTransition(input: string): { from: string; to: string } { if (/\b(ready to build|ready to implement|start building|let's code|start coding)\b/i.test(input)) return { from: "planning", to: "execution" }; @@ -315,7 +374,7 @@ async function discoverEncodingTypes( const qualityCriteria: Array<{ criterion: string; check: string; gapMessage: string }> = []; if (criteriaSection) { for (const row of criteriaSection[1].split("\n").filter((r: string) => r.includes("|"))) { - const cols = row.split("|").map((c: string) => c.trim()).filter((c: string) => c.length > 0); + const cols = parseTableRow(row); if (cols.length >= 3) { qualityCriteria.push({ criterion: cols[0], @@ -355,6 +414,346 @@ async function discoverEncodingTypes( return types; } +// ────────────────────────────────────────────────────────────────────────────── +// E0008 — Governance-driven challenge (mirrors encode pattern from PR #96) +// Four discovery/fetch helpers read canon at runtime rather than hardcoding +// claim types, tensions, prerequisites, and mode calibration in source. +// ────────────────────────────────────────────────────────────────────────────── + +async function discoverChallengeTypes( + fetcher: ZipBaselineFetcher, + canonUrl?: string, +): Promise { + if (cachedChallengeTypes && cachedChallengeTypesCanonUrl === canonUrl) return cachedChallengeTypes; + + const index = await fetcher.getIndex(canonUrl); + const typeArticles = index.entries.filter( + (entry: IndexEntry) => + entry.tags?.includes("challenge-type") && entry.path.includes("challenge-types/"), + ); + + const types: ChallengeTypeDef[] = []; + for (const article of typeArticles) { + try { + const content = await fetcher.getFile(article.path, canonUrl); + if (!content) continue; + + // Slug from ## Type Identity table + const slugMatch = content.match(/\|\s*Slug\s*\|\s*([^|]+)\s*\|/); + const nameMatch = content.match(/\|\s*Name\s*\|\s*([^|]+)\s*\|/); + if (!slugMatch) continue; + const slug = slugMatch[1].trim(); + const name = nameMatch ? nameMatch[1].trim() : slug; + + // Opening blockquote (first > line after title) + const blockquoteMatch = content.match(/^#\s[^\n]+\n+>\s*([^\n]+(?:\n>\s*[^\n]+)*)/m); + const blockquote = blockquoteMatch + ? blockquoteMatch[1].replace(/\n>\s*/g, " ").trim() + : ""; + + // Detection patterns — code block under ## Detection Patterns + const detectionSection = content.match( + /## Detection Patterns[\s\S]*?```\n([\s\S]*?)\n```/, + ); + const triggerWords = detectionSection + ? detectionSection[1] + .split(",") + .map((w: string) => w.trim()) + .filter((w: string) => w.length > 0) + : []; + // Detection text fed to BM25 = trigger words + blockquote. + // Stemming handles morphology (coining ~ coin ~ coined ~ coinage) + // and IDF naturally weights distinctive trigger words above filler. + const detectionText = [triggerWords.join(" "), blockquote].filter((s) => s.length > 0).join(" "); + + // Challenge Questions table — rows of (Question, Stakes tier) + const questionsSection = content.match( + /## Challenge Questions[\s\S]*?\| Question[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const questions: Array<{ question: string; tier: string }> = []; + if (questionsSection) { + for (const row of questionsSection[1].split("\n").filter((r: string) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 2) { + questions.push({ question: cols[0], tier: cols[1].toLowerCase() }); + } + } + } + + // Prerequisite Overlays table — rows of (Prerequisite, Check, Gap message) + const prereqSection = content.match( + /## Prerequisite Overlays[\s\S]*?\| Prerequisite[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const prerequisiteOverlays: Array<{ + prerequisite: string; + check: string; + gapMessage: string; + }> = []; + if (prereqSection) { + for (const row of prereqSection[1].split("\n").filter((r: string) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 3) { + // Substitute {name} placeholder in gap messages + const gap = cols[2].replace(/^"|"$/g, "").replace(/\{name\}/g, name); + prerequisiteOverlays.push({ + prerequisite: cols[0], + check: cols[1], + gapMessage: gap, + }); + } + } + } + + // Suggested Reframings — bullet list + const reframingsSection = content.match( + /## Suggested Reframings[\s\S]*?\n((?:-\s+[^\n]+\n?)+)/, + ); + const reframings: string[] = []; + if (reframingsSection) { + for (const line of reframingsSection[1].split("\n")) { + const m = line.match(/^-\s+(.+)$/); + if (m) reframings.push(m[1].trim()); + } + } + + // Fallback flag from frontmatter + const frontmatter = parseFullFrontmatter(content); + const fallback = frontmatter?.fallback === true; + + types.push({ + slug, + name, + blockquote, + triggerWords, + detectionText, + questions, + prerequisiteOverlays, + reframings, + fallback, + }); + } catch { + continue; + } + } + + // Sort: fallback types last for deterministic fallback-resolution + types.sort((a, b) => { + if (a.fallback && !b.fallback) return 1; + if (!a.fallback && b.fallback) return -1; + return a.slug.localeCompare(b.slug); + }); + + cachedChallengeTypes = types; + cachedChallengeTypesCanonUrl = canonUrl; + // Index build deferred — needs vocab.stopWords from fetchNormativeVocabulary, + // assembled lazily by getOrBuildChallengeTypeIndex below. Both types and the + // index are deterministic functions of canonUrl, so caching by canonUrl + // remains safe. + return types; +} + +/** Lazily build (or return cached) per-canonUrl BM25 index over the per-type + * detection text, using governance-sourced stop words from normative-vocabulary.md. + * The cache is keyed on canonUrl so different canon sources do not contaminate + * each other's indexes. */ +function getOrBuildChallengeTypeIndex( + types: ChallengeTypeDef[], + vocab: NormativeVocabulary, + canonUrl?: string, +): BM25Index { + if (cachedChallengeTypeIndex && cachedChallengeTypeIndexCanonUrl === canonUrl) { + return cachedChallengeTypeIndex; + } + // Build BM25 index over per-type detection text (triggers + blockquote). + // Stemming handles morphology; IDF weights distinctive trigger terms above filler. + // vocab.stopWords comes from `## Detection Noise` in normative-vocabulary.md; + // it deliberately preserves modal verbs and negation as signal. An empty + // Set means no filtering (governance opted into IDF-only scoring). + const bm25Docs = types.map((t) => ({ id: t.slug, text: t.detectionText })); + const bm25Index = buildBM25Index(bm25Docs, vocab.stopWords); + cachedChallengeTypeIndex = bm25Index; + cachedChallengeTypeIndexCanonUrl = canonUrl; + return bm25Index; +} + +async function fetchBasePrerequisites( + fetcher: ZipBaselineFetcher, + canonUrl?: string, +): Promise { + if (cachedBasePrerequisites && cachedBasePrerequisitesCanonUrl === canonUrl) + return cachedBasePrerequisites; + + const result: BasePrerequisite[] = []; + try { + const content = await fetcher.getFile("odd/challenge/base-prerequisites.md", canonUrl); + if (content) { + const prereqSection = content.match( + /## Prerequisite Overlays[\s\S]*?\| Prerequisite[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + if (prereqSection) { + for (const row of prereqSection[1].split("\n").filter((r: string) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 3) { + result.push({ + prerequisite: cols[0], + check: cols[1], + gapMessage: cols[2].replace(/^"|"$/g, ""), + }); + } + } + } + } + } catch { + // Graceful degradation: no base prerequisites article → type overlays only + } + + cachedBasePrerequisites = result; + cachedBasePrerequisitesCanonUrl = canonUrl; + return result; +} + +async function fetchNormativeVocabulary( + fetcher: ZipBaselineFetcher, + canonUrl?: string, +): Promise { + if (cachedNormativeVocabulary && cachedNormativeVocabularyCanonUrl === canonUrl) + return cachedNormativeVocabulary; + + const caseSensitiveWords: string[] = []; + const caseInsensitiveWords: string[] = []; + const directiveTypes = new Map(); + const stopWords = new Set(); + + try { + const content = await fetcher.getFile("odd/challenge/normative-vocabulary.md", canonUrl); + if (content) { + // ── Surface 1: Normative Vocabulary (signal in canon quotes) ── + // Two subsections under "## Normative Vocabulary": one keyed by "RFC 2119" + // or "Directive Language" (case-sensitive), one for architectural-writing + // load-bearing phrases (case-insensitive). Each is a markdown table with + // (Word | Directive type). + const sections = content.split(/###\s+/); + for (const section of sections) { + const isCaseSensitive = /RFC 2119|Directive Language/i.test(section.split("\n")[0] || ""); + const tableMatch = section.match(/\|\s*(?:Word|Phrase)\s*\|[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/); + if (!tableMatch) continue; + for (const row of tableMatch[1].split("\n").filter((r: string) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 2) { + const phrase = cols[0]; + const dtype = cols[1]; + directiveTypes.set(phrase, dtype); + if (isCaseSensitive) caseSensitiveWords.push(phrase); + else caseInsensitiveWords.push(phrase); + } + } + } + + // ── Surface 2: Detection Noise (filler in user input) ── + // A code block of comma-and-newline separated words under "## Detection + // Noise". The set is passed to the BM25 indexer as the custom stop-word + // filter. Modal verbs and negation are deliberately absent — they are + // signal for strong-claim, proposal, and assumption type detection. + // If the section is missing, stopWords stays empty and BM25 falls back + // to IDF-only filtering — an explicit governance choice in the article. + const noiseMatch = content.match(/## Detection Noise[\s\S]*?```\n([\s\S]*?)\n```/); + if (noiseMatch) { + for (const word of noiseMatch[1].split(/[,\n]/)) { + const w = word.trim().toLowerCase(); + if (w.length > 0) stopWords.add(w); + } + } + } + } catch { + // Graceful degradation below + } + + // Fallback: minimal built-in RFC 2119 if article missing + if (caseSensitiveWords.length === 0 && caseInsensitiveWords.length === 0) { + for (const w of ["MUST", "MUST NOT", "SHOULD", "SHOULD NOT"]) { + caseSensitiveWords.push(w); + directiveTypes.set(w, w.includes("NOT") ? "prohibition" : "requirement"); + } + } + + const escape = (s: string) => s.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); + const caseSensitiveRegex = + caseSensitiveWords.length > 0 + ? new RegExp( + "\\b(" + + [...caseSensitiveWords].sort((a, b) => b.length - a.length).map(escape).join("|") + + ")\\b", + "g", + ) + : null; + const caseInsensitiveRegex = + caseInsensitiveWords.length > 0 + ? new RegExp( + "(" + + [...caseInsensitiveWords].sort((a, b) => b.length - a.length).map(escape).join("|") + + ")", + "gi", + ) + : null; + + const vocab: NormativeVocabulary = { + caseSensitiveRegex, + caseInsensitiveRegex, + directiveTypes, + stopWords, + }; + cachedNormativeVocabulary = vocab; + cachedNormativeVocabularyCanonUrl = canonUrl; + return vocab; +} + +async function fetchStakesCalibration( + fetcher: ZipBaselineFetcher, + canonUrl?: string, +): Promise { + if (cachedStakesCalibration && cachedStakesCalibrationCanonUrl === canonUrl) + return cachedStakesCalibration; + + const byMode = new Map(); + try { + const content = await fetcher.getFile("odd/challenge/stakes-calibration.md", canonUrl); + if (content) { + // Parse the Stakes Calibration table: + // | Mode | Question tiers surfaced | Prerequisite strictness | Reframings surfaced | + const tableMatch = content.match( + /## Stakes Calibration[\s\S]*?\| Mode[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + if (tableMatch) { + for (const row of tableMatch[1].split("\n").filter((r: string) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 4) { + const mode = cols[0].toLowerCase(); + const tiersRaw = cols[1].toLowerCase().trim(); + // The cell may be "none" or "none (suppress all challenge)" — both mean + // empty tier list and trigger the voice-dump suppression invariant. + // Without this leading-"none" check the suppression invariant ships broken. + const isNone = tiersRaw === "none" || tiersRaw.startsWith("none ") || tiersRaw.startsWith("none("); + const questionTiers: string[] = isNone + ? [] + : tiersRaw.split(",").map((t: string) => t.trim()).filter((t: string) => t.length > 0); + byMode.set(mode, { + questionTiers, + prerequisiteStrictness: cols[2], + reframingSurfacing: cols[3], + }); + } + } + } + } + } catch { + // Graceful degradation below + } + + cachedStakesCalibration = { byMode }; + cachedStakesCalibrationCanonUrl = canonUrl; + return cachedStakesCalibration; +} + function isStructuredInput(input: string): boolean { const lines = input.split("\n").filter((l) => l.trim().length > 0); return lines.length > 0 && lines.every((l) => /^[A-Z]\t/.test(l)); @@ -740,6 +1139,17 @@ async function runCleanupStorage( cachedBM25Entries = null; cachedEncodingTypes = null; cachedEncodingTypesCanonUrl = undefined; + // E0008 — governance-driven challenge caches (mirror PR #96 fix) + cachedChallengeTypes = null; + cachedChallengeTypesCanonUrl = undefined; + cachedChallengeTypeIndex = null; + cachedChallengeTypeIndexCanonUrl = undefined; + cachedBasePrerequisites = null; + cachedBasePrerequisitesCanonUrl = undefined; + cachedNormativeVocabulary = null; + cachedNormativeVocabularyCanonUrl = undefined; + cachedStakesCalibration = null; + cachedStakesCalibrationCanonUrl = undefined; return { action: "cleanup_storage", @@ -1150,6 +1560,30 @@ async function runOrientAction( }; } +// Governance-driven tension detection helper. +// +// `.match()` with a combined alternation returns the *leftmost* hit, so +// "You MUST do X and MUST NOT do Y" would resolve to "MUST" (requirement) +// even though a prohibition is present later in the excerpt. Collect all +// matches via `matchAll` and prefer a prohibition over any other directive +// type, falling back to the leftmost match otherwise. This preserves the +// prior two-test priority (MUST NOT before MUST) without coupling to a +// hard-coded vocabulary. +function pickStrongestDirective( + matches: IterableIterator, + lookup: (phrase: string) => string | undefined, +): { phrase: string; dtype: string } | null { + let first: { phrase: string; dtype: string } | null = null; + let prohibition: { phrase: string; dtype: string } | null = null; + for (const m of matches) { + const phrase = m[1]; + const dtype = lookup(phrase) || "directive"; + if (!first) first = { phrase, dtype }; + if (!prohibition && dtype === "prohibition") prohibition = { phrase, dtype }; + } + return prohibition || first; +} + async function runChallengeAction( input: string, modeHint: string | undefined, @@ -1158,72 +1592,218 @@ async function runChallengeAction( state?: OddkitState, ): Promise { const startMs = Date.now(); - const claimType = detectClaimType(input); + const mode = (modeHint || "planning").toLowerCase(); + + // Load governance in parallel + const [types, basePrereqs, vocab, calibration] = await Promise.all([ + discoverChallengeTypes(fetcher, canonUrl), + fetchBasePrerequisites(fetcher, canonUrl), + fetchNormativeVocabulary(fetcher, canonUrl), + fetchStakesCalibration(fetcher, canonUrl), + ]); + + const modeConfig = calibration.byMode.get(mode); + + // Detect matching types via BM25 over per-type detection text. + // Stemming makes "coining" match "coin", "rolled" match "rollback", etc. + // score > 0 = match (BM25 returns 0 when no stemmed query terms hit). + // Multi-match preserved: a single input may score against several types. + // Detection runs BEFORE the voice-dump suppression check so the SUPPRESSED + // response can still expose `governance` — the model sees what would have + // fired without surfacing the pressure-test questions. + // Stop words come from `## Detection Noise` in normative-vocabulary.md + // (governance), not a hardcoded constant in this file. + const typeIndex = getOrBuildChallengeTypeIndex(types, vocab, canonUrl); + const matchedTypes: ChallengeTypeDef[] = []; + const hits = searchBM25(typeIndex, input, types.length); + const typeBySlug = new Map(types.map((t) => [t.slug, t])); + for (const hit of hits) { + const t = typeBySlug.get(hit.id); + if (t) matchedTypes.push(t); + } + + // Fallback resolution when no type scored above zero + if (matchedTypes.length === 0) { + const fallback = types.find((t) => t.fallback) || types[0]; + if (fallback) matchedTypes.push(fallback); + } + + // Voice-dump invariant: suppress all challenge output regardless of matched types. + // Encoded at klappy://odd/challenge/stakes-calibration. Some modes exist for getting + // thoughts out of the head; pressure-testing at that stage damages the mode. + // The `governance` field is still surfaced so the model sees what types matched. + if (modeConfig && modeConfig.questionTiers.length === 0) { + return { + action: "challenge", + result: { + status: "SUPPRESSED", + mode, + claim_type: matchedTypes[0]?.slug, + matched_types: matchedTypes.map((t) => t.slug), + governance: matchedTypes.map((t) => ({ + slug: t.slug, + name: t.name, + description: t.blockquote, + })), + tensions: [], + missing_prerequisites: [], + challenges: [], + suggested_reframings: [], + canon_constraints: [], + suppression_reason: + `Mode '${mode}' suppresses challenge output. Challenge is not applied during raw thought capture.`, + }, + state: state ? initState(state) : undefined, + assistant_text: `Challenge suppressed for mode '${mode}'. Raw thought capture protected.`, + debug: { duration_ms: Date.now() - startMs, generated_at: new Date().toISOString() }, + }; + } + + // Aggregate questions across matched types, deduped by question string + const questionMap = new Map(); + for (const t of matchedTypes) { + for (const q of t.questions) { + if (!questionMap.has(q.question)) questionMap.set(q.question, q); + } + } + + // Aggregate prerequisite overlays: base + all matched type overlays, deduped by prerequisite name + const prereqMap = new Map(); + for (const p of basePrereqs) { + prereqMap.set(p.prerequisite, p); + } + for (const t of matchedTypes) { + for (const p of t.prerequisiteOverlays) { + if (!prereqMap.has(p.prerequisite)) prereqMap.set(p.prerequisite, p); + } + } + + // Aggregate reframings across matched types, deduped by string equality + const reframingSet = new Set(); + const reframingsByType = new Map(); + for (const t of matchedTypes) { + const typeReframings: string[] = []; + for (const r of t.reframings) { + if (!reframingSet.has(r)) { + reframingSet.add(r); + typeReframings.push(r); + } + } + reframingsByType.set(t.slug, typeReframings); + } + + // Apply stakes calibration: filter questions by tier, evaluate prerequisites by strictness, + // surface reframings by the surfacing rule. When modeConfig is absent (no calibration + // article or mode not in table), surface everything — "uniformly loud" fallback. + // Note: the questionTiers.length === 0 case is impossible here because the + // SUPPRESSED early-return above already handled it. We branch only on + // modeConfig presence and tier-membership. + const surfacedQuestions: string[] = []; + for (const q of questionMap.values()) { + if (!modeConfig || modeConfig.questionTiers.includes(q.tier)) { + surfacedQuestions.push(q.question); + } + } + + const strictness = modeConfig?.prerequisiteStrictness?.toLowerCase() || "required"; + const missing: string[] = []; + for (const p of prereqMap.values()) { + const passed = evaluatePrerequisiteCheck(input, p.check); + if (!passed) { + // source-named check is escalated to blocking when strictness says so + if (strictness.includes("optional") && !p.prerequisite.includes("source-named")) { + continue; + } + missing.push(p.gapMessage); + } + } + + const surfacing = modeConfig?.reframingSurfacing?.toLowerCase() || "all"; + const allReframings: string[] = []; + for (const typeReframings of reframingsByType.values()) { + allReframings.push(...typeReframings); + } + let surfacedReframings: string[] = []; + // Same defensive shape as the tiersRaw "none" check in fetchStakesCalibration. + // The cell may be "none" or "none (parenthetical reason)" — both mean suppress + // all reframings. Strict equality would let the parenthetical fall through to + // the "all" branch and silently surface every reframing for a mode that opted + // out of them. + const surfaceNone = + surfacing === "none" || surfacing.startsWith("none ") || surfacing.startsWith("none("); + if (surfaceNone) { + surfacedReframings = []; + } else if ( + surfacing.includes("first 1") || + surfacing.includes("first-1") || + surfacing.includes("first one") + ) { + // Surface at most one reframing total — across all matched types, not one per type. + // The governance phrase "first 1" means a single reframing in the response; + // multi-match should not multiply the surfacing. + surfacedReframings = allReframings.slice(0, 1); + } else { + // "all" or "all, plus block-until-addressed" + surfacedReframings = allReframings; + } + const blockUntilAddressed = surfacing.includes("block-until-addressed"); + + // Retrieve canon quotes and detect tensions via governance-driven vocabulary const index = await fetcher.getIndex(canonUrl); const results = scoreEntries(index.entries, `constraints challenges risks ${input}`).slice(0, 4); const canonConstraints: Array<{ citation: string; quote: string }> = []; - const tensions: Array<{ type: string; message: string }> = []; + const tensions: Array<{ type: string; message: string; citation?: string; quote?: string }> = []; for (const entry of results) { const content = await fetcher.getFile(entry.path, canonUrl); if (content) { const stripped = content.replace(/^---[\s\S]*?---\n/, ""); const lines = stripped.split("\n").filter((l) => l.trim() && !l.startsWith("#")); const excerpt = lines.slice(0, 2).join(" ").slice(0, 150); - canonConstraints.push({ citation: `${entry.path}#${entry.title}`, quote: excerpt }); - if (/\bMUST NOT\b/.test(excerpt)) - tensions.push({ type: "prohibition", message: `Canon prohibition found in ${entry.path}` }); - else if (/\bMUST\b/.test(excerpt)) - tensions.push({ type: "requirement", message: `Canon requirement found in ${entry.path}` }); + const citation = `${entry.path}#${entry.title}`; + canonConstraints.push({ citation, quote: excerpt }); + + if (vocab.caseSensitiveRegex) { + const hit = pickStrongestDirective( + excerpt.matchAll(vocab.caseSensitiveRegex), + (p) => vocab.directiveTypes.get(p), + ); + if (hit) { + tensions.push({ + type: hit.dtype, + message: `Canon ${hit.dtype} (${hit.phrase}) found in ${entry.path}`, + citation, + quote: excerpt, + }); + continue; + } + } + if (vocab.caseInsensitiveRegex) { + const hit = pickStrongestDirective( + excerpt.matchAll(vocab.caseInsensitiveRegex), + (p) => vocab.directiveTypes.get(p) || vocab.directiveTypes.get(p.toLowerCase()) || "load-bearing-claim", + ); + if (hit) { + tensions.push({ + type: hit.dtype, + message: `Canon ${hit.dtype} (${hit.phrase}) found in ${entry.path}`, + citation, + quote: excerpt, + }); + } + } } } - const missing: string[] = []; - if (!/\bevidence\b/i.test(input) && !/\bdata\b/i.test(input)) - missing.push("No evidence cited — claims without evidence are assumptions"); - if (claimType === "strong_claim" || claimType === "proposal") { - if (!/\balternative/i.test(input)) missing.push("No alternatives mentioned"); - if (!/\brisk/i.test(input) && !/\bcost\b/i.test(input)) - missing.push("No risks or costs acknowledged"); - } - - const challenges: string[] = []; - if (claimType === "strong_claim") { - challenges.push( - "What evidence would disprove this?", - "Under what conditions does this NOT hold?", - "Who would disagree, and why?", - ); - } else if (claimType === "proposal") { - challenges.push( - "What's the cost of being wrong?", - "What alternatives were considered?", - "What would need to be true for this to fail?", - ); - } else if (claimType === "assumption") { - challenges.push( - "Has this assumption been validated?", - "What if this assumption is wrong — what breaks?", - ); - } else { - challenges.push("Is this observation representative?", "What context might change this?"); - } - - const reframings: string[] = []; - if (claimType === "strong_claim") - reframings.push("Reframe as hypothesis: 'We believe X because Y, and would reconsider if Z'"); - if (claimType === "assumption") - reframings.push("Make explicit: state the assumption and how you'd validate it"); - if (claimType === "proposal") - reframings.push("Add optionality: 'We're choosing X over Y because Z, reversible until W'"); - // Update state const updatedState = state ? initState(state) : undefined; if (updatedState && missing.length > 0) { updatedState.unresolved = [...updatedState.unresolved, ...missing]; } - const lines = [`Challenge (${claimType}):`, ""]; + // Assistant text — preserves prior format, extends with matched types and mode + const matchedSlugs = matchedTypes.map((t) => t.slug); + const lines = [`Challenge (${matchedSlugs.join(", ") || "no-match"}) [mode: ${mode}]:`, ""]; if (tensions.length > 0) { lines.push("Tensions found:"); for (const t of tensions) lines.push(` - [${t.type}] ${t.message}`); @@ -1234,12 +1814,20 @@ async function runChallengeAction( for (const m of missing) lines.push(` - ${m}`); lines.push(""); } - lines.push("Questions to address:"); - for (const c of challenges) lines.push(` - ${c}`); - lines.push(""); - if (reframings.length > 0) { + if (surfacedQuestions.length > 0) { + lines.push("Questions to address:"); + for (const c of surfacedQuestions) lines.push(` - ${c}`); + lines.push(""); + } + if (surfacedReframings.length > 0) { lines.push("Suggested reframings:"); - for (const r of reframings) lines.push(` - ${r}`); + for (const r of surfacedReframings) lines.push(` - ${r}`); + lines.push(""); + } + if (blockUntilAddressed && (missing.length > 0 || tensions.length > 0)) { + lines.push( + "⚠ Block-until-addressed: in this mode, the claim should not proceed until the gaps above are resolved or explicitly declined.", + ); lines.push(""); } if (canonConstraints.length > 0) { @@ -1255,11 +1843,19 @@ async function runChallengeAction( action: "challenge", result: { status: "CHALLENGED", - claim_type: claimType, + mode, + claim_type: matchedSlugs[0], + matched_types: matchedSlugs, + governance: matchedTypes.map((t) => ({ + slug: t.slug, + name: t.name, + description: t.blockquote, + })), tensions, missing_prerequisites: missing, - challenges, - suggested_reframings: reframings, + challenges: surfacedQuestions, + suggested_reframings: surfacedReframings, + block_until_addressed: blockUntilAddressed, canon_constraints: canonConstraints, }, state: updatedState, @@ -1268,6 +1864,38 @@ async function runChallengeAction( }; } +// Governance-driven check evaluator — interprets natural-language `check` strings +// from ## Prerequisite Overlays tables. Uses cheap heuristics: substring matching +// against quoted keywords in the check description, plus a few special-case patterns. +function evaluatePrerequisiteCheck(input: string, check: string): boolean { + // Extract quoted keywords like "evidence", "observed", "alternative" + const quotedKeywords: string[] = []; + const quotedRegex = /"([^"]+)"/g; + let m: RegExpExecArray | null; + while ((m = quotedRegex.exec(check)) !== null) { + quotedKeywords.push(m[1]); + } + + if (quotedKeywords.length > 0) { + // Pass if ANY quoted keyword appears in input (case-insensitive, word-boundary where possible) + for (const kw of quotedKeywords) { + const escaped = kw.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); + // Use word-boundary for single words, substring for phrases + const pattern = /^\w+$/.test(kw) ? new RegExp("\\b" + escaped + "\\b", "i") : new RegExp(escaped, "i"); + if (pattern.test(input)) return true; + } + // Special-case check descriptions that mention URLs, citations, numeric markers + if (/\bURL\b/i.test(check) && /https?:\/\//.test(input)) return true; + if (/numeric/i.test(check) && /\d/.test(input)) return true; + if (/proper-?noun/i.test(check) && /\b[A-Z][a-z]+\s+[A-Z]/.test(input)) return true; + if (/citation/i.test(check) && /\[\d+\]|\bper\s+[A-Z]|\baccording to\b/i.test(input)) return true; + return false; + } + + // No quoted keywords: conservative fallback — passes if input is non-trivial + return input.trim().length >= 20; +} + async function runGateAction( input: string, context: string | undefined, diff --git a/workers/test/governance-parser.test.mjs b/workers/test/governance-parser.test.mjs new file mode 100644 index 0000000..4ae40c5 --- /dev/null +++ b/workers/test/governance-parser.test.mjs @@ -0,0 +1,346 @@ +#!/usr/bin/env node +/** + * Parser-fidelity test for governance-driven challenge extraction. + * + * Fetches the 11 live governance articles from klappy.dev and runs the same + * regex patterns used in workers/src/orchestrate.ts to confirm the parsers + * correctly extract types, questions, prerequisites, vocabulary, and calibration. + * + * This is not a worker integration test — it exercises the parser logic + * outside the Cloudflare runtime. Run pre-PR to verify parser regexes match + * real-world article structure. + */ + +import { readFile } from "node:fs/promises"; +import { fileURLToPath } from "node:url"; +import { dirname, join } from "node:path"; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const REPO_ROOT = join(__dirname, "..", ".."); + +// Articles to test against — these MUST exist in the local clone of klappy.dev +// or we fetch from raw.githubusercontent.com +// Default to main; override via KLAPPYDEV_RAW env var when testing against +// an unmerged feature branch (e.g. while klappy.dev#100 is still open). +const KLAPPYDEV_RAW = + process.env.KLAPPYDEV_RAW || "https://raw.githubusercontent.com/klappy/klappy.dev/main"; +const ARTICLE_PATHS = { + meta: "odd/challenge-types/how-to-write-challenge-types.md", + strongClaim: "odd/challenge-types/strong-claim.md", + proposal: "odd/challenge-types/proposal.md", + assumption: "odd/challenge-types/assumption.md", + observation: "odd/challenge-types/observation.md", + patternCoinage: "odd/challenge-types/pattern-coinage.md", + comparativePositioning: "odd/challenge-types/comparative-positioning.md", + principleExtraction: "odd/challenge-types/principle-extraction.md", + basePrerequisites: "odd/challenge/base-prerequisites.md", + normativeVocabulary: "odd/challenge/normative-vocabulary.md", + stakesCalibration: "odd/challenge/stakes-calibration.md", +}; + +async function fetchArticle(path) { + const url = `${KLAPPYDEV_RAW}/${path}`; + const r = await fetch(url); + if (!r.ok) throw new Error(`Failed to fetch ${url}: ${r.status}`); + return r.text(); +} + +// ────────────────────────────────────────────────────────────────────────── +// Parser logic — verbatim copies of the regexes in workers/src/orchestrate.ts +// ────────────────────────────────────────────────────────────────────────── + +// Mirror of `parseTableRow` in workers/src/orchestrate.ts. Preserves +// legitimately-empty interior cells (a prior `.filter(c => c.length > 0)` +// approach dropped them and silently collapsed column indexes). +function parseTableRow(row) { + const parts = row.split("|"); + if (parts.length > 0 && parts[0].trim() === "") parts.shift(); + if (parts.length > 0 && parts[parts.length - 1].trim() === "") parts.pop(); + return parts.map((c) => c.trim()); +} + +function parseChallengeType(content) { + const slugMatch = content.match(/\|\s*Slug\s*\|\s*([^|]+)\s*\|/); + const nameMatch = content.match(/\|\s*Name\s*\|\s*([^|]+)\s*\|/); + if (!slugMatch) return null; + const slug = slugMatch[1].trim(); + const name = nameMatch ? nameMatch[1].trim() : slug; + + const blockquoteMatch = content.match(/^#\s[^\n]+\n+>\s*([^\n]+(?:\n>\s*[^\n]+)*)/m); + const blockquote = blockquoteMatch + ? blockquoteMatch[1].replace(/\n>\s*/g, " ").trim() + : ""; + + const detectionSection = content.match( + /## Detection Patterns[\s\S]*?```\n([\s\S]*?)\n```/, + ); + const triggerWords = detectionSection + ? detectionSection[1].split(",").map((w) => w.trim()).filter((w) => w.length > 0) + : []; + + const questionsSection = content.match( + /## Challenge Questions[\s\S]*?\| Question[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const questions = []; + if (questionsSection) { + for (const row of questionsSection[1].split("\n").filter((r) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 2) questions.push({ question: cols[0], tier: cols[1] }); + } + } + + const prereqSection = content.match( + /## Prerequisite Overlays[\s\S]*?\| Prerequisite[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const prerequisiteOverlays = []; + if (prereqSection) { + for (const row of prereqSection[1].split("\n").filter((r) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 3) { + prerequisiteOverlays.push({ + prerequisite: cols[0], + check: cols[1], + gapMessage: cols[2].replace(/^"|"$/g, "").replace(/\{name\}/g, name), + }); + } + } + } + + const reframingsSection = content.match(/## Suggested Reframings[\s\S]*?\n((?:-\s+[^\n]+\n?)+)/); + const reframings = []; + if (reframingsSection) { + for (const line of reframingsSection[1].split("\n")) { + const m = line.match(/^-\s+(.+)$/); + if (m) reframings.push(m[1].trim()); + } + } + + const fmMatch = content.match(/^---\n([\s\S]*?)\n---/); + let fallback = false; + if (fmMatch) { + fallback = /^fallback:\s*true\s*$/m.test(fmMatch[1]); + } + + return { slug, name, blockquote, triggerWords, questions, prerequisiteOverlays, reframings, fallback }; +} + +function parseBasePrereqs(content) { + const section = content.match( + /## Prerequisite Overlays[\s\S]*?\| Prerequisite[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const result = []; + if (section) { + for (const row of section[1].split("\n").filter((r) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 3) { + result.push({ prerequisite: cols[0], check: cols[1], gapMessage: cols[2].replace(/^"|"$/g, "") }); + } + } + } + return result; +} + +function parseNormativeVocab(content) { + const caseSensitive = []; + const caseInsensitive = []; + const sections = content.split(/###\s+/); + for (const section of sections) { + const isCS = /RFC 2119|Directive Language/i.test(section.split("\n")[0] || ""); + const tableMatch = section.match(/\|\s*(?:Word|Phrase)\s*\|[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/); + if (!tableMatch) continue; + for (const row of tableMatch[1].split("\n").filter((r) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 2) { + if (isCS) caseSensitive.push(cols[0]); + else caseInsensitive.push(cols[0]); + } + } + } + return { caseSensitive, caseInsensitive }; +} + +function parseStakesCalibration(content) { + const tableMatch = content.match( + /## Stakes Calibration[\s\S]*?\| Mode[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, + ); + const byMode = new Map(); + if (tableMatch) { + for (const row of tableMatch[1].split("\n").filter((r) => r.includes("|"))) { + const cols = parseTableRow(row); + if (cols.length >= 4) { + const tiersRaw = cols[1].toLowerCase().trim(); + const isNone = tiersRaw === "none" || tiersRaw.startsWith("none ") || tiersRaw.startsWith("none("); + const tiers = isNone ? [] : tiersRaw.split(",").map((t) => t.trim()).filter((t) => t); + byMode.set(cols[0].toLowerCase(), { tiers, strictness: cols[2], surfacing: cols[3] }); + } + } + } + return byMode; +} + +// ────────────────────────────────────────────────────────────────────────── +// Tests +// ────────────────────────────────────────────────────────────────────────── + +let passed = 0; +let failed = 0; + +function ok(name, cond, detail = "") { + if (cond) { console.log(` ✓ ${name}`); passed++; } + else { console.log(` ✗ ${name}${detail ? " — " + detail : ""}`); failed++; } +} + +async function run() { + console.log("Fetching 11 governance articles from klappy.dev...\n"); + + const articles = {}; + for (const [key, path] of Object.entries(ARTICLE_PATHS)) { + articles[key] = await fetchArticle(path); + } + + console.log("─── Test 1: Challenge type parsing ───"); + const types = []; + for (const key of ["strongClaim", "proposal", "assumption", "observation", "patternCoinage", "comparativePositioning", "principleExtraction"]) { + const t = parseChallengeType(articles[key]); + types.push(t); + ok(`${key} parses`, t !== null); + if (t) { + ok(`${key} has slug`, t.slug.length > 0, `got "${t.slug}"`); + ok(`${key} has name`, t.name.length > 0, `got "${t.name}"`); + ok(`${key} has blockquote`, t.blockquote.length > 20, `got ${t.blockquote.length} chars`); + ok(`${key} has trigger words`, t.triggerWords.length >= 3, `got ${t.triggerWords.length}`); + ok(`${key} has questions`, t.questions.length >= 2, `got ${t.questions.length}`); + ok(`${key} questions have tiers`, t.questions.every((q) => ["baseline", "elevated", "rigorous"].includes(q.tier)), `tiers: ${[...new Set(t.questions.map((q) => q.tier))].join(",")}`); + ok(`${key} has prerequisite overlays`, t.prerequisiteOverlays.length >= 1, `got ${t.prerequisiteOverlays.length}`); + ok(`${key} has reframings`, t.reframings.length >= 1, `got ${t.reframings.length}`); + } + } + + console.log("\n─── Test 2: Fallback resolution ───"); + const observation = types.find((t) => t && t.slug === "observation"); + ok("observation has fallback: true", observation && observation.fallback === true); + const otherTypes = types.filter((t) => t && t.slug !== "observation"); + ok("non-fallback types do not have fallback: true", otherTypes.every((t) => !t.fallback)); + + console.log("\n─── Test 3: BM25 detection with stemming ───"); + // Build the per-type BM25 index the same way the worker does + const { buildBM25Index, searchBM25, stem } = await import("../src/bm25.ts").catch(() => + import("../src/bm25.js"), + ); + const detectionDocs = types + .filter((t) => t) + .map((t) => ({ + id: t.slug, + text: [t.triggerWords.join(" "), t.blockquote].filter((s) => s.length > 0).join(" "), + })); + // Stop words come from the `## Detection Noise` section of normative-vocabulary.md + // (governance), exactly the same way the worker reads them. No hardcoded + // duplicate in this test — drift would mean the test passes while production fails. + const noiseMatch = articles.normativeVocabulary.match( + /## Detection Noise[\s\S]*?```\n([\s\S]*?)\n```/, + ); + const stopWords = new Set(); + if (noiseMatch) { + for (const word of noiseMatch[1].split(/[,\n]/)) { + const w = word.trim().toLowerCase(); + if (w.length > 0) stopWords.add(w); + } + } + ok( + "Detection Noise section parses non-empty stop word set", + stopWords.size > 0, + `parsed ${stopWords.size} stop words`, + ); + ok( + "Detection Noise excludes modal verbs (signal preservation)", + !stopWords.has("must") && !stopWords.has("should") && !stopWords.has("not"), + `must=${stopWords.has("must")} should=${stopWords.has("should")} not=${stopWords.has("not")}`, + ); + ok( + "Detection Noise includes common filler", + stopWords.has("the") && stopWords.has("of") && stopWords.has("in"), + ); + const bm25 = buildBM25Index(detectionDocs, stopWords); + + // Each type's first trigger word should still match its own type + for (const t of types) { + if (!t) continue; + const sampleWord = t.triggerWords[0]; + const hits = searchBM25(bm25, sampleWord, types.length); + ok( + `${t.slug} matches its first trigger word "${sampleWord}" via BM25`, + hits.some((h) => h.id === t.slug), + `top hit was "${hits[0]?.id || "(none)"}" with score ${hits[0]?.score?.toFixed(2) || 0}`, + ); + } + + console.log("\n─── Test 3b: Stemming defeats the original coin/coining bug ───"); + // The original regex-based approach had "coining" as a trigger but failed on "coin". + // With stemming, both should reduce to the same root. + ok( + `stem("coin") === stem("coining")`, + stem("coin") === stem("coining"), + `stem("coin")="${stem("coin")}" stem("coining")="${stem("coining")}"`, + ); + ok( + `"coin the term" matches pattern-coinage via BM25`, + searchBM25(bm25, "coin the term", types.length).some((h) => h.id === "pattern-coinage"), + ); + ok( + `"I'm coining a new term" matches pattern-coinage via BM25`, + searchBM25(bm25, "I'm coining a new term", types.length).some((h) => h.id === "pattern-coinage"), + ); + ok( + `"the principles" matches principle-extraction (plural form)`, + searchBM25(bm25, "the principles", types.length).some((h) => h.id === "principle-extraction"), + ); + ok( + `"alternatives proposed" matches proposal (proposed not propose)`, + searchBM25(bm25, "alternatives proposed", types.length).some((h) => h.id === "proposal"), + ); + + console.log("\n─── Test 4: Multi-match semantics (BM25) ───"); + const compoundInput = "We must always be coining new terms like Vodka Architecture"; + const matched = searchBM25(bm25, compoundInput, types.length); + ok( + "compound input fires multiple types via BM25", + matched.length >= 2, + `matched: ${matched.map((m) => m.id).join(", ")}`, + ); + ok("strong-claim fires on 'must always'", matched.some((m) => m.id === "strong-claim")); + ok("pattern-coinage fires on 'coining'", matched.some((m) => m.id === "pattern-coinage")); + + console.log("\n─── Test 4b: Empty input + irrelevant input do not over-match ───"); + ok( + "irrelevant input scores no types", + searchBM25(bm25, "the cat sat on the mat", types.length).length === 0, + `(would have triggered fallback in runChallengeAction)`, + ); + + console.log("\n─── Test 5: Base prerequisites ───"); + const basePrereqs = parseBasePrereqs(articles.basePrerequisites); + ok("base prerequisites parse", basePrereqs.length >= 3, `got ${basePrereqs.length}`); + ok("base includes evidence-cited", basePrereqs.some((p) => p.prerequisite === "evidence-cited")); + ok("base includes source-named", basePrereqs.some((p) => p.prerequisite === "source-named")); + ok("base includes confidence-signaled", basePrereqs.some((p) => p.prerequisite === "confidence-signaled")); + + console.log("\n─── Test 6: Normative vocabulary ───"); + const vocab = parseNormativeVocab(articles.normativeVocabulary); + ok("case-sensitive RFC 2119 words present", vocab.caseSensitive.length >= 4, `got ${vocab.caseSensitive.length}: ${vocab.caseSensitive.slice(0,5).join(",")}`); + ok("case-insensitive architectural words present", vocab.caseInsensitive.length >= 3, `got ${vocab.caseInsensitive.length}: ${vocab.caseInsensitive.slice(0,5).join(",")}`); + ok("includes MUST", vocab.caseSensitive.includes("MUST")); + ok("includes invariant", vocab.caseInsensitive.includes("invariant")); + + console.log("\n─── Test 7: Stakes calibration ───"); + const calib = parseStakesCalibration(articles.stakesCalibration); + ok("calibration parses 9 modes", calib.size >= 9, `got ${calib.size} modes: ${[...calib.keys()].join(", ")}`); + ok("voice-dump exists", calib.has("voice-dump")); + ok("voice-dump has empty tiers (suppression invariant)", calib.get("voice-dump")?.tiers.length === 0); + ok("planning has baseline+elevated", calib.get("planning")?.tiers.length === 2); + ok("execution has all three tiers", calib.get("execution")?.tiers.length === 3); + + console.log(`\n${passed} passed, ${failed} failed`); + process.exit(failed === 0 ? 0 : 1); +} + +run().catch((e) => { console.error(e); process.exit(1); });