From 259170aa988ab9f90523ac0d040abd50f5bec9d9 Mon Sep 17 00:00:00 2001 From: Klappy Date: Mon, 20 Apr 2026 12:45:50 +0000 Subject: [PATCH 1/5] feat(encode): D5 stemmed set intersection + D9 cache removal (0.22.0) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Migrates oddkit_encode trigger-word classifier from regex alternation to stemmed set intersection — the last regex matcher in the canon-parity sweep. Same D5 shape challenge adopted in 0.21.0 and gate adopted in 0.20.0. Closes the sweep. Changed: - EncodingTypeDef.triggerRegex: RegExp | null → stemmedTokens: Set - Parse-time: tokenize(word, new Set()) per canon trigger word; stems go into a Set built once per fetch. Stop-words disabled on both sides per P1.3.3 C-04 — canon vocab includes stop-word phrases such as "going with" (Decision), "committed to" (Decision), "must not" (Constraint), "turns out" (Learning), "found that" (Learning), "next step" (Handoff), "blocked by" (Handoff). Dropping them would silently break the strictly-additive invariant the same way challenge broke it on "from" in source-named. - Runtime: const inputStems = new Set(tokenize(para, new Set())) hoisted above the per-type loop at both classifier call sites. - New intersectsStems(vocab, input) helper — iterates smaller set with early exit; mirrors evaluatePrerequisiteCheck from P1.3.3. - parsePrefixedBatchInput untagged-paragraph path keeps its break (first-match per paragraph, by design). - parseUnstructuredInput keeps its no-break multi-type design — the L1161–1164 DESIGN comment is preserved verbatim. Removed: - Module-level cachedEncodingTypes + cachedEncodingTypesKnowledgeBaseUrl + cachedEncodingTypesSource deleted per cache-fetches-and-parses. The fetch tier (Module Memory → Cache API → R2, 5-min TTL) already caches the canon read; parse-product caching for microsecond derivation savings is the anti-pattern the principle names. - Cache-check short-circuit at top of discoverEncodingTypes deleted. - cleanup_storage resets for the three removed fields deleted. Added: - Smoke regression assertions (12)–(15) in canon-tool-envelope.smoke.mjs: (12) inflection match — "deciding" → Decision via decid stem (13) stop-word survival — "going with" matches Decision (14) multi-type no-break — C and D both emitted for mixed input (15) batch first-match — mixed batch emits exactly 2 artifacts Refs: - Handoff: klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity - Canon basis: klappy://canon/principles/cache-fetches-and-parses, klappy://canon/principles/vodka-architecture - Precedent: oddkit 0.21.1 (challenge D5+D9), 0.20.0 (gate D5+D9) - Shipping under: klappy://canon/constraints/release-validation-gate Verified: - tsc --noEmit clean - governance-parser.test.mjs 105/105 pass - Smoke assertions (12)-(15) will be exercised against preview post-push. --- CHANGELOG.md | 22 +++++ package.json | 2 +- workers/package.json | 2 +- workers/src/orchestrate.ts | 102 +++++++++++++++------ workers/test/canon-tool-envelope.smoke.mjs | 56 +++++++++++ 5 files changed, 155 insertions(+), 29 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 282a667..cde7915 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.22.0] - 2026-04-20 + +### Changed + +- **`oddkit_encode` trigger-word classifier migrated from regex alternation to stemmed set intersection** (per PRD D5 from P1.3.4 — split-by-fit, same shape challenge adopted in 0.21.0 and gate adopted in 0.20.0). `EncodingTypeDef.triggerRegex: RegExp | null` is replaced with `stemmedTokens: Set` — a parse product built once per canon fetch. Canon trigger vocabulary reads unchanged from `odd/encoding-types/*.md` (`## Trigger Words` fenced block); the new matcher tokenizes each vocabulary word with stop-words disabled, stems into the Set at parse time, and intersects against a stop-word-disabled stemmed input set at runtime. Inflected forms (`deciding`, `realizing`, `discovering`) now match their canonical stems (`decid`, `realiz`, `discover`) without canon having to enumerate each inflection. **Strictly additive**: every input that matched the prior regex still matches, plus stemmed variations now do too. Stop-words are disabled (empty `Set`) on both the parse-time `tokenize(word, new Set())` and the runtime `tokenize(para, new Set())` calls — canon vocab includes stop-word-adjacent phrases like `going with`, `committed to`, `must not`, `turns out`, `found that`, `next step`, `blocked by`; the P1.3.3 `from`-in-source-named precedent (C-04) demanded this pattern be replicated verbatim. Both classifier call sites preserve their existing semantics: `parsePrefixedBatchInput` untagged-paragraph path picks first match via `break` (one artifact per paragraph); `parseUnstructuredInput` emits one artifact per matching type (no `break` — the load-bearing design comment is preserved verbatim). `tokenize(para, new Set())` is hoisted out of the per-type loop at both call sites. A new `intersectsStems(vocab, input)` helper encapsulates the match test. Per `klappy://canon/principles/vodka-architecture`: fit the matcher to the problem shape (independent gap-or-not per type, multi-type allowed by design). + +### Removed + +- **Module-level `cachedEncodingTypes` in-process cache** (per PRD D9 from P1.3.4 — don't cache microsecond derivations; same pattern challenge shipped in 0.21.0 and gate shipped in 0.20.0). `cachedEncodingTypes`, `cachedEncodingTypesKnowledgeBaseUrl`, `cachedEncodingTypesSource` module-level fields deleted; cache-check short-circuit at the top of `discoverEncodingTypes` deleted; `cleanup_storage` resets for the three fields deleted. Per `klappy://canon/principles/cache-fetches-and-parses`: the fetch layer (Module Memory → Cache API → R2, 5-minute TTL) already caches the canon file content; caching the parse product for microsecond re-derivation savings is the anti-pattern the principle names. Parse runs fresh per call; overhead is sub-millisecond on hot fetches. + +### Added + +- **New smoke regression assertions in `workers/test/canon-tool-envelope.smoke.mjs`** anchoring the D5 migration: (12) stemmed inflection match — `"I'm deciding to ship two-tier cascade"` classifies as Decision (`decid` stem matches `decided` in canon vocab); (13) stop-word canon vocab survives — `"we're going with option B after the review"` matches Decision (`going with` multi-word canon vocab); (14) multi-type preservation — `"We must never deploy without tests because we decided this last week"` emits both `C` and `D` artifacts via the no-break path; (15) first-match preservation — untagged paragraph in a mixed batch emits exactly one artifact via the batch classifier's `break` semantic. + +### Refs + +- Handoff: `klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity` +- Canon basis: `klappy://canon/principles/cache-fetches-and-parses`, `klappy://canon/principles/vodka-architecture` +- Precedent: oddkit 0.21.1 (challenge's D5 + D9), 0.20.0 (gate's D5 + D9) +- Shipping gate: `klappy://canon/constraints/release-validation-gate` (binding) +- Closes the canon-parity sweep — all three tools now use stemmed set intersection and have their in-process derivation caches removed per `cache-fetches-and-parses`. + ## [0.21.1] - 2026-04-20 ### Fixed diff --git a/package.json b/package.json index d4da3ba..d377ab3 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "oddkit", - "version": "0.21.1", + "version": "0.22.0", "description": "Agent-first CLI for ODD-governed repos. Epistemic terrain rendering with portable baseline.", "type": "module", "bin": { diff --git a/workers/package.json b/workers/package.json index 400c8e5..ecf3988 100644 --- a/workers/package.json +++ b/workers/package.json @@ -1,6 +1,6 @@ { "name": "oddkit-mcp-worker", - "version": "0.21.1", + "version": "0.22.0", "private": true, "type": "module", "scripts": { diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts index 96c55b5..7d52535 100644 --- a/workers/src/orchestrate.ts +++ b/workers/src/orchestrate.ts @@ -56,12 +56,16 @@ export interface OddkitEnvelope { /** Internal type — handlers return this, handleUnifiedAction stamps server_time */ type ActionResult = Omit; -// Governance-driven encoding types +// Governance-driven encoding types. Trigger-word classification is stemmed +// set intersection per klappy://canon/principles/vodka-architecture (fit the +// matcher to the problem) — same D5 shape applied to challenge prereqs in +// 0.21.0 and gate prereqs in 0.20.0. triggerWords kept for debugging only; +// stemmedTokens is the parse product the runtime evaluates against. interface EncodingTypeDef { letter: string; name: string; triggerWords: string[]; - triggerRegex: RegExp | null; + stemmedTokens: Set; qualityCriteria: Array<{ criterion: string; check: string; gapMessage: string }>; } @@ -79,9 +83,12 @@ interface ParsedArtifact { priority_band?: string; } -let cachedEncodingTypes: EncodingTypeDef[] | null = null; -let cachedEncodingTypesKnowledgeBaseUrl: string | undefined = undefined; -let cachedEncodingTypesSource: "knowledge_base" | "minimal" = "minimal"; +// D9 / klappy://canon/principles/cache-fetches-and-parses — no module-level +// cache on the parse product. fetcher.getFile / fetcher.getIndex already cache +// the canon read (Module Memory → Cache API → R2, 5-min TTL). Re-running the +// parse loop per request is sub-millisecond derivation work, not worth the +// plumbing tax of a keyed cache. Same pattern challenge (0.21.0) and gate +// (0.20.0) already applied. // Governance-driven challenge types (E0008 — mirrors encode pattern from PR #96) interface ChallengeTypeDef { @@ -409,10 +416,6 @@ async function discoverEncodingTypes( fetcher: KnowledgeBaseFetcher, knowledgeBaseUrl?: string, ): Promise<{ types: EncodingTypeDef[]; source: "knowledge_base" | "minimal" }> { - if (cachedEncodingTypes && cachedEncodingTypesKnowledgeBaseUrl === knowledgeBaseUrl) { - return { types: cachedEncodingTypes, source: cachedEncodingTypesSource }; - } - const index = await fetcher.getIndex(knowledgeBaseUrl); const typeArticles = index.entries.filter( (entry: IndexEntry) => entry.tags?.includes("encoding-type") && entry.path.includes("encoding-types/"), @@ -437,10 +440,24 @@ async function discoverEncodingTypes( const triggerWords = triggerSection ? triggerSection[1].split(",").map((w: string) => w.trim()).filter((w: string) => w.length > 0) : []; - const triggerRegex = - triggerWords.length > 0 - ? new RegExp("\\b(" + triggerWords.map((w: string) => w.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")).join("|") + ")\\b", "i") - : null; + // D5 / klappy://canon/principles/vodka-architecture — classification is + // stemmed set intersection, not regex alternation. Canon vocab is parsed + // once into a Set of stems; runtime tokenizes input once and + // intersects. Inflected forms (deciding → decid, realizing → realiz) + // now match their canonical stems without canon having to list each + // inflection. Stop-word filtering is disabled (empty Set) on both the + // parse-time and runtime tokenize() calls — canon vocab includes + // stop-word-adjacent phrases (`going with`, `committed to`, `must not`, + // `turns out`, `next step`, `blocked by`, `found that`) and dropping + // them would silently break the strictly-additive invariant, the same + // failure mode P1.3.3 hit on challenge's `from`-in-source-named vocab. + // Per canon/constraints/release-validation-gate and P1.3.3 C-04. + const stemmedTokens = new Set(); + for (const word of triggerWords) { + for (const stem of tokenize(word, new Set())) { + stemmedTokens.add(stem); + } + } const criteriaSection = content.match( /## Quality Criteria[\s\S]*?\| Criterion[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/, @@ -459,7 +476,7 @@ async function discoverEncodingTypes( } } - types.push({ letter, name, triggerWords, triggerRegex, qualityCriteria }); + types.push({ letter, name, triggerWords, stemmedTokens, qualityCriteria }); } catch { continue; } @@ -495,17 +512,22 @@ async function discoverEncodingTypes( ["H", "Handoff", ["next session", "next step", "todo", "follow up", "blocked by"]], ["E", "Encode", ["encoded", "captured", "crystallized", "persisted", "artifact"]], ]; - resolved = defaults.map(([letter, name, words]) => ({ - letter, name, triggerWords: words, - triggerRegex: new RegExp("\\b(" + words.join("|") + ")\\b", "i"), - qualityCriteria: [], - })); + resolved = defaults.map(([letter, name, words]) => { + const stemmedTokens = new Set(); + for (const word of words) { + for (const stem of tokenize(word, new Set())) { + stemmedTokens.add(stem); + } + } + return { + letter, name, triggerWords: words, + stemmedTokens, + qualityCriteria: [], + }; + }); source = "minimal"; } - cachedEncodingTypes = resolved; - cachedEncodingTypesKnowledgeBaseUrl = knowledgeBaseUrl; - cachedEncodingTypesSource = source; return { types: resolved, source }; } @@ -1084,6 +1106,17 @@ function isPrefixedBatchInput(input: string): boolean { return paragraphs.some((p) => PREFIX_TAG_REGEX.test(p)); } +// Stemmed set intersection — the D5 matcher shape for encode trigger-word +// classification. Iterate the smaller set (canon vocab) and probe the larger +// (runtime input stems) for O(min(|a|,|b|)) early exit. Mirrors the shape +// used by evaluatePrerequisiteCheck in the P1.3.3 challenge evaluator. +function intersectsStems(vocab: Set, input: Set): boolean { + for (const s of vocab) { + if (input.has(s)) return true; + } + return false; +} + function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): ParsedArtifact[] { const typeMap = new Map(types.map((t) => [t.letter, t.name])); const paragraphs = input.split(/\n\n+/).map((p) => p.trim()).filter((p) => p.length > 0); @@ -1118,9 +1151,16 @@ function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): Parse // Untagged paragraph in a batch that contains tags: classify via trigger // words like parseUnstructuredInput, but emit one artifact per paragraph // (not one-per-match) to preserve the author's paragraph boundaries. + // Stemmed set intersection mirrors parseUnstructuredInput — stop-words + // disabled on tokenize() both sides per P1.3.3 C-04 (canon vocab + // includes stop-word phrases like `going with` / `must not`). let matched: EncodingTypeDef | null = null; + const inputStems = new Set(tokenize(para, new Set())); for (const t of types) { - if (t.triggerRegex && t.triggerRegex.test(para)) { matched = t; break; } + // Break on first match: this path picks one type per paragraph by + // design (paragraph boundaries are the author's). Unlike + // parseUnstructuredInput which emits one artifact per matching type. + if (intersectsStems(t.stemmedTokens, inputStems)) { matched = t; break; } } const pick = matched ?? types[0] ?? { letter: "D", name: "Decision" }; const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60); @@ -1157,12 +1197,19 @@ function parseUnstructuredInput(input: string, types: EncodingTypeDef[]): Parsed const artifacts: ParsedArtifact[] = []; for (const para of paragraphs) { let matched = false; + // Hoist tokenize(para) out of the per-type loop — para is constant across + // the loop, stemmedTokens differ per type. Mirrors the P1.3.3 challenge + // prereq evaluator shape. Stop-words disabled (empty Set) on both parse- + // time and runtime tokenize() calls so canon vocab like `going with`, + // `must not`, `turns out`, `found that` survives on both sides. Per + // canon/constraints/release-validation-gate and P1.3.3 Bug #1 precedent. + const inputStems = new Set(tokenize(para, new Set())); for (const t of types) { // DESIGN: no break — a paragraph can match multiple types intentionally. // "We must never deploy without tests" is both Decision and Constraint. // Multi-typing at the server level mirrors what the model would do with // separate TSV rows. Do not add a break here. - if (t.triggerRegex && t.triggerRegex.test(para)) { + if (intersectsStems(t.stemmedTokens, inputStems)) { const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60); const title = first.split(/\s+/).length <= 12 ? first : first.split(/\s+/).slice(0, 8).join(" ") + "..."; artifacts.push({ type: t.letter, typeName: t.name, fields: [t.letter, title, para.trim()], title, body: para.trim() }); @@ -1518,9 +1565,10 @@ async function runCleanupStorage( // Also clear the in-memory BM25 index cachedBM25Index = null; cachedBM25Entries = null; - cachedEncodingTypes = null; - cachedEncodingTypesKnowledgeBaseUrl = undefined; - cachedEncodingTypesSource = "minimal"; + // cachedEncodingTypes removed in 0.22.0 per cache-fetches-and-parses — + // encode's parse product is no longer cached in-process. The fetch tier + // (Cache API, R2) already handles canon file caching; the derivation is + // sub-millisecond. No reset needed here. // E0008 — governance-driven challenge caches (mirror PR #96 fix) cachedChallengeTypes = null; cachedChallengeTypesKnowledgeBaseUrl = undefined; diff --git a/workers/test/canon-tool-envelope.smoke.mjs b/workers/test/canon-tool-envelope.smoke.mjs index 2474fa8..e6c5d3b 100644 --- a/workers/test/canon-tool-envelope.smoke.mjs +++ b/workers/test/canon-tool-envelope.smoke.mjs @@ -224,6 +224,62 @@ async function run() { `got: ${encodeOverride.result?.governance_source}`, ); + // P1.3.4 D5 regression anchors — stemmed set intersection replaces regex + // alternation on the encode classifier. These assertions exist because + // the pre-refactor literal regex path could not match inflections of + // canon vocab (`deciding` does not match `decided` under `\bdecided\b`), + // and the P1.3.3 Bug #1 precedent showed that tokenize()'s default + // stop-word filter silently breaks multi-word canon vocab (`going with`, + // `committed to`, `must not`). The assertions are numbered (12)–(15) to + // continue the sequence P1.3.3 established at (10)/(11). + console.log(`\n─── oddkit_encode: (12) stemmed inflection match (D5 landed) ───`); + const encodeInflection = await callTool("oddkit_encode", { + input: "I'm deciding to ship the two-tier cascade", + }); + expectFullEnvelope("oddkit_encode (inflection match)", encodeInflection); + const inflectionTypes = (encodeInflection.result?.artifacts ?? []).map((a) => a.type); + ok( + "oddkit_encode: (12) `deciding` (inflection of `decided`) classifies as Decision via stem intersection", + inflectionTypes.includes("D"), + `got artifact types: ${inflectionTypes.join(",")}`, + ); + + console.log(`\n─── oddkit_encode: (13) stop-word canon vocab survives tokenize (P1.3.3 C-04 ported) ───`); + const encodeStopWord = await callTool("oddkit_encode", { + input: "we're going with option B after the review", + }); + expectFullEnvelope("oddkit_encode (stop-word survival)", encodeStopWord); + const stopWordTypes = (encodeStopWord.result?.artifacts ?? []).map((a) => a.type); + ok( + "oddkit_encode: (13) `going with` (multi-word canon vocab containing stop-word `with`) matches Decision", + stopWordTypes.includes("D"), + `got artifact types: ${stopWordTypes.join(",")}`, + ); + + console.log(`\n─── oddkit_encode: (14) multi-type no-break preservation (L1161 design comment) ───`); + const encodeMultiType = await callTool("oddkit_encode", { + input: "We must never deploy without tests because we decided this last week", + }); + expectFullEnvelope("oddkit_encode (multi-type)", encodeMultiType); + const multiTypeTypes = (encodeMultiType.result?.artifacts ?? []).map((a) => a.type); + ok( + "oddkit_encode: (14) paragraph matching both Constraint and Decision emits both artifact types (no-break path)", + multiTypeTypes.includes("C") && multiTypeTypes.includes("D"), + `got artifact types: ${multiTypeTypes.join(",")}`, + ); + + console.log(`\n─── oddkit_encode: (15) first-match preservation in batch-untagged path ───`); + const encodeBatchUntagged = await callTool("oddkit_encode", { + input: "[D] explicit decision tag on first paragraph\n\nwe must always write tests before we decided on TDD", + }); + expectFullEnvelope("oddkit_encode (batch first-match)", encodeBatchUntagged); + const batchArtifacts = encodeBatchUntagged.result?.artifacts ?? []; + ok( + "oddkit_encode: (15) batch with tagged + untagged paragraphs emits exactly 2 artifacts (first-match path picks one type per untagged paragraph)", + batchArtifacts.length === 2, + `got length: ${batchArtifacts.length}; types: ${batchArtifacts.map((a) => a.type).join(",")}`, + ); + // Tool 5: oddkit_challenge — canon-driven, four governance surfaces. // Full envelope + governance_source + governance_uris (plural, per PRD D4 — // shape diverges from encode by design because challenge reads four peer From 113ba11ff8f2ce23ef3f7a27218b7092876c9537 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Mon, 20 Apr 2026 13:04:09 +0000 Subject: [PATCH 2/5] fix(encode): require all phrase stems to match, not any stem MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Multi-word canon trigger phrases (committed to, going with, must not, next step, follow up, blocked by, turns out) were tokenized into individual stems and flattened into a per-type Set. Common English function words (to, with, by, up, out, not) became standalone match triggers, causing false positives on nearly every English paragraph in parseUnstructuredInput and the untagged-paragraph branch of parsePrefixedBatchInput. Replace the flat stemmedTokens: Set with stemmedPhrases: string[][] — each inner array is the ordered stem sequence of one trigger word or phrase. A type matches when ALL stems of at least one phrase co-occur in the input stem set. Single-stem phrases degenerate to set membership (identical to the old behavior for inflection matching like deciding -> decid), while multi-stem phrases now preserve phrase-level semantics. Regression anchors (smoke tests 12/13/14/15) still pass: inflection, stop-word-adjacent canon vocab, multi-type no-break, and batch first-match all retain their previous behavior. --- workers/package-lock.json | 4 +- workers/src/orchestrate.ts | 82 +++++++++++++++++++++++++------------- 2 files changed, 57 insertions(+), 29 deletions(-) diff --git a/workers/package-lock.json b/workers/package-lock.json index e0021e7..a7994f0 100644 --- a/workers/package-lock.json +++ b/workers/package-lock.json @@ -1,12 +1,12 @@ { "name": "oddkit-mcp-worker", - "version": "0.21.1", + "version": "0.22.0", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "oddkit-mcp-worker", - "version": "0.21.1", + "version": "0.22.0", "dependencies": { "agents": "^0.4.1", "fflate": "^0.8.2", diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts index 7d52535..77e21ad 100644 --- a/workers/src/orchestrate.ts +++ b/workers/src/orchestrate.ts @@ -57,15 +57,22 @@ export interface OddkitEnvelope { type ActionResult = Omit; // Governance-driven encoding types. Trigger-word classification is stemmed -// set intersection per klappy://canon/principles/vodka-architecture (fit the -// matcher to the problem) — same D5 shape applied to challenge prereqs in -// 0.21.0 and gate prereqs in 0.20.0. triggerWords kept for debugging only; -// stemmedTokens is the parse product the runtime evaluates against. +// phrase-subset matching per klappy://canon/principles/vodka-architecture +// (fit the matcher to the problem) — same D5 shape applied to challenge +// prereqs in 0.21.0 and gate prereqs in 0.20.0. triggerWords kept for +// debugging only; stemmedPhrases is the parse product the runtime evaluates +// against. Each inner array is the ordered stem sequence of a single +// trigger word or phrase; a type matches an input when ALL stems of at +// least one phrase are present in the input's stem set. This preserves +// phrase-level semantics (`committed to`, `going with`, `must not`, +// `next step`, `follow up`, `blocked by`, `turns out`) so common function +// words (`to`, `with`, `by`, `up`, `out`, `not`) do not become standalone +// match triggers on every English paragraph. interface EncodingTypeDef { letter: string; name: string; triggerWords: string[]; - stemmedTokens: Set; + stemmedPhrases: string[][]; qualityCriteria: Array<{ criterion: string; check: string; gapMessage: string }>; } @@ -441,22 +448,26 @@ async function discoverEncodingTypes( ? triggerSection[1].split(",").map((w: string) => w.trim()).filter((w: string) => w.length > 0) : []; // D5 / klappy://canon/principles/vodka-architecture — classification is - // stemmed set intersection, not regex alternation. Canon vocab is parsed - // once into a Set of stems; runtime tokenizes input once and - // intersects. Inflected forms (deciding → decid, realizing → realiz) - // now match their canonical stems without canon having to list each - // inflection. Stop-word filtering is disabled (empty Set) on both the - // parse-time and runtime tokenize() calls — canon vocab includes - // stop-word-adjacent phrases (`going with`, `committed to`, `must not`, - // `turns out`, `next step`, `blocked by`, `found that`) and dropping - // them would silently break the strictly-additive invariant, the same - // failure mode P1.3.3 hit on challenge's `from`-in-source-named vocab. + // stemmed phrase-subset matching, not regex alternation. Each canon + // trigger word/phrase is parsed once into its ordered stem sequence; + // runtime tokenizes input once and a type matches when ALL stems of + // at least one phrase are present. Inflected forms (deciding → decid, + // realizing → realiz) match their canonical stems without canon having + // to list each inflection. Stop-word filtering is disabled (empty Set) + // on both the parse-time and runtime tokenize() calls — canon vocab + // includes stop-word-adjacent phrases (`going with`, `committed to`, + // `must not`, `turns out`, `next step`, `blocked by`, `found that`) + // and dropping them would silently break the strictly-additive + // invariant, the same failure mode P1.3.3 hit on challenge's + // `from`-in-source-named vocab. Phrase-level conjunction (all stems + // of a phrase must match) is the precision floor: without it, + // ubiquitous function words like `to`/`with`/`by`/`up`/`out`/`not` + // would become standalone triggers on every English paragraph. // Per canon/constraints/release-validation-gate and P1.3.3 C-04. - const stemmedTokens = new Set(); + const stemmedPhrases: string[][] = []; for (const word of triggerWords) { - for (const stem of tokenize(word, new Set())) { - stemmedTokens.add(stem); - } + const stems = tokenize(word, new Set()); + if (stems.length > 0) stemmedPhrases.push(stems); } const criteriaSection = content.match( @@ -476,7 +487,7 @@ async function discoverEncodingTypes( } } - types.push({ letter, name, triggerWords, stemmedTokens, qualityCriteria }); + types.push({ letter, name, triggerWords, stemmedPhrases, qualityCriteria }); } catch { continue; } @@ -513,15 +524,14 @@ async function discoverEncodingTypes( ["E", "Encode", ["encoded", "captured", "crystallized", "persisted", "artifact"]], ]; resolved = defaults.map(([letter, name, words]) => { - const stemmedTokens = new Set(); + const stemmedPhrases: string[][] = []; for (const word of words) { - for (const stem of tokenize(word, new Set())) { - stemmedTokens.add(stem); - } + const stems = tokenize(word, new Set()); + if (stems.length > 0) stemmedPhrases.push(stems); } return { letter, name, triggerWords: words, - stemmedTokens, + stemmedPhrases, qualityCriteria: [], }; }); @@ -1117,6 +1127,24 @@ function intersectsStems(vocab: Set, input: Set): boolean { return false; } +// Phrase-subset match — a phrase matches when ALL of its stems appear in the +// input stem set. Short-circuits on the first phrase that matches. This is +// the precision-preserving variant of intersectsStems for multi-word canon +// vocab: single-stem phrases degenerate to set membership (identical to the +// old single-token behavior), while multi-stem phrases like +// `committed to` → ["committ","to"] require both stems to co-occur, so +// ubiquitous function words cannot match on their own. +function matchesStemmedPhrases(phrases: string[][], input: Set): boolean { + for (const phrase of phrases) { + let allPresent = true; + for (const stem of phrase) { + if (!input.has(stem)) { allPresent = false; break; } + } + if (allPresent) return true; + } + return false; +} + function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): ParsedArtifact[] { const typeMap = new Map(types.map((t) => [t.letter, t.name])); const paragraphs = input.split(/\n\n+/).map((p) => p.trim()).filter((p) => p.length > 0); @@ -1160,7 +1188,7 @@ function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): Parse // Break on first match: this path picks one type per paragraph by // design (paragraph boundaries are the author's). Unlike // parseUnstructuredInput which emits one artifact per matching type. - if (intersectsStems(t.stemmedTokens, inputStems)) { matched = t; break; } + if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) { matched = t; break; } } const pick = matched ?? types[0] ?? { letter: "D", name: "Decision" }; const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60); @@ -1209,7 +1237,7 @@ function parseUnstructuredInput(input: string, types: EncodingTypeDef[]): Parsed // "We must never deploy without tests" is both Decision and Constraint. // Multi-typing at the server level mirrors what the model would do with // separate TSV rows. Do not add a break here. - if (intersectsStems(t.stemmedTokens, inputStems)) { + if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) { const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60); const title = first.split(/\s+/).length <= 12 ? first : first.split(/\s+/).slice(0, 8).join(" ") + "..."; artifacts.push({ type: t.letter, typeName: t.name, fields: [t.letter, title, para.trim()], title, body: para.trim() }); From e404fe0a430796b09a6291ffeb2adecc2c5ffda2 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Mon, 20 Apr 2026 13:11:24 +0000 Subject: [PATCH 3/5] Remove unused intersectsStems dead code --- workers/src/orchestrate.ts | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts index 77e21ad..a415a80 100644 --- a/workers/src/orchestrate.ts +++ b/workers/src/orchestrate.ts @@ -1116,22 +1116,12 @@ function isPrefixedBatchInput(input: string): boolean { return paragraphs.some((p) => PREFIX_TAG_REGEX.test(p)); } -// Stemmed set intersection — the D5 matcher shape for encode trigger-word -// classification. Iterate the smaller set (canon vocab) and probe the larger -// (runtime input stems) for O(min(|a|,|b|)) early exit. Mirrors the shape -// used by evaluatePrerequisiteCheck in the P1.3.3 challenge evaluator. -function intersectsStems(vocab: Set, input: Set): boolean { - for (const s of vocab) { - if (input.has(s)) return true; - } - return false; -} - // Phrase-subset match — a phrase matches when ALL of its stems appear in the -// input stem set. Short-circuits on the first phrase that matches. This is -// the precision-preserving variant of intersectsStems for multi-word canon -// vocab: single-stem phrases degenerate to set membership (identical to the -// old single-token behavior), while multi-stem phrases like +// input stem set. Short-circuits on the first phrase that matches. The D5 +// matcher shape for encode trigger-word classification, mirroring the shape +// used by evaluatePrerequisiteCheck in the P1.3.3 challenge evaluator: +// single-stem phrases degenerate to set membership (identical to the old +// single-token behavior), while multi-stem phrases like // `committed to` → ["committ","to"] require both stems to co-occur, so // ubiquitous function words cannot match on their own. function matchesStemmedPhrases(phrases: string[][], input: Set): boolean { From eaa1234fce14d4a9ce57c2e519b37894af061840 Mon Sep 17 00:00:00 2001 From: Klappy Date: Mon, 20 Apr 2026 13:11:43 +0000 Subject: [PATCH 4/5] docs(encode): port CHANGELOG + smoke assertion (16) onto autofix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Bugbot autofix on commit 113ba11 is the canonical disposition of the high-severity finding on 259170a. This commit ports the orchestrator-drafted CHANGELOG [0.22.0] entry and the smoke regression anchor onto the autofix. CHANGELOG: rewritten to describe the phrase-subset match autofix actually landed (ALL stems of at least one phrase must co-occur in the input stem set, any order). Previous draft described a consecutive-subsequence variant that was never shipped. Subset-match is simpler, one uniform structure, and aligned with encode's multi-type tolerance philosophy. Smoke assertion (16): traced against the autofix semantics. Input "I need to wait until tomorrow for the review" contains `to` but no Decision phrase has all its stems present (`decid`/`decis`/`chose` absent; `[committ,to]` fails on `committ`; `[go,with]` fails on both), and no Handoff phrase has all its stems present (`[next,step]`, `[follow,up]`, `[block,by]`, `[wait,on]` all fail on their second stem; `todo`, `continu`, `remain`, `handoff` singletons absent). A revision that re-flattens the matcher back to standalone-singleton triggers would fail this assertion. Disposition record for PR #126 Bugbot finding: - Finding: multi-word vocab flattening produces universal function-word triggers (high severity, 2026-04-20T12:55:03Z) - Fix-forward: Cursor autofix commit 113ba11 (phrase-subset match) - Orchestrator's proposed alternative: stricter consecutive-subsequence match — not shipped; subset-match is the simpler, correct design for encode's multi-type philosophy Verified: - tsc --noEmit clean - governance-parser.test.mjs 105/105 pass - Smoke assertions (12)-(16) traced against autofix semantics, all hold --- CHANGELOG.md | 7 +++-- workers/test/canon-tool-envelope.smoke.mjs | 36 ++++++++++++++++++++++ 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index cde7915..b8e2ced 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed -- **`oddkit_encode` trigger-word classifier migrated from regex alternation to stemmed set intersection** (per PRD D5 from P1.3.4 — split-by-fit, same shape challenge adopted in 0.21.0 and gate adopted in 0.20.0). `EncodingTypeDef.triggerRegex: RegExp | null` is replaced with `stemmedTokens: Set` — a parse product built once per canon fetch. Canon trigger vocabulary reads unchanged from `odd/encoding-types/*.md` (`## Trigger Words` fenced block); the new matcher tokenizes each vocabulary word with stop-words disabled, stems into the Set at parse time, and intersects against a stop-word-disabled stemmed input set at runtime. Inflected forms (`deciding`, `realizing`, `discovering`) now match their canonical stems (`decid`, `realiz`, `discover`) without canon having to enumerate each inflection. **Strictly additive**: every input that matched the prior regex still matches, plus stemmed variations now do too. Stop-words are disabled (empty `Set`) on both the parse-time `tokenize(word, new Set())` and the runtime `tokenize(para, new Set())` calls — canon vocab includes stop-word-adjacent phrases like `going with`, `committed to`, `must not`, `turns out`, `found that`, `next step`, `blocked by`; the P1.3.3 `from`-in-source-named precedent (C-04) demanded this pattern be replicated verbatim. Both classifier call sites preserve their existing semantics: `parsePrefixedBatchInput` untagged-paragraph path picks first match via `break` (one artifact per paragraph); `parseUnstructuredInput` emits one artifact per matching type (no `break` — the load-bearing design comment is preserved verbatim). `tokenize(para, new Set())` is hoisted out of the per-type loop at both call sites. A new `intersectsStems(vocab, input)` helper encapsulates the match test. Per `klappy://canon/principles/vodka-architecture`: fit the matcher to the problem shape (independent gap-or-not per type, multi-type allowed by design). +- **`oddkit_encode` trigger-word classifier migrated from regex alternation to stemmed phrase-subset matching** (per PRD D5 from P1.3.4 — split-by-fit, same matcher family shipped for challenge in 0.21.0 and gate in 0.20.0, adapted for encode's phrasal vocabulary). `EncodingTypeDef.triggerRegex: RegExp | null` is replaced with `stemmedPhrases: string[][]` — each inner array is the ordered stem sequence of a single canon trigger word or phrase, parsed once per canon fetch. The runtime matcher `matchesStemmedPhrases(phrases, inputStems)` declares a match when ALL stems of at least one phrase appear in the input stem set. Single-stem phrases degenerate to set membership (identical to the old behavior for inflection matching like `deciding` → `decid`); multi-stem phrases like `committed to` → `[committ, to]` require both stems to co-occur, so ubiquitous function words like `to`, `with`, `by`, `up`, `out`, `not` cannot fire as standalone match triggers just because they appear inside a canon phrase. This preserves the pre-refactor regex semantic where `\b(committed to)\b` matched only when both words were present. Canon trigger vocabulary reads unchanged from `odd/encoding-types/*.md` (`## Trigger Words` fenced block); the matcher tokenizes each vocabulary entry with stop-words disabled (`tokenize(word, new Set())`) and stores the ordered stem array at parse time, and intersects against a stop-word-disabled stemmed input set at runtime. Inflected forms (`deciding`, `realizing`, `discovering`) now match their canonical stems (`decid`, `realiz`, `discover`) without canon having to enumerate each inflection. **Strictly additive** over the pre-refactor regex: every input that matched still matches (both phrase conjunction and word-boundary semantics preserved), plus stemmed variations of single-word vocab now match additionally. Stop-words disabled on both parse-time and runtime `tokenize()` calls — canon vocab survival is mandatory for the strictly-additive invariant to hold, per the P1.3.3 C-04 precedent. Both classifier call sites preserve their existing semantics: `parsePrefixedBatchInput` untagged-paragraph path picks first match via `break` (one artifact per paragraph); `parseUnstructuredInput` emits one artifact per matching type (no `break` — the load-bearing design comment at L1161–1164 preserved verbatim). `tokenize(para, new Set())` is hoisted once per paragraph into an `inputStems` Set reused across the per-type loop. The phrase-subset match (all stems co-occurring, any order) was adopted mid-PR in response to a high-severity Cursor Bugbot finding on commit `259170a` — the first version's flat `stemmedTokens: Set` would have fired Decision on virtually every English paragraph because the ubiquitous function-word constituents of phrasal canon vocab (`to`, `with`) were being added as standalone singletons. Per `klappy://canon/principles/vodka-architecture`: fit the matcher to the problem shape. ### Removed @@ -19,7 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- **New smoke regression assertions in `workers/test/canon-tool-envelope.smoke.mjs`** anchoring the D5 migration: (12) stemmed inflection match — `"I'm deciding to ship two-tier cascade"` classifies as Decision (`decid` stem matches `decided` in canon vocab); (13) stop-word canon vocab survives — `"we're going with option B after the review"` matches Decision (`going with` multi-word canon vocab); (14) multi-type preservation — `"We must never deploy without tests because we decided this last week"` emits both `C` and `D` artifacts via the no-break path; (15) first-match preservation — untagged paragraph in a mixed batch emits exactly one artifact via the batch classifier's `break` semantic. +- **New smoke regression assertions in `workers/test/canon-tool-envelope.smoke.mjs`** anchoring the D5 migration and the Bugbot phrase-subset fix: (12) stemmed inflection match — `"I'm deciding to ship the two-tier cascade"` classifies as Decision (`decid` stem degenerate-singleton matches `decided` in canon vocab); (13) stop-word phrase survival — `"we're going with option B after the review"` matches Decision via the `[go, with]` phrase having both stems present in the input set; (14) multi-type preservation — `"We must never deploy without tests because we decided this last week"` emits both `C` and `D` artifacts via the no-break path (`must`/`never` singletons for Constraint; `decid` singleton for Decision); (15) first-match preservation — untagged paragraph in a mixed batch emits exactly one artifact via the batch classifier's `break` semantic; (16) phrase-subset regression anchor — `"I need to wait until tomorrow for the review"` does NOT classify as Decision or Handoff (the pre-Bugbot-fix flat-Set implementation would have fired Decision via standalone `to` and Handoff via standalone `to`/`for`; post-fix, no phrase of either type has all its stems present in the input). Assertion (16) is the Bugbot PR #126 regression anchor and will fail against any revision where multi-word vocab is flattened back into standalone-singleton triggers. ### Refs @@ -27,7 +27,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Canon basis: `klappy://canon/principles/cache-fetches-and-parses`, `klappy://canon/principles/vodka-architecture` - Precedent: oddkit 0.21.1 (challenge's D5 + D9), 0.20.0 (gate's D5 + D9) - Shipping gate: `klappy://canon/constraints/release-validation-gate` (binding) -- Closes the canon-parity sweep — all three tools now use stemmed set intersection and have their in-process derivation caches removed per `cache-fetches-and-parses`. +- Bugbot finding dispositioned: PR #126 review `cursor[bot]` 2026-04-20T12:55:03Z (high severity, multi-word vocab flattening) — fix-forward in same PR via Cursor autofix commit `113ba11` (phrase-subset match). The in-session orchestrator proposed a stricter consecutive-subsequence variant; autofix's subset-match was accepted as the simpler design better aligned with encode's multi-type tolerance philosophy. +- Closes the canon-parity sweep — all three tools now use stemmed matching and have their in-process derivation caches removed per `cache-fetches-and-parses`. ## [0.21.1] - 2026-04-20 diff --git a/workers/test/canon-tool-envelope.smoke.mjs b/workers/test/canon-tool-envelope.smoke.mjs index e6c5d3b..3dc4c74 100644 --- a/workers/test/canon-tool-envelope.smoke.mjs +++ b/workers/test/canon-tool-envelope.smoke.mjs @@ -280,6 +280,42 @@ async function run() { `got length: ${batchArtifacts.length}; types: ${batchArtifacts.map((a) => a.type).join(",")}`, ); + console.log(`\n─── oddkit_encode: (16) phrase-subset regression anchor (Bugbot PR #126) ───`); + // Pre-Bugbot-fix the matcher used a flat stemmedTokens: Set where + // multi-word canon phrases like `committed to` (Decision) and `next step` + // (Handoff) were flattened into individual stems and each was added as a + // standalone singleton. Stop-word filtering is disabled by design (P1.3.3 + // C-04), so function-word stems like `to`, `with`, `by`, `up`, `out` + // became universal match triggers — virtually every English paragraph + // would fire Decision and Handoff and more. Autofix commit 113ba11 + // adopted a phrase-subset match: a phrase matches only when ALL of its + // stems appear in the input stem set. Single-stem phrases degenerate to + // set membership (inflection matching still works); multi-stem phrases + // require conjunction. The input below contains stems `need`, `to`, + // `wait`, `until`, `tomorrow`, `for`, `the`, `review` — no Decision + // phrase has ALL its stems present (`decid` / `decis` / `chose` / `choos` + // / `select` all absent; `[committ, to]` fails on `committ`; `[go, with]` + // fails on both), and no Handoff phrase has ALL its stems present + // (`[next, session]` / `[next, step]` / `[follow, up]` / `[block, by]` + // / `[wait, on]` all fail on their second stem; `todo` / `continu` / + // `remain` / `handoff` singletons all absent). A revision that + // re-flattens the matcher would spuriously fire D and H on this input. + const encodePhraseSubset = await callTool("oddkit_encode", { + input: "I need to wait until tomorrow for the review", + }); + expectFullEnvelope("oddkit_encode (phrase-subset regression)", encodePhraseSubset); + const phraseSubsetTypes = (encodePhraseSubset.result?.artifacts ?? []).map((a) => a.type); + ok( + "oddkit_encode: (16) `to` inside phrasal canon vocab does NOT fire Decision as a standalone trigger", + !phraseSubsetTypes.includes("D"), + `got artifact types: ${phraseSubsetTypes.join(",")}`, + ); + ok( + "oddkit_encode: (16) `to` inside phrasal canon vocab does NOT fire Handoff as a standalone trigger", + !phraseSubsetTypes.includes("H"), + `got artifact types: ${phraseSubsetTypes.join(",")}`, + ); + // Tool 5: oddkit_challenge — canon-driven, four governance surfaces. // Full envelope + governance_source + governance_uris (plural, per PRD D4 — // shape diverges from encode by design because challenge reads four peer From 8a0636bee89b9ef443261037793ee5279825227e Mon Sep 17 00:00:00 2001 From: Klappy Date: Mon, 20 Apr 2026 13:54:29 +0000 Subject: [PATCH 5/5] fix: stale 0.22.0 reference in orchestrate.ts cleanup_storage comment The d2acf91 merge commit bumped this refactor from 0.22.0 to 0.23.0 per release-validation-gate Rule 3, but the inline comment on the removed cachedEncodingTypes reset block still said "removed in 0.22.0". Updated to reflect the actual shipping version. No functional change. --- workers/src/orchestrate.ts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts index c5d97c3..716cae6 100644 --- a/workers/src/orchestrate.ts +++ b/workers/src/orchestrate.ts @@ -1583,7 +1583,7 @@ async function runCleanupStorage( // Also clear the in-memory BM25 index cachedBM25Index = null; cachedBM25Entries = null; - // cachedEncodingTypes removed in 0.22.0 per cache-fetches-and-parses — + // cachedEncodingTypes removed in 0.23.0 per cache-fetches-and-parses — // encode's parse product is no longer cached in-process. The fetch tier // (Cache API, R2) already handles canon file caching; the derivation is // sub-millisecond. No reset needed here.