Skip to content

feat(encode): D5 stemmed set intersection + D9 cache removal (0.22.0)#126

Merged
klappy merged 6 commits into
mainfrom
feat/encode-stemmed-matcher-d5-d9
Apr 20, 2026
Merged

feat(encode): D5 stemmed set intersection + D9 cache removal (0.22.0)#126
klappy merged 6 commits into
mainfrom
feat/encode-stemmed-matcher-d5-d9

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Apr 20, 2026

P1.3.4 — Encode Canon-Parity Refactor (D5 + D9)

Migrates oddkit_encode's trigger-word classifier from regex alternation to stemmed phrase-subset matching — the last regex matcher in the canon-parity sweep. Same D5 family challenge adopted in 0.21.0 and gate adopted in 0.20.0, adapted for encode's phrasal vocabulary.

Closes the sweep. After this lands, all three tools (oddkit_encode, oddkit_challenge, oddkit_gate) use stemmed matching and have their in-process derivation caches removed per klappy://canon/principles/cache-fetches-and-parses.

Commit history on this PR

SHA Author Summary
259170a orchestrator feat(encode): D5 stemmed set intersection + D9 cache removal (0.22.0) — first cut, flat stemmedTokens: Set<string>
113ba11 cursor[bot] autofix fix(encode): require all phrase stems to match, not any stem — Bugbot's fix-forward for the high-severity finding; phrase-subset match
e404fe0 cursor[bot] autofix Remove unused intersectsStems dead code — cleanup after 113ba11
eaa1234 orchestrator docs(encode): port CHANGELOG + smoke assertion (16) onto autofix

Bugbot finding disposition (Rule 1)

Finding: high-severity on 259170a, 2026-04-20T12:55:03Z, posted by cursor[bot].

Multi-word canon trigger phrases like "committed to", "going with", "must not", "blocked by", "turns out", "follow up", and "next step" are tokenized into individual stems and each is added independently to stemmedTokens. This makes ubiquitous English words — "to", "with", "not", "by", "out", "up", "go", "next", "step" — into standalone match triggers. Since stop-word filtering is disabled, virtually every English paragraph will intersect with multiple types.

Confirmed. The first-cut implementation flattened multi-word canon phrases into individual stems. With stop-words disabled (required per P1.3.3 C-04 so canon vocab survives), function-word stems like to, with, by, up, out became universal match triggers on every English paragraph.

Disposition: fix-forward in this PR via Cursor autofix commit 113ba11. EncodingTypeDef.stemmedTokens: Set<string> replaced with stemmedPhrases: string[][], where each inner array is the ordered stem sequence of a single canon trigger entry. Runtime matcher matchesStemmedPhrases declares a type match only when ALL stems of at least one phrase co-occur in the input stem set. Single-stem phrases degenerate to set membership (inflection matching still works); multi-stem phrases require stem conjunction. Subsequent autofix commit e404fe0 removed the now-unused intersectsStems helper.

Scope landed

Item 1 — D5 stemmed phrase-subset matcher (workers/src/orchestrate.ts)

  • EncodingTypeDef.triggerRegex: RegExp | null removed, replaced with stemmedPhrases: string[][].
  • triggerWords: string[] kept for debugging visibility (per handoff).
  • Canon-path parse: tokenize each trigger word with tokenize(word, new Set()) — stop-words disabled per P1.3.3 C-04. Each stem array is stored as-is.
  • Minimal-fallback path migrated identically.
  • matchesStemmedPhrases(phrases, inputStems) helper: iterates phrases; declares match on the first phrase whose stems are all present in the input set.
  • parsePrefixedBatchInput untagged-paragraph path: inputStems = new Set(tokenize(para, new Set())) hoisted above the per-type loop; matchesStemmedPhrases replaces triggerRegex.test; break preserved — this path picks one type per paragraph by design.
  • parseUnstructuredInput: same hoist + same match substitution; no break — the L1161–1164 DESIGN comment preserved verbatim.

Item 2 — D9 cache removal (workers/src/orchestrate.ts)

  • Module-level cachedEncodingTypes, cachedEncodingTypesKnowledgeBaseUrl, cachedEncodingTypesSource deleted.
  • Cache-check short-circuit at top of discoverEncodingTypes deleted.
  • cleanup_storage resets for the three fields deleted.
  • Per klappy://canon/principles/cache-fetches-and-parses.

Item 3 — Smoke regression assertions (workers/test/canon-tool-envelope.smoke.mjs)

  • (12) Inflection match"I'm deciding to ship the two-tier cascade" → Decision (via decid stem-singleton matching decided in canon vocab)
  • (13) Phrasal vocab survival"we're going with option B after the review" → Decision (via [go, with] phrase having both stems present)
  • (14) Multi-type no-break"We must never deploy without tests because we decided this last week" → both C and D artifacts
  • (15) Batch first-match[D] tag + untagged paragraph with multiple triggers → exactly 2 artifacts
  • (16) Phrase-subset regression anchor"I need to wait until tomorrow for the review" → NEITHER Decision NOR Handoff. Bugbot-finding regression anchor; would fail if the matcher was re-flattened to standalone singletons.

Verified (local)

  • tsc --noEmit clean on every commit
  • node test/governance-parser.test.mjs = 105/105 pass on every commit
  • Smoke assertions (12)–(16) require a deployed preview URL; will be exercised against preview as part of Rule 2 validation

Release validation gate attestation

Per klappy://canon/constraints/release-validation-gate:

  • Rule 1 (no merge with active reviews): ✓ Bugbot completed/neutral on 259170a with one high-severity finding; fix-forwarded in commits 113ba11, e404fe0, eaa1234. Bugbot will be re-polled on head eaa1234 before merge; if any finding remains, will be dispositioned before merge.
  • Rule 2 (no promotion without independent validator): this refactor touches workers/src/orchestrate.ts + matcher algorithm + oddkit_encode action behavior = load-bearing surface. A Sonnet 4.6 read-only validator session will be dispatched via Managed Agents before the main → prod promotion PR merges. 5-corroboration pattern from P1.3.1, adapted per the P1.3.4 handoff's C1–C5 spec for encode.
  • Rule 3 (canon outranks session artifacts): the P1.3.4 handoff's guidance was followed. Where the orchestrator initially proposed a stricter consecutive-subsequence phrase match and Bugbot's autofix proposed a simpler subset match, canon (the Bugbot review acting as the enforcer of release-validation-gate Rule 1) was followed — autofix accepted, orchestrator's alternative not shipped.

Refs

  • Handoff: klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity
  • Predecessor ledger: klappy://odd/ledger/2026-04-20-p1-3-3-challenge-canon-parity-landed
  • Canon basis: klappy://canon/principles/cache-fetches-and-parses, klappy://canon/principles/vodka-architecture
  • Binding gate: klappy://canon/constraints/release-validation-gate
  • Precedent: oddkit 0.21.1 (challenge D5+D9), 0.20.0 (gate D5+D9)

Note

Medium Risk
Changes core encode classification logic and caching behavior, which can affect artifact typing and downstream workflows if the new matcher has edge cases or performance regressions.

Overview
oddkit_encode’s trigger-word classifier is refactored from per-type regex alternation to stemmed phrase-subset matching (EncodingTypeDef.triggerRegexstemmedPhrases + matchesStemmedPhrases), with stop-word filtering disabled and per-paragraph tokenization hoisted to preserve multi-type vs first-match semantics.

The module-level discoverEncodingTypes parse-product cache is removed (and related cleanup_storage resets deleted), relying on the existing fetch-layer caching instead.

Adds new canon-tool-envelope.smoke.mjs assertions to lock in inflection/phrase behavior and prevent function-word false positives, updates the changelog accordingly, and bumps package versions to 0.23.0.

Reviewed by Cursor Bugbot for commit 8a0636b. Bugbot is set up for automated code reviews on this repo. Configure here.

Migrates oddkit_encode trigger-word classifier from regex alternation to
stemmed set intersection — the last regex matcher in the canon-parity
sweep. Same D5 shape challenge adopted in 0.21.0 and gate adopted in
0.20.0. Closes the sweep.

Changed:
- EncodingTypeDef.triggerRegex: RegExp | null → stemmedTokens: Set<string>
- Parse-time: tokenize(word, new Set()) per canon trigger word; stems go
  into a Set<string> built once per fetch. Stop-words disabled on both
  sides per P1.3.3 C-04 — canon vocab includes stop-word phrases such
  as "going with" (Decision), "committed to" (Decision), "must not"
  (Constraint), "turns out" (Learning), "found that" (Learning), "next
  step" (Handoff), "blocked by" (Handoff). Dropping them would silently
  break the strictly-additive invariant the same way challenge broke it
  on "from" in source-named.
- Runtime: const inputStems = new Set(tokenize(para, new Set())) hoisted
  above the per-type loop at both classifier call sites.
- New intersectsStems(vocab, input) helper — iterates smaller set with
  early exit; mirrors evaluatePrerequisiteCheck from P1.3.3.
- parsePrefixedBatchInput untagged-paragraph path keeps its break
  (first-match per paragraph, by design).
- parseUnstructuredInput keeps its no-break multi-type design — the
  L1161–1164 DESIGN comment is preserved verbatim.

Removed:
- Module-level cachedEncodingTypes + cachedEncodingTypesKnowledgeBaseUrl
  + cachedEncodingTypesSource deleted per cache-fetches-and-parses. The
  fetch tier (Module Memory → Cache API → R2, 5-min TTL) already caches
  the canon read; parse-product caching for microsecond derivation
  savings is the anti-pattern the principle names.
- Cache-check short-circuit at top of discoverEncodingTypes deleted.
- cleanup_storage resets for the three removed fields deleted.

Added:
- Smoke regression assertions (12)–(15) in canon-tool-envelope.smoke.mjs:
  (12) inflection match — "deciding" → Decision via decid stem
  (13) stop-word survival — "going with" matches Decision
  (14) multi-type no-break — C and D both emitted for mixed input
  (15) batch first-match — mixed batch emits exactly 2 artifacts

Refs:
- Handoff: klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity
- Canon basis: klappy://canon/principles/cache-fetches-and-parses,
  klappy://canon/principles/vodka-architecture
- Precedent: oddkit 0.21.1 (challenge D5+D9), 0.20.0 (gate D5+D9)
- Shipping under: klappy://canon/constraints/release-validation-gate

Verified:
- tsc --noEmit clean
- governance-parser.test.mjs 105/105 pass
- Smoke assertions (12)-(15) will be exercised against preview post-push.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 20, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
oddkit 8a0636b Commit Preview URL

Branch Preview URL
Apr 20 2026, 01:55 PM

Comment thread workers/src/orchestrate.ts
Multi-word canon trigger phrases (committed to, going with, must not,
next step, follow up, blocked by, turns out) were tokenized into
individual stems and flattened into a per-type Set<string>. Common
English function words (to, with, by, up, out, not) became standalone
match triggers, causing false positives on nearly every English
paragraph in parseUnstructuredInput and the untagged-paragraph branch
of parsePrefixedBatchInput.

Replace the flat stemmedTokens: Set<string> with stemmedPhrases:
string[][] — each inner array is the ordered stem sequence of one
trigger word or phrase. A type matches when ALL stems of at least one
phrase co-occur in the input stem set. Single-stem phrases degenerate
to set membership (identical to the old behavior for inflection
matching like deciding -> decid), while multi-stem phrases now
preserve phrase-level semantics.

Regression anchors (smoke tests 12/13/14/15) still pass: inflection,
stop-word-adjacent canon vocab, multi-type no-break, and batch
first-match all retain their previous behavior.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Unused intersectsStems function is dead code
    • Removed the unused intersectsStems function and updated the neighboring comment to stand alone referencing the P1.3.3 challenge evaluator; typecheck passes.
Preview (e404fe0a43)
diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,28 @@
 
 ## [Unreleased]
 
+## [0.22.0] - 2026-04-20
+
+### Changed
+
+- **`oddkit_encode` trigger-word classifier migrated from regex alternation to stemmed set intersection** (per PRD D5 from P1.3.4 — split-by-fit, same shape challenge adopted in 0.21.0 and gate adopted in 0.20.0). `EncodingTypeDef.triggerRegex: RegExp | null` is replaced with `stemmedTokens: Set<string>` — a parse product built once per canon fetch. Canon trigger vocabulary reads unchanged from `odd/encoding-types/*.md` (`## Trigger Words` fenced block); the new matcher tokenizes each vocabulary word with stop-words disabled, stems into the Set at parse time, and intersects against a stop-word-disabled stemmed input set at runtime. Inflected forms (`deciding`, `realizing`, `discovering`) now match their canonical stems (`decid`, `realiz`, `discover`) without canon having to enumerate each inflection. **Strictly additive**: every input that matched the prior regex still matches, plus stemmed variations now do too. Stop-words are disabled (empty `Set`) on both the parse-time `tokenize(word, new Set())` and the runtime `tokenize(para, new Set())` calls — canon vocab includes stop-word-adjacent phrases like `going with`, `committed to`, `must not`, `turns out`, `found that`, `next step`, `blocked by`; the P1.3.3 `from`-in-source-named precedent (C-04) demanded this pattern be replicated verbatim. Both classifier call sites preserve their existing semantics: `parsePrefixedBatchInput` untagged-paragraph path picks first match via `break` (one artifact per paragraph); `parseUnstructuredInput` emits one artifact per matching type (no `break` — the load-bearing design comment is preserved verbatim). `tokenize(para, new Set())` is hoisted out of the per-type loop at both call sites. A new `intersectsStems(vocab, input)` helper encapsulates the match test. Per `klappy://canon/principles/vodka-architecture`: fit the matcher to the problem shape (independent gap-or-not per type, multi-type allowed by design).
+
+### Removed
+
+- **Module-level `cachedEncodingTypes` in-process cache** (per PRD D9 from P1.3.4 — don't cache microsecond derivations; same pattern challenge shipped in 0.21.0 and gate shipped in 0.20.0). `cachedEncodingTypes`, `cachedEncodingTypesKnowledgeBaseUrl`, `cachedEncodingTypesSource` module-level fields deleted; cache-check short-circuit at the top of `discoverEncodingTypes` deleted; `cleanup_storage` resets for the three fields deleted. Per `klappy://canon/principles/cache-fetches-and-parses`: the fetch layer (Module Memory → Cache API → R2, 5-minute TTL) already caches the canon file content; caching the parse product for microsecond re-derivation savings is the anti-pattern the principle names. Parse runs fresh per call; overhead is sub-millisecond on hot fetches.
+
+### Added
+
+- **New smoke regression assertions in `workers/test/canon-tool-envelope.smoke.mjs`** anchoring the D5 migration: (12) stemmed inflection match — `"I'm deciding to ship two-tier cascade"` classifies as Decision (`decid` stem matches `decided` in canon vocab); (13) stop-word canon vocab survives — `"we're going with option B after the review"` matches Decision (`going with` multi-word canon vocab); (14) multi-type preservation — `"We must never deploy without tests because we decided this last week"` emits both `C` and `D` artifacts via the no-break path; (15) first-match preservation — untagged paragraph in a mixed batch emits exactly one artifact via the batch classifier's `break` semantic.
+
+### Refs
+
+- Handoff: `klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity`
+- Canon basis: `klappy://canon/principles/cache-fetches-and-parses`, `klappy://canon/principles/vodka-architecture`
+- Precedent: oddkit 0.21.1 (challenge's D5 + D9), 0.20.0 (gate's D5 + D9)
+- Shipping gate: `klappy://canon/constraints/release-validation-gate` (binding)
+- Closes the canon-parity sweep — all three tools now use stemmed set intersection and have their in-process derivation caches removed per `cache-fetches-and-parses`.
+
 ## [0.21.1] - 2026-04-20
 
 ### Fixed

diff --git a/package.json b/package.json
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "oddkit",
-  "version": "0.21.1",
+  "version": "0.22.0",
   "description": "Agent-first CLI for ODD-governed repos. Epistemic terrain rendering with portable baseline.",
   "type": "module",
   "bin": {

diff --git a/workers/package-lock.json b/workers/package-lock.json
--- a/workers/package-lock.json
+++ b/workers/package-lock.json
@@ -1,12 +1,12 @@
 {
   "name": "oddkit-mcp-worker",
-  "version": "0.21.1",
+  "version": "0.22.0",
   "lockfileVersion": 3,
   "requires": true,
   "packages": {
     "": {
       "name": "oddkit-mcp-worker",
-      "version": "0.21.1",
+      "version": "0.22.0",
       "dependencies": {
         "agents": "^0.4.1",
         "fflate": "^0.8.2",

diff --git a/workers/package.json b/workers/package.json
--- a/workers/package.json
+++ b/workers/package.json
@@ -1,6 +1,6 @@
 {
   "name": "oddkit-mcp-worker",
-  "version": "0.21.1",
+  "version": "0.22.0",
   "private": true,
   "type": "module",
   "scripts": {

diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts
--- a/workers/src/orchestrate.ts
+++ b/workers/src/orchestrate.ts
@@ -56,12 +56,23 @@
 /** Internal type — handlers return this, handleUnifiedAction stamps server_time */
 type ActionResult = Omit<OddkitEnvelope, "server_time">;
 
-// Governance-driven encoding types
+// Governance-driven encoding types. Trigger-word classification is stemmed
+// phrase-subset matching per klappy://canon/principles/vodka-architecture
+// (fit the matcher to the problem) — same D5 shape applied to challenge
+// prereqs in 0.21.0 and gate prereqs in 0.20.0. triggerWords kept for
+// debugging only; stemmedPhrases is the parse product the runtime evaluates
+// against. Each inner array is the ordered stem sequence of a single
+// trigger word or phrase; a type matches an input when ALL stems of at
+// least one phrase are present in the input's stem set. This preserves
+// phrase-level semantics (`committed to`, `going with`, `must not`,
+// `next step`, `follow up`, `blocked by`, `turns out`) so common function
+// words (`to`, `with`, `by`, `up`, `out`, `not`) do not become standalone
+// match triggers on every English paragraph.
 interface EncodingTypeDef {
   letter: string;
   name: string;
   triggerWords: string[];
-  triggerRegex: RegExp | null;
+  stemmedPhrases: string[][];
   qualityCriteria: Array<{ criterion: string; check: string; gapMessage: string }>;
 }
 
@@ -79,9 +90,12 @@
   priority_band?: string;
 }
 
-let cachedEncodingTypes: EncodingTypeDef[] | null = null;
-let cachedEncodingTypesKnowledgeBaseUrl: string | undefined = undefined;
-let cachedEncodingTypesSource: "knowledge_base" | "minimal" = "minimal";
+// D9 / klappy://canon/principles/cache-fetches-and-parses — no module-level
+// cache on the parse product. fetcher.getFile / fetcher.getIndex already cache
+// the canon read (Module Memory → Cache API → R2, 5-min TTL). Re-running the
+// parse loop per request is sub-millisecond derivation work, not worth the
+// plumbing tax of a keyed cache. Same pattern challenge (0.21.0) and gate
+// (0.20.0) already applied.
 
 // Governance-driven challenge types (E0008 — mirrors encode pattern from PR #96)
 interface ChallengeTypeDef {
@@ -409,10 +423,6 @@
   fetcher: KnowledgeBaseFetcher,
   knowledgeBaseUrl?: string,
 ): Promise<{ types: EncodingTypeDef[]; source: "knowledge_base" | "minimal" }> {
-  if (cachedEncodingTypes && cachedEncodingTypesKnowledgeBaseUrl === knowledgeBaseUrl) {
-    return { types: cachedEncodingTypes, source: cachedEncodingTypesSource };
-  }
-
   const index = await fetcher.getIndex(knowledgeBaseUrl);
   const typeArticles = index.entries.filter(
     (entry: IndexEntry) => entry.tags?.includes("encoding-type") && entry.path.includes("encoding-types/"),
@@ -437,10 +447,28 @@
       const triggerWords = triggerSection
         ? triggerSection[1].split(",").map((w: string) => w.trim()).filter((w: string) => w.length > 0)
         : [];
-      const triggerRegex =
-        triggerWords.length > 0
-          ? new RegExp("\\b(" + triggerWords.map((w: string) => w.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")).join("|") + ")\\b", "i")
-          : null;
+      // D5 / klappy://canon/principles/vodka-architecture — classification is
+      // stemmed phrase-subset matching, not regex alternation. Each canon
+      // trigger word/phrase is parsed once into its ordered stem sequence;
+      // runtime tokenizes input once and a type matches when ALL stems of
+      // at least one phrase are present. Inflected forms (deciding → decid,
+      // realizing → realiz) match their canonical stems without canon having
+      // to list each inflection. Stop-word filtering is disabled (empty Set)
+      // on both the parse-time and runtime tokenize() calls — canon vocab
+      // includes stop-word-adjacent phrases (`going with`, `committed to`,
+      // `must not`, `turns out`, `next step`, `blocked by`, `found that`)
+      // and dropping them would silently break the strictly-additive
+      // invariant, the same failure mode P1.3.3 hit on challenge's
+      // `from`-in-source-named vocab. Phrase-level conjunction (all stems
+      // of a phrase must match) is the precision floor: without it,
+      // ubiquitous function words like `to`/`with`/`by`/`up`/`out`/`not`
+      // would become standalone triggers on every English paragraph.
+      // Per canon/constraints/release-validation-gate and P1.3.3 C-04.
+      const stemmedPhrases: string[][] = [];
+      for (const word of triggerWords) {
+        const stems = tokenize(word, new Set());
+        if (stems.length > 0) stemmedPhrases.push(stems);
+      }
 
       const criteriaSection = content.match(
         /## Quality Criteria[\s\S]*?\| Criterion[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/,
@@ -459,7 +487,7 @@
         }
       }
 
-      types.push({ letter, name, triggerWords, triggerRegex, qualityCriteria });
+      types.push({ letter, name, triggerWords, stemmedPhrases, qualityCriteria });
     } catch {
       continue;
     }
@@ -495,17 +523,21 @@
       ["H", "Handoff",     ["next session", "next step", "todo", "follow up", "blocked by"]],
       ["E", "Encode",      ["encoded", "captured", "crystallized", "persisted", "artifact"]],
     ];
-    resolved = defaults.map(([letter, name, words]) => ({
-      letter, name, triggerWords: words,
-      triggerRegex: new RegExp("\\b(" + words.join("|") + ")\\b", "i"),
-      qualityCriteria: [],
-    }));
+    resolved = defaults.map(([letter, name, words]) => {
+      const stemmedPhrases: string[][] = [];
+      for (const word of words) {
+        const stems = tokenize(word, new Set());
+        if (stems.length > 0) stemmedPhrases.push(stems);
+      }
+      return {
+        letter, name, triggerWords: words,
+        stemmedPhrases,
+        qualityCriteria: [],
+      };
+    });
     source = "minimal";
   }
 
-  cachedEncodingTypes = resolved;
-  cachedEncodingTypesKnowledgeBaseUrl = knowledgeBaseUrl;
-  cachedEncodingTypesSource = source;
   return { types: resolved, source };
 }
 
@@ -1084,6 +1116,25 @@
   return paragraphs.some((p) => PREFIX_TAG_REGEX.test(p));
 }
 
+// Phrase-subset match — a phrase matches when ALL of its stems appear in the
+// input stem set. Short-circuits on the first phrase that matches. The D5
+// matcher shape for encode trigger-word classification, mirroring the shape
+// used by evaluatePrerequisiteCheck in the P1.3.3 challenge evaluator:
+// single-stem phrases degenerate to set membership (identical to the old
+// single-token behavior), while multi-stem phrases like
+// `committed to` → ["committ","to"] require both stems to co-occur, so
+// ubiquitous function words cannot match on their own.
+function matchesStemmedPhrases(phrases: string[][], input: Set<string>): boolean {
+  for (const phrase of phrases) {
+    let allPresent = true;
+    for (const stem of phrase) {
+      if (!input.has(stem)) { allPresent = false; break; }
+    }
+    if (allPresent) return true;
+  }
+  return false;
+}
+
 function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): ParsedArtifact[] {
   const typeMap = new Map(types.map((t) => [t.letter, t.name]));
   const paragraphs = input.split(/\n\n+/).map((p) => p.trim()).filter((p) => p.length > 0);
@@ -1118,9 +1169,16 @@
       // Untagged paragraph in a batch that contains tags: classify via trigger
       // words like parseUnstructuredInput, but emit one artifact per paragraph
       // (not one-per-match) to preserve the author's paragraph boundaries.
+      // Stemmed set intersection mirrors parseUnstructuredInput — stop-words
+      // disabled on tokenize() both sides per P1.3.3 C-04 (canon vocab
+      // includes stop-word phrases like `going with` / `must not`).
       let matched: EncodingTypeDef | null = null;
+      const inputStems = new Set(tokenize(para, new Set()));
       for (const t of types) {
-        if (t.triggerRegex && t.triggerRegex.test(para)) { matched = t; break; }
+        // Break on first match: this path picks one type per paragraph by
+        // design (paragraph boundaries are the author's). Unlike
+        // parseUnstructuredInput which emits one artifact per matching type.
+        if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) { matched = t; break; }
       }
       const pick = matched ?? types[0] ?? { letter: "D", name: "Decision" };
       const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60);
@@ -1157,12 +1215,19 @@
   const artifacts: ParsedArtifact[] = [];
   for (const para of paragraphs) {
     let matched = false;
+    // Hoist tokenize(para) out of the per-type loop — para is constant across
+    // the loop, stemmedTokens differ per type. Mirrors the P1.3.3 challenge
+    // prereq evaluator shape. Stop-words disabled (empty Set) on both parse-
+    // time and runtime tokenize() calls so canon vocab like `going with`,
+    // `must not`, `turns out`, `found that` survives on both sides. Per
+    // canon/constraints/release-validation-gate and P1.3.3 Bug #1 precedent.
+    const inputStems = new Set(tokenize(para, new Set()));
     for (const t of types) {
       // DESIGN: no break — a paragraph can match multiple types intentionally.
       // "We must never deploy without tests" is both Decision and Constraint.
       // Multi-typing at the server level mirrors what the model would do with
       // separate TSV rows. Do not add a break here.
-      if (t.triggerRegex && t.triggerRegex.test(para)) {
+      if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) {
         const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60);
         const title = first.split(/\s+/).length <= 12 ? first : first.split(/\s+/).slice(0, 8).join(" ") + "...";
         artifacts.push({ type: t.letter, typeName: t.name, fields: [t.letter, title, para.trim()], title, body: para.trim() });
@@ -1518,9 +1583,10 @@
   // Also clear the in-memory BM25 index
   cachedBM25Index = null;
   cachedBM25Entries = null;
-  cachedEncodingTypes = null;
-  cachedEncodingTypesKnowledgeBaseUrl = undefined;
-  cachedEncodingTypesSource = "minimal";
+  // cachedEncodingTypes removed in 0.22.0 per cache-fetches-and-parses —
+  // encode's parse product is no longer cached in-process. The fetch tier
+  // (Cache API, R2) already handles canon file caching; the derivation is
+  // sub-millisecond. No reset needed here.
   // E0008 — governance-driven challenge caches (mirror PR #96 fix)
   cachedChallengeTypes = null;
   cachedChallengeTypesKnowledgeBaseUrl = undefined;

diff --git a/workers/test/canon-tool-envelope.smoke.mjs b/workers/test/canon-tool-envelope.smoke.mjs
--- a/workers/test/canon-tool-envelope.smoke.mjs
+++ b/workers/test/canon-tool-envelope.smoke.mjs
@@ -224,6 +224,62 @@
     `got: ${encodeOverride.result?.governance_source}`,
   );
 
+  // P1.3.4 D5 regression anchors — stemmed set intersection replaces regex
+  // alternation on the encode classifier. These assertions exist because
+  // the pre-refactor literal regex path could not match inflections of
+  // canon vocab (`deciding` does not match `decided` under `\bdecided\b`),
+  // and the P1.3.3 Bug #1 precedent showed that tokenize()'s default
+  // stop-word filter silently breaks multi-word canon vocab (`going with`,
+  // `committed to`, `must not`). The assertions are numbered (12)–(15) to
+  // continue the sequence P1.3.3 established at (10)/(11).
+  console.log(`\n─── oddkit_encode: (12) stemmed inflection match (D5 landed) ───`);
+  const encodeInflection = await callTool("oddkit_encode", {
+    input: "I'm deciding to ship the two-tier cascade",
+  });
+  expectFullEnvelope("oddkit_encode (inflection match)", encodeInflection);
+  const inflectionTypes = (encodeInflection.result?.artifacts ?? []).map((a) => a.type);
+  ok(
+    "oddkit_encode: (12) `deciding` (inflection of `decided`) classifies as Decision via stem intersection",
+    inflectionTypes.includes("D"),
+    `got artifact types: ${inflectionTypes.join(",")}`,
+  );
+
+  console.log(`\n─── oddkit_encode: (13) stop-word canon vocab survives tokenize (P1.3.3 C-04 ported) ───`);
+  const encodeStopWord = await callTool("oddkit_encode", {
+    input: "we're going with option B after the review",
+  });
+  expectFullEnvelope("oddkit_encode (stop-word survival)", encodeStopWord);
+  const stopWordTypes = (encodeStopWord.result?.artifacts ?? []).map((a) => a.type);
+  ok(
+    "oddkit_encode: (13) `going with` (multi-word canon vocab containing stop-word `with`) matches Decision",
+    stopWordTypes.includes("D"),
+    `got artifact types: ${stopWordTypes.join(",")}`,
+  );
+
+  console.log(`\n─── oddkit_encode: (14) multi-type no-break preservation (L1161 design comment) ───`);
+  const encodeMultiType = await callTool("oddkit_encode", {
+    input: "We must never deploy without tests because we decided this last week",
+  });
+  expectFullEnvelope("oddkit_encode (multi-type)", encodeMultiType);
+  const multiTypeTypes = (encodeMultiType.result?.artifacts ?? []).map((a) => a.type);
+  ok(
+    "oddkit_encode: (14) paragraph matching both Constraint and Decision emits both artifact types (no-break path)",
+    multiTypeTypes.includes("C") && multiTypeTypes.includes("D"),
+    `got artifact types: ${multiTypeTypes.join(",")}`,
+  );
+
+  console.log(`\n─── oddkit_encode: (15) first-match preservation in batch-untagged path ───`);
+  const encodeBatchUntagged = await callTool("oddkit_encode", {
+    input: "[D] explicit decision tag on first paragraph\n\nwe must always write tests before we decided on TDD",
+  });
+  expectFullEnvelope("oddkit_encode (batch first-match)", encodeBatchUntagged);
+  const batchArtifacts = encodeBatchUntagged.result?.artifacts ?? [];
+  ok(
+    "oddkit_encode: (15) batch with tagged + untagged paragraphs emits exactly 2 artifacts (first-match path picks one type per untagged paragraph)",
+    batchArtifacts.length === 2,
+    `got length: ${batchArtifacts.length}; types: ${batchArtifacts.map((a) => a.type).join(",")}`,
+  );
+
   // Tool 5: oddkit_challenge — canon-driven, four governance surfaces.
   // Full envelope + governance_source + governance_uris (plural, per PRD D4 —
   // shape diverges from encode by design because challenge reads four peer

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 113ba11. Configure here.

Comment thread workers/src/orchestrate.ts Outdated
cursoragent and others added 4 commits April 20, 2026 13:11
The Bugbot autofix on commit 113ba11 is the canonical disposition of the
high-severity finding on 259170a. This commit ports the orchestrator-drafted
CHANGELOG [0.22.0] entry and the smoke regression anchor onto the autofix.

CHANGELOG: rewritten to describe the phrase-subset match autofix actually
landed (ALL stems of at least one phrase must co-occur in the input stem
set, any order). Previous draft described a consecutive-subsequence variant
that was never shipped. Subset-match is simpler, one uniform structure, and
aligned with encode's multi-type tolerance philosophy.

Smoke assertion (16): traced against the autofix semantics. Input
"I need to wait until tomorrow for the review" contains `to` but no
Decision phrase has all its stems present (`decid`/`decis`/`chose` absent;
`[committ,to]` fails on `committ`; `[go,with]` fails on both), and no
Handoff phrase has all its stems present (`[next,step]`, `[follow,up]`,
`[block,by]`, `[wait,on]` all fail on their second stem; `todo`, `continu`,
`remain`, `handoff` singletons absent). A revision that re-flattens the
matcher back to standalone-singleton triggers would fail this assertion.

Disposition record for PR #126 Bugbot finding:
- Finding: multi-word vocab flattening produces universal function-word
  triggers (high severity, 2026-04-20T12:55:03Z)
- Fix-forward: Cursor autofix commit 113ba11 (phrase-subset match)
- Orchestrator's proposed alternative: stricter consecutive-subsequence
  match — not shipped; subset-match is the simpler, correct design for
  encode's multi-type philosophy

Verified:
- tsc --noEmit clean
- governance-parser.test.mjs 105/105 pass
- Smoke assertions (12)-(16) traced against autofix semantics, all hold
Main shipped 0.22.0 via PR #128 while this branch was in Sonnet 4.6
validator dispatch. PR #128 backfilled CHANGELOG + version bump covering
the envelope-conformance fixes from PR #124 (telemetry_public) and
PR #125 (catalog generated_at).

Per klappy://canon/constraints/release-validation-gate Rule 3 (canon
outranks session artifacts), this refactor is re-versioned to 0.23.0.
The handoff's "ship as 0.22.0" recommendation was session-scoped; main-
reality is the canon.

Resolution:
- CHANGELOG.md: my encode D5+D9 content moves to a new [0.23.0] section
  above the existing [0.22.0] (telemetry + catalog); added a version-
  note blockquote explaining the bump.
- package.json / workers/package.json / both lockfiles: 0.22.0 → 0.23.0.
- workers/src/orchestrate.ts: auto-merged cleanly (catalog fix touched
  runCatalog, encode refactor touched discoverEncodingTypes + classifier
  call sites; zero overlap).
- workers/test/canon-tool-envelope.smoke.mjs: auto-merged cleanly
  (additive on both sides).

Verified:
- tsc --noEmit clean
- governance-parser.test.mjs 105/105 pass
- CHANGELOG structure: [Unreleased] [0.23.0] [0.22.0] [0.21.1] [0.21.0]...
- All conflict markers removed

Sonnet 4.6 validator verdict (session sesn_011CaF5vqjgzN7Mw8s84qvK9,
PASS on all 5 corroborations) remains valid for the encode refactor
content — the rebase does not touch any matcher code or action behavior.
A fresh-context validator re-dispatch will run before promotion per
release-validation-gate Rule 2 out of canon-discipline caution.
The d2acf91 merge commit bumped this refactor from 0.22.0 to 0.23.0 per
release-validation-gate Rule 3, but the inline comment on the removed
cachedEncodingTypes reset block still said "removed in 0.22.0". Updated
to reflect the actual shipping version.

No functional change.
@klappy klappy merged commit 7542cbb into main Apr 20, 2026
5 checks passed
klappy added a commit that referenced this pull request Apr 20, 2026
Promotes oddkit 0.23.0 to prod: the P1.3.4 encode canon-parity refactor. Closes the sweep — all three tools now use stemmed matching and have their in-process derivation caches removed per klappy://canon/principles/cache-fetches-and-parses.

Release validation gate (klappy://canon/constraints/release-validation-gate):

Rule 1 — Bugbot completed on all merged SHAs (feat PR #126): 259170a/neutral→fixed, 113ba11/neutral→fixed, e404fe0/success, eaa1234/success, 8a0636b/success; promotion head 7542cbb: success.

Rule 2 — Independent fresh-context validators:
- Feat validator: agent_011CaF5vo8B5UpqtfZAmSeui, session sesn_011CaF5vqjgzN7Mw8s84qvK9 against eaa1234 — PASS on all 5 corroborations
- Promotion validator: agent_011CaF9tvJgRXQ6F96MtN4iu, session sesn_011CaF9tx18Af3z1Fy9trwz8 against 7542cbb — PASS on all 5 corroborations, smoke 223/0 × 3

Rule 3 — handoff's 0.22.0 recommendation overridden by main-reality (PR #128/#129 shipped 0.22.0 envelope fixes while this was in validator dispatch); rebased forward to 0.23.0 per canon-outranks-session-artifacts.

Non-blocking carry-forward: P13 — parseUnstructuredInput fallback-to-types[0] behavior for inputs with no canon vocab intersection. Pre-existing, surfaced by both validators, outside P1.3.4 scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants