feat(encode): D5 stemmed set intersection + D9 cache removal (0.22.0)#126
Merged
Conversation
Migrates oddkit_encode trigger-word classifier from regex alternation to stemmed set intersection — the last regex matcher in the canon-parity sweep. Same D5 shape challenge adopted in 0.21.0 and gate adopted in 0.20.0. Closes the sweep. Changed: - EncodingTypeDef.triggerRegex: RegExp | null → stemmedTokens: Set<string> - Parse-time: tokenize(word, new Set()) per canon trigger word; stems go into a Set<string> built once per fetch. Stop-words disabled on both sides per P1.3.3 C-04 — canon vocab includes stop-word phrases such as "going with" (Decision), "committed to" (Decision), "must not" (Constraint), "turns out" (Learning), "found that" (Learning), "next step" (Handoff), "blocked by" (Handoff). Dropping them would silently break the strictly-additive invariant the same way challenge broke it on "from" in source-named. - Runtime: const inputStems = new Set(tokenize(para, new Set())) hoisted above the per-type loop at both classifier call sites. - New intersectsStems(vocab, input) helper — iterates smaller set with early exit; mirrors evaluatePrerequisiteCheck from P1.3.3. - parsePrefixedBatchInput untagged-paragraph path keeps its break (first-match per paragraph, by design). - parseUnstructuredInput keeps its no-break multi-type design — the L1161–1164 DESIGN comment is preserved verbatim. Removed: - Module-level cachedEncodingTypes + cachedEncodingTypesKnowledgeBaseUrl + cachedEncodingTypesSource deleted per cache-fetches-and-parses. The fetch tier (Module Memory → Cache API → R2, 5-min TTL) already caches the canon read; parse-product caching for microsecond derivation savings is the anti-pattern the principle names. - Cache-check short-circuit at top of discoverEncodingTypes deleted. - cleanup_storage resets for the three removed fields deleted. Added: - Smoke regression assertions (12)–(15) in canon-tool-envelope.smoke.mjs: (12) inflection match — "deciding" → Decision via decid stem (13) stop-word survival — "going with" matches Decision (14) multi-type no-break — C and D both emitted for mixed input (15) batch first-match — mixed batch emits exactly 2 artifacts Refs: - Handoff: klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity - Canon basis: klappy://canon/principles/cache-fetches-and-parses, klappy://canon/principles/vodka-architecture - Precedent: oddkit 0.21.1 (challenge D5+D9), 0.20.0 (gate D5+D9) - Shipping under: klappy://canon/constraints/release-validation-gate Verified: - tsc --noEmit clean - governance-parser.test.mjs 105/105 pass - Smoke assertions (12)-(15) will be exercised against preview post-push.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
oddkit | 8a0636b | Commit Preview URL Branch Preview URL |
Apr 20 2026, 01:55 PM |
Multi-word canon trigger phrases (committed to, going with, must not, next step, follow up, blocked by, turns out) were tokenized into individual stems and flattened into a per-type Set<string>. Common English function words (to, with, by, up, out, not) became standalone match triggers, causing false positives on nearly every English paragraph in parseUnstructuredInput and the untagged-paragraph branch of parsePrefixedBatchInput. Replace the flat stemmedTokens: Set<string> with stemmedPhrases: string[][] — each inner array is the ordered stem sequence of one trigger word or phrase. A type matches when ALL stems of at least one phrase co-occur in the input stem set. Single-stem phrases degenerate to set membership (identical to the old behavior for inflection matching like deciding -> decid), while multi-stem phrases now preserve phrase-level semantics. Regression anchors (smoke tests 12/13/14/15) still pass: inflection, stop-word-adjacent canon vocab, multi-type no-break, and batch first-match all retain their previous behavior.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Unused
intersectsStemsfunction is dead code- Removed the unused
intersectsStemsfunction and updated the neighboring comment to stand alone referencing the P1.3.3 challenge evaluator; typecheck passes.
- Removed the unused
Preview (e404fe0a43)
diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,28 @@
## [Unreleased]
+## [0.22.0] - 2026-04-20
+
+### Changed
+
+- **`oddkit_encode` trigger-word classifier migrated from regex alternation to stemmed set intersection** (per PRD D5 from P1.3.4 — split-by-fit, same shape challenge adopted in 0.21.0 and gate adopted in 0.20.0). `EncodingTypeDef.triggerRegex: RegExp | null` is replaced with `stemmedTokens: Set<string>` — a parse product built once per canon fetch. Canon trigger vocabulary reads unchanged from `odd/encoding-types/*.md` (`## Trigger Words` fenced block); the new matcher tokenizes each vocabulary word with stop-words disabled, stems into the Set at parse time, and intersects against a stop-word-disabled stemmed input set at runtime. Inflected forms (`deciding`, `realizing`, `discovering`) now match their canonical stems (`decid`, `realiz`, `discover`) without canon having to enumerate each inflection. **Strictly additive**: every input that matched the prior regex still matches, plus stemmed variations now do too. Stop-words are disabled (empty `Set`) on both the parse-time `tokenize(word, new Set())` and the runtime `tokenize(para, new Set())` calls — canon vocab includes stop-word-adjacent phrases like `going with`, `committed to`, `must not`, `turns out`, `found that`, `next step`, `blocked by`; the P1.3.3 `from`-in-source-named precedent (C-04) demanded this pattern be replicated verbatim. Both classifier call sites preserve their existing semantics: `parsePrefixedBatchInput` untagged-paragraph path picks first match via `break` (one artifact per paragraph); `parseUnstructuredInput` emits one artifact per matching type (no `break` — the load-bearing design comment is preserved verbatim). `tokenize(para, new Set())` is hoisted out of the per-type loop at both call sites. A new `intersectsStems(vocab, input)` helper encapsulates the match test. Per `klappy://canon/principles/vodka-architecture`: fit the matcher to the problem shape (independent gap-or-not per type, multi-type allowed by design).
+
+### Removed
+
+- **Module-level `cachedEncodingTypes` in-process cache** (per PRD D9 from P1.3.4 — don't cache microsecond derivations; same pattern challenge shipped in 0.21.0 and gate shipped in 0.20.0). `cachedEncodingTypes`, `cachedEncodingTypesKnowledgeBaseUrl`, `cachedEncodingTypesSource` module-level fields deleted; cache-check short-circuit at the top of `discoverEncodingTypes` deleted; `cleanup_storage` resets for the three fields deleted. Per `klappy://canon/principles/cache-fetches-and-parses`: the fetch layer (Module Memory → Cache API → R2, 5-minute TTL) already caches the canon file content; caching the parse product for microsecond re-derivation savings is the anti-pattern the principle names. Parse runs fresh per call; overhead is sub-millisecond on hot fetches.
+
+### Added
+
+- **New smoke regression assertions in `workers/test/canon-tool-envelope.smoke.mjs`** anchoring the D5 migration: (12) stemmed inflection match — `"I'm deciding to ship two-tier cascade"` classifies as Decision (`decid` stem matches `decided` in canon vocab); (13) stop-word canon vocab survives — `"we're going with option B after the review"` matches Decision (`going with` multi-word canon vocab); (14) multi-type preservation — `"We must never deploy without tests because we decided this last week"` emits both `C` and `D` artifacts via the no-break path; (15) first-match preservation — untagged paragraph in a mixed batch emits exactly one artifact via the batch classifier's `break` semantic.
+
+### Refs
+
+- Handoff: `klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parity`
+- Canon basis: `klappy://canon/principles/cache-fetches-and-parses`, `klappy://canon/principles/vodka-architecture`
+- Precedent: oddkit 0.21.1 (challenge's D5 + D9), 0.20.0 (gate's D5 + D9)
+- Shipping gate: `klappy://canon/constraints/release-validation-gate` (binding)
+- Closes the canon-parity sweep — all three tools now use stemmed set intersection and have their in-process derivation caches removed per `cache-fetches-and-parses`.
+
## [0.21.1] - 2026-04-20
### Fixed
diff --git a/package.json b/package.json
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
{
"name": "oddkit",
- "version": "0.21.1",
+ "version": "0.22.0",
"description": "Agent-first CLI for ODD-governed repos. Epistemic terrain rendering with portable baseline.",
"type": "module",
"bin": {
diff --git a/workers/package-lock.json b/workers/package-lock.json
--- a/workers/package-lock.json
+++ b/workers/package-lock.json
@@ -1,12 +1,12 @@
{
"name": "oddkit-mcp-worker",
- "version": "0.21.1",
+ "version": "0.22.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "oddkit-mcp-worker",
- "version": "0.21.1",
+ "version": "0.22.0",
"dependencies": {
"agents": "^0.4.1",
"fflate": "^0.8.2",
diff --git a/workers/package.json b/workers/package.json
--- a/workers/package.json
+++ b/workers/package.json
@@ -1,6 +1,6 @@
{
"name": "oddkit-mcp-worker",
- "version": "0.21.1",
+ "version": "0.22.0",
"private": true,
"type": "module",
"scripts": {
diff --git a/workers/src/orchestrate.ts b/workers/src/orchestrate.ts
--- a/workers/src/orchestrate.ts
+++ b/workers/src/orchestrate.ts
@@ -56,12 +56,23 @@
/** Internal type — handlers return this, handleUnifiedAction stamps server_time */
type ActionResult = Omit<OddkitEnvelope, "server_time">;
-// Governance-driven encoding types
+// Governance-driven encoding types. Trigger-word classification is stemmed
+// phrase-subset matching per klappy://canon/principles/vodka-architecture
+// (fit the matcher to the problem) — same D5 shape applied to challenge
+// prereqs in 0.21.0 and gate prereqs in 0.20.0. triggerWords kept for
+// debugging only; stemmedPhrases is the parse product the runtime evaluates
+// against. Each inner array is the ordered stem sequence of a single
+// trigger word or phrase; a type matches an input when ALL stems of at
+// least one phrase are present in the input's stem set. This preserves
+// phrase-level semantics (`committed to`, `going with`, `must not`,
+// `next step`, `follow up`, `blocked by`, `turns out`) so common function
+// words (`to`, `with`, `by`, `up`, `out`, `not`) do not become standalone
+// match triggers on every English paragraph.
interface EncodingTypeDef {
letter: string;
name: string;
triggerWords: string[];
- triggerRegex: RegExp | null;
+ stemmedPhrases: string[][];
qualityCriteria: Array<{ criterion: string; check: string; gapMessage: string }>;
}
@@ -79,9 +90,12 @@
priority_band?: string;
}
-let cachedEncodingTypes: EncodingTypeDef[] | null = null;
-let cachedEncodingTypesKnowledgeBaseUrl: string | undefined = undefined;
-let cachedEncodingTypesSource: "knowledge_base" | "minimal" = "minimal";
+// D9 / klappy://canon/principles/cache-fetches-and-parses — no module-level
+// cache on the parse product. fetcher.getFile / fetcher.getIndex already cache
+// the canon read (Module Memory → Cache API → R2, 5-min TTL). Re-running the
+// parse loop per request is sub-millisecond derivation work, not worth the
+// plumbing tax of a keyed cache. Same pattern challenge (0.21.0) and gate
+// (0.20.0) already applied.
// Governance-driven challenge types (E0008 — mirrors encode pattern from PR #96)
interface ChallengeTypeDef {
@@ -409,10 +423,6 @@
fetcher: KnowledgeBaseFetcher,
knowledgeBaseUrl?: string,
): Promise<{ types: EncodingTypeDef[]; source: "knowledge_base" | "minimal" }> {
- if (cachedEncodingTypes && cachedEncodingTypesKnowledgeBaseUrl === knowledgeBaseUrl) {
- return { types: cachedEncodingTypes, source: cachedEncodingTypesSource };
- }
-
const index = await fetcher.getIndex(knowledgeBaseUrl);
const typeArticles = index.entries.filter(
(entry: IndexEntry) => entry.tags?.includes("encoding-type") && entry.path.includes("encoding-types/"),
@@ -437,10 +447,28 @@
const triggerWords = triggerSection
? triggerSection[1].split(",").map((w: string) => w.trim()).filter((w: string) => w.length > 0)
: [];
- const triggerRegex =
- triggerWords.length > 0
- ? new RegExp("\\b(" + triggerWords.map((w: string) => w.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")).join("|") + ")\\b", "i")
- : null;
+ // D5 / klappy://canon/principles/vodka-architecture — classification is
+ // stemmed phrase-subset matching, not regex alternation. Each canon
+ // trigger word/phrase is parsed once into its ordered stem sequence;
+ // runtime tokenizes input once and a type matches when ALL stems of
+ // at least one phrase are present. Inflected forms (deciding → decid,
+ // realizing → realiz) match their canonical stems without canon having
+ // to list each inflection. Stop-word filtering is disabled (empty Set)
+ // on both the parse-time and runtime tokenize() calls — canon vocab
+ // includes stop-word-adjacent phrases (`going with`, `committed to`,
+ // `must not`, `turns out`, `next step`, `blocked by`, `found that`)
+ // and dropping them would silently break the strictly-additive
+ // invariant, the same failure mode P1.3.3 hit on challenge's
+ // `from`-in-source-named vocab. Phrase-level conjunction (all stems
+ // of a phrase must match) is the precision floor: without it,
+ // ubiquitous function words like `to`/`with`/`by`/`up`/`out`/`not`
+ // would become standalone triggers on every English paragraph.
+ // Per canon/constraints/release-validation-gate and P1.3.3 C-04.
+ const stemmedPhrases: string[][] = [];
+ for (const word of triggerWords) {
+ const stems = tokenize(word, new Set());
+ if (stems.length > 0) stemmedPhrases.push(stems);
+ }
const criteriaSection = content.match(
/## Quality Criteria[\s\S]*?\| Criterion[\s\S]*?\|[-|\s]+\|\n([\s\S]*?)(?=\n\n|\n##|$)/,
@@ -459,7 +487,7 @@
}
}
- types.push({ letter, name, triggerWords, triggerRegex, qualityCriteria });
+ types.push({ letter, name, triggerWords, stemmedPhrases, qualityCriteria });
} catch {
continue;
}
@@ -495,17 +523,21 @@
["H", "Handoff", ["next session", "next step", "todo", "follow up", "blocked by"]],
["E", "Encode", ["encoded", "captured", "crystallized", "persisted", "artifact"]],
];
- resolved = defaults.map(([letter, name, words]) => ({
- letter, name, triggerWords: words,
- triggerRegex: new RegExp("\\b(" + words.join("|") + ")\\b", "i"),
- qualityCriteria: [],
- }));
+ resolved = defaults.map(([letter, name, words]) => {
+ const stemmedPhrases: string[][] = [];
+ for (const word of words) {
+ const stems = tokenize(word, new Set());
+ if (stems.length > 0) stemmedPhrases.push(stems);
+ }
+ return {
+ letter, name, triggerWords: words,
+ stemmedPhrases,
+ qualityCriteria: [],
+ };
+ });
source = "minimal";
}
- cachedEncodingTypes = resolved;
- cachedEncodingTypesKnowledgeBaseUrl = knowledgeBaseUrl;
- cachedEncodingTypesSource = source;
return { types: resolved, source };
}
@@ -1084,6 +1116,25 @@
return paragraphs.some((p) => PREFIX_TAG_REGEX.test(p));
}
+// Phrase-subset match — a phrase matches when ALL of its stems appear in the
+// input stem set. Short-circuits on the first phrase that matches. The D5
+// matcher shape for encode trigger-word classification, mirroring the shape
+// used by evaluatePrerequisiteCheck in the P1.3.3 challenge evaluator:
+// single-stem phrases degenerate to set membership (identical to the old
+// single-token behavior), while multi-stem phrases like
+// `committed to` → ["committ","to"] require both stems to co-occur, so
+// ubiquitous function words cannot match on their own.
+function matchesStemmedPhrases(phrases: string[][], input: Set<string>): boolean {
+ for (const phrase of phrases) {
+ let allPresent = true;
+ for (const stem of phrase) {
+ if (!input.has(stem)) { allPresent = false; break; }
+ }
+ if (allPresent) return true;
+ }
+ return false;
+}
+
function parsePrefixedBatchInput(input: string, types: EncodingTypeDef[]): ParsedArtifact[] {
const typeMap = new Map(types.map((t) => [t.letter, t.name]));
const paragraphs = input.split(/\n\n+/).map((p) => p.trim()).filter((p) => p.length > 0);
@@ -1118,9 +1169,16 @@
// Untagged paragraph in a batch that contains tags: classify via trigger
// words like parseUnstructuredInput, but emit one artifact per paragraph
// (not one-per-match) to preserve the author's paragraph boundaries.
+ // Stemmed set intersection mirrors parseUnstructuredInput — stop-words
+ // disabled on tokenize() both sides per P1.3.3 C-04 (canon vocab
+ // includes stop-word phrases like `going with` / `must not`).
let matched: EncodingTypeDef | null = null;
+ const inputStems = new Set(tokenize(para, new Set()));
for (const t of types) {
- if (t.triggerRegex && t.triggerRegex.test(para)) { matched = t; break; }
+ // Break on first match: this path picks one type per paragraph by
+ // design (paragraph boundaries are the author's). Unlike
+ // parseUnstructuredInput which emits one artifact per matching type.
+ if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) { matched = t; break; }
}
const pick = matched ?? types[0] ?? { letter: "D", name: "Decision" };
const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60);
@@ -1157,12 +1215,19 @@
const artifacts: ParsedArtifact[] = [];
for (const para of paragraphs) {
let matched = false;
+ // Hoist tokenize(para) out of the per-type loop — para is constant across
+ // the loop, stemmedTokens differ per type. Mirrors the P1.3.3 challenge
+ // prereq evaluator shape. Stop-words disabled (empty Set) on both parse-
+ // time and runtime tokenize() calls so canon vocab like `going with`,
+ // `must not`, `turns out`, `found that` survives on both sides. Per
+ // canon/constraints/release-validation-gate and P1.3.3 Bug #1 precedent.
+ const inputStems = new Set(tokenize(para, new Set()));
for (const t of types) {
// DESIGN: no break — a paragraph can match multiple types intentionally.
// "We must never deploy without tests" is both Decision and Constraint.
// Multi-typing at the server level mirrors what the model would do with
// separate TSV rows. Do not add a break here.
- if (t.triggerRegex && t.triggerRegex.test(para)) {
+ if (matchesStemmedPhrases(t.stemmedPhrases, inputStems)) {
const first = para.split(/[.!?\n]/)[0]?.trim() || para.slice(0, 60);
const title = first.split(/\s+/).length <= 12 ? first : first.split(/\s+/).slice(0, 8).join(" ") + "...";
artifacts.push({ type: t.letter, typeName: t.name, fields: [t.letter, title, para.trim()], title, body: para.trim() });
@@ -1518,9 +1583,10 @@
// Also clear the in-memory BM25 index
cachedBM25Index = null;
cachedBM25Entries = null;
- cachedEncodingTypes = null;
- cachedEncodingTypesKnowledgeBaseUrl = undefined;
- cachedEncodingTypesSource = "minimal";
+ // cachedEncodingTypes removed in 0.22.0 per cache-fetches-and-parses —
+ // encode's parse product is no longer cached in-process. The fetch tier
+ // (Cache API, R2) already handles canon file caching; the derivation is
+ // sub-millisecond. No reset needed here.
// E0008 — governance-driven challenge caches (mirror PR #96 fix)
cachedChallengeTypes = null;
cachedChallengeTypesKnowledgeBaseUrl = undefined;
diff --git a/workers/test/canon-tool-envelope.smoke.mjs b/workers/test/canon-tool-envelope.smoke.mjs
--- a/workers/test/canon-tool-envelope.smoke.mjs
+++ b/workers/test/canon-tool-envelope.smoke.mjs
@@ -224,6 +224,62 @@
`got: ${encodeOverride.result?.governance_source}`,
);
+ // P1.3.4 D5 regression anchors — stemmed set intersection replaces regex
+ // alternation on the encode classifier. These assertions exist because
+ // the pre-refactor literal regex path could not match inflections of
+ // canon vocab (`deciding` does not match `decided` under `\bdecided\b`),
+ // and the P1.3.3 Bug #1 precedent showed that tokenize()'s default
+ // stop-word filter silently breaks multi-word canon vocab (`going with`,
+ // `committed to`, `must not`). The assertions are numbered (12)–(15) to
+ // continue the sequence P1.3.3 established at (10)/(11).
+ console.log(`\n─── oddkit_encode: (12) stemmed inflection match (D5 landed) ───`);
+ const encodeInflection = await callTool("oddkit_encode", {
+ input: "I'm deciding to ship the two-tier cascade",
+ });
+ expectFullEnvelope("oddkit_encode (inflection match)", encodeInflection);
+ const inflectionTypes = (encodeInflection.result?.artifacts ?? []).map((a) => a.type);
+ ok(
+ "oddkit_encode: (12) `deciding` (inflection of `decided`) classifies as Decision via stem intersection",
+ inflectionTypes.includes("D"),
+ `got artifact types: ${inflectionTypes.join(",")}`,
+ );
+
+ console.log(`\n─── oddkit_encode: (13) stop-word canon vocab survives tokenize (P1.3.3 C-04 ported) ───`);
+ const encodeStopWord = await callTool("oddkit_encode", {
+ input: "we're going with option B after the review",
+ });
+ expectFullEnvelope("oddkit_encode (stop-word survival)", encodeStopWord);
+ const stopWordTypes = (encodeStopWord.result?.artifacts ?? []).map((a) => a.type);
+ ok(
+ "oddkit_encode: (13) `going with` (multi-word canon vocab containing stop-word `with`) matches Decision",
+ stopWordTypes.includes("D"),
+ `got artifact types: ${stopWordTypes.join(",")}`,
+ );
+
+ console.log(`\n─── oddkit_encode: (14) multi-type no-break preservation (L1161 design comment) ───`);
+ const encodeMultiType = await callTool("oddkit_encode", {
+ input: "We must never deploy without tests because we decided this last week",
+ });
+ expectFullEnvelope("oddkit_encode (multi-type)", encodeMultiType);
+ const multiTypeTypes = (encodeMultiType.result?.artifacts ?? []).map((a) => a.type);
+ ok(
+ "oddkit_encode: (14) paragraph matching both Constraint and Decision emits both artifact types (no-break path)",
+ multiTypeTypes.includes("C") && multiTypeTypes.includes("D"),
+ `got artifact types: ${multiTypeTypes.join(",")}`,
+ );
+
+ console.log(`\n─── oddkit_encode: (15) first-match preservation in batch-untagged path ───`);
+ const encodeBatchUntagged = await callTool("oddkit_encode", {
+ input: "[D] explicit decision tag on first paragraph\n\nwe must always write tests before we decided on TDD",
+ });
+ expectFullEnvelope("oddkit_encode (batch first-match)", encodeBatchUntagged);
+ const batchArtifacts = encodeBatchUntagged.result?.artifacts ?? [];
+ ok(
+ "oddkit_encode: (15) batch with tagged + untagged paragraphs emits exactly 2 artifacts (first-match path picks one type per untagged paragraph)",
+ batchArtifacts.length === 2,
+ `got length: ${batchArtifacts.length}; types: ${batchArtifacts.map((a) => a.type).join(",")}`,
+ );
+
// Tool 5: oddkit_challenge — canon-driven, four governance surfaces.
// Full envelope + governance_source + governance_uris (plural, per PRD D4 —
// shape diverges from encode by design because challenge reads four peerYou can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 113ba11. Configure here.
The Bugbot autofix on commit 113ba11 is the canonical disposition of the high-severity finding on 259170a. This commit ports the orchestrator-drafted CHANGELOG [0.22.0] entry and the smoke regression anchor onto the autofix. CHANGELOG: rewritten to describe the phrase-subset match autofix actually landed (ALL stems of at least one phrase must co-occur in the input stem set, any order). Previous draft described a consecutive-subsequence variant that was never shipped. Subset-match is simpler, one uniform structure, and aligned with encode's multi-type tolerance philosophy. Smoke assertion (16): traced against the autofix semantics. Input "I need to wait until tomorrow for the review" contains `to` but no Decision phrase has all its stems present (`decid`/`decis`/`chose` absent; `[committ,to]` fails on `committ`; `[go,with]` fails on both), and no Handoff phrase has all its stems present (`[next,step]`, `[follow,up]`, `[block,by]`, `[wait,on]` all fail on their second stem; `todo`, `continu`, `remain`, `handoff` singletons absent). A revision that re-flattens the matcher back to standalone-singleton triggers would fail this assertion. Disposition record for PR #126 Bugbot finding: - Finding: multi-word vocab flattening produces universal function-word triggers (high severity, 2026-04-20T12:55:03Z) - Fix-forward: Cursor autofix commit 113ba11 (phrase-subset match) - Orchestrator's proposed alternative: stricter consecutive-subsequence match — not shipped; subset-match is the simpler, correct design for encode's multi-type philosophy Verified: - tsc --noEmit clean - governance-parser.test.mjs 105/105 pass - Smoke assertions (12)-(16) traced against autofix semantics, all hold
Main shipped 0.22.0 via PR #128 while this branch was in Sonnet 4.6 validator dispatch. PR #128 backfilled CHANGELOG + version bump covering the envelope-conformance fixes from PR #124 (telemetry_public) and PR #125 (catalog generated_at). Per klappy://canon/constraints/release-validation-gate Rule 3 (canon outranks session artifacts), this refactor is re-versioned to 0.23.0. The handoff's "ship as 0.22.0" recommendation was session-scoped; main- reality is the canon. Resolution: - CHANGELOG.md: my encode D5+D9 content moves to a new [0.23.0] section above the existing [0.22.0] (telemetry + catalog); added a version- note blockquote explaining the bump. - package.json / workers/package.json / both lockfiles: 0.22.0 → 0.23.0. - workers/src/orchestrate.ts: auto-merged cleanly (catalog fix touched runCatalog, encode refactor touched discoverEncodingTypes + classifier call sites; zero overlap). - workers/test/canon-tool-envelope.smoke.mjs: auto-merged cleanly (additive on both sides). Verified: - tsc --noEmit clean - governance-parser.test.mjs 105/105 pass - CHANGELOG structure: [Unreleased] [0.23.0] [0.22.0] [0.21.1] [0.21.0]... - All conflict markers removed Sonnet 4.6 validator verdict (session sesn_011CaF5vqjgzN7Mw8s84qvK9, PASS on all 5 corroborations) remains valid for the encode refactor content — the rebase does not touch any matcher code or action behavior. A fresh-context validator re-dispatch will run before promotion per release-validation-gate Rule 2 out of canon-discipline caution.
The d2acf91 merge commit bumped this refactor from 0.22.0 to 0.23.0 per release-validation-gate Rule 3, but the inline comment on the removed cachedEncodingTypes reset block still said "removed in 0.22.0". Updated to reflect the actual shipping version. No functional change.
klappy
added a commit
that referenced
this pull request
Apr 20, 2026
Promotes oddkit 0.23.0 to prod: the P1.3.4 encode canon-parity refactor. Closes the sweep — all three tools now use stemmed matching and have their in-process derivation caches removed per klappy://canon/principles/cache-fetches-and-parses. Release validation gate (klappy://canon/constraints/release-validation-gate): Rule 1 — Bugbot completed on all merged SHAs (feat PR #126): 259170a/neutral→fixed, 113ba11/neutral→fixed, e404fe0/success, eaa1234/success, 8a0636b/success; promotion head 7542cbb: success. Rule 2 — Independent fresh-context validators: - Feat validator: agent_011CaF5vo8B5UpqtfZAmSeui, session sesn_011CaF5vqjgzN7Mw8s84qvK9 against eaa1234 — PASS on all 5 corroborations - Promotion validator: agent_011CaF9tvJgRXQ6F96MtN4iu, session sesn_011CaF9tx18Af3z1Fy9trwz8 against 7542cbb — PASS on all 5 corroborations, smoke 223/0 × 3 Rule 3 — handoff's 0.22.0 recommendation overridden by main-reality (PR #128/#129 shipped 0.22.0 envelope fixes while this was in validator dispatch); rebased forward to 0.23.0 per canon-outranks-session-artifacts. Non-blocking carry-forward: P13 — parseUnstructuredInput fallback-to-types[0] behavior for inputs with no canon vocab intersection. Pre-existing, surfaced by both validators, outside P1.3.4 scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

P1.3.4 — Encode Canon-Parity Refactor (D5 + D9)
Migrates
oddkit_encode's trigger-word classifier from regex alternation to stemmed phrase-subset matching — the last regex matcher in the canon-parity sweep. Same D5 family challenge adopted in 0.21.0 and gate adopted in 0.20.0, adapted for encode's phrasal vocabulary.Closes the sweep. After this lands, all three tools (
oddkit_encode,oddkit_challenge,oddkit_gate) use stemmed matching and have their in-process derivation caches removed perklappy://canon/principles/cache-fetches-and-parses.Commit history on this PR
259170astemmedTokens: Set<string>113ba11e404fe0eaa1234Bugbot finding disposition (Rule 1)
Finding: high-severity on
259170a, 2026-04-20T12:55:03Z, posted bycursor[bot].Confirmed. The first-cut implementation flattened multi-word canon phrases into individual stems. With stop-words disabled (required per P1.3.3 C-04 so canon vocab survives), function-word stems like
to,with,by,up,outbecame universal match triggers on every English paragraph.Disposition: fix-forward in this PR via Cursor autofix commit
113ba11.EncodingTypeDef.stemmedTokens: Set<string>replaced withstemmedPhrases: string[][], where each inner array is the ordered stem sequence of a single canon trigger entry. Runtime matchermatchesStemmedPhrasesdeclares a type match only when ALL stems of at least one phrase co-occur in the input stem set. Single-stem phrases degenerate to set membership (inflection matching still works); multi-stem phrases require stem conjunction. Subsequent autofix commite404fe0removed the now-unusedintersectsStemshelper.Scope landed
Item 1 — D5 stemmed phrase-subset matcher (
workers/src/orchestrate.ts)EncodingTypeDef.triggerRegex: RegExp | nullremoved, replaced withstemmedPhrases: string[][].triggerWords: string[]kept for debugging visibility (per handoff).tokenize(word, new Set())— stop-words disabled per P1.3.3 C-04. Each stem array is stored as-is.matchesStemmedPhrases(phrases, inputStems)helper: iterates phrases; declares match on the first phrase whose stems are all present in the input set.parsePrefixedBatchInputuntagged-paragraph path:inputStems = new Set(tokenize(para, new Set()))hoisted above the per-type loop;matchesStemmedPhrasesreplacestriggerRegex.test;breakpreserved — this path picks one type per paragraph by design.parseUnstructuredInput: same hoist + same match substitution; nobreak— the L1161–1164 DESIGN comment preserved verbatim.Item 2 — D9 cache removal (
workers/src/orchestrate.ts)cachedEncodingTypes,cachedEncodingTypesKnowledgeBaseUrl,cachedEncodingTypesSourcedeleted.discoverEncodingTypesdeleted.cleanup_storageresets for the three fields deleted.klappy://canon/principles/cache-fetches-and-parses.Item 3 — Smoke regression assertions (
workers/test/canon-tool-envelope.smoke.mjs)"I'm deciding to ship the two-tier cascade"→ Decision (viadecidstem-singleton matchingdecidedin canon vocab)"we're going with option B after the review"→ Decision (via[go, with]phrase having both stems present)"We must never deploy without tests because we decided this last week"→ bothCandDartifacts[D]tag + untagged paragraph with multiple triggers → exactly 2 artifacts"I need to wait until tomorrow for the review"→ NEITHER Decision NOR Handoff. Bugbot-finding regression anchor; would fail if the matcher was re-flattened to standalone singletons.Verified (local)
tsc --noEmitclean on every commitnode test/governance-parser.test.mjs= 105/105 pass on every commitRelease validation gate attestation
Per
klappy://canon/constraints/release-validation-gate:259170awith one high-severity finding; fix-forwarded in commits113ba11,e404fe0,eaa1234. Bugbot will be re-polled on headeaa1234before merge; if any finding remains, will be dispositioned before merge.workers/src/orchestrate.ts+ matcher algorithm +oddkit_encodeaction behavior = load-bearing surface. A Sonnet 4.6 read-only validator session will be dispatched via Managed Agents before themain → prodpromotion PR merges. 5-corroboration pattern from P1.3.1, adapted per the P1.3.4 handoff's C1–C5 spec for encode.Refs
klappy://odd/handoffs/2026-04-20-p1-3-4-encode-canon-parityklappy://odd/ledger/2026-04-20-p1-3-3-challenge-canon-parity-landedklappy://canon/principles/cache-fetches-and-parses,klappy://canon/principles/vodka-architectureklappy://canon/constraints/release-validation-gateNote
Medium Risk
Changes core encode classification logic and caching behavior, which can affect artifact typing and downstream workflows if the new matcher has edge cases or performance regressions.
Overview
oddkit_encode’s trigger-word classifier is refactored from per-type regex alternation to stemmed phrase-subset matching (EncodingTypeDef.triggerRegex→stemmedPhrases+matchesStemmedPhrases), with stop-word filtering disabled and per-paragraph tokenization hoisted to preserve multi-type vs first-match semantics.The module-level
discoverEncodingTypesparse-product cache is removed (and relatedcleanup_storageresets deleted), relying on the existing fetch-layer caching instead.Adds new
canon-tool-envelope.smoke.mjsassertions to lock in inflection/phrase behavior and prevent function-word false positives, updates the changelog accordingly, and bumps package versions to0.23.0.Reviewed by Cursor Bugbot for commit 8a0636b. Bugbot is set up for automated code reviews on this repo. Configure here.