feat(telemetry): semantic names for telemetry_public — hide blob*/double* from consumers by klappy · Pull Request #137 · klappy/oddkit

klappy · 2026-04-24T18:01:35Z

Summary

Consumers of telemetry_public now write SQL using semantic names only. Raw slot names (blob1–blob9, double1–double6) are hidden from the consumer interface entirely.

Raw slot names in queries → rejected with a helpful error naming the semantic equivalent
Semantic names in queries → transparently rewritten to raw slots before forwarding to Cloudflare Analytics Engine
Result columns → rewritten back from raw slots to semantic names before returning
Tool docstring → semantic names only, two example queries, zero mentions of blob*/double*

Per maintainer directive: "no deprecation, nobody uses them yet."

Semantic Names Exposed

Semantic name	Slot	Description
`event_type`	blob1	"mcp_request" \| "tool_call"
`method`	blob2	JSON-RPC method
`tool_name`	blob3	oddkit action name
`consumer_label`	blob4	best-effort caller identity
`consumer_source`	blob5	how label was resolved
`knowledge_base_url`	blob6	which repo is being served
`document_uri`	blob7	for get calls, the URI requested
`worker_version`	blob8	oddkit version string
`cache_tier`	blob9	which storage tier served the index
`count`	double1	always 1
`duration_ms`	double2	full request wall-clock
`bytes_in`	double3	UTF-8 byte length of request
`bytes_out`	double4	UTF-8 byte length of response
`tokens_in`	double5	cl100k_base token count of request
`tokens_out`	double6	cl100k_base token count of response

Vodka Architecture Compliance

The schema mapping is sourced from canon at runtime via KnowledgeBaseFetcher.getFile(canon/constraints/telemetry-governance.md) — the same content-addressed R2/ZIP/memory cache mechanism all other canon-loading paths use. Specifically:

Doubles: parsed from the "Numeric Values (Doubles)" table in the governance doc (backtick-quoted identifiers in the Value column)
Blobs: positional from hardcoded baseline (the canon table uses human-readable dimension names, not machine-readable identifiers, so positional is the only reliable signal)
Fallback: hardcoded BASELINE_BLOB/DOUBLE_SEMANTIC_NAMES arrays when canon is unreachable — the safety net per vodka architecture
Module cache: cachedSchemaMap — reset between isolate restarts, same pattern as other canon-derived data

Phase 1 Audit Findings

Live data query (last 7 days, SUM(_sample_interval) = 10,301 rows):

Field	Population
`bytes_in` (double3)	✅ Non-zero — sum=41,545
`bytes_out` (double4)	✅ Non-zero — sum=1,316,480
`tokens_in` (double5)	✅ Non-zero — sum=12,602
`tokens_out` (double6)	✅ Non-zero — sum=325,627
`cache_tier` (blob9)	⚠️ Always `"none"` — written but always default

Open issue (not in scope): cache_tier always reads "none" in the last 7 days. The blob9 slot is being written by recordTelemetry with the cacheTier param, but the write-path call sites appear to always pass "none" or no value. This is a write-path concern — the read interface is correct, the signal is just always the default value. Filed as an observation, not fixed here.

Previous docstring gap: The old docstring showed blob1–blob8 and double1–double2 only — bytes_in, bytes_out, tokens_in, tokens_out, and cache_tier were invisible to consumers even under the raw-slot scheme. The semantic view is complete.

Files Changed

workers/src/telemetry.ts (+260 lines): SchemaMap type; BASELINE_BLOB/DOUBLE_SEMANTIC_NAMES; buildSchemaMapFromArrays; parseDoublesFromCanon (regex parses the governance doc table); getSchemaMap (async, canon-first, module-level cache); detectRawSlotNames; rewriteSqlToRaw; rewriteResultToSemantic; queryTelemetry updated to wire all rewriting. Pure functions exported for testability.
workers/src/index.ts (~20 lines changed): telemetry_public docstring rewritten — semantic names, descriptions, two example queries. No blob*/double* anywhere.
workers/test/telemetry-integration.test.mjs (+140 lines): zip-baseline-fetcher.ts added to compile scope; compiled JS patched for Node ESM .js extension resolution; 8 new unit tests for the semantic rewriting layer.

Test Evidence

telemetry integration tests (full write path)

  bytes_in=93 bytes_out=115 tokens_in=31 tokens_out=45
✓ oddkit_time tool call lands a complete telemetry record
  bytes_out=8633 (~8.4KB) tokens_out=1785
✓ oddkit_search with realistic ~8KB response — measurements are sane
✓ SSE response (empty body) records bytes_out=0 and tokens_out=0
✓ batch JSON-RPC produces one data point per message
✓ detectRawSlotNames: returns null for clean semantic query
✓ detectRawSlotNames: rejects blob1 with helpful message
✓ detectRawSlotNames: rejects double5 with helpful message
✓ rewriteSqlToRaw: translates all blob semantic names
✓ rewriteSqlToRaw: translates all double semantic names
✓ rewriteSqlToRaw: knowledge_base_url does not clobber shorter substrings
✓ rewriteResultToSemantic: renames blob/double columns in meta and data
✓ rewriteResultToSemantic: passes through non-slot result unchanged
✓ malformed JSON-RPC is silently dropped (telemetry never throws)
✓ missing env.ODDKIT_TELEMETRY is a graceful no-op

14 passed, 0 failed

npx tsc --noEmit — zero errors.

Smoke Tests (Phase 3 — post-deploy)

To be run against the deployed staging URL after CI passes:

Semantic query works: SELECT tool_name, SUM(_sample_interval) AS calls FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL 7 DAY GROUP BY tool_name ORDER BY calls DESC LIMIT 5 → returns rows with tool_name column (not blob3)
Raw name rejected: SELECT blob3 FROM oddkit_telemetry LIMIT 1 → error: "Raw column names are not allowed. Use semantic names instead: blob3 → tool_name..."
Result columns semantic: any query selecting blob/double columns returns those columns renamed to semantic names in meta and data.

Note

Medium Risk
Introduces SQL rewriting/validation logic for telemetry_public queries, which can subtly break query behavior or error handling if the rewrite rules miss edge cases. Scope is contained to telemetry querying and is backed by expanded integration/unit tests.

Overview
telemetry_public now exposes semantic telemetry column names only (e.g. tool_name, duration_ms) and removes blob*/double* from the public interface, including updated tool docs and example queries.

Telemetry querying (queryTelemetry) now loads a canon-derived schema mapping (with hardcoded fallback), rewrites consumer SQL from semantic→raw, rejects any raw slot references (while ignoring matches inside string literals), and rewrites Cloudflare Analytics Engine results back to semantic column names.

Tests expand the telemetry integration harness to compile additional dependencies and patch emitted ESM imports, and add coverage for raw-slot detection, SQL rewrite edge cases (e.g. count(*), literals), and result-column renaming.

^{Reviewed by Cursor Bugbot for commit 3e48b7a. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ble* slots Consumers now write SQL using semantic names only. Raw slot names (blob1..9, double1..6) are rejected with a helpful error pointing to the semantic equivalent. Result column names are rewritten raw→semantic before returning to the consumer. Schema mapping is sourced from canon at runtime (vodka architecture): - KnowledgeBaseFetcher.getFile('canon/constraints/telemetry-governance.md') - Doubles table parsed for backtick-quoted identifiers (e.g. duration_ms) - Blob names are positional from the hardcoded baseline (canon table uses human-readable dimension names, not machine-readable identifiers) - Falls back to hardcoded baseline on canon fetch failure Semantic names exposed: event_type, method, tool_name, consumer_label, consumer_source, knowledge_base_url, document_uri, worker_version, cache_tier (blobs) count, duration_ms, bytes_in, bytes_out, tokens_in, tokens_out (doubles) Breaking: blob*/double* column names are no longer accepted. Per maintainer directive: no deprecation, nobody uses them yet. Changes: - workers/src/telemetry.ts: SchemaMap type, baseline arrays, getSchemaMap (async, canon-first with baseline fallback), detectRawSlotNames, rewriteSqlToRaw, rewriteResultToSemantic; queryTelemetry wired to rewrite SQL in and results out; pure functions exported for testability. - workers/src/index.ts: telemetry_public docstring rewritten to semantic names only, two example queries, no blob*/double* visible to consumers. - workers/test/telemetry-integration.test.mjs: added zip-baseline-fetcher.ts to compile scope, patched compiled JS for Node ESM .js resolution, added 8 new semantic-rewriting unit tests (14/14 total pass). Phase 1 audit findings: - All fields populated: bytes_in/out and tokens_in/out non-zero in last 7 days (sums: bytes_in=41545, bytes_out=1316480, tokens_in=12602, tokens_out=325627 over 10301 sampled rows). - cache_tier (blob9) always reads 'none': write-path concern, out of scope. - Previous docstring was missing double3-6 and blob9 entirely.

cloudflare-workers-and-pages · 2026-04-24T18:01:54Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	oddkit	`3e48b7a`	Commit Preview URL Branch Preview URL	Apr 26 2026, 02:09 AM

…ollision The semantic name count maps to double1, and the previous word-boundary regex would rewrite SQL count(*) to double1(*), which is invalid SQL that the CF Analytics Engine API rejects. Add a negative lookahead (?!\s*\() so function-call positions are skipped; column references (e.g. SUM(count)) still rewrite correctly.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Raw slot detection scans inside string literals unlike rewrite
- Updated detectRawSlotNames to strip single-quoted string literals (with doubled-quote escape handling) before scanning for raw slot names, matching the scoping rules already used by rewriteSqlToRaw.

Preview (815585d028)

diff --git a/odd/ledger/learnings.jsonl b/odd/ledger/learnings.jsonl
--- a/odd/ledger/learnings.jsonl
+++ b/odd/ledger/learnings.jsonl
@@ -38,3 +38,5 @@
 {"id":"learn-20260412-0001","timestamp":"2026-04-12T00:52:00Z","summary":"Standalone Worker tools (telemetry, time) bypass orchestrate pipeline — they share oddkit_ MCP prefix but register directly in createServer with their own handler. CLI parity requires adding to TOOLS array (auto-cascades) plus explicit param threading in cli.js and server.js","trigger":"architecture","impact":"New standalone tools need 5 files touched: index.ts (Worker registration), tool-registry.js (TOOLS entry), actions.js (handler), server.js (param threading), cli.js (param threading). The TOOLS auto-derivation handles enum/listing but not param plumbing.","confidence":0.95,"sources":["workers/src/index.ts","src/core/tool-registry.js","src/core/actions.js","src/mcp/server.js","src/cli.js"],"evidence":[{"type":"artifact","ref":"PR #87 — oddkit_time implementation across 5 files"}],"candidate_targets":[],"proposed_escalation":"none"}
 {"id":"L39","timestamp":"2026-04-13T11:12:00Z","type":"learning","summary":"raw.githubusercontent.com URL parsing must rejoin all path segments after owner/repo to support branch names with slashes — parts[2] truncates multi-segment refs like publish/four-essays-and-skill to just publish","context":"extractBranchRef() and getZipUrl() in zip-baseline-fetcher.ts both used parts[2] which only captured the first segment of a slash-containing branch name, causing 404s on both SHA resolution and ZIP download","resolution":"Changed to parts.slice(2).join(\"/\") in both functions — minimal 2-line fix"}
 {"type":"D","summary":"E0008 challenge governance refactor: replaced hardcoded detectClaimType logic in runChallengeAction with four governance-driven fetch functions (discoverChallengeTypes, fetchBasePrerequisites, fetchNormativeVocabulary, fetchStakesCalibration). Voice-dump suppression invariant is load-bearing — questionTiers.length === 0 short-circuits all output. Four new caches cleared in runCleanupStorage. tsc clean. PR #100.","rationale":"Hardcoded challenge logic cannot evolve with governance articles; governance-driven extraction means challenge behavior updates when articles update, no code change required. Mirrors PR #96 encode precedent exactly.","context":"workers/src/orchestrate.ts, branch feat/e0008-challenge-governance-driven, commit aa4445c","date":"2026-04-17"}
+{"date": "2026-04-24", "epoch": "E0008", "task": "feat/telemetry-semantic-names", "summary": "TypeScript bundler moduleResolution omits .js extensions on local imports in compiled output \u2014 Node.js ESM resolver requires explicit .js suffix. When compiling telemetry.ts for integration tests, all compiled .js files in the build dir must be post-processed to add .js to extensionless relative imports. Patch all files in the build dir, not just telemetry.js.", "detail": "telemetry.ts now imports KnowledgeBaseFetcher (a value import, not just a type import) from zip-baseline-fetcher.ts. The existing integration test only compiled tokenize.ts and telemetry.ts. Adding zip-baseline-fetcher.ts to the tsconfig include list is necessary but insufficient \u2014 the compiled JS has extensionless imports (./zip-baseline-fetcher, ./tracing) that Node ESM cannot resolve. Must patch all .js files in the build dir with a regex replace of from \"./foo\" -> from \"./foo.js\".", "pr": "https://github.com/klappy/oddkit/pull/137"}
+{"date": "2026-04-24", "epoch": "E0008", "task": "feat/telemetry-semantic-names", "summary": "JSDoc block comments must not contain */ sequences \u2014 they terminate the comment prematurely. Patterns like blob*/double* in a JSDoc comment cause TypeScript parse errors. Use blob1..9/double1..6 or similar notation instead.", "detail": "detectRawSlotNames JSDoc had blob*/double* which the TypeScript parser reads as end-of-comment at the first */. tsc reported TS1109 (Expression expected) at line 459. The fix is trivial but the error message is cryptic \u2014 the real cause is invisible until you stare at the raw characters.", "pr": "https://github.com/klappy/oddkit/pull/137"}

diff --git a/workers/src/index.ts b/workers/src/index.ts
--- a/workers/src/index.ts
+++ b/workers/src/index.ts
@@ -425,20 +425,31 @@
 
 Dataset: oddkit_telemetry (Cloudflare Analytics Engine)
 Schema:
-  blob1  — event_type      "mcp_request" | "tool_call"
-  blob2  — method          JSON-RPC method (e.g. "tools/call")
-  blob3  — tool_name       oddkit action (e.g. "orient", "search")
-  blob4  — consumer_label  best-effort caller identity
-  blob5  — consumer_source how label was resolved (e.g. "user-agent")
-  blob6  — knowledge_base_url       which knowledge base is being served
-  blob7  — document_uri    for get calls, the klappy:// URI requested
-  blob8  — worker_version  oddkit version string
-  double1 — count          always 1
-  double2 — duration_ms    request processing time
-  index1 — sampling_key    consumer label
+  event_type        — "mcp_request" | "tool_call"
+  method            — JSON-RPC method (e.g. "tools/call")
+  tool_name         — oddkit action (e.g. "orient", "search")
+  consumer_label    — best-effort caller identity
+  consumer_source   — how label was resolved (e.g. "user-agent")
+  knowledge_base_url — which knowledge base is being served
+  document_uri      — for get calls, the klappy:// URI requested
+  worker_version    — oddkit version string
+  cache_tier        — which storage tier served the index
+  count             — always 1 (use SUM for aggregation)
+  duration_ms       — request processing time (full wall-clock at worker edge)
+  bytes_in          — UTF-8 byte length of the request body
+  bytes_out         — UTF-8 byte length of the response body (0 for SSE streams)
+  tokens_in         — cl100k_base token count of the request body
+  tokens_out        — cl100k_base token count of the response body
+  index1            — sampling key (consumer label)
 
 Use SUM(_sample_interval) instead of COUNT(*) to account for Analytics Engine sampling.
-Time filter example: WHERE timestamp > NOW() - INTERVAL '30' DAY`,
+Time filter example: WHERE timestamp > NOW() - INTERVAL '30' DAY
+
+Example — tool leaderboard:
+  SELECT tool_name, SUM(_sample_interval) AS calls FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL '30' DAY GROUP BY tool_name ORDER BY calls DESC LIMIT 10
+
+Example — payload shape by tool:
+  SELECT tool_name, AVG(tokens_out) AS avg_tokens_out FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL '7' DAY GROUP BY tool_name ORDER BY avg_tokens_out DESC`,
     {
       sql: z.string().describe("Analytics Engine SQL query against the oddkit_telemetry dataset."),
     },

diff --git a/workers/src/telemetry.ts b/workers/src/telemetry.ts
--- a/workers/src/telemetry.ts
+++ b/workers/src/telemetry.ts
@@ -55,6 +55,7 @@
  */
 
 import type { Env } from "./zip-baseline-fetcher";
+import { KnowledgeBaseFetcher } from "./zip-baseline-fetcher";
 import type { PayloadShape } from "./tokenize";
 import pkg from "../package.json";
 
@@ -301,6 +302,294 @@
 }
 
 // ──────────────────────────────────────────────────────────────────────────────
+// Semantic schema mapping (vodka-architecture compliant)
+//
+// Cloudflare Analytics Engine uses positional slot names (blob1..9, double1..6).
+// Consumers see semantic names only. This module:
+//   1. Loads the authoritative mapping from canon at runtime
+//      (canon/constraints/telemetry-governance.md, parsed via KnowledgeBaseFetcher)
+//   2. Falls back to the hardcoded baseline below if canon is unreachable
+//   3. Rewrites consumer SQL (semantic → raw) before forwarding to CF
+//   4. Rewrites result column names (raw → semantic) before returning to consumer
+//   5. Rejects any query containing raw blob*/double* names with a helpful error
+//
+// Source of truth: klappy://canon/constraints/telemetry-governance
+// ──────────────────────────────────────────────────────────────────────────────
+
+export interface SchemaMap {
+  /** raw slot name → semantic name (e.g. blob1 → event_type) */
+  rawToSemantic: Map<string, string>;
+  /** semantic name → raw slot name (e.g. event_type → blob1) */
+  semanticToRaw: Map<string, string>;
+}
+
+/**
+ * Canonical blob→semantic ordering derived from canon table
+ * "Structural Dimensions (Blobs)" (positional, 9 entries).
+ * The canon doc uses human-readable dimension names in that table, not
+ * machine-readable column identifiers, so positional order is the only
+ * reliable parse signal. These names mirror what the tool docstring exposes.
+ */
+const BASELINE_BLOB_SEMANTIC_NAMES = [
+  "event_type",       // blob1
+  "method",           // blob2
+  "tool_name",        // blob3
+  "consumer_label",   // blob4
+  "consumer_source",  // blob5
+  "knowledge_base_url", // blob6
+  "document_uri",     // blob7
+  "worker_version",   // blob8
+  "cache_tier",       // blob9
+] as const;
+
+/**
+ * Baseline double→semantic names. Canon table "Numeric Values (Doubles)"
+ * encodes these with backtick-quoted identifiers (except "Count" → "count")
+ * which are parseable at runtime. Baseline is the safety net.
+ */
+const BASELINE_DOUBLE_SEMANTIC_NAMES = [
+  "count",       // double1
+  "duration_ms", // double2
+  "bytes_in",    // double3
+  "bytes_out",   // double4
+  "tokens_in",   // double5
+  "tokens_out",  // double6
+] as const;
+
+/** Build a SchemaMap from ordered blob/double name arrays. Exported for unit testing. */
+export function buildSchemaMapFromArrays(
+  blobNames: readonly string[],
+  doubleNames: readonly string[],
+): SchemaMap {
+  const rawToSemantic = new Map<string, string>();
+  const semanticToRaw = new Map<string, string>();
+
+  blobNames.forEach((name, i) => {
+    const slot = `blob${i + 1}`;
+    rawToSemantic.set(slot, name);
+    semanticToRaw.set(name, slot);
+  });
+
+  doubleNames.forEach((name, i) => {
+    const slot = `double${i + 1}`;
+    rawToSemantic.set(slot, name);
+    semanticToRaw.set(name, slot);
+  });
+
+  return { rawToSemantic, semanticToRaw };
+}
+
+/**
+ * Attempt to parse semantic double names from the canon governance document.
+ * The "Numeric Values (Doubles)" table in canon uses backtick-quoted identifiers
+ * in the "Value" column (e.g. `duration_ms`, `bytes_in`). Row 1 uses "Count"
+ * (unquoted) which we lowercase to "count".
+ *
+ * Returns null if the section is missing or has too few rows to be trusted.
+ */
+function parseDoublesFromCanon(content: string): string[] | null {
+  // Match the Numeric Values section up to the next ## or ### heading
+  const sectionMatch = content.match(
+    /###\s+Numeric Values \(Doubles\)([\s\S]*?)(?=\n###|\n##|$)/,
+  );
+  if (!sectionMatch) return null;
+
+  const section = sectionMatch[1];
+
+  // Match table data rows: | # | Value | ...
+  // Value may be plain text (Count) or backtick-quoted (`duration_ms`)
+  const rowPattern = /^\|\s*\d+\s*\|\s*(`[^`]+`|[A-Za-z_][A-Za-z0-9_]*)\s*\|/gm;
+  const names: string[] = [];
+  let m: RegExpExecArray | null;
+  while ((m = rowPattern.exec(section)) !== null) {
+    const raw = m[1].replace(/`/g, "").trim().toLowerCase();
+    names.push(raw);
+  }
+
+  // Sanity: must match expected baseline count
+  if (names.length !== BASELINE_DOUBLE_SEMANTIC_NAMES.length) return null;
+
+  return names;
+}
+
+/** Module-level cache — reset between Worker isolate restarts (Cloudflare normal). */
+let cachedSchemaMap: SchemaMap | null = null;
+
+/**
+ * Return the schema map, loading from canon on first call.
+ * Canon-derived doubles names take precedence over baseline; blob names are
+ * always positional from the baseline (the canon table uses human-readable
+ * dimension names, not machine-readable identifiers).
+ */
+async function getSchemaMap(env: Env): Promise<SchemaMap> {
+  if (cachedSchemaMap) return cachedSchemaMap;
+
+  let doubleNames: readonly string[] = BASELINE_DOUBLE_SEMANTIC_NAMES;
+
+  try {
+    const fetcher = new KnowledgeBaseFetcher(env);
+    const content = await fetcher.getFile(
+      "canon/constraints/telemetry-governance.md",
+    );
+    if (content) {
+      const parsed = parseDoublesFromCanon(content);
+      if (parsed) {
+        doubleNames = parsed;
+      }
+    }
+  } catch {
+    // Canon unreachable — fall through to baseline (vodka architecture safety net)
+  }
+
+  cachedSchemaMap = buildSchemaMapFromArrays(
+    BASELINE_BLOB_SEMANTIC_NAMES,
+    doubleNames,
+  );
+  return cachedSchemaMap;
+}
+
+// ──────────────────────────────────────────────────────────────────────────────
+// SQL query rewriting
+// ──────────────────────────────────────────────────────────────────────────────
+
+/** Raw slot name pattern — used to detect forbidden column references. */
+const RAW_SLOT_PATTERN = /\b(blob[1-9]|double[1-9])\b/gi;
+
+/**
+ * Reject queries that contain raw slot names (blob1..9 / double1..6).
+ * Returns an error message string, or null if the query is clean.
+ *
+ * Single-quoted string literals are skipped so that values like
+ * `'https://example.com/blob1/readme'` do not trigger a false rejection.
+ * This matches the scoping rules used by `rewriteSqlToRaw`. SQL's
+ * doubled-quote escape (`''`) is respected so that escaped quotes do not
+ * terminate the literal prematurely.
+ * Exported for unit testing.
+ */
+export function detectRawSlotNames(
+  sql: string,
+  schemaMap: SchemaMap,
+): string | null {
+  // Strip single-quoted string literals before scanning for raw slot names
+  // so filter values containing substrings like `blob1` are not flagged.
+  const literalPattern = /'(?:[^']|'')*'/g;
+  const scannable = sql.replace(literalPattern, "''");
+  const matches = scannable.match(RAW_SLOT_PATTERN);
+  if (!matches) return null;
+
+  const unique = [...new Set(matches.map((m) => m.toLowerCase()))];
+  const suggestions = unique
+    .map((raw) => {
+      const semantic = schemaMap.rawToSemantic.get(raw);
+      return semantic ? `\`${raw}\` → \`${semantic}\`` : `\`${raw}\` (unmapped)`;
+    })
+    .join(", ");
+
+  return (
+    `Raw column names are not allowed. Use semantic names instead: ${suggestions}. ` +
+    `See the telemetry_public tool description for the full schema.`
+  );
+}
+
+/**
+ * Rewrite a SQL query from semantic names to raw Analytics Engine slot names.
+ * Semantic names are matched on word boundaries to avoid partial replacements.
+ * Longer names are replaced first to prevent prefix collisions (e.g.
+ * knowledge_base_url vs url).
+ *
+ * A negative lookahead `(?!\s*\()` prevents rewriting identifiers that are
+ * immediately followed by an opening parenthesis — i.e. SQL function calls.
+ * This is required because the semantic name `count` collides with the SQL
+ * aggregate `count()`; without the guard, `count(*)` would become
+ * `double1(*)` and be rejected by the CF API. Column references never
+ * take a `(` after them, so this is safe for all semantic names.
+ *
+ * Single-quoted string literals are skipped so that values like
+ * `'klappy://sources/scientific-method'` are not corrupted (word boundaries
+ * around `-` / `/` would otherwise cause `method` to be rewritten to the raw
+ * slot name inside the literal). SQL's doubled-quote escape (`''`) is
+ * respected so that escaped quotes do not terminate the literal prematurely.
+ * Exported for unit testing.
+ */
+export function rewriteSqlToRaw(sql: string, schemaMap: SchemaMap): string {
+  // Sort by semantic name length descending to avoid prefix collisions
+  const entries = [...schemaMap.semanticToRaw.entries()].sort(
+    (a, b) => b[0].length - a[0].length,
+  );
+
+  const rewriteSegment = (segment: string): string => {
+    let out = segment;
+    for (const [semantic, raw] of entries) {
+      // \b word-boundary anchors prevent partial matches inside longer identifiers.
+      // Negative lookahead (?!\s*\() skips function-call positions (e.g. count(*)).
+      const pattern = new RegExp(`\\b${semantic}\\b(?!\\s*\\()`, "g");
+      out = out.replace(pattern, raw);
+    }
+    return out;
+  };
+
+  // Split SQL into alternating non-literal and single-quoted literal segments.
+  // Only non-literal segments are subject to rewriting, so user-supplied
+  // filter values passed as string literals are preserved verbatim.
+  const literalPattern = /'(?:[^']|'')*'/g;
+  let rewritten = "";
+  let lastIndex = 0;
+  let match: RegExpExecArray | null;
+  while ((match = literalPattern.exec(sql)) !== null) {
+    rewritten += rewriteSegment(sql.slice(lastIndex, match.index));
+    rewritten += match[0];
+    lastIndex = match.index + match[0].length;
+  }
+  rewritten += rewriteSegment(sql.slice(lastIndex));
+  return rewritten;
+}
+
+/**
+ * Rewrite result column names from raw slot names back to semantic names.
+ * Operates on the `meta` array (CF Analytics Engine response format) and
+ * rewrites the corresponding keys in each `data` row.
+ * Exported for unit testing.
+ */
+export function rewriteResultToSemantic(
+  result: unknown,
+  schemaMap: SchemaMap,
+): unknown {
+  if (typeof result !== "object" || result === null) return result;
+
+  const r = result as Record<string, unknown>;
+  if (!Array.isArray(r.meta)) return result;
+
+  type MetaCol = { name: string; type: string };
+  const oldMeta = r.meta as MetaCol[];
+
+  // Build a remapping from old column name → semantic name (only for slots)
+  const colRemap = new Map<string, string>();
+  const newMeta: MetaCol[] = oldMeta.map((col) => {
+    const semantic = schemaMap.rawToSemantic.get(col.name);
+    if (semantic && semantic !== col.name) {
+      colRemap.set(col.name, semantic);
+      return { ...col, name: semantic };
+    }
+    return col;
+  });
+
+  if (colRemap.size === 0) return result; // nothing to rename
+
+  // Rewrite data rows
+  const newData = Array.isArray(r.data)
+    ? (r.data as Array<Record<string, unknown>>).map((row) => {
+        const newRow: Record<string, unknown> = {};
+        for (const [key, val] of Object.entries(row)) {
+          newRow[colRemap.get(key) ?? key] = val;
+        }
+        return newRow;
+      })
+    : r.data;
+
+  return { ...r, meta: newMeta, data: newData };
+}
+
+// ──────────────────────────────────────────────────────────────────────────────
 // Analytics Engine SQL query
 // ──────────────────────────────────────────────────────────────────────────────
 
@@ -345,10 +634,17 @@
 }
 
 /**
- * Query Analytics Engine SQL API.
+ * Query Analytics Engine SQL API with semantic-name rewriting.
  * Used by telemetry_public tool.
  * Requires CF_ACCOUNT_ID and CF_API_TOKEN env vars.
  * Only permits SELECT queries against the oddkit_telemetry dataset.
+ *
+ * Semantic-name contract:
+ *   - Consumers write SQL using semantic names (event_type, tool_name, etc.)
+ *   - Raw slot names (blob1, double2, etc.) are rejected with a helpful error
+ *   - The schema mapping is loaded from canon at runtime; the hardcoded baseline
+ *     is the safety net when canon is unreachable (vodka architecture)
+ *   - Result columns are renamed from raw slots back to semantic names
  */
 export async function queryTelemetry(env: Env, query: string): Promise<unknown> {
   if (!env.CF_ACCOUNT_ID || !env.CF_API_TOKEN) {
@@ -357,7 +653,19 @@
     };
   }
 
-  const validationError = validateTelemetryQuery(query);
+  // Load schema map (canon-first, baseline fallback)
+  const schemaMap = await getSchemaMap(env);
+
+  // Reject raw slot names before any other validation
+  const rawSlotError = detectRawSlotNames(query, schemaMap);
+  if (rawSlotError) {
+    return { error: rawSlotError };
+  }
+
+  // Rewrite semantic names → raw slot names for the CF API
+  const rawQuery = rewriteSqlToRaw(query, schemaMap);
+
+  const validationError = validateTelemetryQuery(rawQuery);
   if (validationError) {
     return { error: validationError };
   }
@@ -370,9 +678,12 @@
         Authorization: `Bearer ${env.CF_API_TOKEN}`,
         "Content-Type": "text/plain",
       },
-      body: query,
+      body: rawQuery,
     },
   );
 
-  return response.json();
+  const result = await response.json();
+
+  // Rewrite result column names from raw slots back to semantic names
+  return rewriteResultToSemantic(result, schemaMap);
 }

diff --git a/workers/test/telemetry-integration.test.mjs b/workers/test/telemetry-integration.test.mjs
--- a/workers/test/telemetry-integration.test.mjs
+++ b/workers/test/telemetry-integration.test.mjs
@@ -49,6 +49,7 @@
   include: [
     join(WORKERS_ROOT, "src", "tokenize.ts"),
     join(WORKERS_ROOT, "src", "telemetry.ts"),
+    join(WORKERS_ROOT, "src", "zip-baseline-fetcher.ts"),
   ],
 };
 const tsconfigPath = join(tmp, "tsconfig.json");
@@ -74,7 +75,8 @@
 // actually need weren't emitted.
 const tokenizeJs = join(tmp, "build", "tokenize.js");
 const telemetryJs = join(tmp, "build", "telemetry.js");
-if (!existsSync(tokenizeJs) || !existsSync(telemetryJs)) {
+const zipFetcherJs = join(tmp, "build", "zip-baseline-fetcher.js");
+if (!existsSync(tokenizeJs) || !existsSync(telemetryJs) || !existsSync(zipFetcherJs)) {
   console.error("TypeScript compile failed (target files not emitted):");
   console.error(compile.stdout);
   console.error(compile.stderr);
@@ -86,14 +88,25 @@
 }
 
 // Newer Node requires `with { type: "json" }` on JSON imports in ESM.
-// TypeScript doesn't add this — patch it in.
-const { readFileSync, writeFileSync: wf } = await import("node:fs");
-let telemetrySrc = readFileSync(telemetryJs, "utf8");
-telemetrySrc = telemetrySrc.replace(
-  /from ["']\.\.\/package\.json["'];/g,
-  'from "../package.json" with { type: "json" };',
-);
-wf(telemetryJs, telemetrySrc);
+// TypeScript bundler moduleResolution omits .js extensions on local imports.
+// Node.js ESM resolver requires explicit extensions — patch all compiled files.
+const { readFileSync, writeFileSync: wf, readdirSync: rds } = await import("node:fs");
+const buildDir = join(tmp, "build");
+for (const f of rds(buildDir).filter(n => n.endsWith(".js"))) {
+  const fpath = join(buildDir, f);
+  let src = readFileSync(fpath, "utf8");
+  // Patch JSON imports
+  src = src.replace(
+    /from ["']\.\.\/package\.json["'];/g,
+    'from "../package.json" with { type: "json" };',
+  );
+  // Patch extensionless local imports (TypeScript bundler mode omits .js)
+  src = src.replace(
+    /from ["'](\.\/[^"'.]+)["'];/g,
+    'from "$1.js";',
+  );
+  wf(fpath, src);
+}
 
 const { measurePayloadShape } = await import(tokenizeJs);
 const { recordTelemetry } = await import(telemetryJs);
@@ -276,6 +289,139 @@
   }
 });
 
+// ─── Semantic schema rewriting tests ──────────────────────────────────────
+
+const {
+  buildSchemaMapFromArrays,
+  detectRawSlotNames,
+  rewriteSqlToRaw,
+  rewriteResultToSemantic,
+} = await import(telemetryJs);
+
+// Build a test schema map (mirrors the production baseline)
+const TEST_BLOB_NAMES = [
+  "event_type", "method", "tool_name", "consumer_label", "consumer_source",
+  "knowledge_base_url", "document_uri", "worker_version", "cache_tier",
+];
+const TEST_DOUBLE_NAMES = [
+  "count", "duration_ms", "bytes_in", "bytes_out", "tokens_in", "tokens_out",
+];
+const testMap = buildSchemaMapFromArrays(TEST_BLOB_NAMES, TEST_DOUBLE_NAMES);
+
+await test("detectRawSlotNames: returns null for clean semantic query", async () => {
+  const result = detectRawSlotNames(
+    "SELECT tool_name, SUM(_sample_interval) FROM oddkit_telemetry GROUP BY tool_name",
+    testMap,
+  );
+  assert.equal(result, null, "clean query should return null");
+});
+
+await test("detectRawSlotNames: rejects blob1 with helpful message", async () => {
+  const result = detectRawSlotNames(
+    "SELECT blob1, blob3 FROM oddkit_telemetry",
+    testMap,
+  );
+  assert.ok(result !== null, "should return error string");
+  assert.ok(result.includes("blob1"), "error should mention the raw name");
+  assert.ok(result.includes("event_type"), "error should suggest semantic name");
+  assert.ok(result.includes("tool_name"), "error should suggest tool_name for blob3");
+});
+
+await test("detectRawSlotNames: rejects double5 with helpful message", async () => {
+  const result = detectRawSlotNames(
+    "SELECT SUM(double5) AS x FROM oddkit_telemetry",
+    testMap,
+  );
+  assert.ok(result !== null, "should return error string");
+  assert.ok(result.includes("double5"), "error should mention the raw name");
+  assert.ok(result.includes("tokens_in"), "error should suggest semantic name");
+});
+
+await test("rewriteSqlToRaw: translates all blob semantic names", async () => {
+  const sql = "SELECT event_type, method, tool_name, consumer_label, consumer_source, knowledge_base_url, document_uri, worker_version, cache_tier FROM oddkit_telemetry";
+  const rewritten = rewriteSqlToRaw(sql, testMap);
+  assert.ok(rewritten.includes("blob1"), "event_type → blob1");
+  assert.ok(rewritten.includes("blob2"), "method → blob2");
+  assert.ok(rewritten.includes("blob3"), "tool_name → blob3");
+  assert.ok(rewritten.includes("blob6"), "knowledge_base_url → blob6");
+  assert.ok(rewritten.includes("blob9"), "cache_tier → blob9");
+  assert.ok(!rewritten.includes("event_type"), "event_type should be gone");
+});
+
+await test("rewriteSqlToRaw: translates all double semantic names", async () => {
+  const sql = "SELECT SUM(count) AS n, AVG(duration_ms), SUM(bytes_in), SUM(bytes_out), AVG(tokens_in), AVG(tokens_out) FROM oddkit_telemetry";
+  const rewritten = rewriteSqlToRaw(sql, testMap);
+  assert.ok(rewritten.includes("double1"), "count → double1");
+  assert.ok(rewritten.includes("double2"), "duration_ms → double2");
+  assert.ok(rewritten.includes("double3"), "bytes_in → double3");
+  assert.ok(rewritten.includes("double4"), "bytes_out → double4");
+  assert.ok(rewritten.includes("double5"), "tokens_in → double5");
+  assert.ok(rewritten.includes("double6"), "tokens_out → double6");
+  assert.ok(!rewritten.includes("duration_ms"), "duration_ms should be gone");
+  assert.ok(!rewritten.includes("tokens_out"), "tokens_out should be gone");
+});
+
+await test("rewriteSqlToRaw: knowledge_base_url doesn't clobber shorter substrings", async () => {
+  // 'url' as alias should not be mistaken for a semantic column name
+  // and 'knowledge_base_url' should replace as a whole unit
+  const sql = "SELECT knowledge_base_url AS url FROM oddkit_telemetry";
+  const rewritten = rewriteSqlToRaw(sql, testMap);
+  assert.ok(rewritten.includes("blob6"), "knowledge_base_url → blob6");
+  assert.ok(rewritten.includes("AS url"), "alias 'url' should be untouched");
+});
+
+await test("rewriteSqlToRaw: count() SQL aggregate is not rewritten to double1()", async () => {
+  // `count` is both a semantic column name (double1) and a SQL aggregate
+  // function. Rewriting `count(*)` to `double1(*)` would produce invalid SQL
+  // that CF rejects. A function-call guard (negative lookahead for `(`) keeps
+  // the aggregate intact while still rewriting column references to `count`.
+  const sql = "SELECT tool_name, count(*) AS n FROM oddkit_telemetry GROUP BY tool_name";
+  const rewritten = rewriteSqlToRaw(sql, testMap);
+  assert.ok(rewritten.includes("count(*)"), "count(*) aggregate should be preserved");
+  assert.ok(!rewritten.includes("double1(*)"), "count(*) must not become double1(*)");
+  assert.ok(rewritten.includes("blob3"), "tool_name should still rewrite to blob3");
+
+  // Lowercase count( with whitespace also preserved
+  const sql2 = "SELECT count (DISTINCT tool_name) FROM oddkit_telemetry";
+  const rewritten2 = rewriteSqlToRaw(sql2, testMap);
+  assert.ok(!rewritten2.includes("double1 ("), "count (DISTINCT ...) must not be rewritten");
+
+  // But a bare `count` column reference (no paren) still rewrites
+  const sql3 = "SELECT SUM(count) AS n FROM oddkit_telemetry";
+  const rewritten3 = rewriteSqlToRaw(sql3, testMap);
+  assert.ok(rewritten3.includes("SUM(double1)"), "count as column reference should still rewrite to double1");
+});
+
+await test("rewriteResultToSemantic: renames blob/double columns in meta and data", async () => {
+  const rawResult = {
+    meta: [
+      { name: "blob3", type: "String" },
+      { name: "double2", type: "Float64" },
+      { name: "total", type: "UInt64" },
+    ],
+    data: [
+      { blob3: "search", double2: 123.4, total: "42" },
+      { blob3: "orient", double2: 88.0, total: "17" },
+    ],
+    rows: 2,
+  };
+  const result = rewriteResultToSemantic(rawResult, testMap);
+  assert.deepEqual(result.meta[0], { name: "tool_name", type: "String" }, "blob3 → tool_name in meta");
+  assert.deepEqual(result.meta[1], { name: "duration_ms", type: "Float64" }, "double2 → duration_ms in meta");
+  assert.deepEqual(result.meta[2], { name: "total", type: "UInt64" }, "non-slot column unchanged");
+  assert.equal(result.data[0].tool_name, "search", "data row key renamed");
+  assert.equal(result.data[0].duration_ms, 123.4, "double2 key renamed");
+  assert.equal(result.data[0].total, "42", "non-slot key unchanged");
+  assert.ok(!("blob3" in result.data[0]), "old key blob3 removed");
+  assert.ok(!("double2" in result.data[0]), "old key double2 removed");
+});
+
+await test("rewriteResultToSemantic: passes through non-slot result unchanged", async () => {
+  const rawResult = { error: "bad query" };
+  const result = rewriteResultToSemantic(rawResult, testMap);
+  assert.deepEqual(result, rawResult, "error result passed through unchanged");
+});
+
 // ─── Test 5: Malformed JSON-RPC gets dropped silently ──────────────────────
 
 await test("malformed JSON-RPC is silently dropped (telemetry never throws)", async () => {

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit b3a9c29. Configure here.}

The three Cursor Agent fix commits on this PR addressed all three Bugbot findings correctly, but only the count() collision (Bug #1) got dedicated test coverage. Bugs #2 and #3 — both about literal handling — landed without regression tests, so a future refactor that reverts the literal-skip logic would not be caught by CI. This commit closes the gap. Test 1: rewriteSqlToRaw — semantic names inside single-quoted literals - 'klappy://sources/scientific-method' must not have method → blob2 - SQL doubled-quote escape '' must keep word inside literal preserved - Mixed case: column ref outside literal still rewrites; same word inside a literal stays untouched Test 2: detectRawSlotNames — raw slot names inside literals do not trigger rejection - 'https://example.com/blob1/readme' must not be falsely rejected - 'klappy://reports/double5-summary' likewise - Sanity guard: bare blob1 outside any literal STILL gets rejected - Mixed case: raw slot outside a literal is rejected even when another raw slot appears inside a literal in the same query All 17 tests pass locally: 17 passed, 0 failed (was 15 — added 2) This addresses the test-coverage gap I noted in the PR review, not a code defect — the fixes themselves are correct and these tests verify they hold.

chore: append learnings from feat/telemetry-semantic-names

482fa48

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread workers/src/telemetry.ts

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread workers/src/telemetry.ts

fix(telemetry): skip string literals in rewriteSqlToRaw

b3a9c29

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread workers/src/telemetry.ts

cursoragent and others added 2 commits April 24, 2026 18:37

fix(telemetry): skip string literals in raw slot detection

815585d

klappy merged commit 33cceee into main Apr 26, 2026
5 checks passed

klappy deleted the feat/telemetry-semantic-names branch April 26, 2026 02:27

klappy mentioned this pull request Apr 26, 2026

refactor(telemetry): retire indexSource interpreter — fetchers report, telemetry tallies #141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): semantic names for telemetry_public — hide blob/double from consumers#137

feat(telemetry): semantic names for telemetry_public — hide blob/double from consumers#137
klappy merged 6 commits into
mainfrom
feat/telemetry-semantic-names

klappy commented Apr 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

klappy commented Apr 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Semantic Names Exposed

Vodka Architecture Compliance

Phase 1 Audit Findings

Files Changed

Test Evidence

Smoke Tests (Phase 3 — post-deploy)

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

klappy commented Apr 24, 2026 •

edited by cursor Bot

Loading

cloudflare-workers-and-pages Bot commented Apr 24, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading