fix(telemetry): blob8 falls back to BUILD_VERSION; clarify duration_ms semantics#131
Conversation
…s semantics blob8 (worker_version) was logging as the literal 'unknown' for 100% of tool calls in production. Root cause: env.ODDKIT_VERSION is only injected by 'npm run deploy' via --var, but Cloudflare's auto-deploy from GitHub invokes wrangler directly from wrangler.toml and never executes the deploy script. Other sites (index.ts, orchestrate.ts) already fall back to pkg.version; telemetry.ts was missed. Fix: import pkg from ../package.json (mirroring index.ts pattern), define BUILD_VERSION = pkg.version, use it as the fallback. Telemetry now reports a real semver under the canonical deploy path. Also clarifies the duration_ms docstring. The schema previously said 'request processing time (measured by caller)' which under-described the measurement. The value is full MCP request wall-clock measured at the worker edge — V8 cold-start, KB fetch, MCP SDK overhead, action handler compute, all included. This is NOT the per-action debug.duration_ms in tool envelopes (which measures only the action handler's internal compute). The discrepancy explains why telemetry shows oddkit_time avg 269ms / max 9362ms while debug.duration_ms reports near-zero. No behavior change to duration_ms measurement. Documentation only. See: klappy://canon/constraints/telemetry-governance
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
oddkit | fbe51e4 | Commit Preview URL Branch Preview URL |
Apr 21 2026, 05:50 PM |
Independent Validator Verdict (Sonnet 4.6, Managed Agents, fresh context)Session: OVERALL: PARTIAL (Safe to Promote)Per
Bugbot: Clean (zero findings, conclusion=success). Finding (non-blocking, carry-forward)Telemetry instrumentation category is undefined in release-validation-gate canon. This PR sits in the ambiguous zone — it doesn't change user-facing behavior, but it changes what data is collected (blob8 shifts from literal The author's Recommendation: Carry-forward as O-open to amend Disposition
Validator artifactsFull reports in validator session filesystem at
Independent validator dispatched by the PR author's orchestrator session per release-validation-gate Rule 2. The author cannot validate their own work; this comment records the fresh-context reviewer's verdict. The PR body has been amended to retract the incorrect "non-load-bearing" framing. |
Independent Validator — Smoke Round 2 (DNS retry)Timestamp: 2026-04-21T21:20:05Z (≈30 min after initial round, DNS cache expected to have cleared) Health checks
✅ Both
|
| Target | Run 1 | Run 2 | Run 3 | Rate |
|---|---|---|---|---|
| Preview | ✅ PASS (server_time=2026-04-21T21:20:08.317Z) | ❌ 503 DNS cache overflow | ✅ PASS (server_time=2026-04-21T21:20:08.724Z) | 2/3 |
| Production | ✅ PASS (server_time=2026-04-21T21:20:09.082Z) | ✅ PASS (server_time=2026-04-21T21:20:09.168Z) | ❌ 503 DNS cache overflow | 2/3 |
Analysis
- Round 1: Preview 0/3, Production 2/3 — preview appeared broken.
- Round 2: Preview 2/3, Production 2/3 — identical failure rate on both targets proves DNS cache overflow is a validator-environment artifact, not preview-specific.
- Both successful preview runs returned proper MCP envelopes with
server_time. - The
/healthendpoint returnsversion: 0.23.0on both targets — consistent with the BUILD_VERSION fallback working (env.ODDKIT_VERSION still unset on auto-deploy, fallback in effect).
Corroboration 5 status
FAIL-infra → PASS (with DNS caveat). Direct empirical verification of preview URL achieved — preview deployed, healthy, responding with correct MCP envelope shape. Combined with the CI "Test CF Preview" signal, this clears the smoke corroboration.
Overall verdict update
PARTIAL → PASS (effective). No change in code disposition, but the last remaining gap (smoke) is now empirically closed. Author's PR changes are verified functioning on the preview URL.
Post-merge verification (unchanged)
24h after prod promotion, run:
SELECT blob8 AS worker_version, SUM(_sample_interval) AS calls
FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL '1' DAY
AND blob1 = 'tool_call' GROUP BY worker_version ORDER BY calls DESC;Expected: worker_version shows 0.23.0, not "unknown".
Independent validator, round 2. Same fresh-context agent, new invocation. Orchestrator has not self-smoked at any point in this validation.
Problem
blob8(worker_version) was logging as the literal string"unknown"for 100% of tool calls in production telemetry over the last 7 days (6,051 / 6,051 calls). This blocks version-aware regression detection right when 0.23.0 just shipped.Surfaced during a telemetry evaluation pass:
Root cause
workers/src/telemetry.ts:218falls back to the literal"unknown"whenenv.ODDKIT_VERSIONis undefined:env.ODDKIT_VERSIONis only set when deploying vianpm run deploy(which passes--var ODDKIT_VERSION:...). Cloudflare's auto-deploy from GitHub — the canonical deploy path — invokeswranglerdirectly fromwrangler.tomland never executes thedeployscript. So in production,env.ODDKIT_VERSIONis always undefined, and the fallback is the literal"unknown".Other call sites (
workers/src/index.ts:30,184,894,915,workers/src/orchestrate.ts:1570) already fall back topkg.versionvia aBUILD_VERSIONconstant. Telemetry was the lone holdout.Fix
pkg from "../package.json"(mirroring theindex.tspattern).BUILD_VERSION = pkg.version."unknown"literal withBUILD_VERSION.After merge,
blob8will report the actual semver on the canonical deploy path. Existingenv.ODDKIT_VERSIONoverride semantics are preserved —npm run deployusers still take precedence.Bonus:
duration_msschema clarificationWhile in the file, also expanded the
double2(duration_ms) docstring. The previous one-liner — "MCP request processing time (measured by caller)" — under-described the measurement. The same telemetry pass surfaced the discrepancy:oddkit_timereportsdebug.duration_ms: 0in its tool envelope, but telemetry shows avg 269ms / max 9362ms.duration_msis full wall-clock from worker entry throughhandler()return (index.ts:962) — V8 cold-start + KB fetch + MCP SDK overhead + action compute. Thedebug.duration_msin tool envelopes measures only the action handler's internal compute.No behavior change. Documentation only. The new docstring states the measurement boundary and explicitly disambiguates from
debug.duration_ms.Verification
npx tsc --noEmitpasses.env.ODDKIT_VERSIONis set (override path unchanged).Release validation (AMENDED)
Load-bearing classification: Ambiguous. Telemetry instrumentation is not explicitly listed in either the "always load-bearing" or "not load-bearing" categories in
klappy://canon/constraints/release-validation-gate. Per canon: "Ambiguous cases default to load-bearing. When in doubt, dispatch the validator."This PR changes what data gets collected (blob8 value shifts from
"unknown"to real semver), which affects operational regression-detection queries — operationally load-bearing even if user-facing behavior is unchanged.Validator dispatched: Sonnet 4.6 via Managed Agents, fresh context, read-only, 5-corroboration pattern. Verdict posted as comment below: PARTIAL (Safe to Promote) — code PASS, Bugbot clean, CI green, validator DNS-blocked on direct preview curl (CI's "Test CF Preview" is authoritative).
Retraction: An earlier version of this PR body claimed "Sonnet 4.6 read-only validator dispatch is not required." That framing was wrong per canon and has been retracted. Validator was dispatched; verdict recorded.
Out of scope
canon/canon/...mdmalformed URIs fromklappy.dev-podcastandDeno/2.1.4consumers) is a klappy.dev edge-function bug, not an oddkit bug. Vodka principle says fix at the source — separate PR in klappy.dev repo.Carry-forward (non-blocking)
Validator flagged that
release-validation-gatecanon does not explicitly categorize telemetry instrumentation changes. Open an O-open to amend canon with explicit guidance so future telemetry PRs don't re-litigate this ambiguity.