chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs by jkeeley2073 · Pull Request #98 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-08T12:23:57Z

Summary

Path A — operator-pending recuration done. Cosmos auth worked from this worktree (az login active on the personal Earlybird sub), and the recuration ran end-to-end against deployed Cosmos pinwiz-cosmos-dev-hlpz4. 8 of 30 ground-truth ids were updated to actual deployed-catalog OPDB ids; 18 left as-is (machines absent from the current OPDB sync's view of the catalog); 4 out-of-scope rows skipped.

Phase 3 PR 8 shipped wizard.v1.jsonl with subagent-curated plausible OPDB-format ids. The deployed Cosmos catalog contains the actual OPDB ids and the two did not match — when the agent successfully called getMachineByTitle("Godzilla") and cited the real id, expected_citation_set held a different one, so citation_precision/recall scored 0 even on a correct lookup. This is one of the two reasons H2 baseline citation_precision was 0.133 (the other was connected-agents wiring, fixed in PR #96). See build-spec.md § Phase 4 § Scope item 9 and Phase 3 retrospective lesson 5.

What this PR adds

tools/eval/Recurate.csx — dotnet-script recuration tool. Reads each question's curated machine title from the side-car, queries deployed Cosmos via case-insensitive STRINGEQUALS on c.title (mirrors IMachineRepository.QueryByTitleAsync), rewrites expected_citation_set with the actual document id. --dry-run, --cosmos-endpoint, --jsonl, --titles flags. Idempotent.
tools/eval/wizard.v1.titles.json — explicit per-question machine-title curation side-car. 30 entries; 4 are null (out-of-scope rows).
data/eval/wizard.v1.jsonl — 8 questions recurated to actual OPDB ids:
- Godzilla (3 questions) → G5po2-MeP6B (mfg=sega — Stern's 2021 release lives under OPDB's Sega-era partition)
- The Wizard of Oz (3 questions) → G4xyR-MJ2v0 (mfg=jjp)
- Dialed In! (2 questions) → G4X2l-MYepl (mfg=jjp)
data/eval/wizard.v1.recuration.json — provenance side-car: timestamp, Cosmos endpoint, JSONL SHA before recuration, script SHA, per-question outcome with curated title + resolved manufacturer + resolved OPDB id.
data/eval/README.md — appended Phase 4 W1-3 section documenting the recuration outcome and the operator workflow for future re-runs.

Why some questions did not resolve

18 of 30 questions reference machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles, AC/DC, Metallica, Rush) that are either absent from the current OPDB sync's view of the catalog or appear only as edition-suffixed records (e.g., "AC/DC (Pro)"). Their expected_citation_set was left untouched per the script's no-fabrication contract — keeping the fictional ids in expected_citation_set makes the gap honestly visible: until the catalog has the machine, the agent cannot drive a non-zero citation_precision on those questions, and the metric should reflect that. This is a Phase 4 follow-up (re-target the questions to machines that ARE in the catalog, or re-sync the catalog).

Notable

SDK pin to Microsoft.Azure.Cosmos 3.43.1 (one minor below production's 3.59.0) — 3.59.0 + .NET 10 + dotnet-script + serverless Cosmos returns "BadRequest: One of the specified inputs is invalid" on every query path. Production code is unaffected (different runtime). Documented in the script header so a future operator does not get confused.
Godzilla's manufacturer surprise — the recuration manifest records resolved_manufacturer=sega for Stern's Godzilla 2021 because OPDB inherits the original Sega-era record's manufacturer key. This is a Phase 1 catalog observation, not a recuration choice; documented in the side-car's _about block so a future curator does not re-trip on it.

Test Plan

dotnet build — zero warnings, zero errors.
dotnet test — all 691 tests pass (eval tests verify the parser still reads the recurated JSONL).
dotnet script tools/eval/Recurate.csx -- --dry-run — prints proposed changes without writing.
dotnet script tools/eval/Recurate.csx — applies recuration; first run wrote 8 lines + manifest.
dotnet script tools/eval/Recurate.csx — re-run: 0 changes, manifest re-written with same outcomes (idempotent).

Out of Scope

Re-targeting the 18 unresolved questions to machines that ARE in the deployed catalog. Scoped as a Phase 4 follow-up; this PR delivers the tool + the actual recuration of the 8 resolvable questions.
Re-running the eval harness against the recurated ground-truth. That's hand-off H2 (intermediate) in build-spec § Phase 4 § Scope item 19, sequenced after items 8+9+10+11 land.
A unit test for the script. Per the spec, the script lives outside PinballWizard.slnx and does not have a test surface; the production MachineRepository.QueryByTitleAsync already has unit + integration tests.

Checklist

CI is green (build + test + coverage + CodeQL + sanitization)
PR title follows the Conventional Commits format above
If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (tools + data; no new architecture)
If user-visible behavior changes, README.md and/or docs/ are updated in the same PR (data/eval/README.md updated)
If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — N/A
No TODO / FIXME / commented-out code committed
No new entries in <NoWarn> without a comment explaining why and the removal criterion

Pre-push self-audit (additive PRs)

Step 0 — `/local-review` (qualitative)

Ran /local-review and addressed every 🔴 finding before push
Local review outcome: 0 🔴 / 2 ⚠️ / 8 categories ✅ — both ⚠️ deferred with justification:
- Minimal-shape TitleHit DTO used only in the probe (deferred — keeps probe payload smaller than the lookup; DTOs have distinct shapes by design)
- Dry-run path uses return instead of explicit Environment.Exit(0) (deferred — return from a dotnet-script main block is idiomatic and produces exit 0 reliably)

Step 1 — Mechanical checklist

Every new *Options property has at least one real getter call in src/ — N/A (no new Options classes; tool flags read in script)
Sibling-diffed against the closest existing implementation; drift is justified or removed — N/A (first .csx script in the repo)
No bare catch { } — minimum scope is catch (Exception)
New ISourceScraper? SourceAliasContractTests still passes without edit — N/A
Tests assert behavior, not just structure (named "rejects X" → fixture contains X) — N/A (no new tests; existing 691 still pass)
Build is zero-warning
git log -1 --format='%an <%ae>' shows personal noreply, not work email — verified 94459922+jkeeley2073@users.noreply.github.com

…ed Cosmos OPDB IDs Phase 3 PR 8 shipped wizard.v1.jsonl with subagent-curated plausible OPDB-format ids ("GRBN-MQR4P" etc.). The deployed Cosmos catalog contains the actual OPDB ids and they did not match — when the agent called getMachineByTitle("Godzilla") and got back the real catalog record, it cited the real id while expected_citation_set held a fictional one, so citation_precision/recall scored 0 even on a correct lookup. This is one of the two reasons H2 baseline citation_precision was 0.133 (the other was connected-agents wiring, fixed in PR #96). Adds tools/eval/Recurate.csx — a dotnet-script tool that reads each question's curated machine title from tools/eval/wizard.v1.titles.json (side-car, also added) and rewrites expected_citation_set with the actual deployed-Cosmos document id. Cross-partition case-insensitive STRINGEQUALS query mirrors IMachineRepository.QueryByTitleAsync so a recurated id is exactly what the production getMachineByTitle function tool would return. Provenance side-car wizard.v1.recuration.json records timestamp / endpoint / SHAs / per-question outcome. Idempotent. First run (2026-05-08): 8 of 30 questions resolved (Godzilla, The Wizard of Oz, Dialed In!). 18 did not resolve — their reference machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles, AC/DC, Metallica, Rush) are absent from the current OPDB sync's view of the catalog or appear only as edition-suffixed records ("AC/DC (Pro)"). expected_citation_set on those 18 left untouched per the script's no-fabrication contract; the unresolved questions are an honest signal that drives a future Phase 4 follow-up to re-target the questions or re-sync the catalog. 4 out-of-scope rows skipped (correct). Cosmos SDK pinned to 3.43.1 (one minor below production's 3.59.0) because 3.59.0 + .NET 10 + dotnet-script + serverless Cosmos returns "BadRequest: One of the specified inputs is invalid" on every query; 3.43.1 round-trips cleanly. Production code is unaffected (different runtime path). References: - build-spec.md § Phase 4 § Scope item 9 (the spec) - build-spec.md § Phase 3 § Retrospective lesson 5 (the motivation) - ADR-0014 (Microsoft Foundry orchestration; getMachineByTitle is the function tool) Local review: 0 🔴 / 2 ⚠️ (both deferred with justification — minimal-shape TitleHit probe DTO is intentional; dry-run `return` from script main is idiomatic dotnet-script) / 8 categories ✅. 7-item self-audit: identity ✅ (personal noreply), zero warnings ✅, behavior tests ✅ (all 691 pass).

…smatch A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS (c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras. This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then. Changes: - tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer column (lowercase string matching the deployed catalog's `manufacturer` partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of- scope rows use null to mirror their machine_title=null. _about field documents the new column. - tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose `manufacturer` matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer (in-scope row), fall back to first-hit-wins and log a manufacturer-unconstrained warning. New counts (skipped_mfg_ mismatch, manufacturer_unconstrained) flow into the manifest. - data/eval/README.md: append Hardening (2026-05-08) subsection documenting the new behavior + the dry-run verification output, with explicit "tooling-only; live re-run sequenced after OPDB sync investigation closes" callout. Verification: dry-run against the same deployed Cosmos endpoint as the W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now correctly flagged rather than silently taking Sega's id. Build clean (0 warnings).

jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026

jkeeley2073 merged commit 37e82cb into main May 8, 2026
4 of 5 checks passed

This was referenced May 8, 2026

chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch #99

Merged

fix(opdb) treat blank strings as null in OpdbMachineMapper fallback chain #100

Merged

fix(ci) un-trip Sanitization on its own DL-0004 documentation #101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs#98

chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs#98
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W13EvalRecuration

jkeeley2073 commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkeeley2073 commented May 8, 2026

Summary

What this PR adds

Why some questions did not resolve

Notable

Test Plan

Out of Scope

Checklist

Pre-push self-audit (additive PRs)

Step 0 — /local-review (qualitative)

Step 1 — Mechanical checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Step 0 — `/local-review` (qualitative)