chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs#98
Merged
Merged
Conversation
…ed Cosmos OPDB IDs
Phase 3 PR 8 shipped wizard.v1.jsonl with subagent-curated plausible
OPDB-format ids ("GRBN-MQR4P" etc.). The deployed Cosmos catalog
contains the actual OPDB ids and they did not match — when the agent
called getMachineByTitle("Godzilla") and got back the real catalog
record, it cited the real id while expected_citation_set held a
fictional one, so citation_precision/recall scored 0 even on a correct
lookup. This is one of the two reasons H2 baseline citation_precision
was 0.133 (the other was connected-agents wiring, fixed in PR #96).
Adds tools/eval/Recurate.csx — a dotnet-script tool that reads each
question's curated machine title from tools/eval/wizard.v1.titles.json
(side-car, also added) and rewrites expected_citation_set with the
actual deployed-Cosmos document id. Cross-partition case-insensitive
STRINGEQUALS query mirrors IMachineRepository.QueryByTitleAsync so a
recurated id is exactly what the production getMachineByTitle function
tool would return. Provenance side-car wizard.v1.recuration.json
records timestamp / endpoint / SHAs / per-question outcome. Idempotent.
First run (2026-05-08): 8 of 30 questions resolved (Godzilla, The
Wizard of Oz, Dialed In!). 18 did not resolve — their reference
machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles,
AC/DC, Metallica, Rush) are absent from the current OPDB sync's view
of the catalog or appear only as edition-suffixed records ("AC/DC
(Pro)"). expected_citation_set on those 18 left untouched per the
script's no-fabrication contract; the unresolved questions are an
honest signal that drives a future Phase 4 follow-up to re-target the
questions or re-sync the catalog. 4 out-of-scope rows skipped (correct).
Cosmos SDK pinned to 3.43.1 (one minor below production's 3.59.0)
because 3.59.0 + .NET 10 + dotnet-script + serverless Cosmos returns
"BadRequest: One of the specified inputs is invalid" on every query;
3.43.1 round-trips cleanly. Production code is unaffected (different
runtime path).
References:
- build-spec.md § Phase 4 § Scope item 9 (the spec)
- build-spec.md § Phase 3 § Retrospective lesson 5 (the motivation)
- ADR-0014 (Microsoft Foundry orchestration; getMachineByTitle is the function tool)
Local review: 0 🔴 / 2 ⚠️ (both deferred with justification — minimal-shape
TitleHit probe DTO is intentional; dry-run `return` from script main is
idiomatic dotnet-script) / 8 categories ✅. 7-item self-audit: identity ✅
(personal noreply), zero warnings ✅, behavior tests ✅ (all 691 pass).
jkeeley2073
added a commit
that referenced
this pull request
May 8, 2026
…smatch A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS (c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras. This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then. Changes: - tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer column (lowercase string matching the deployed catalog's `manufacturer` partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of- scope rows use null to mirror their machine_title=null. _about field documents the new column. - tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose `manufacturer` matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer (in-scope row), fall back to first-hit-wins and log a manufacturer-unconstrained warning. New counts (skipped_mfg_ mismatch, manufacturer_unconstrained) flow into the manifest. - data/eval/README.md: append Hardening (2026-05-08) subsection documenting the new behavior + the dry-run verification output, with explicit "tooling-only; live re-run sequenced after OPDB sync investigation closes" callout. Verification: dry-run against the same deployed Cosmos endpoint as the W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now correctly flagged rather than silently taking Sega's id. Build clean (0 warnings).
jkeeley2073
added a commit
that referenced
this pull request
May 8, 2026
…smatch A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS (c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras. This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then. Changes: - tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer column (lowercase string matching the deployed catalog's `manufacturer` partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of- scope rows use null to mirror their machine_title=null. _about field documents the new column. - tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose `manufacturer` matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer (in-scope row), fall back to first-hit-wins and log a manufacturer-unconstrained warning. New counts (skipped_mfg_ mismatch, manufacturer_unconstrained) flow into the manifest. - data/eval/README.md: append Hardening (2026-05-08) subsection documenting the new behavior + the dry-run verification output, with explicit "tooling-only; live re-run sequenced after OPDB sync investigation closes" callout. Verification: dry-run against the same deployed Cosmos endpoint as the W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now correctly flagged rather than silently taking Sega's id. Build clean (0 warnings).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Path A — operator-pending recuration done. Cosmos auth worked from this worktree (
az loginactive on the personal Earlybird sub), and the recuration ran end-to-end against deployed Cosmospinwiz-cosmos-dev-hlpz4. 8 of 30 ground-truth ids were updated to actual deployed-catalog OPDB ids; 18 left as-is (machines absent from the current OPDB sync's view of the catalog); 4 out-of-scope rows skipped.Phase 3 PR 8 shipped
wizard.v1.jsonlwith subagent-curated plausible OPDB-format ids. The deployed Cosmos catalog contains the actual OPDB ids and the two did not match — when the agent successfully calledgetMachineByTitle("Godzilla")and cited the real id,expected_citation_setheld a different one, socitation_precision/recallscored 0 even on a correct lookup. This is one of the two reasons H2 baselinecitation_precisionwas 0.133 (the other was connected-agents wiring, fixed in PR #96). See build-spec.md § Phase 4 § Scope item 9 and Phase 3 retrospective lesson 5.What this PR adds
tools/eval/Recurate.csx—dotnet-scriptrecuration tool. Reads each question's curated machine title from the side-car, queries deployed Cosmos via case-insensitiveSTRINGEQUALSonc.title(mirrorsIMachineRepository.QueryByTitleAsync), rewritesexpected_citation_setwith the actual document id.--dry-run,--cosmos-endpoint,--jsonl,--titlesflags. Idempotent.tools/eval/wizard.v1.titles.json— explicit per-question machine-title curation side-car. 30 entries; 4 arenull(out-of-scope rows).data/eval/wizard.v1.jsonl— 8 questions recurated to actual OPDB ids:G5po2-MeP6B(mfg=sega— Stern's 2021 release lives under OPDB's Sega-era partition)G4xyR-MJ2v0(mfg=jjp)G4X2l-MYepl(mfg=jjp)data/eval/wizard.v1.recuration.json— provenance side-car: timestamp, Cosmos endpoint, JSONL SHA before recuration, script SHA, per-question outcome with curated title + resolved manufacturer + resolved OPDB id.data/eval/README.md— appended Phase 4 W1-3 section documenting the recuration outcome and the operator workflow for future re-runs.Why some questions did not resolve
18 of 30 questions reference machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles, AC/DC, Metallica, Rush) that are either absent from the current OPDB sync's view of the catalog or appear only as edition-suffixed records (e.g., "AC/DC (Pro)"). Their
expected_citation_setwas left untouched per the script's no-fabrication contract — keeping the fictional ids inexpected_citation_setmakes the gap honestly visible: until the catalog has the machine, the agent cannot drive a non-zerocitation_precisionon those questions, and the metric should reflect that. This is a Phase 4 follow-up (re-target the questions to machines that ARE in the catalog, or re-sync the catalog).Notable
Microsoft.Azure.Cosmos3.43.1 (one minor below production's 3.59.0) — 3.59.0 + .NET 10 +dotnet-script+ serverless Cosmos returns "BadRequest: One of the specified inputs is invalid" on every query path. Production code is unaffected (different runtime). Documented in the script header so a future operator does not get confused.resolved_manufacturer=segafor Stern's Godzilla 2021 because OPDB inherits the original Sega-era record's manufacturer key. This is a Phase 1 catalog observation, not a recuration choice; documented in the side-car's_aboutblock so a future curator does not re-trip on it.Test Plan
dotnet build— zero warnings, zero errors.dotnet test— all 691 tests pass (eval tests verify the parser still reads the recurated JSONL).dotnet script tools/eval/Recurate.csx -- --dry-run— prints proposed changes without writing.dotnet script tools/eval/Recurate.csx— applies recuration; first run wrote 8 lines + manifest.dotnet script tools/eval/Recurate.csx— re-run: 0 changes, manifest re-written with same outcomes (idempotent).Out of Scope
PinballWizard.slnxand does not have a test surface; the productionMachineRepository.QueryByTitleAsyncalready has unit + integration tests.Checklist
docs/adr/— N/A (tools + data; no new architecture)README.mdand/ordocs/are updated in the same PR (data/eval/README.md updated)~/.claude/projects/c--projects-PinballWizard/memory/is now stale, it has been updated or removed in the same PR — N/ATODO/FIXME/ commented-out code committed<NoWarn>without a comment explaining why and the removal criterionPre-push self-audit (additive PRs)
Step 0 —
/local-review(qualitative)/local-reviewand addressed every 🔴 finding before pushTitleHitDTO used only in the probe (deferred — keeps probe payload smaller than the lookup; DTOs have distinct shapes by design)returninstead of explicitEnvironment.Exit(0)(deferred —returnfrom adotnet-scriptmain block is idiomatic and produces exit 0 reliably)Step 1 — Mechanical checklist
*Optionsproperty has at least one real getter call insrc/— N/A (no new Options classes; tool flags read in script).csxscript in the repo)catch { }— minimum scope iscatch (Exception)ISourceScraper?SourceAliasContractTestsstill passes without edit — N/Agit log -1 --format='%an <%ae>'shows personal noreply, not work email — verified94459922+jkeeley2073@users.noreply.github.com