Skip to content

chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs#98

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W13EvalRecuration
May 8, 2026
Merged

chore(eval) Phase 4 W1-3 — re-curate eval ground-truth against deployed Cosmos OPDB IDs#98
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W13EvalRecuration

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Path A — operator-pending recuration done. Cosmos auth worked from this worktree (az login active on the personal Earlybird sub), and the recuration ran end-to-end against deployed Cosmos pinwiz-cosmos-dev-hlpz4. 8 of 30 ground-truth ids were updated to actual deployed-catalog OPDB ids; 18 left as-is (machines absent from the current OPDB sync's view of the catalog); 4 out-of-scope rows skipped.

Phase 3 PR 8 shipped wizard.v1.jsonl with subagent-curated plausible OPDB-format ids. The deployed Cosmos catalog contains the actual OPDB ids and the two did not match — when the agent successfully called getMachineByTitle("Godzilla") and cited the real id, expected_citation_set held a different one, so citation_precision/recall scored 0 even on a correct lookup. This is one of the two reasons H2 baseline citation_precision was 0.133 (the other was connected-agents wiring, fixed in PR #96). See build-spec.md § Phase 4 § Scope item 9 and Phase 3 retrospective lesson 5.

What this PR adds

  • tools/eval/Recurate.csxdotnet-script recuration tool. Reads each question's curated machine title from the side-car, queries deployed Cosmos via case-insensitive STRINGEQUALS on c.title (mirrors IMachineRepository.QueryByTitleAsync), rewrites expected_citation_set with the actual document id. --dry-run, --cosmos-endpoint, --jsonl, --titles flags. Idempotent.
  • tools/eval/wizard.v1.titles.json — explicit per-question machine-title curation side-car. 30 entries; 4 are null (out-of-scope rows).
  • data/eval/wizard.v1.jsonl — 8 questions recurated to actual OPDB ids:
    • Godzilla (3 questions) → G5po2-MeP6B (mfg=sega — Stern's 2021 release lives under OPDB's Sega-era partition)
    • The Wizard of Oz (3 questions) → G4xyR-MJ2v0 (mfg=jjp)
    • Dialed In! (2 questions) → G4X2l-MYepl (mfg=jjp)
  • data/eval/wizard.v1.recuration.json — provenance side-car: timestamp, Cosmos endpoint, JSONL SHA before recuration, script SHA, per-question outcome with curated title + resolved manufacturer + resolved OPDB id.
  • data/eval/README.md — appended Phase 4 W1-3 section documenting the recuration outcome and the operator workflow for future re-runs.

Why some questions did not resolve

18 of 30 questions reference machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles, AC/DC, Metallica, Rush) that are either absent from the current OPDB sync's view of the catalog or appear only as edition-suffixed records (e.g., "AC/DC (Pro)"). Their expected_citation_set was left untouched per the script's no-fabrication contract — keeping the fictional ids in expected_citation_set makes the gap honestly visible: until the catalog has the machine, the agent cannot drive a non-zero citation_precision on those questions, and the metric should reflect that. This is a Phase 4 follow-up (re-target the questions to machines that ARE in the catalog, or re-sync the catalog).

Notable

  • SDK pin to Microsoft.Azure.Cosmos 3.43.1 (one minor below production's 3.59.0) — 3.59.0 + .NET 10 + dotnet-script + serverless Cosmos returns "BadRequest: One of the specified inputs is invalid" on every query path. Production code is unaffected (different runtime). Documented in the script header so a future operator does not get confused.
  • Godzilla's manufacturer surprise — the recuration manifest records resolved_manufacturer=sega for Stern's Godzilla 2021 because OPDB inherits the original Sega-era record's manufacturer key. This is a Phase 1 catalog observation, not a recuration choice; documented in the side-car's _about block so a future curator does not re-trip on it.

Test Plan

  • dotnet build — zero warnings, zero errors.
  • dotnet test — all 691 tests pass (eval tests verify the parser still reads the recurated JSONL).
  • dotnet script tools/eval/Recurate.csx -- --dry-run — prints proposed changes without writing.
  • dotnet script tools/eval/Recurate.csx — applies recuration; first run wrote 8 lines + manifest.
  • dotnet script tools/eval/Recurate.csx — re-run: 0 changes, manifest re-written with same outcomes (idempotent).

Out of Scope

  • Re-targeting the 18 unresolved questions to machines that ARE in the deployed catalog. Scoped as a Phase 4 follow-up; this PR delivers the tool + the actual recuration of the 8 resolvable questions.
  • Re-running the eval harness against the recurated ground-truth. That's hand-off H2 (intermediate) in build-spec § Phase 4 § Scope item 19, sequenced after items 8+9+10+11 land.
  • A unit test for the script. Per the spec, the script lives outside PinballWizard.slnx and does not have a test surface; the production MachineRepository.QueryByTitleAsync already has unit + integration tests.

Checklist

  • CI is green (build + test + coverage + CodeQL + sanitization)
  • PR title follows the Conventional Commits format above
  • If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (tools + data; no new architecture)
  • If user-visible behavior changes, README.md and/or docs/ are updated in the same PR (data/eval/README.md updated)
  • If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — N/A
  • No TODO / FIXME / commented-out code committed
  • No new entries in <NoWarn> without a comment explaining why and the removal criterion

Pre-push self-audit (additive PRs)

Step 0 — /local-review (qualitative)

  • Ran /local-review and addressed every 🔴 finding before push
  • Local review outcome: 0 🔴 / 2 ⚠️ / 8 categories ✅ — both ⚠️ deferred with justification:
    • Minimal-shape TitleHit DTO used only in the probe (deferred — keeps probe payload smaller than the lookup; DTOs have distinct shapes by design)
    • Dry-run path uses return instead of explicit Environment.Exit(0) (deferred — return from a dotnet-script main block is idiomatic and produces exit 0 reliably)

Step 1 — Mechanical checklist

  • Every new *Options property has at least one real getter call in src/ — N/A (no new Options classes; tool flags read in script)
  • Sibling-diffed against the closest existing implementation; drift is justified or removed — N/A (first .csx script in the repo)
  • No bare catch { } — minimum scope is catch (Exception)
  • New ISourceScraper? SourceAliasContractTests still passes without edit — N/A
  • Tests assert behavior, not just structure (named "rejects X" → fixture contains X) — N/A (no new tests; existing 691 still pass)
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' shows personal noreply, not work email — verified 94459922+jkeeley2073@users.noreply.github.com

…ed Cosmos OPDB IDs

Phase 3 PR 8 shipped wizard.v1.jsonl with subagent-curated plausible
OPDB-format ids ("GRBN-MQR4P" etc.). The deployed Cosmos catalog
contains the actual OPDB ids and they did not match — when the agent
called getMachineByTitle("Godzilla") and got back the real catalog
record, it cited the real id while expected_citation_set held a
fictional one, so citation_precision/recall scored 0 even on a correct
lookup. This is one of the two reasons H2 baseline citation_precision
was 0.133 (the other was connected-agents wiring, fixed in PR #96).

Adds tools/eval/Recurate.csx — a dotnet-script tool that reads each
question's curated machine title from tools/eval/wizard.v1.titles.json
(side-car, also added) and rewrites expected_citation_set with the
actual deployed-Cosmos document id. Cross-partition case-insensitive
STRINGEQUALS query mirrors IMachineRepository.QueryByTitleAsync so a
recurated id is exactly what the production getMachineByTitle function
tool would return. Provenance side-car wizard.v1.recuration.json
records timestamp / endpoint / SHAs / per-question outcome. Idempotent.

First run (2026-05-08): 8 of 30 questions resolved (Godzilla, The
Wizard of Oz, Dialed In!). 18 did not resolve — their reference
machines (Foo Fighters, Stranger Things, Iron Maiden, The Beatles,
AC/DC, Metallica, Rush) are absent from the current OPDB sync's view
of the catalog or appear only as edition-suffixed records ("AC/DC
(Pro)"). expected_citation_set on those 18 left untouched per the
script's no-fabrication contract; the unresolved questions are an
honest signal that drives a future Phase 4 follow-up to re-target the
questions or re-sync the catalog. 4 out-of-scope rows skipped (correct).

Cosmos SDK pinned to 3.43.1 (one minor below production's 3.59.0)
because 3.59.0 + .NET 10 + dotnet-script + serverless Cosmos returns
"BadRequest: One of the specified inputs is invalid" on every query;
3.43.1 round-trips cleanly. Production code is unaffected (different
runtime path).

References:
- build-spec.md § Phase 4 § Scope item 9 (the spec)
- build-spec.md § Phase 3 § Retrospective lesson 5 (the motivation)
- ADR-0014 (Microsoft Foundry orchestration; getMachineByTitle is the function tool)

Local review: 0 🔴 / 2 ⚠️ (both deferred with justification — minimal-shape
TitleHit probe DTO is intentional; dry-run `return` from script main is
idiomatic dotnet-script) / 8 categories ✅. 7-item self-audit: identity ✅
(personal noreply), zero warnings ✅, behavior tests ✅ (all 691 pass).
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026
@jkeeley2073 jkeeley2073 merged commit 37e82cb into main May 8, 2026
4 of 5 checks passed
jkeeley2073 added a commit that referenced this pull request May 8, 2026
…smatch

A spot-check of the W1-3 first run (PR #98) revealed a silent failure
mode: the 3 Godzilla questions in the eval set are intended for Stern's
2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998
Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS
(c.title, 'Godzilla', true) and took the first hit blindly — recording
the Sega record's id under each Godzilla question's expected_citation_set.
The agent's correct answer about Stern 2021 would have failed eval
because its citation wouldn't match the (incorrect) Sega ground truth.
Same risk class exists for any title shared across manufacturers/eras.

This PR ships hardening of the recuration tool only. The next live
recuration run is sequenced after the OPDB sync investigation closes
(why Stern's modern catalog is currently absent from the deployed Cosmos
is a separate root-cause investigation already underway). Running the
live script before that closes would compound the catalog-state issue.
The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and
data/eval/wizard.v1.recuration.json) remain authoritative until then.

Changes:
- tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer
  column (lowercase string matching the deployed catalog's `manufacturer`
  partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of-
  scope rows use null to mirror their machine_title=null. _about field
  documents the new column.
- tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple
  from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller
  walks results and picks the first hit whose `manufacturer` matches
  expected_manufacturer (case-insensitive). On no-match, skip the row
  with new mfg_mismatch outcome — JSONL untouched. On null
  expected_manufacturer (in-scope row), fall back to first-hit-wins and
  log a manufacturer-unconstrained warning. New counts (skipped_mfg_
  mismatch, manufacturer_unconstrained) flow into the manifest.
- data/eval/README.md: append Hardening (2026-05-08) subsection
  documenting the new behavior + the dry-run verification output, with
  explicit "tooling-only; live re-run sequenced after OPDB sync
  investigation closes" callout.

Verification: dry-run against the same deployed Cosmos endpoint as the
W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of
Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog
absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has
sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now
correctly flagged rather than silently taking Sega's id. Build clean
(0 warnings).
jkeeley2073 added a commit that referenced this pull request May 8, 2026
…smatch

A spot-check of the W1-3 first run (PR #98) revealed a silent failure
mode: the 3 Godzilla questions in the eval set are intended for Stern's
2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998
Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS
(c.title, 'Godzilla', true) and took the first hit blindly — recording
the Sega record's id under each Godzilla question's expected_citation_set.
The agent's correct answer about Stern 2021 would have failed eval
because its citation wouldn't match the (incorrect) Sega ground truth.
Same risk class exists for any title shared across manufacturers/eras.

This PR ships hardening of the recuration tool only. The next live
recuration run is sequenced after the OPDB sync investigation closes
(why Stern's modern catalog is currently absent from the deployed Cosmos
is a separate root-cause investigation already underway). Running the
live script before that closes would compound the catalog-state issue.
The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and
data/eval/wizard.v1.recuration.json) remain authoritative until then.

Changes:
- tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer
  column (lowercase string matching the deployed catalog's `manufacturer`
  partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of-
  scope rows use null to mirror their machine_title=null. _about field
  documents the new column.
- tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple
  from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller
  walks results and picks the first hit whose `manufacturer` matches
  expected_manufacturer (case-insensitive). On no-match, skip the row
  with new mfg_mismatch outcome — JSONL untouched. On null
  expected_manufacturer (in-scope row), fall back to first-hit-wins and
  log a manufacturer-unconstrained warning. New counts (skipped_mfg_
  mismatch, manufacturer_unconstrained) flow into the manifest.
- data/eval/README.md: append Hardening (2026-05-08) subsection
  documenting the new behavior + the dry-run verification output, with
  explicit "tooling-only; live re-run sequenced after OPDB sync
  investigation closes" callout.

Verification: dry-run against the same deployed Cosmos endpoint as the
W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of
Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog
absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has
sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now
correctly flagged rather than silently taking Sega's id. Build clean
(0 warnings).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant