chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch by jkeeley2073 · Pull Request #99 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-08T12:50:32Z

Summary

A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS(c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras.

This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then.

Changes

tools/eval/wizard.v1.titles.json — add per-question expected_manufacturer column (lowercase string matching the deployed catalog's manufacturer partition value — stern, jjp, sega, etc.). All 30 rows curated from the question text + notes field. Out-of-scope rows use null to mirror their machine_title=null. _about field documents the new column and the failure-mode that motivated it.
tools/eval/Recurate.csx — replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose manufacturer matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer for an in-scope row, fall back to first-hit-wins and log a "manufacturer-unconstrained" warning. New counts (skipped_mfg_mismatch, manufacturer_unconstrained) flow into the recuration manifest's counts block; per-row outcomes carry expected_manufacturer alongside resolved_manufacturer for full audit trail.
data/eval/README.md — append "Hardening (2026-05-08) — manufacturer-aware skip-on-mismatch" subsection documenting the failure mode + the new behavior + the dry-run verification output, with explicit tooling-only; live re-run sequenced after OPDB sync investigation closes callout.

Test Plan

dotnet build PinballWizard.slnx — 0 warnings, 0 errors (no src/ changes; build remains green).
dotnet script tools/eval/Recurate.csx -- --dry-run against the deployed Cosmos endpoint (pinwiz-cosmos-dev-hlpz4, the same endpoint used by the W1-3 first run; az login to the personal Earlybird sub) — captured output appended to data/eval/README.md. Result:
- 3 Godzilla rows (ev-rules-0003, ev-valuation-0004, ev-repair-0003) now flagged mfg_mismatch (expected stern, catalog has sega) — the failure mode this PR exists to catch.
- 5 JJP rows still resolve correctly (3× The Wizard of Oz, 2× Dialed In!) — JJP is the only manufacturer holding those titles in the deployed catalog.
- 18 no_match rows unchanged (Stern's modern catalog still absent — separate investigation).
- 4 out_of_scope rows unchanged (correct refusals).
- 0 manufacturer-unconstrained (every in-scope row now has expected_manufacturer set).
The live (non-dry-run) script was deliberately NOT run; the eval JSONL ground truth and the W1-3 recuration provenance are unchanged.

Out of Scope

The OPDB sync investigation (why Stern's modern catalog is missing from deployed Cosmos) — separate root-cause investigation already underway.
Re-running the live recuration to update data/eval/wizard.v1.jsonl against a populated catalog — sequenced after the OPDB sync investigation closes.
Unit tests for Recurate.csx — the script doesn't compile as part of PinballWizard.slnx; the dry-run output documented in README is the test-of-record (it exercises the new mfg_mismatch branch on real Godzilla rows).

Checklist

CI is green (build + test; tooling-only PR — script exercised via dry-run)
PR title follows the Conventional Commits format
If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (hardening of an existing tool, not a new architectural decision)
If user-visible behavior changes, README.md and/or docs/ are updated in the same PR (data/eval/README.md appended)
If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — N/A (no memory referenced this script)
No TODO / FIXME / commented-out code committed
No new entries in <NoWarn> without a comment

Pre-push self-audit

Step 0 — `/local-review` (qualitative)

Ran /local-review and addressed every 🔴 finding before push
Local review outcome: 0 🔴 / 2 ⚠️ / 8 categories ✅
- ⚠️ Test quality: dry-run-output-as-test approach is appropriate for a dotnet-script tool that doesn't compile as part of PinballWizard.slnx; the Godzilla rows actually exercise the new mfg_mismatch branch. Deferred — justification: there is no project to host xUnit tests in; the dry-run is the canonical verification path documented in the script header.
- ⚠️ Performance: QueryHitsByTitle reads ALL hits per title rather than SELECT TOP 1. Deferred — justification: TOP 1 was the bug; reading all hits is the necessary correctness fix. RU cost is negligible across 30 questions and the dry-run completed in 3.1s (no observable regression vs. the W1-3 first run).

Step 1 — Mechanical checklist

Every new *Options property has at least one real getter call in src/ — N/A (no *Options class added; the new expected_manufacturer side-car field is read at lines 188-190 and consumed at lines 332/344 of Recurate.csx)
Sibling-diffed against the closest existing implementation; drift is justified or removed — diffed against the prior version of the same script; no drift.
No bare catch { } — no new catches added.
New ISourceScraper? — N/A
Tests assert behavior, not just structure — the dry-run exercises the actual mfg_mismatch branch on real Godzilla rows.
Build is zero-warning — dotnet build PinballWizard.slnx → 0 warnings, 0 errors.
git log -1 --format='%an <%ae>' shows personal noreply, not work email — verified Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>.

…hain Critical data-correctness fix uncovered while spot-checking the W1-3 recuration outcome. The deployed Cosmos catalog has 58 modern Stern records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where title is empty string, breaking IMachineRepository.QueryByTitleAsync — the function tool the Wizard agent uses to ground every machine reference. The eval's W1-3 recuration silently picked the Sega 1998 Godzilla because that was the only "Godzilla" with a non-empty title the title-keyed lookup could find. ROOT CAUSE C#'s null-coalescing operator (??) preserves empty strings: `null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`. OPDB's /api/export returns some modern Stern records with empty `name` / `common_name`, so the existing chain `Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty title instead of the documented OpdbId fallback. Same issue affected MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles back to empty if OPDB returned blanks). Defense-in-depth: the manufacturer ShortName fallback had the same shape; a blank ShortName would have produced a "unknown" partition key. FIX Introduced FirstNonBlank(params string?[] candidates) helper that returns the first non-null AND non-whitespace candidate, or null if every candidate is blank. Applied at three sites in OpdbMachineMapper: - Map() Title fallback chain - Map() manufacturer-key resolution (ShortName → Name) - MergeOpdbFieldsInto() Title refresh REGRESSION TESTS (5 new) - Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5 blank/null/whitespace permutations; pins the documented OpdbId fallback fires when both name fields are blank - Map_BlankCommonName_FallsThroughToName — pins the falls-through-to-Name behavior when CommonName is blank but Name is non-blank - Map_BlankShortName_FallsBackToFullManufacturerName — pins the defense-in-depth fix on the manufacturer-key chain - MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle — pins that blanks on a re-sync don't wipe the existing title OPERATOR HAND-OFF The fix corrects the mapping logic; the deployed Cosmos catalog is still in the broken state until the OPDB sync is re-run. Sequenced hand-off: after this PR merges, re-run `dotnet run --project src/PinballWizard.Cli -- --source opdb` against the personal Earlybird subscription. The 58 Stern modern records will repopulate their titles. Then re-run the W1-3 hardened recuration tool (per PR #99) to reconcile the eval ground truth against the corrected catalog. Tests: 709 → 717. Build clean, zero warnings.

…smatch A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS (c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras. This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then. Changes: - tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer column (lowercase string matching the deployed catalog's `manufacturer` partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of- scope rows use null to mirror their machine_title=null. _about field documents the new column. - tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose `manufacturer` matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer (in-scope row), fall back to first-hit-wins and log a manufacturer-unconstrained warning. New counts (skipped_mfg_ mismatch, manufacturer_unconstrained) flow into the manifest. - data/eval/README.md: append Hardening (2026-05-08) subsection documenting the new behavior + the dry-run verification output, with explicit "tooling-only; live re-run sequenced after OPDB sync investigation closes" callout. Verification: dry-run against the same deployed Cosmos endpoint as the W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now correctly flagged rather than silently taking Sega's id. Build clean (0 warnings).

…hain Critical data-correctness fix uncovered while spot-checking the W1-3 recuration outcome. The deployed Cosmos catalog has 58 modern Stern records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where title is empty string, breaking IMachineRepository.QueryByTitleAsync — the function tool the Wizard agent uses to ground every machine reference. The eval's W1-3 recuration silently picked the Sega 1998 Godzilla because that was the only "Godzilla" with a non-empty title the title-keyed lookup could find. ROOT CAUSE C#'s null-coalescing operator (??) preserves empty strings: `null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`. OPDB's /api/export returns some modern Stern records with empty `name` / `common_name`, so the existing chain `Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty title instead of the documented OpdbId fallback. Same issue affected MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles back to empty if OPDB returned blanks). Defense-in-depth: the manufacturer ShortName fallback had the same shape; a blank ShortName would have produced a "unknown" partition key. FIX Introduced FirstNonBlank(params string?[] candidates) helper that returns the first non-null AND non-whitespace candidate, or null if every candidate is blank. Applied at three sites in OpdbMachineMapper: - Map() Title fallback chain - Map() manufacturer-key resolution (ShortName → Name) - MergeOpdbFieldsInto() Title refresh REGRESSION TESTS (5 new) - Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5 blank/null/whitespace permutations; pins the documented OpdbId fallback fires when both name fields are blank - Map_BlankCommonName_FallsThroughToName — pins the falls-through-to-Name behavior when CommonName is blank but Name is non-blank - Map_BlankShortName_FallsBackToFullManufacturerName — pins the defense-in-depth fix on the manufacturer-key chain - MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle — pins that blanks on a re-sync don't wipe the existing title OPERATOR HAND-OFF The fix corrects the mapping logic; the deployed Cosmos catalog is still in the broken state until the OPDB sync is re-run. Sequenced hand-off: after this PR merges, re-run `dotnet run --project src/PinballWizard.Cli -- --source opdb` against the personal Earlybird subscription. The 58 Stern modern records will repopulate their titles. Then re-run the W1-3 hardened recuration tool (per PR #99) to reconcile the eval ground truth against the corrected catalog. Tests: 709 → 717. Build clean, zero warnings.

jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026

This was referenced May 8, 2026

fix(opdb) treat blank strings as null in OpdbMachineMapper fallback chain #100

Merged

fix(ci) un-trip Sanitization on its own DL-0004 documentation #101

Merged

jkeeley2073 force-pushed the Dev-Phase4W13RecurateHardening branch from f8a19c1 to 4817eb0 Compare May 8, 2026 13:39

jkeeley2073 mentioned this pull request May 8, 2026

fix(ci) bypass errexit on WORK_EMAIL_PATTERN validation in sanitization #102

Merged

19 tasks

jkeeley2073 added 2 commits May 8, 2026 10:01

chore(ci) retrigger workflows after WORK_EMAIL_PATTERN secret fix

8b5b1a7

jkeeley2073 force-pushed the Dev-Phase4W13RecurateHardening branch from fe7dfa9 to 8b5b1a7 Compare May 8, 2026 14:01

jkeeley2073 merged commit e294e42 into main May 8, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch#99

chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch#99
jkeeley2073 merged 2 commits into
mainfrom
Dev-Phase4W13RecurateHardening

jkeeley2073 commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkeeley2073 commented May 8, 2026

Summary

Changes

Test Plan

Out of Scope

Checklist

Pre-push self-audit

Step 0 — /local-review (qualitative)

Step 1 — Mechanical checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Step 0 — `/local-review` (qualitative)