chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch#99
Merged
Conversation
This was referenced May 8, 2026
f8a19c1 to
4817eb0
Compare
jkeeley2073
added a commit
that referenced
this pull request
May 8, 2026
…hain
Critical data-correctness fix uncovered while spot-checking the W1-3
recuration outcome. The deployed Cosmos catalog has 58 modern Stern
records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where
title is empty string, breaking IMachineRepository.QueryByTitleAsync —
the function tool the Wizard agent uses to ground every machine
reference. The eval's W1-3 recuration silently picked the Sega 1998
Godzilla because that was the only "Godzilla" with a non-empty title
the title-keyed lookup could find.
ROOT CAUSE
C#'s null-coalescing operator (??) preserves empty strings:
`null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`.
OPDB's /api/export returns some modern Stern records with empty
`name` / `common_name`, so the existing chain
`Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty
title instead of the documented OpdbId fallback. Same issue affected
MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles
back to empty if OPDB returned blanks). Defense-in-depth: the
manufacturer ShortName fallback had the same shape; a blank ShortName
would have produced a "unknown" partition key.
FIX
Introduced FirstNonBlank(params string?[] candidates) helper that
returns the first non-null AND non-whitespace candidate, or null if
every candidate is blank. Applied at three sites in
OpdbMachineMapper:
- Map() Title fallback chain
- Map() manufacturer-key resolution (ShortName → Name)
- MergeOpdbFieldsInto() Title refresh
REGRESSION TESTS (5 new)
- Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5
blank/null/whitespace permutations; pins the documented OpdbId
fallback fires when both name fields are blank
- Map_BlankCommonName_FallsThroughToName — pins the
falls-through-to-Name behavior when CommonName is blank but Name
is non-blank
- Map_BlankShortName_FallsBackToFullManufacturerName — pins the
defense-in-depth fix on the manufacturer-key chain
- MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle
— pins that blanks on a re-sync don't wipe the existing title
OPERATOR HAND-OFF
The fix corrects the mapping logic; the deployed Cosmos catalog is
still in the broken state until the OPDB sync is re-run. Sequenced
hand-off: after this PR merges, re-run `dotnet run --project
src/PinballWizard.Cli -- --source opdb` against the personal
Earlybird subscription. The 58 Stern modern records will repopulate
their titles. Then re-run the W1-3 hardened recuration tool (per PR
#99) to reconcile the eval ground truth against the corrected
catalog.
Tests: 709 → 717. Build clean, zero warnings.
19 tasks
…smatch A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS (c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras. This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then. Changes: - tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer column (lowercase string matching the deployed catalog's `manufacturer` partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of- scope rows use null to mirror their machine_title=null. _about field documents the new column. - tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose `manufacturer` matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer (in-scope row), fall back to first-hit-wins and log a manufacturer-unconstrained warning. New counts (skipped_mfg_ mismatch, manufacturer_unconstrained) flow into the manifest. - data/eval/README.md: append Hardening (2026-05-08) subsection documenting the new behavior + the dry-run verification output, with explicit "tooling-only; live re-run sequenced after OPDB sync investigation closes" callout. Verification: dry-run against the same deployed Cosmos endpoint as the W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now correctly flagged rather than silently taking Sega's id. Build clean (0 warnings).
fe7dfa9 to
8b5b1a7
Compare
jkeeley2073
added a commit
that referenced
this pull request
May 8, 2026
…hain
Critical data-correctness fix uncovered while spot-checking the W1-3
recuration outcome. The deployed Cosmos catalog has 58 modern Stern
records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where
title is empty string, breaking IMachineRepository.QueryByTitleAsync —
the function tool the Wizard agent uses to ground every machine
reference. The eval's W1-3 recuration silently picked the Sega 1998
Godzilla because that was the only "Godzilla" with a non-empty title
the title-keyed lookup could find.
ROOT CAUSE
C#'s null-coalescing operator (??) preserves empty strings:
`null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`.
OPDB's /api/export returns some modern Stern records with empty
`name` / `common_name`, so the existing chain
`Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty
title instead of the documented OpdbId fallback. Same issue affected
MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles
back to empty if OPDB returned blanks). Defense-in-depth: the
manufacturer ShortName fallback had the same shape; a blank ShortName
would have produced a "unknown" partition key.
FIX
Introduced FirstNonBlank(params string?[] candidates) helper that
returns the first non-null AND non-whitespace candidate, or null if
every candidate is blank. Applied at three sites in
OpdbMachineMapper:
- Map() Title fallback chain
- Map() manufacturer-key resolution (ShortName → Name)
- MergeOpdbFieldsInto() Title refresh
REGRESSION TESTS (5 new)
- Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5
blank/null/whitespace permutations; pins the documented OpdbId
fallback fires when both name fields are blank
- Map_BlankCommonName_FallsThroughToName — pins the
falls-through-to-Name behavior when CommonName is blank but Name
is non-blank
- Map_BlankShortName_FallsBackToFullManufacturerName — pins the
defense-in-depth fix on the manufacturer-key chain
- MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle
— pins that blanks on a re-sync don't wipe the existing title
OPERATOR HAND-OFF
The fix corrects the mapping logic; the deployed Cosmos catalog is
still in the broken state until the OPDB sync is re-run. Sequenced
hand-off: after this PR merges, re-run `dotnet run --project
src/PinballWizard.Cli -- --source opdb` against the personal
Earlybird subscription. The 58 Stern modern records will repopulate
their titles. Then re-run the W1-3 hardened recuration tool (per PR
#99) to reconcile the eval ground truth against the corrected
catalog.
Tests: 709 → 717. Build clean, zero warnings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued
SELECT TOP 1 c.id ... STRINGEQUALS(c.title, 'Godzilla', true)and took the first hit blindly — recording the Sega record's id under each Godzilla question'sexpected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras.This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (
data/eval/wizard.v1.jsonlanddata/eval/wizard.v1.recuration.json) remain authoritative until then.Changes
tools/eval/wizard.v1.titles.json— add per-questionexpected_manufacturercolumn (lowercase string matching the deployed catalog'smanufacturerpartition value —stern,jjp,sega, etc.). All 30 rows curated from the question text +notesfield. Out-of-scope rows usenullto mirror theirmachine_title=null._aboutfield documents the new column and the failure-mode that motivated it.tools/eval/Recurate.csx— replaceLookupOpdbIdByTitle(returned a tuple fromSELECT TOP 1) withQueryHitsByTitle(returns all hits). Caller walks results and picks the first hit whosemanufacturermatchesexpected_manufacturer(case-insensitive). On no-match, skip the row with newmfg_mismatchoutcome — JSONL untouched. On nullexpected_manufacturerfor an in-scope row, fall back to first-hit-wins and log a "manufacturer-unconstrained" warning. New counts (skipped_mfg_mismatch,manufacturer_unconstrained) flow into the recuration manifest'scountsblock; per-row outcomes carryexpected_manufactureralongsideresolved_manufacturerfor full audit trail.data/eval/README.md— append "Hardening (2026-05-08) — manufacturer-aware skip-on-mismatch" subsection documenting the failure mode + the new behavior + the dry-run verification output, with explicit tooling-only; live re-run sequenced after OPDB sync investigation closes callout.Test Plan
dotnet build PinballWizard.slnx— 0 warnings, 0 errors (nosrc/changes; build remains green).dotnet script tools/eval/Recurate.csx -- --dry-runagainst the deployed Cosmos endpoint (pinwiz-cosmos-dev-hlpz4, the same endpoint used by the W1-3 first run;az loginto the personal Earlybird sub) — captured output appended todata/eval/README.md. Result:ev-rules-0003,ev-valuation-0004,ev-repair-0003) now flaggedmfg_mismatch(expectedstern, catalog hassega) — the failure mode this PR exists to catch.expected_manufacturerset).Out of Scope
data/eval/wizard.v1.jsonlagainst a populated catalog — sequenced after the OPDB sync investigation closes.Recurate.csx— the script doesn't compile as part ofPinballWizard.slnx; the dry-run output documented in README is the test-of-record (it exercises the newmfg_mismatchbranch on real Godzilla rows).Checklist
docs/adr/— N/A (hardening of an existing tool, not a new architectural decision)README.mdand/ordocs/are updated in the same PR (data/eval/README.mdappended)~/.claude/projects/c--projects-PinballWizard/memory/is now stale, it has been updated or removed in the same PR — N/A (no memory referenced this script)TODO/FIXME/ commented-out code committed<NoWarn>without a commentPre-push self-audit
Step 0 —
/local-review(qualitative)/local-reviewand addressed every 🔴 finding before pushdotnet-scripttool that doesn't compile as part ofPinballWizard.slnx; the Godzilla rows actually exercise the newmfg_mismatchbranch. Deferred — justification: there is no project to host xUnit tests in; the dry-run is the canonical verification path documented in the script header.QueryHitsByTitlereads ALL hits per title rather thanSELECT TOP 1. Deferred — justification:TOP 1was the bug; reading all hits is the necessary correctness fix. RU cost is negligible across 30 questions and the dry-run completed in 3.1s (no observable regression vs. the W1-3 first run).Step 1 — Mechanical checklist
*Optionsproperty has at least one real getter call insrc/— N/A (no*Optionsclass added; the newexpected_manufacturerside-car field is read at lines 188-190 and consumed at lines 332/344 ofRecurate.csx)catch { }— no new catches added.ISourceScraper? — N/Amfg_mismatchbranch on real Godzilla rows.dotnet build PinballWizard.slnx→ 0 warnings, 0 errors.git log -1 --format='%an <%ae>'shows personal noreply, not work email — verifiedJim Keeley <94459922+jkeeley2073@users.noreply.github.com>.