Skip to content

chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch#99

Merged
jkeeley2073 merged 2 commits into
mainfrom
Dev-Phase4W13RecurateHardening
May 8, 2026
Merged

chore(eval) Phase 4 W1-3 hardening — mfg-aware re-curation skip-on-mismatch#99
jkeeley2073 merged 2 commits into
mainfrom
Dev-Phase4W13RecurateHardening

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

A spot-check of the W1-3 first run (PR #98) revealed a silent failure mode: the 3 Godzilla questions in the eval set are intended for Stern's 2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998 Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS(c.title, 'Godzilla', true) and took the first hit blindly — recording the Sega record's id under each Godzilla question's expected_citation_set. The agent's correct answer about Stern 2021 would have failed eval because its citation wouldn't match the (incorrect) Sega ground truth. Same risk class exists for any title shared across manufacturers/eras.

This PR ships hardening of the recuration tool only. The next live recuration run is sequenced after the OPDB sync investigation closes (why Stern's modern catalog is currently absent from the deployed Cosmos is a separate root-cause investigation already underway). Running the live script before that closes would compound the catalog-state issue. The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and data/eval/wizard.v1.recuration.json) remain authoritative until then.

Changes

  • tools/eval/wizard.v1.titles.json — add per-question expected_manufacturer column (lowercase string matching the deployed catalog's manufacturer partition value — stern, jjp, sega, etc.). All 30 rows curated from the question text + notes field. Out-of-scope rows use null to mirror their machine_title=null. _about field documents the new column and the failure-mode that motivated it.
  • tools/eval/Recurate.csx — replace LookupOpdbIdByTitle (returned a tuple from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller walks results and picks the first hit whose manufacturer matches expected_manufacturer (case-insensitive). On no-match, skip the row with new mfg_mismatch outcome — JSONL untouched. On null expected_manufacturer for an in-scope row, fall back to first-hit-wins and log a "manufacturer-unconstrained" warning. New counts (skipped_mfg_mismatch, manufacturer_unconstrained) flow into the recuration manifest's counts block; per-row outcomes carry expected_manufacturer alongside resolved_manufacturer for full audit trail.
  • data/eval/README.md — append "Hardening (2026-05-08) — manufacturer-aware skip-on-mismatch" subsection documenting the failure mode + the new behavior + the dry-run verification output, with explicit tooling-only; live re-run sequenced after OPDB sync investigation closes callout.

Test Plan

  • dotnet build PinballWizard.slnx — 0 warnings, 0 errors (no src/ changes; build remains green).
  • dotnet script tools/eval/Recurate.csx -- --dry-run against the deployed Cosmos endpoint (pinwiz-cosmos-dev-hlpz4, the same endpoint used by the W1-3 first run; az login to the personal Earlybird sub) — captured output appended to data/eval/README.md. Result:
    • 3 Godzilla rows (ev-rules-0003, ev-valuation-0004, ev-repair-0003) now flagged mfg_mismatch (expected stern, catalog has sega) — the failure mode this PR exists to catch.
    • 5 JJP rows still resolve correctly (3× The Wizard of Oz, 2× Dialed In!) — JJP is the only manufacturer holding those titles in the deployed catalog.
    • 18 no_match rows unchanged (Stern's modern catalog still absent — separate investigation).
    • 4 out_of_scope rows unchanged (correct refusals).
    • 0 manufacturer-unconstrained (every in-scope row now has expected_manufacturer set).
  • The live (non-dry-run) script was deliberately NOT run; the eval JSONL ground truth and the W1-3 recuration provenance are unchanged.

Out of Scope

  • The OPDB sync investigation (why Stern's modern catalog is missing from deployed Cosmos) — separate root-cause investigation already underway.
  • Re-running the live recuration to update data/eval/wizard.v1.jsonl against a populated catalog — sequenced after the OPDB sync investigation closes.
  • Unit tests for Recurate.csx — the script doesn't compile as part of PinballWizard.slnx; the dry-run output documented in README is the test-of-record (it exercises the new mfg_mismatch branch on real Godzilla rows).

Checklist

  • CI is green (build + test; tooling-only PR — script exercised via dry-run)
  • PR title follows the Conventional Commits format
  • If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (hardening of an existing tool, not a new architectural decision)
  • If user-visible behavior changes, README.md and/or docs/ are updated in the same PR (data/eval/README.md appended)
  • If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — N/A (no memory referenced this script)
  • No TODO / FIXME / commented-out code committed
  • No new entries in <NoWarn> without a comment

Pre-push self-audit

Step 0 — /local-review (qualitative)

  • Ran /local-review and addressed every 🔴 finding before push
  • Local review outcome: 0 🔴 / 2 ⚠️ / 8 categories ✅
    • ⚠️ Test quality: dry-run-output-as-test approach is appropriate for a dotnet-script tool that doesn't compile as part of PinballWizard.slnx; the Godzilla rows actually exercise the new mfg_mismatch branch. Deferred — justification: there is no project to host xUnit tests in; the dry-run is the canonical verification path documented in the script header.
    • ⚠️ Performance: QueryHitsByTitle reads ALL hits per title rather than SELECT TOP 1. Deferred — justification: TOP 1 was the bug; reading all hits is the necessary correctness fix. RU cost is negligible across 30 questions and the dry-run completed in 3.1s (no observable regression vs. the W1-3 first run).

Step 1 — Mechanical checklist

  • Every new *Options property has at least one real getter call in src/ — N/A (no *Options class added; the new expected_manufacturer side-car field is read at lines 188-190 and consumed at lines 332/344 of Recurate.csx)
  • Sibling-diffed against the closest existing implementation; drift is justified or removed — diffed against the prior version of the same script; no drift.
  • No bare catch { } — no new catches added.
  • New ISourceScraper? — N/A
  • Tests assert behavior, not just structure — the dry-run exercises the actual mfg_mismatch branch on real Godzilla rows.
  • Build is zero-warning — dotnet build PinballWizard.slnx → 0 warnings, 0 errors.
  • git log -1 --format='%an <%ae>' shows personal noreply, not work email — verified Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>.

@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026
@jkeeley2073 jkeeley2073 force-pushed the Dev-Phase4W13RecurateHardening branch from f8a19c1 to 4817eb0 Compare May 8, 2026 13:39
jkeeley2073 added a commit that referenced this pull request May 8, 2026
…hain

Critical data-correctness fix uncovered while spot-checking the W1-3
recuration outcome. The deployed Cosmos catalog has 58 modern Stern
records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where
title is empty string, breaking IMachineRepository.QueryByTitleAsync —
the function tool the Wizard agent uses to ground every machine
reference. The eval's W1-3 recuration silently picked the Sega 1998
Godzilla because that was the only "Godzilla" with a non-empty title
the title-keyed lookup could find.

ROOT CAUSE
C#'s null-coalescing operator (??) preserves empty strings:
`null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`.
OPDB's /api/export returns some modern Stern records with empty
`name` / `common_name`, so the existing chain
`Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty
title instead of the documented OpdbId fallback. Same issue affected
MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles
back to empty if OPDB returned blanks). Defense-in-depth: the
manufacturer ShortName fallback had the same shape; a blank ShortName
would have produced a "unknown" partition key.

FIX
Introduced FirstNonBlank(params string?[] candidates) helper that
returns the first non-null AND non-whitespace candidate, or null if
every candidate is blank. Applied at three sites in
OpdbMachineMapper:
  - Map() Title fallback chain
  - Map() manufacturer-key resolution (ShortName → Name)
  - MergeOpdbFieldsInto() Title refresh

REGRESSION TESTS (5 new)
  - Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5
    blank/null/whitespace permutations; pins the documented OpdbId
    fallback fires when both name fields are blank
  - Map_BlankCommonName_FallsThroughToName — pins the
    falls-through-to-Name behavior when CommonName is blank but Name
    is non-blank
  - Map_BlankShortName_FallsBackToFullManufacturerName — pins the
    defense-in-depth fix on the manufacturer-key chain
  - MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle
    — pins that blanks on a re-sync don't wipe the existing title

OPERATOR HAND-OFF
The fix corrects the mapping logic; the deployed Cosmos catalog is
still in the broken state until the OPDB sync is re-run. Sequenced
hand-off: after this PR merges, re-run `dotnet run --project
src/PinballWizard.Cli -- --source opdb` against the personal
Earlybird subscription. The 58 Stern modern records will repopulate
their titles. Then re-run the W1-3 hardened recuration tool (per PR
#99) to reconcile the eval ground truth against the corrected
catalog.

Tests: 709 → 717. Build clean, zero warnings.
…smatch

A spot-check of the W1-3 first run (PR #98) revealed a silent failure
mode: the 3 Godzilla questions in the eval set are intended for Stern's
2021 Godzilla, but the deployed Cosmos catalog only contains Sega's 1998
Godzilla. The W1-3 script issued SELECT TOP 1 c.id ... STRINGEQUALS
(c.title, 'Godzilla', true) and took the first hit blindly — recording
the Sega record's id under each Godzilla question's expected_citation_set.
The agent's correct answer about Stern 2021 would have failed eval
because its citation wouldn't match the (incorrect) Sega ground truth.
Same risk class exists for any title shared across manufacturers/eras.

This PR ships hardening of the recuration tool only. The next live
recuration run is sequenced after the OPDB sync investigation closes
(why Stern's modern catalog is currently absent from the deployed Cosmos
is a separate root-cause investigation already underway). Running the
live script before that closes would compound the catalog-state issue.
The W1-3 first run's artifacts (data/eval/wizard.v1.jsonl and
data/eval/wizard.v1.recuration.json) remain authoritative until then.

Changes:
- tools/eval/wizard.v1.titles.json: add per-question expected_manufacturer
  column (lowercase string matching the deployed catalog's `manufacturer`
  partition value — stern, jjp, sega, etc.). All 30 rows curated; out-of-
  scope rows use null to mirror their machine_title=null. _about field
  documents the new column.
- tools/eval/Recurate.csx: replace LookupOpdbIdByTitle (returned a tuple
  from SELECT TOP 1) with QueryHitsByTitle (returns all hits). Caller
  walks results and picks the first hit whose `manufacturer` matches
  expected_manufacturer (case-insensitive). On no-match, skip the row
  with new mfg_mismatch outcome — JSONL untouched. On null
  expected_manufacturer (in-scope row), fall back to first-hit-wins and
  log a manufacturer-unconstrained warning. New counts (skipped_mfg_
  mismatch, manufacturer_unconstrained) flow into the manifest.
- data/eval/README.md: append Hardening (2026-05-08) subsection
  documenting the new behavior + the dry-run verification output, with
  explicit "tooling-only; live re-run sequenced after OPDB sync
  investigation closes" callout.

Verification: dry-run against the same deployed Cosmos endpoint as the
W1-3 first run produces — 0 recurated / 5 unchanged (3x The Wizard of
Oz, 2x Dialed In!) / 4 out_of_scope / 18 no_match (Stern catalog
absent) / 3 mfg_mismatch (Godzilla x3, expected stern, catalog has
sega) / 0 manufacturer-unconstrained. The 3 Godzilla rows are now
correctly flagged rather than silently taking Sega's id. Build clean
(0 warnings).
@jkeeley2073 jkeeley2073 force-pushed the Dev-Phase4W13RecurateHardening branch from fe7dfa9 to 8b5b1a7 Compare May 8, 2026 14:01
jkeeley2073 added a commit that referenced this pull request May 8, 2026
…hain

Critical data-correctness fix uncovered while spot-checking the W1-3
recuration outcome. The deployed Cosmos catalog has 58 modern Stern
records (2018+, including Stern Godzilla Pro 2021 / GweeP-MW95j) where
title is empty string, breaking IMachineRepository.QueryByTitleAsync —
the function tool the Wizard agent uses to ground every machine
reference. The eval's W1-3 recuration silently picked the Sega 1998
Godzilla because that was the only "Godzilla" with a non-empty title
the title-keyed lookup could find.

ROOT CAUSE
C#'s null-coalescing operator (??) preserves empty strings:
`null ?? "" ?? fallback` evaluates to `""`, never reaching `fallback`.
OPDB's /api/export returns some modern Stern records with empty
`name` / `common_name`, so the existing chain
`Title = dto.CommonName ?? dto.Name ?? dto.OpdbId` produced an empty
title instead of the documented OpdbId fallback. Same issue affected
MergeOpdbFieldsInto on subsequent re-syncs (would WIPE existing titles
back to empty if OPDB returned blanks). Defense-in-depth: the
manufacturer ShortName fallback had the same shape; a blank ShortName
would have produced a "unknown" partition key.

FIX
Introduced FirstNonBlank(params string?[] candidates) helper that
returns the first non-null AND non-whitespace candidate, or null if
every candidate is blank. Applied at three sites in
OpdbMachineMapper:
  - Map() Title fallback chain
  - Map() manufacturer-key resolution (ShortName → Name)
  - MergeOpdbFieldsInto() Title refresh

REGRESSION TESTS (5 new)
  - Map_BlankCommonNameAndName_FallsBackToOpdbId — Theory with 5
    blank/null/whitespace permutations; pins the documented OpdbId
    fallback fires when both name fields are blank
  - Map_BlankCommonName_FallsThroughToName — pins the
    falls-through-to-Name behavior when CommonName is blank but Name
    is non-blank
  - Map_BlankShortName_FallsBackToFullManufacturerName — pins the
    defense-in-depth fix on the manufacturer-key chain
  - MergeOpdbFieldsInto_BlankCommonNameAndName_PreservesExistingTitle
    — pins that blanks on a re-sync don't wipe the existing title

OPERATOR HAND-OFF
The fix corrects the mapping logic; the deployed Cosmos catalog is
still in the broken state until the OPDB sync is re-run. Sequenced
hand-off: after this PR merges, re-run `dotnet run --project
src/PinballWizard.Cli -- --source opdb` against the personal
Earlybird subscription. The 58 Stern modern records will repopulate
their titles. Then re-run the W1-3 hardened recuration tool (per PR
#99) to reconcile the eval ground truth against the corrected
catalog.

Tests: 709 → 717. Build clean, zero warnings.
@jkeeley2073 jkeeley2073 merged commit e294e42 into main May 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant