fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit by jkeeley2073 · Pull Request #34 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-02T23:18:38Z

Summary

Audit-driven hardening pass before stacking the Cosmos migration on top. This session ran an enterprise-quality audit on the recent Phase 1.2.a/b/c manufacturer-scraper PRs (#31 / #32 / #33) and found one 🔴 critical regression and several ⚠️ drift items that would have been painful to retrofit after a Cosmos migration was already ingesting polluted records.

Critical fix — JJP merch was being scraped as machines

JjpOptions.PinballMachinesCollectionSlug was declared, defaulted in appsettings.json, copied into integration test config, and never read by any code path. The XML doc explicitly described filtering merch out of the catalog — the wiring just never happened. As a result the JJP scraper would emit GameRecord entries for every /products/* URL on JJP's Shopify storefront, including:

jjp-merch-shirt, jjp-flag-tee, jjp-established-tee
jjp-skeleton-hoodie, jjp-flex-fit-hat, jjp-new-era-skull-hat
avatar-pinball-banner-navi-copy

The bug shipped through PRs #31, #32, #33 unchallenged. Wired the option as the canonical filter:

JjpSitemapClient.FetchPinballMachineHandlesAsync fetches /collections/{slug}/products.json, parses the Shopify product handle set
FilterByHandleSet intersects the sitemap output with the handle set
JjpOptions.PinballMachinesCollectionSlug now [Required] so a blank value fails fast at startup

Verified against the live storefront: the configured collection contains 12 actual machines (Harry Potter / Avatar / Guns N' Roses / Toy Story / Elton John editions). All apparel and accessories filtered out.

Drift / parity fixes (Spooky vs JJP/AP)

Item	Before	After
`SpookyGamePageScraper`	inline extraction, no try/catch	private `TryExtract` wrapper logs warning + returns null on failure (matches JJP/AP `TryExtractAsync`)
`SpookyGamePageExtractor.BuildAnchorTextLookup`	bare `catch { }` swallowed everything	`catch (Exception)` so OOM/cancellation propagate; comment documents intent
`SpookyOptions.MaxPagesToFetch`	hardcoded `if (page > 50)` magic	`[Range(1, 1000)]` configurable cap (default 50)
`JjpProductScraper`	unused `_options` field/ctor param	removed; startup log no longer interpolates a property that no longer backs it

New contract test pins `--source <alias>` end-to-end

Added ScraperOrchestrator.KnownSourceCanonicalNames + SourceAliasContractTests. Every ISourceScraper.Name registered in DI must appear in the canonical-names set, otherwise --source <alias> silently produces a no-op run. The test uses RuntimeHelpers.GetUninitializedObject to read each scraper's Name literal without invoking its DI-bound constructor — no test fixtures, no DI host setup, just the property contract.

Durable rule so this class of bug doesn't recur

The dead-config bug shipped through three PRs because no audit ran during the originals. The fix is a checklist that runs at push time, not in a later session:

📄 New ## PR self-audit (pre-push, BLOCKING) section in CLAUDE.md
✅ New ### Pre-push self-audit block in .github/PULL_REQUEST_TEMPLATE.md — visible at PR creation
🧠 New feedback memory: memory/feedback_pre_pr_self_audit.md — visible at session start

Seven-item checklist for additive PRs:

Every new *Options property has at least one real getter call in src/
Sibling-diff for drift against the closest existing implementation
No bare catch { }
New ISourceScraper? SourceAliasContractTests still passes without edit
Tests assert behavior, not just structure
Build is zero-warning
Identity check — personal noreply, not work email

Tests / build

260 tests passing (was 248) — +6 JJP filter tests, +2 contract tests, +4 (existing tests adapted)
Build: zero warnings
DI smoke: dotnet run -- --status clean, app reports existing catalog

Test plan

dotnet build — clean
dotnet test — 260/260 pass
dotnet run -- --status — DI startup clean
Verified Shopify /collections/pinball-machines-for-sale/products.json returns the expected 12 machine handles
Verified merch handles (jjp-merch-shirt, jjp-flag-tee) appear in the unfiltered storefront sample but NOT in the collection products.json
Live JJP scrape end-to-end (deferred — same discipline applied to original PRs)

Out of scope

Cosmos migration PR (the originally-planned next move) — pivoted to this hardening pass first; will land as Dev-CosmosMigration once this merges.
AP /games/{slug}/updates firmware scraper — deferred per feat(ap) American Pinball scraper -- Phase 1.2.b #32 description.
Phase 1.3 manufacturer scrapers — same template, lower priority than Cosmos migration.

🤖 Generated with Claude Code

Audit-driven hardening pass before stacking the Cosmos migration on top. Three findings the audit surfaced after Phase 1.2.a/b/c (JJP/AP/Spooky) all shipped in quick succession against the same template. Critical fix ------------ JJP scraper was emitting GameRecord entries for every /products/* URL on the Shopify storefront -- including JJP-branded apparel and accessories. JjpOptions.PinballMachinesCollectionSlug was declared, defaulted in appsettings.json, copied into integration test config, and never read. The bug shipped through PRs #31, #32, #33 unchallenged. Wired the option as the canonical filter via /collections/{slug}/products.json: fetch the Shopify product handle set, intersect with sitemap URLs. [Required] on the option so a blank value fails fast at startup. Drift / parity fixes (Spooky vs JJP/AP siblings) ------------------------------------------------ - SpookyGamePageScraper: per-page extraction now wrapped in TryExtract (matches JJP/AP), so a single bad page can never abort the run path. - SpookyGamePageExtractor.BuildAnchorTextLookup: bare catch{} replaced with catch (Exception) so OOM/cancellation can propagate. - SpookyOptions.MaxPagesToFetch retires the magic 50 pagination cap. - JjpProductScraper drops the unused _options field/ctor param; the startup log no longer interpolates a removed property. New contract test ----------------- ScraperOrchestrator.KnownSourceCanonicalNames + SourceAliasContractTests pin every ISourceScraper.Name to the --source <alias> filter map. A typo on either side previously produced a silent no-op run; this test catches it. Uses RuntimeHelpers.GetUninitializedObject so we read each scraper's literal Name without DI overhead. Durable rule (prevents recurrence) ---------------------------------- - New `## PR self-audit (pre-push, BLOCKING)` section in CLAUDE.md - New `### Pre-push self-audit` block in PULL_REQUEST_TEMPLATE.md - New feedback memory: `feedback_pre_pr_self_audit.md` Seven-item checklist (option fields read, sibling-diff for drift, no bare catch{}, CLI/orchestrator wiring, behavior-vs-structure tests, zero warnings, identity check). The dead-config bug shipped through three PRs because no audit ran during the originals; this checklist runs at push time, not in a later session. Tests: 248 → 260 (+12). Build: zero warnings. DI smoke: clean.

+            foreach (var product in payload.Products)
+            {
+                if (!string.IsNullOrWhiteSpace(product.Handle))
+                {
+                    handles.Add(product.Handle);
+                }
+            }


+        catch (Exception ex)
+        {
+            Logger.LogWarning(
+                ex, "Spooky scraper: failed to extract page {Url}; skipping.", page.Link);
+            return (null, []);
+        }


jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026

jkeeley2073 merged commit 824178c into main May 2, 2026
3 checks passed

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit#34

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit#34
jkeeley2073 merged 1 commit into
mainfrom
Dev-ScraperHardening

jkeeley2073 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkeeley2073 commented May 2, 2026

Summary

Critical fix — JJP merch was being scraped as machines

Drift / parity fixes (Spooky vs JJP/AP)

New contract test pins --source <alias> end-to-end

Durable rule so this class of bug doesn't recur

Tests / build

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New contract test pins `--source <alias>` end-to-end