Skip to content

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit#34

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-ScraperHardening
May 2, 2026
Merged

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit#34
jkeeley2073 merged 1 commit into
mainfrom
Dev-ScraperHardening

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Audit-driven hardening pass before stacking the Cosmos migration on top. This session ran an enterprise-quality audit on the recent Phase 1.2.a/b/c manufacturer-scraper PRs (#31 / #32 / #33) and found one 🔴 critical regression and several ⚠️ drift items that would have been painful to retrofit after a Cosmos migration was already ingesting polluted records.

Critical fix — JJP merch was being scraped as machines

JjpOptions.PinballMachinesCollectionSlug was declared, defaulted in appsettings.json, copied into integration test config, and never read by any code path. The XML doc explicitly described filtering merch out of the catalog — the wiring just never happened. As a result the JJP scraper would emit GameRecord entries for every /products/* URL on JJP's Shopify storefront, including:

  • jjp-merch-shirt, jjp-flag-tee, jjp-established-tee
  • jjp-skeleton-hoodie, jjp-flex-fit-hat, jjp-new-era-skull-hat
  • avatar-pinball-banner-navi-copy

The bug shipped through PRs #31, #32, #33 unchallenged. Wired the option as the canonical filter:

  • JjpSitemapClient.FetchPinballMachineHandlesAsync fetches /collections/{slug}/products.json, parses the Shopify product handle set
  • FilterByHandleSet intersects the sitemap output with the handle set
  • JjpOptions.PinballMachinesCollectionSlug now [Required] so a blank value fails fast at startup

Verified against the live storefront: the configured collection contains 12 actual machines (Harry Potter / Avatar / Guns N' Roses / Toy Story / Elton John editions). All apparel and accessories filtered out.

Drift / parity fixes (Spooky vs JJP/AP)

Item Before After
SpookyGamePageScraper inline extraction, no try/catch private TryExtract wrapper logs warning + returns null on failure (matches JJP/AP TryExtractAsync)
SpookyGamePageExtractor.BuildAnchorTextLookup bare catch { } swallowed everything catch (Exception) so OOM/cancellation propagate; comment documents intent
SpookyOptions.MaxPagesToFetch hardcoded if (page > 50) magic [Range(1, 1000)] configurable cap (default 50)
JjpProductScraper unused _options field/ctor param removed; startup log no longer interpolates a property that no longer backs it

New contract test pins --source <alias> end-to-end

Added ScraperOrchestrator.KnownSourceCanonicalNames + SourceAliasContractTests. Every ISourceScraper.Name registered in DI must appear in the canonical-names set, otherwise --source <alias> silently produces a no-op run. The test uses RuntimeHelpers.GetUninitializedObject to read each scraper's Name literal without invoking its DI-bound constructor — no test fixtures, no DI host setup, just the property contract.

Durable rule so this class of bug doesn't recur

The dead-config bug shipped through three PRs because no audit ran during the originals. The fix is a checklist that runs at push time, not in a later session:

  • 📄 New ## PR self-audit (pre-push, BLOCKING) section in CLAUDE.md
  • ✅ New ### Pre-push self-audit block in .github/PULL_REQUEST_TEMPLATE.md — visible at PR creation
  • 🧠 New feedback memory: memory/feedback_pre_pr_self_audit.md — visible at session start

Seven-item checklist for additive PRs:

  1. Every new *Options property has at least one real getter call in src/
  2. Sibling-diff for drift against the closest existing implementation
  3. No bare catch { }
  4. New ISourceScraper? SourceAliasContractTests still passes without edit
  5. Tests assert behavior, not just structure
  6. Build is zero-warning
  7. Identity check — personal noreply, not work email

Tests / build

  • 260 tests passing (was 248) — +6 JJP filter tests, +2 contract tests, +4 (existing tests adapted)
  • Build: zero warnings
  • DI smoke: dotnet run -- --status clean, app reports existing catalog

Test plan

  • dotnet build — clean
  • dotnet test — 260/260 pass
  • dotnet run -- --status — DI startup clean
  • Verified Shopify /collections/pinball-machines-for-sale/products.json returns the expected 12 machine handles
  • Verified merch handles (jjp-merch-shirt, jjp-flag-tee) appear in the unfiltered storefront sample but NOT in the collection products.json
  • Live JJP scrape end-to-end (deferred — same discipline applied to original PRs)

Out of scope

  • Cosmos migration PR (the originally-planned next move) — pivoted to this hardening pass first; will land as Dev-CosmosMigration once this merges.
  • AP /games/{slug}/updates firmware scraper — deferred per feat(ap) American Pinball scraper -- Phase 1.2.b #32 description.
  • Phase 1.3 manufacturer scrapers — same template, lower priority than Cosmos migration.

🤖 Generated with Claude Code

Audit-driven hardening pass before stacking the Cosmos migration on top.
Three findings the audit surfaced after Phase 1.2.a/b/c (JJP/AP/Spooky)
all shipped in quick succession against the same template.

Critical fix
------------
JJP scraper was emitting GameRecord entries for every /products/* URL on
the Shopify storefront -- including JJP-branded apparel and accessories.
JjpOptions.PinballMachinesCollectionSlug was declared, defaulted in
appsettings.json, copied into integration test config, and never read.
The bug shipped through PRs #31, #32, #33 unchallenged. Wired the option
as the canonical filter via /collections/{slug}/products.json: fetch
the Shopify product handle set, intersect with sitemap URLs.
[Required] on the option so a blank value fails fast at startup.

Drift / parity fixes (Spooky vs JJP/AP siblings)
------------------------------------------------
- SpookyGamePageScraper: per-page extraction now wrapped in TryExtract
  (matches JJP/AP), so a single bad page can never abort the run path.
- SpookyGamePageExtractor.BuildAnchorTextLookup: bare catch{} replaced
  with catch (Exception) so OOM/cancellation can propagate.
- SpookyOptions.MaxPagesToFetch retires the magic 50 pagination cap.
- JjpProductScraper drops the unused _options field/ctor param; the
  startup log no longer interpolates a removed property.

New contract test
-----------------
ScraperOrchestrator.KnownSourceCanonicalNames + SourceAliasContractTests
pin every ISourceScraper.Name to the --source <alias> filter map. A
typo on either side previously produced a silent no-op run; this test
catches it. Uses RuntimeHelpers.GetUninitializedObject so we read
each scraper's literal Name without DI overhead.

Durable rule (prevents recurrence)
----------------------------------
- New `## PR self-audit (pre-push, BLOCKING)` section in CLAUDE.md
- New `### Pre-push self-audit` block in PULL_REQUEST_TEMPLATE.md
- New feedback memory: `feedback_pre_pr_self_audit.md`
Seven-item checklist (option fields read, sibling-diff for drift, no
bare catch{}, CLI/orchestrator wiring, behavior-vs-structure tests,
zero warnings, identity check). The dead-config bug shipped through
three PRs because no audit ran during the originals; this checklist
runs at push time, not in a later session.

Tests: 248 → 260 (+12). Build: zero warnings. DI smoke: clean.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026
@jkeeley2073 jkeeley2073 merged commit 824178c into main May 2, 2026
3 checks passed
Comment on lines +167 to +173
foreach (var product in payload.Products)
{
if (!string.IsNullOrWhiteSpace(product.Handle))
{
handles.Add(product.Handle);
}
}
Comment on lines +111 to +116
catch (Exception ex)
{
Logger.LogWarning(
ex, "Spooky scraper: failed to extract page {Url}; skipping.", page.Link);
return (null, []);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants