fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit#34
Merged
Conversation
Audit-driven hardening pass before stacking the Cosmos migration on top. Three findings the audit surfaced after Phase 1.2.a/b/c (JJP/AP/Spooky) all shipped in quick succession against the same template. Critical fix ------------ JJP scraper was emitting GameRecord entries for every /products/* URL on the Shopify storefront -- including JJP-branded apparel and accessories. JjpOptions.PinballMachinesCollectionSlug was declared, defaulted in appsettings.json, copied into integration test config, and never read. The bug shipped through PRs #31, #32, #33 unchallenged. Wired the option as the canonical filter via /collections/{slug}/products.json: fetch the Shopify product handle set, intersect with sitemap URLs. [Required] on the option so a blank value fails fast at startup. Drift / parity fixes (Spooky vs JJP/AP siblings) ------------------------------------------------ - SpookyGamePageScraper: per-page extraction now wrapped in TryExtract (matches JJP/AP), so a single bad page can never abort the run path. - SpookyGamePageExtractor.BuildAnchorTextLookup: bare catch{} replaced with catch (Exception) so OOM/cancellation can propagate. - SpookyOptions.MaxPagesToFetch retires the magic 50 pagination cap. - JjpProductScraper drops the unused _options field/ctor param; the startup log no longer interpolates a removed property. New contract test ----------------- ScraperOrchestrator.KnownSourceCanonicalNames + SourceAliasContractTests pin every ISourceScraper.Name to the --source <alias> filter map. A typo on either side previously produced a silent no-op run; this test catches it. Uses RuntimeHelpers.GetUninitializedObject so we read each scraper's literal Name without DI overhead. Durable rule (prevents recurrence) ---------------------------------- - New `## PR self-audit (pre-push, BLOCKING)` section in CLAUDE.md - New `### Pre-push self-audit` block in PULL_REQUEST_TEMPLATE.md - New feedback memory: `feedback_pre_pr_self_audit.md` Seven-item checklist (option fields read, sibling-diff for drift, no bare catch{}, CLI/orchestrator wiring, behavior-vs-structure tests, zero warnings, identity check). The dead-config bug shipped through three PRs because no audit ran during the originals; this checklist runs at push time, not in a later session. Tests: 248 → 260 (+12). Build: zero warnings. DI smoke: clean.
Comment on lines
+167
to
+173
| foreach (var product in payload.Products) | ||
| { | ||
| if (!string.IsNullOrWhiteSpace(product.Handle)) | ||
| { | ||
| handles.Add(product.Handle); | ||
| } | ||
| } |
Comment on lines
+111
to
+116
| catch (Exception ex) | ||
| { | ||
| Logger.LogWarning( | ||
| ex, "Spooky scraper: failed to extract page {Url}; skipping.", page.Link); | ||
| return (null, []); | ||
| } |
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit-driven hardening pass before stacking the Cosmos migration on top. This session ran an enterprise-quality audit on the recent Phase 1.2.a/b/c manufacturer-scraper PRs (#31 / #32 / #33) and found one 🔴 critical regression and several⚠️ drift items that would have been painful to retrofit after a Cosmos migration was already ingesting polluted records.
Critical fix — JJP merch was being scraped as machines
JjpOptions.PinballMachinesCollectionSlugwas declared, defaulted inappsettings.json, copied into integration test config, and never read by any code path. The XML doc explicitly described filtering merch out of the catalog — the wiring just never happened. As a result the JJP scraper would emitGameRecordentries for every/products/*URL on JJP's Shopify storefront, including:jjp-merch-shirt,jjp-flag-tee,jjp-established-teejjp-skeleton-hoodie,jjp-flex-fit-hat,jjp-new-era-skull-hatavatar-pinball-banner-navi-copyThe bug shipped through PRs #31, #32, #33 unchallenged. Wired the option as the canonical filter:
JjpSitemapClient.FetchPinballMachineHandlesAsyncfetches/collections/{slug}/products.json, parses the Shopify product handle setFilterByHandleSetintersects the sitemap output with the handle setJjpOptions.PinballMachinesCollectionSlugnow[Required]so a blank value fails fast at startupVerified against the live storefront: the configured collection contains 12 actual machines (Harry Potter / Avatar / Guns N' Roses / Toy Story / Elton John editions). All apparel and accessories filtered out.
Drift / parity fixes (Spooky vs JJP/AP)
SpookyGamePageScraperTryExtractwrapper logs warning + returns null on failure (matches JJP/APTryExtractAsync)SpookyGamePageExtractor.BuildAnchorTextLookupcatch { }swallowed everythingcatch (Exception)so OOM/cancellation propagate; comment documents intentSpookyOptions.MaxPagesToFetchif (page > 50)magic[Range(1, 1000)]configurable cap (default 50)JjpProductScraper_optionsfield/ctor paramNew contract test pins
--source <alias>end-to-endAdded
ScraperOrchestrator.KnownSourceCanonicalNames+SourceAliasContractTests. EveryISourceScraper.Nameregistered in DI must appear in the canonical-names set, otherwise--source <alias>silently produces a no-op run. The test usesRuntimeHelpers.GetUninitializedObjectto read each scraper'sNameliteral without invoking its DI-bound constructor — no test fixtures, no DI host setup, just the property contract.Durable rule so this class of bug doesn't recur
The dead-config bug shipped through three PRs because no audit ran during the originals. The fix is a checklist that runs at push time, not in a later session:
## PR self-audit (pre-push, BLOCKING)section inCLAUDE.md### Pre-push self-auditblock in.github/PULL_REQUEST_TEMPLATE.md— visible at PR creationmemory/feedback_pre_pr_self_audit.md— visible at session startSeven-item checklist for additive PRs:
*Optionsproperty has at least one real getter call insrc/catch { }ISourceScraper?SourceAliasContractTestsstill passes without editTests / build
dotnet run -- --statusclean, app reports existing catalogTest plan
dotnet build— cleandotnet test— 260/260 passdotnet run -- --status— DI startup clean/collections/pinball-machines-for-sale/products.jsonreturns the expected 12 machine handlesjjp-merch-shirt,jjp-flag-tee) appear in the unfiltered storefront sample but NOT in the collection products.jsonOut of scope
Dev-CosmosMigrationonce this merges./games/{slug}/updatesfirmware scraper — deferred per feat(ap) American Pinball scraper -- Phase 1.2.b #32 description.🤖 Generated with Claude Code