feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a by jkeeley2073 · Pull Request #31 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-02T22:36:17Z

Summary

Phase 1.2.a — first non-Stern manufacturer scraper on the polite + Clean Architecture foundation. Validates that the layout cleanly accommodates a totally different source-site shape (Shopify storefront with sitemap + JSON-LD product schema, no Vue.js, no manuals on product pages) without any foundation refactoring.

Recon (via WebFetch before designing the scraper)

jerseyjackpinball.com is Shopify-based (server-rendered HTML) → HTTP scraping via PoliteScraperBase, no Playwright needed
robots.txt allows catalog crawling; explicitly blocks /cart, /checkout, /account, /admin paths and asserts "Checkouts are for humans" against automated buy-for-me agents. We only read catalog pages.
Catalog at /collections/pinball-machines-for-sale, products at /products/{slug}
Sitemap at /sitemap.xml — Shopify INDEX referencing sitemap_products_*.xml children
JSON-LD product schema present on product pages with name / description / image[] / offers (price + Schema.org availability)

Per the locked feedback memory feedback_machine_consumer_metadata_first.md, sitemap-first discovery + JSON-LD extraction beats DOM scraping. Designed accordingly.

What ships

Component	What
`JjpOptions`	Bound from `"Jjp"` config section
`SourceType.JjpProductPage`	New enum value in `Enums.cs`
`["jjp"] = "JJP"` source-filter alias	`ScraperOrchestrator.SourceAliases`
`JjpSitemapClient`	Sitemap-first discovery: index → `sitemap_products_*.xml` children → `/products/{slug}` URLs. XML parsing surface exposed as static methods for direct testing.
`JjpProductExtractor`	Pure-function HTML → `GameRecord`. Prefers JSON-LD product schema; falls back to og: tags; last resort H1. Schema.org availability normalized. Tolerates malformed JSON-LD, array-wrapped JSON-LD, and non-Product JSON-LD.
`JjpProductScraper`	Extends `PoliteScraperBase`, implements `ISourceScraper`. Politeness + 429 backoff inherited from the gate.
`AddJjpScraping` DI extension	Two typed `HttpClient`s (sitemap + product), polite UA, Shopify-appropriate `Accept` headers
CLI	`--source jjp` invokes the scraper

ID strategy

JJP GameRecord IDs use prefix game_jjp_{slug} to avoid colliding with Stern's existing game_{slug} pattern. Both manufacturers coexist in the same games.json catalog.

Test Plan

dotnet build /warnaserror — zero warnings
dotnet test — 201/201 passing (was 185 pre-PR; +16 new JJP tests)
All 185 prior tests still pass
CI re-validates on this PR
Author = personal noreply

What the 16 new tests cover

JjpSitemapClientTests (5): sitemap-index parsing returns only sitemap_products entries (skips pages / collections / blogs); product-sitemap parsing returns only /products/ paths (skips collections / pages); empty sitemap returns empty list; null-arg validation on both static methods
JjpProductExtractorTests (11): full JSON-LD product happy path populates every mapped field; og-only fallback works when no JSON-LD product is present; non-Product JSON-LD on the page falls through to og: tags; array-wrapped JSON-LD parses correctly; malformed JSON-LD is tolerated; slug parsing across 4 URL shapes; null-arg validation

Out of Scope (intentional follow-ups)

JJP service-bulletin / manual discovery — JJP doesn't host docs on /products/. They may be on a separate /pages/support area; needs reconnaissance.
Cosmos Machine repository writes — this PR uses the existing catalog.json pipeline (mirroring Stern). Migrating the manufacturer scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after the live deployment exists.
OPDB-id reconciliation — a JJP product → Machine.ManufacturerSlugs["jjp"] backfill needs the Machine repository populated by the OPDB sync first; lives in the same Cosmos-migration concern.
Per-source PolitenessOverrides from IngestionSource — JJP currently uses the global Politeness defaults; per-source overrides ride along with the eventual IngestionSource Cosmos integration.

What this unblocks

AP (American Pinball) and Spooky scrapers can now follow the same template: sitemap-discover, JSON-LD-extract, ISourceScraper-yield. Each lands as its own focused PR; together with Stern + JJP they cover ~95% of currently-shipping commercial pinball.
The polite-scraping pattern is validated against a second source — first proof that PoliteScraperBase + the gate generalize beyond Stern.

First non-Stern manufacturer scraper on the polite + Clean Architecture foundation. Validates that the layout cleanly accommodates a totally different source-site shape (Shopify storefront with sitemap + JSON-LD product schema, no Vue.js, no manuals on product pages) without any foundation refactoring. Recon (via WebFetch before designing the scraper): - jerseyjackpinball.com is Shopify-based (server-rendered HTML) - robots.txt allows catalog crawling; explicitly blocks cart/checkout/account/admin paths and asserts "Checkouts are for humans" against automated buy-for-me agents (we only read catalog). - Catalog at /collections/pinball-machines-for-sale; products at /products/{slug} - Sitemap at /sitemap.xml -- Shopify INDEX referencing sitemap_products_*.xml children - JSON-LD product schema present on product pages with name / description / image[] / offers (price + Schema.org availability) Code: src/PinballWizard.Core/Configuration/JjpOptions.cs Bound from "Jjp" config section. BaseUrl, SitemapPath, PinballMachinesCollectionSlug. src/PinballWizard.Core/Models/Enums.cs Adds SourceType.JjpProductPage. src/PinballWizard.Application/ScraperOrchestrator.cs Adds ["jjp"] = "JJP" alias to the source-filter map. src/PinballWizard.Infrastructure/Scraping/Jjp/ - JjpSitemapClient Sitemap-first discovery (per the locked feedback memory feedback_machine_consumer _metadata_first.md). Reads index, follows sitemap_products_*.xml children, returns /products/{slug} URLs. XML parsing surface exposed as static methods for direct testing. - JjpProductExtractor Pure-function HTML -> GameRecord. Prefers JSON-LD product schema; falls back to og: tags; last resort H1. Schema.org availability normalized to in_stock / out_of_stock / preorder / discontinued. Tolerates malformed JSON-LD, array-wrapped JSON-LD, and non-Product JSON-LD on the same page. - JjpProductScraper Extends PoliteScraperBase, implements ISourceScraper. Discovers via the sitemap client, fetches each product page, yields ScrapedItem with .Game = GameRecord. Politeness and 429 backoff inherited from the gate. - AddJjpScraping DI extension. Two typed HttpClients (sitemap + product) with the polite UA and Shopify- appropriate Accept headers. Bridges the typed- client registration into the ISourceScraper enumerable. src/PinballWizard.Cli/ - Program.cs Wires AddJjpScraping; updates --source flag help text. - appsettings.json Adds Jjp section with default Shopify-shape config. ID strategy: JJP GameRecord IDs use prefix game_jjp_{slug} to avoid colliding with Stern's existing game_{slug} pattern. Both manufacturers coexist in the same games.json catalog. Tests (16 new, 201 total passing): - JjpSitemapClientTests Sitemap-index parsing returns only sitemap_products entries (skips pages / collections / blogs sitemaps); product-sitemap parsing returns only /products/ paths (skips collections / pages); empty sitemap returns empty list; null-arg validation. - JjpProductExtractorTests Full JSON-LD product happy path populates every mapped field; og-only fallback works when no JSON-LD product is present; non-Product JSON-LD on the page falls through to og: tags; array-wrapped JSON-LD parses correctly; malformed JSON-LD is tolerated; slug parsing across 4 URL shapes; null-arg validation. IntegrationTests.cs updated to register JJP via AddJjpScraping and assert the ISourceScraper enumerable now resolves four scrapers. Out of scope (intentional follow-ups): - JJP service-bulletin / manual discovery -- JJP doesn't host docs on /products/. They may be on a separate /pages/support area; needs reconnaissance. - Cosmos Machine repository writes -- this PR uses the existing catalog.json pipeline (mirroring Stern). Migrating the manufacturer scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after the live deployment exists. - OPDB-id reconciliation -- a JJP-product -> Machine.ManufacturerSlugs ["jjp"] backfill needs the Machine repository populated by the OPDB sync first; lives in the same Cosmos-migration concern. - Per-source PolitenessOverrides from IngestionSource -- JJP currently uses the global Politeness defaults; per-source overrides ride along with the eventual IngestionSource Cosmos integration. What this unblocks: - AP (American Pinball) and Spooky scrapers can now follow the same template: sitemap-discover, JSON-LD-extract, ISourceScraper-yield. Each lands as its own focused PR; together with Stern + JJP they cover ~95% of currently-shipping commercial pinball.

+            "outofstock" => "out_of_stock",
+            "preorder" => "preorder",
+            "discontinued" => "discontinued",
+            _ => lastSegment?.ToLowerInvariant(),


+        foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']"))
+        {
+            var text = script.TextContent;
+            if (string.IsNullOrWhiteSpace(text)) continue;
+
+            JsonElement root;
+            try
+            {
+                using var parsed = JsonDocument.Parse(text);
+                root = parsed.RootElement.Clone();
+            }
+            catch (JsonException)
+            {
+                continue;
+            }
+
+            // Some Shopify themes wrap JSON-LD in an array; some don't.
+            if (root.ValueKind == JsonValueKind.Array)
+            {
+                foreach (var item in root.EnumerateArray())
+                {
+                    if (TryReadProduct(item) is { } prod) return prod;
+                }
+            }
+            else
+            {
+                if (TryReadProduct(root) is { } prod) return prod;
+            }
+        }


+                foreach (var item in root.EnumerateArray())
+                {
+                    if (TryReadProduct(item) is { } prod) return prod;
+                }


+                foreach (var item in imageProp.EnumerateArray())
+                {
+                    if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s)
+                    {
+                        images.Add(s);
+                    }
+                }


+        catch (Exception ex)
+        {
+            Logger.LogWarning(ex, "JJP scraper: failed to fetch / extract {Url}; skipping.", productUrl);
+            return null;
+        }


+        foreach (var sitemap in doc.Descendants(SitemapNs + "sitemap"))
+        {
+            var loc = sitemap.Element(SitemapNs + "loc")?.Value;
+            if (string.IsNullOrWhiteSpace(loc)) continue;
+            if (loc.Contains("sitemap_products", StringComparison.OrdinalIgnoreCase) &&
+                Uri.TryCreate(loc, UriKind.Absolute, out var uri))
+            {
+                sitemaps.Add(uri);
+            }
+        }


+        foreach (var url in doc.Descendants(SitemapNs + "url"))
+        {
+            var loc = url.Element(SitemapNs + "loc")?.Value;
+            if (string.IsNullOrWhiteSpace(loc)) continue;
+            if (!Uri.TryCreate(loc, UriKind.Absolute, out var uri)) continue;
+            if (!uri.AbsolutePath.Contains("/products/", StringComparison.OrdinalIgnoreCase)) continue;
+            urls.Add(uri);
+        }


jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026

jkeeley2073 merged commit 75cd52d into main May 2, 2026
3 checks passed

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

This was referenced May 2, 2026

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit #34

Merged

feat(sync) scraper-to-Machine reconciliation service + ADR 0011 #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a#31

feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a#31
jkeeley2073 merged 1 commit into
mainfrom
Dev-JjpScraper

jkeeley2073 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkeeley2073 commented May 2, 2026

Summary

Recon (via WebFetch before designing the scraper)

What ships

ID strategy

Test Plan

What the 16 new tests cover

Out of Scope (intentional follow-ups)

What this unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants