feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a#31
Merged
Conversation
First non-Stern manufacturer scraper on the polite + Clean Architecture
foundation. Validates that the layout cleanly accommodates a totally
different source-site shape (Shopify storefront with sitemap + JSON-LD
product schema, no Vue.js, no manuals on product pages) without any
foundation refactoring.
Recon (via WebFetch before designing the scraper):
- jerseyjackpinball.com is Shopify-based (server-rendered HTML)
- robots.txt allows catalog crawling; explicitly blocks
cart/checkout/account/admin paths and asserts "Checkouts are for
humans" against automated buy-for-me agents (we only read catalog).
- Catalog at /collections/pinball-machines-for-sale; products at
/products/{slug}
- Sitemap at /sitemap.xml -- Shopify INDEX referencing
sitemap_products_*.xml children
- JSON-LD product schema present on product pages with name /
description / image[] / offers (price + Schema.org availability)
Code:
src/PinballWizard.Core/Configuration/JjpOptions.cs
Bound from "Jjp" config section. BaseUrl, SitemapPath,
PinballMachinesCollectionSlug.
src/PinballWizard.Core/Models/Enums.cs
Adds SourceType.JjpProductPage.
src/PinballWizard.Application/ScraperOrchestrator.cs
Adds ["jjp"] = "JJP" alias to the source-filter map.
src/PinballWizard.Infrastructure/Scraping/Jjp/
- JjpSitemapClient Sitemap-first discovery (per the locked
feedback memory feedback_machine_consumer
_metadata_first.md). Reads index, follows
sitemap_products_*.xml children, returns
/products/{slug} URLs. XML parsing surface
exposed as static methods for direct testing.
- JjpProductExtractor Pure-function HTML -> GameRecord. Prefers
JSON-LD product schema; falls back to og: tags;
last resort H1. Schema.org availability
normalized to in_stock / out_of_stock /
preorder / discontinued. Tolerates malformed
JSON-LD, array-wrapped JSON-LD, and non-Product
JSON-LD on the same page.
- JjpProductScraper Extends PoliteScraperBase, implements
ISourceScraper. Discovers via the sitemap
client, fetches each product page, yields
ScrapedItem with .Game = GameRecord. Politeness
and 429 backoff inherited from the gate.
- AddJjpScraping DI extension. Two typed HttpClients (sitemap +
product) with the polite UA and Shopify-
appropriate Accept headers. Bridges the typed-
client registration into the ISourceScraper
enumerable.
src/PinballWizard.Cli/
- Program.cs Wires AddJjpScraping; updates --source flag
help text.
- appsettings.json Adds Jjp section with default Shopify-shape
config.
ID strategy: JJP GameRecord IDs use prefix game_jjp_{slug} to avoid
colliding with Stern's existing game_{slug} pattern. Both manufacturers
coexist in the same games.json catalog.
Tests (16 new, 201 total passing):
- JjpSitemapClientTests
Sitemap-index parsing returns only sitemap_products entries (skips
pages / collections / blogs sitemaps); product-sitemap parsing
returns only /products/ paths (skips collections / pages); empty
sitemap returns empty list; null-arg validation.
- JjpProductExtractorTests
Full JSON-LD product happy path populates every mapped field;
og-only fallback works when no JSON-LD product is present;
non-Product JSON-LD on the page falls through to og: tags;
array-wrapped JSON-LD parses correctly; malformed JSON-LD is
tolerated; slug parsing across 4 URL shapes; null-arg validation.
IntegrationTests.cs updated to register JJP via AddJjpScraping and
assert the ISourceScraper enumerable now resolves four scrapers.
Out of scope (intentional follow-ups):
- JJP service-bulletin / manual discovery -- JJP doesn't host docs on
/products/. They may be on a separate /pages/support area; needs
reconnaissance.
- Cosmos Machine repository writes -- this PR uses the existing
catalog.json pipeline (mirroring Stern). Migrating the manufacturer
scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after
the live deployment exists.
- OPDB-id reconciliation -- a JJP-product -> Machine.ManufacturerSlugs
["jjp"] backfill needs the Machine repository populated by the OPDB
sync first; lives in the same Cosmos-migration concern.
- Per-source PolitenessOverrides from IngestionSource -- JJP currently
uses the global Politeness defaults; per-source overrides ride along
with the eventual IngestionSource Cosmos integration.
What this unblocks:
- AP (American Pinball) and Spooky scrapers can now follow the same
template: sitemap-discover, JSON-LD-extract, ISourceScraper-yield.
Each lands as its own focused PR; together with Stern + JJP they
cover ~95% of currently-shipping commercial pinball.
| "outofstock" => "out_of_stock", | ||
| "preorder" => "preorder", | ||
| "discontinued" => "discontinued", | ||
| _ => lastSegment?.ToLowerInvariant(), |
Comment on lines
+112
to
+140
| foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']")) | ||
| { | ||
| var text = script.TextContent; | ||
| if (string.IsNullOrWhiteSpace(text)) continue; | ||
|
|
||
| JsonElement root; | ||
| try | ||
| { | ||
| using var parsed = JsonDocument.Parse(text); | ||
| root = parsed.RootElement.Clone(); | ||
| } | ||
| catch (JsonException) | ||
| { | ||
| continue; | ||
| } | ||
|
|
||
| // Some Shopify themes wrap JSON-LD in an array; some don't. | ||
| if (root.ValueKind == JsonValueKind.Array) | ||
| { | ||
| foreach (var item in root.EnumerateArray()) | ||
| { | ||
| if (TryReadProduct(item) is { } prod) return prod; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| if (TryReadProduct(root) is { } prod) return prod; | ||
| } | ||
| } |
Comment on lines
+131
to
+134
| foreach (var item in root.EnumerateArray()) | ||
| { | ||
| if (TryReadProduct(item) is { } prod) return prod; | ||
| } |
Comment on lines
+179
to
+185
| foreach (var item in imageProp.EnumerateArray()) | ||
| { | ||
| if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s) | ||
| { | ||
| images.Add(s); | ||
| } | ||
| } |
Comment on lines
+122
to
+126
| catch (Exception ex) | ||
| { | ||
| Logger.LogWarning(ex, "JJP scraper: failed to fetch / extract {Url}; skipping.", productUrl); | ||
| return null; | ||
| } |
Comment on lines
+86
to
+95
| foreach (var sitemap in doc.Descendants(SitemapNs + "sitemap")) | ||
| { | ||
| var loc = sitemap.Element(SitemapNs + "loc")?.Value; | ||
| if (string.IsNullOrWhiteSpace(loc)) continue; | ||
| if (loc.Contains("sitemap_products", StringComparison.OrdinalIgnoreCase) && | ||
| Uri.TryCreate(loc, UriKind.Absolute, out var uri)) | ||
| { | ||
| sitemaps.Add(uri); | ||
| } | ||
| } |
Comment on lines
+112
to
+119
| foreach (var url in doc.Descendants(SitemapNs + "url")) | ||
| { | ||
| var loc = url.Element(SitemapNs + "loc")?.Value; | ||
| if (string.IsNullOrWhiteSpace(loc)) continue; | ||
| if (!Uri.TryCreate(loc, UriKind.Absolute, out var uri)) continue; | ||
| if (!uri.AbsolutePath.Contains("/products/", StringComparison.OrdinalIgnoreCase)) continue; | ||
| urls.Add(uri); | ||
| } |
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1.2.a — first non-Stern manufacturer scraper on the polite + Clean Architecture foundation. Validates that the layout cleanly accommodates a totally different source-site shape (Shopify storefront with sitemap + JSON-LD product schema, no Vue.js, no manuals on product pages) without any foundation refactoring.
Recon (via WebFetch before designing the scraper)
jerseyjackpinball.comis Shopify-based (server-rendered HTML) → HTTP scraping viaPoliteScraperBase, no Playwright needed/cart,/checkout,/account,/adminpaths and asserts "Checkouts are for humans" against automated buy-for-me agents. We only read catalog pages./collections/pinball-machines-for-sale, products at/products/{slug}/sitemap.xml— Shopify INDEX referencingsitemap_products_*.xmlchildrenname/description/image[]/offers(price + Schema.org availability)Per the locked feedback memory
feedback_machine_consumer_metadata_first.md, sitemap-first discovery + JSON-LD extraction beats DOM scraping. Designed accordingly.What ships
JjpOptions"Jjp"config sectionSourceType.JjpProductPageEnums.cs["jjp"] = "JJP"source-filter aliasScraperOrchestrator.SourceAliasesJjpSitemapClientsitemap_products_*.xmlchildren →/products/{slug}URLs. XML parsing surface exposed as static methods for direct testing.JjpProductExtractorGameRecord. Prefers JSON-LD product schema; falls back to og: tags; last resort H1. Schema.org availability normalized. Tolerates malformed JSON-LD, array-wrapped JSON-LD, and non-Product JSON-LD.JjpProductScraperPoliteScraperBase, implementsISourceScraper. Politeness + 429 backoff inherited from the gate.AddJjpScrapingDI extensionHttpClients (sitemap + product), polite UA, Shopify-appropriateAcceptheaders--source jjpinvokes the scraperID strategy
JJP
GameRecordIDs use prefixgame_jjp_{slug}to avoid colliding with Stern's existinggame_{slug}pattern. Both manufacturers coexist in the samegames.jsoncatalog.Test Plan
dotnet build /warnaserror— zero warningsdotnet test— 201/201 passing (was 185 pre-PR; +16 new JJP tests)What the 16 new tests cover
JjpSitemapClientTests(5): sitemap-index parsing returns onlysitemap_productsentries (skips pages / collections / blogs); product-sitemap parsing returns only/products/paths (skips collections / pages); empty sitemap returns empty list; null-arg validation on both static methodsJjpProductExtractorTests(11): full JSON-LD product happy path populates every mapped field; og-only fallback works when no JSON-LD product is present; non-Product JSON-LD on the page falls through to og: tags; array-wrapped JSON-LD parses correctly; malformed JSON-LD is tolerated; slug parsing across 4 URL shapes; null-arg validationOut of Scope (intentional follow-ups)
/products/. They may be on a separate/pages/supportarea; needs reconnaissance.Machinerepository writes — this PR uses the existingcatalog.jsonpipeline (mirroring Stern). Migrating the manufacturer scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after the live deployment exists.JJP product→Machine.ManufacturerSlugs["jjp"]backfill needs theMachinerepository populated by the OPDB sync first; lives in the same Cosmos-migration concern.PolitenessOverridesfromIngestionSource— JJP currently uses the global Politeness defaults; per-source overrides ride along with the eventualIngestionSourceCosmos integration.What this unblocks
ISourceScraper-yield. Each lands as its own focused PR; together with Stern + JJP they cover ~95% of currently-shipping commercial pinball.PoliteScraperBase+ the gate generalize beyond Stern.