Skip to content

feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a#31

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-JjpScraper
May 2, 2026
Merged

feat(jjp) Jersey Jack Pinball scraper -- Phase 1.2.a#31
jkeeley2073 merged 1 commit into
mainfrom
Dev-JjpScraper

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Phase 1.2.a — first non-Stern manufacturer scraper on the polite + Clean Architecture foundation. Validates that the layout cleanly accommodates a totally different source-site shape (Shopify storefront with sitemap + JSON-LD product schema, no Vue.js, no manuals on product pages) without any foundation refactoring.

Recon (via WebFetch before designing the scraper)

  • jerseyjackpinball.com is Shopify-based (server-rendered HTML) → HTTP scraping via PoliteScraperBase, no Playwright needed
  • robots.txt allows catalog crawling; explicitly blocks /cart, /checkout, /account, /admin paths and asserts "Checkouts are for humans" against automated buy-for-me agents. We only read catalog pages.
  • Catalog at /collections/pinball-machines-for-sale, products at /products/{slug}
  • Sitemap at /sitemap.xml — Shopify INDEX referencing sitemap_products_*.xml children
  • JSON-LD product schema present on product pages with name / description / image[] / offers (price + Schema.org availability)

Per the locked feedback memory feedback_machine_consumer_metadata_first.md, sitemap-first discovery + JSON-LD extraction beats DOM scraping. Designed accordingly.

What ships

Component What
JjpOptions Bound from "Jjp" config section
SourceType.JjpProductPage New enum value in Enums.cs
["jjp"] = "JJP" source-filter alias ScraperOrchestrator.SourceAliases
JjpSitemapClient Sitemap-first discovery: index → sitemap_products_*.xml children → /products/{slug} URLs. XML parsing surface exposed as static methods for direct testing.
JjpProductExtractor Pure-function HTML → GameRecord. Prefers JSON-LD product schema; falls back to og: tags; last resort H1. Schema.org availability normalized. Tolerates malformed JSON-LD, array-wrapped JSON-LD, and non-Product JSON-LD.
JjpProductScraper Extends PoliteScraperBase, implements ISourceScraper. Politeness + 429 backoff inherited from the gate.
AddJjpScraping DI extension Two typed HttpClients (sitemap + product), polite UA, Shopify-appropriate Accept headers
CLI --source jjp invokes the scraper

ID strategy

JJP GameRecord IDs use prefix game_jjp_{slug} to avoid colliding with Stern's existing game_{slug} pattern. Both manufacturers coexist in the same games.json catalog.

Test Plan

  • dotnet build /warnaserrorzero warnings
  • dotnet test201/201 passing (was 185 pre-PR; +16 new JJP tests)
  • All 185 prior tests still pass
  • CI re-validates on this PR
  • Author = personal noreply

What the 16 new tests cover

  • JjpSitemapClientTests (5): sitemap-index parsing returns only sitemap_products entries (skips pages / collections / blogs); product-sitemap parsing returns only /products/ paths (skips collections / pages); empty sitemap returns empty list; null-arg validation on both static methods
  • JjpProductExtractorTests (11): full JSON-LD product happy path populates every mapped field; og-only fallback works when no JSON-LD product is present; non-Product JSON-LD on the page falls through to og: tags; array-wrapped JSON-LD parses correctly; malformed JSON-LD is tolerated; slug parsing across 4 URL shapes; null-arg validation

Out of Scope (intentional follow-ups)

  • JJP service-bulletin / manual discovery — JJP doesn't host docs on /products/. They may be on a separate /pages/support area; needs reconnaissance.
  • Cosmos Machine repository writes — this PR uses the existing catalog.json pipeline (mirroring Stern). Migrating the manufacturer scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after the live deployment exists.
  • OPDB-id reconciliation — a JJP productMachine.ManufacturerSlugs["jjp"] backfill needs the Machine repository populated by the OPDB sync first; lives in the same Cosmos-migration concern.
  • Per-source PolitenessOverrides from IngestionSource — JJP currently uses the global Politeness defaults; per-source overrides ride along with the eventual IngestionSource Cosmos integration.

What this unblocks

  • AP (American Pinball) and Spooky scrapers can now follow the same template: sitemap-discover, JSON-LD-extract, ISourceScraper-yield. Each lands as its own focused PR; together with Stern + JJP they cover ~95% of currently-shipping commercial pinball.
  • The polite-scraping pattern is validated against a second source — first proof that PoliteScraperBase + the gate generalize beyond Stern.

First non-Stern manufacturer scraper on the polite + Clean Architecture
foundation. Validates that the layout cleanly accommodates a totally
different source-site shape (Shopify storefront with sitemap + JSON-LD
product schema, no Vue.js, no manuals on product pages) without any
foundation refactoring.

Recon (via WebFetch before designing the scraper):
- jerseyjackpinball.com is Shopify-based (server-rendered HTML)
- robots.txt allows catalog crawling; explicitly blocks
  cart/checkout/account/admin paths and asserts "Checkouts are for
  humans" against automated buy-for-me agents (we only read catalog).
- Catalog at /collections/pinball-machines-for-sale; products at
  /products/{slug}
- Sitemap at /sitemap.xml -- Shopify INDEX referencing
  sitemap_products_*.xml children
- JSON-LD product schema present on product pages with name /
  description / image[] / offers (price + Schema.org availability)

Code:

src/PinballWizard.Core/Configuration/JjpOptions.cs
  Bound from "Jjp" config section. BaseUrl, SitemapPath,
  PinballMachinesCollectionSlug.

src/PinballWizard.Core/Models/Enums.cs
  Adds SourceType.JjpProductPage.

src/PinballWizard.Application/ScraperOrchestrator.cs
  Adds ["jjp"] = "JJP" alias to the source-filter map.

src/PinballWizard.Infrastructure/Scraping/Jjp/
  - JjpSitemapClient   Sitemap-first discovery (per the locked
                       feedback memory feedback_machine_consumer
                       _metadata_first.md). Reads index, follows
                       sitemap_products_*.xml children, returns
                       /products/{slug} URLs. XML parsing surface
                       exposed as static methods for direct testing.
  - JjpProductExtractor  Pure-function HTML -> GameRecord. Prefers
                       JSON-LD product schema; falls back to og: tags;
                       last resort H1. Schema.org availability
                       normalized to in_stock / out_of_stock /
                       preorder / discontinued. Tolerates malformed
                       JSON-LD, array-wrapped JSON-LD, and non-Product
                       JSON-LD on the same page.
  - JjpProductScraper  Extends PoliteScraperBase, implements
                       ISourceScraper. Discovers via the sitemap
                       client, fetches each product page, yields
                       ScrapedItem with .Game = GameRecord. Politeness
                       and 429 backoff inherited from the gate.
  - AddJjpScraping     DI extension. Two typed HttpClients (sitemap +
                       product) with the polite UA and Shopify-
                       appropriate Accept headers. Bridges the typed-
                       client registration into the ISourceScraper
                       enumerable.

src/PinballWizard.Cli/
  - Program.cs         Wires AddJjpScraping; updates --source flag
                       help text.
  - appsettings.json   Adds Jjp section with default Shopify-shape
                       config.

ID strategy: JJP GameRecord IDs use prefix game_jjp_{slug} to avoid
colliding with Stern's existing game_{slug} pattern. Both manufacturers
coexist in the same games.json catalog.

Tests (16 new, 201 total passing):

- JjpSitemapClientTests
  Sitemap-index parsing returns only sitemap_products entries (skips
  pages / collections / blogs sitemaps); product-sitemap parsing
  returns only /products/ paths (skips collections / pages); empty
  sitemap returns empty list; null-arg validation.

- JjpProductExtractorTests
  Full JSON-LD product happy path populates every mapped field;
  og-only fallback works when no JSON-LD product is present;
  non-Product JSON-LD on the page falls through to og: tags;
  array-wrapped JSON-LD parses correctly; malformed JSON-LD is
  tolerated; slug parsing across 4 URL shapes; null-arg validation.

IntegrationTests.cs updated to register JJP via AddJjpScraping and
assert the ISourceScraper enumerable now resolves four scrapers.

Out of scope (intentional follow-ups):

- JJP service-bulletin / manual discovery -- JJP doesn't host docs on
  /products/. They may be on a separate /pages/support area; needs
  reconnaissance.
- Cosmos Machine repository writes -- this PR uses the existing
  catalog.json pipeline (mirroring Stern). Migrating the manufacturer
  scrapers to Cosmos lands in the Phase 2 "Cosmos migration" PR after
  the live deployment exists.
- OPDB-id reconciliation -- a JJP-product -> Machine.ManufacturerSlugs
  ["jjp"] backfill needs the Machine repository populated by the OPDB
  sync first; lives in the same Cosmos-migration concern.
- Per-source PolitenessOverrides from IngestionSource -- JJP currently
  uses the global Politeness defaults; per-source overrides ride along
  with the eventual IngestionSource Cosmos integration.

What this unblocks:

- AP (American Pinball) and Spooky scrapers can now follow the same
  template: sitemap-discover, JSON-LD-extract, ISourceScraper-yield.
  Each lands as its own focused PR; together with Stern + JJP they
  cover ~95% of currently-shipping commercial pinball.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026
@jkeeley2073 jkeeley2073 merged commit 75cd52d into main May 2, 2026
3 checks passed
"outofstock" => "out_of_stock",
"preorder" => "preorder",
"discontinued" => "discontinued",
_ => lastSegment?.ToLowerInvariant(),
Comment on lines +112 to +140
foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']"))
{
var text = script.TextContent;
if (string.IsNullOrWhiteSpace(text)) continue;

JsonElement root;
try
{
using var parsed = JsonDocument.Parse(text);
root = parsed.RootElement.Clone();
}
catch (JsonException)
{
continue;
}

// Some Shopify themes wrap JSON-LD in an array; some don't.
if (root.ValueKind == JsonValueKind.Array)
{
foreach (var item in root.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
}
else
{
if (TryReadProduct(root) is { } prod) return prod;
}
}
Comment on lines +131 to +134
foreach (var item in root.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
Comment on lines +179 to +185
foreach (var item in imageProp.EnumerateArray())
{
if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s)
{
images.Add(s);
}
}
Comment on lines +122 to +126
catch (Exception ex)
{
Logger.LogWarning(ex, "JJP scraper: failed to fetch / extract {Url}; skipping.", productUrl);
return null;
}
Comment on lines +86 to +95
foreach (var sitemap in doc.Descendants(SitemapNs + "sitemap"))
{
var loc = sitemap.Element(SitemapNs + "loc")?.Value;
if (string.IsNullOrWhiteSpace(loc)) continue;
if (loc.Contains("sitemap_products", StringComparison.OrdinalIgnoreCase) &&
Uri.TryCreate(loc, UriKind.Absolute, out var uri))
{
sitemaps.Add(uri);
}
}
Comment on lines +112 to +119
foreach (var url in doc.Descendants(SitemapNs + "url"))
{
var loc = url.Element(SitemapNs + "loc")?.Value;
if (string.IsNullOrWhiteSpace(loc)) continue;
if (!Uri.TryCreate(loc, UriKind.Absolute, out var uri)) continue;
if (!uri.AbsolutePath.Contains("/products/", StringComparison.OrdinalIgnoreCase)) continue;
urls.Add(uri);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants