feat(spooky) Spooky Pinball scraper -- Phase 1.2.c by jkeeley2073 · Pull Request #33 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-02T23:00:07Z

Summary

Third non-Stern manufacturer scraper. Spooky Pinball brings the count to four: Stern + JJP + AP + Spooky covers the bulk of currently-shipping commercial pinball.

Spooky runs WordPress + WooCommerce + Yoast SEO with the WP REST API fully open at /wp-json/wp/v2/pages — so this scraper consumes structured JSON rather than scraping rendered HTML. More reliable than DOM heuristics, politer (less data per request), and naturally entity-decoded.

Discovery rule

A WP page is treated as a game page iff its rendered content contains S3 firmware URLs at spookypinball.s3.us-east-2.amazonaws.com AND those URLs all share a single distinct first path segment (the canonical game slug). This naturally rejects aggregator/cross-game pages like the real "SCOOBY BASE IMAGE UPDATE" page (id 2445) that links to firmware for 5 different games — observed during recon and locked into the test suite.

The S3-derived slug becomes GameRecord.Slug, so games whose WP slug is a numeric placeholder (like 2486-2 for "Texas Chainsaw Massacre") still get a stable human-meaningful canonical slug (texaschainsaw). That's what Spooky uses internally and is more useful than a WP page slug for cross-source linking.

What's added

PinballWizard.Core/Configuration/SpookyOptions.cs — BaseUrl, PagesEndpointPath, PageSize (max 100), S3Host
PinballWizard.Core/Models/Enums.cs — adds SourceType.SpookyPinballGamePage
PinballWizard.Application/ScraperOrchestrator.cs — adds ["spooky"] = "Spooky Pinball" source-filter alias
PinballWizard.Infrastructure/Scraping/Spooky/:
- SpookyPageRaw + WpRenderedField — WP REST DTOs as classes with init-only accessors (records bug carried forward from AP scraper)
- SpookyWpPagesClient extends PoliteScraperBase — paginated WP REST consumer; static parsing surface (page JSON deserialization, S3-slug extraction, single-slug game filter) kept testable
- SpookyGamePageExtractor — pure-function page → GameRecord + downloads, HTML-entity decoded, anchor-text labels attached where present
- SpookyGamePageScraper extends PoliteScraperBase, implements ISourceScraper — yields one .Game ScrapedItem and one .Link ScrapedItem per S3 firmware URL
- AddSpookyPinballScraping DI extension
CLI: --source spooky

Politeness

Spooky's robots.txt declares Crawl-delay: 10; the per-origin throttle picks that up via the shared robots-txt cache
All HTTP routes through IPolitenessGate (per-origin throttle, 429 backoff, robots-txt enforcement)
User-Agent identifies the project (PinballWizard/0.1 (+https://github.com/Early-Bird-Solutions-LLC/PinballWizard; polite-scraper))

Tests

26 new unit tests → 248 total (was 222).

Coverage:

JSON deserialization round-trip (full field binding) + graceful handling of empty / non-array / malformed bodies
S3-slug extraction (distinct slugs across content, non-S3 URL rejection, empty content)
Single-S3-slug game filter (single-slug game accepted / multi-slug aggregator rejected / no-S3-link rejected)
Canonical S3-derived slug used even when WP slug is numeric placeholder (2486-2 → texaschainsaw)
HTML entity decoding in titles (Alice Cooper’s → Alice Cooper's)
Downloads dedup of repeated S3 hrefs
Anchor-text label attachment from rendered HTML
Null/blank-arg validation across both client and extractor

Test plan

dotnet build — clean, zero warnings
dotnet test — 248/248 pass
dotnet run -- --status — DI startup clean, app reports existing catalog
Live-site recon validated discovery rule against real page IDs (Beetlejuice, Evil Dead, Halloween, Ultraman, Scooby, Looney Tunes, Texas Chainsaw, plus negative case "SCOOBY BASE IMAGE UPDATE")
Live-site end-to-end run after merge (deferred — same discipline applied to AP / JJP)

Follow-ups (deferred to their own PRs)

Cosmos migration PR — GameRecord → Machine repository persistence; same architectural debt called out in OPDB / JJP / AP PRs
Update project_parallel_execution_plan.md to mark gates and Tracks B-mfg.{Stern, JJP, AP, Spooky} done
Phase 1.3 manufacturer scrapers (Multimorphic, Chicago Gaming, Haggis, Pinball Brothers, Dutch, Barrels of Fun) — same template

🤖 Generated with Claude Code

Third non-Stern manufacturer scraper. Spooky is WordPress + WooCommerce + Yoast with the WP REST API fully open at /wp-json/wp/v2/pages, so the scraper consumes structured JSON instead of scraping rendered HTML -- more reliable than DOM heuristics, politer (less data per request), and naturally entity-decoded. Discovery rule: a WP page is a game page iff its rendered content contains S3 firmware URLs at spookypinball.s3.us-east-2.amazonaws.com AND those URLs all share a single distinct first path segment (the canonical game slug). This naturally rejects aggregator pages like "SCOOBY BASE IMAGE UPDATE" that link to firmware for several games. The S3-derived slug becomes GameRecord.Slug, so games whose WP slug is a numeric placeholder ("2486-2") still get a stable human-meaningful slug ("texaschainsaw"). 26 new unit tests (248 total). CLI: --source spooky.

+        foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase))
+        {
+            var url = WebUtility.HtmlDecode(match.Value);
+            if (!seenUrls.Add(url)) continue;
+
+            anchorTextByHref.TryGetValue(url, out var linkText);
+
+            links.Add(new DiscoveredLink
+            {
+                FileUrl = url,
+                LinkText = string.IsNullOrWhiteSpace(linkText) ? null : linkText,
+                DiscoveryContext = "Spooky Pinball Game Page",
+                GameSlug = canonicalSlug,
+            });
+        }


+        catch
+        {
+            // Anchor lookup is best-effort — a parse failure should not
+            // block download discovery; the regex pass already captured
+            // the URLs we care about.
+        }


+        foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase))
+        {
+            var slug = match.Groups[1].Value;
+            if (!string.IsNullOrWhiteSpace(slug))
+            {
+                slugs.Add(slug);
+            }
+        }


jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026

jkeeley2073 merged commit 23d3332 into main May 2, 2026
3 checks passed

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

This was referenced May 2, 2026

fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit #34

Merged

feat(sync) scraper-to-Machine reconciliation service + ADR 0011 #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spooky) Spooky Pinball scraper -- Phase 1.2.c#33

feat(spooky) Spooky Pinball scraper -- Phase 1.2.c#33
jkeeley2073 merged 1 commit into
mainfrom
Dev-SpookyScraper

jkeeley2073 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkeeley2073 commented May 2, 2026

Summary

Discovery rule

What's added

Politeness

Tests

Test plan

Follow-ups (deferred to their own PRs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants