feat(spooky) Spooky Pinball scraper -- Phase 1.2.c#33
Merged
Conversation
Third non-Stern manufacturer scraper. Spooky is WordPress + WooCommerce
+ Yoast with the WP REST API fully open at /wp-json/wp/v2/pages, so
the scraper consumes structured JSON instead of scraping rendered
HTML -- more reliable than DOM heuristics, politer (less data per
request), and naturally entity-decoded.
Discovery rule: a WP page is a game page iff its rendered content
contains S3 firmware URLs at spookypinball.s3.us-east-2.amazonaws.com
AND those URLs all share a single distinct first path segment (the
canonical game slug). This naturally rejects aggregator pages like
"SCOOBY BASE IMAGE UPDATE" that link to firmware for several games.
The S3-derived slug becomes GameRecord.Slug, so games whose WP slug
is a numeric placeholder ("2486-2") still get a stable human-meaningful
slug ("texaschainsaw").
26 new unit tests (248 total). CLI: --source spooky.
Comment on lines
+86
to
+100
| foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase)) | ||
| { | ||
| var url = WebUtility.HtmlDecode(match.Value); | ||
| if (!seenUrls.Add(url)) continue; | ||
|
|
||
| anchorTextByHref.TryGetValue(url, out var linkText); | ||
|
|
||
| links.Add(new DiscoveredLink | ||
| { | ||
| FileUrl = url, | ||
| LinkText = string.IsNullOrWhiteSpace(linkText) ? null : linkText, | ||
| DiscoveryContext = "Spooky Pinball Game Page", | ||
| GameSlug = canonicalSlug, | ||
| }); | ||
| } |
Comment on lines
+122
to
+127
| catch | ||
| { | ||
| // Anchor lookup is best-effort — a parse failure should not | ||
| // block download discovery; the regex pass already captured | ||
| // the URLs we care about. | ||
| } |
Comment on lines
+144
to
+151
| foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase)) | ||
| { | ||
| var slug = match.Groups[1].Value; | ||
| if (!string.IsNullOrWhiteSpace(slug)) | ||
| { | ||
| slugs.Add(slug); | ||
| } | ||
| } |
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Third non-Stern manufacturer scraper. Spooky Pinball brings the count to four: Stern + JJP + AP + Spooky covers the bulk of currently-shipping commercial pinball.
Spooky runs WordPress + WooCommerce + Yoast SEO with the WP REST API fully open at
/wp-json/wp/v2/pages— so this scraper consumes structured JSON rather than scraping rendered HTML. More reliable than DOM heuristics, politer (less data per request), and naturally entity-decoded.Discovery rule
A WP page is treated as a game page iff its rendered content contains S3 firmware URLs at
spookypinball.s3.us-east-2.amazonaws.comAND those URLs all share a single distinct first path segment (the canonical game slug). This naturally rejects aggregator/cross-game pages like the real "SCOOBY BASE IMAGE UPDATE" page (id 2445) that links to firmware for 5 different games — observed during recon and locked into the test suite.The S3-derived slug becomes
GameRecord.Slug, so games whose WP slug is a numeric placeholder (like2486-2for "Texas Chainsaw Massacre") still get a stable human-meaningful canonical slug (texaschainsaw). That's what Spooky uses internally and is more useful than a WP page slug for cross-source linking.What's added
PinballWizard.Core/Configuration/SpookyOptions.cs— BaseUrl, PagesEndpointPath, PageSize (max 100), S3HostPinballWizard.Core/Models/Enums.cs— addsSourceType.SpookyPinballGamePagePinballWizard.Application/ScraperOrchestrator.cs— adds["spooky"] = "Spooky Pinball"source-filter aliasPinballWizard.Infrastructure/Scraping/Spooky/:SpookyPageRaw+WpRenderedField— WP REST DTOs as classes with init-only accessors (records bug carried forward from AP scraper)SpookyWpPagesClientextendsPoliteScraperBase— paginated WP REST consumer; static parsing surface (page JSON deserialization, S3-slug extraction, single-slug game filter) kept testableSpookyGamePageExtractor— pure-function page →GameRecord+ downloads, HTML-entity decoded, anchor-text labels attached where presentSpookyGamePageScraperextendsPoliteScraperBase, implementsISourceScraper— yields one.GameScrapedItem and one.LinkScrapedItem per S3 firmware URLAddSpookyPinballScrapingDI extension--source spookyPoliteness
robots.txtdeclaresCrawl-delay: 10; the per-origin throttle picks that up via the shared robots-txt cacheIPolitenessGate(per-origin throttle, 429 backoff, robots-txt enforcement)PinballWizard/0.1 (+https://github.com/Early-Bird-Solutions-LLC/PinballWizard; polite-scraper))Tests
26 new unit tests → 248 total (was 222).
Coverage:
2486-2→texaschainsaw)Alice Cooper’s→Alice Cooper's)Test plan
dotnet build— clean, zero warningsdotnet test— 248/248 passdotnet run -- --status— DI startup clean, app reports existing catalogFollow-ups (deferred to their own PRs)
GameRecord→Machinerepository persistence; same architectural debt called out in OPDB / JJP / AP PRsproject_parallel_execution_plan.mdto mark gates and Tracks B-mfg.{Stern, JJP, AP, Spooky} done🤖 Generated with Claude Code