Skip to content

feat(spooky) Spooky Pinball scraper -- Phase 1.2.c#33

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-SpookyScraper
May 2, 2026
Merged

feat(spooky) Spooky Pinball scraper -- Phase 1.2.c#33
jkeeley2073 merged 1 commit into
mainfrom
Dev-SpookyScraper

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Third non-Stern manufacturer scraper. Spooky Pinball brings the count to four: Stern + JJP + AP + Spooky covers the bulk of currently-shipping commercial pinball.

Spooky runs WordPress + WooCommerce + Yoast SEO with the WP REST API fully open at /wp-json/wp/v2/pages — so this scraper consumes structured JSON rather than scraping rendered HTML. More reliable than DOM heuristics, politer (less data per request), and naturally entity-decoded.

Discovery rule

A WP page is treated as a game page iff its rendered content contains S3 firmware URLs at spookypinball.s3.us-east-2.amazonaws.com AND those URLs all share a single distinct first path segment (the canonical game slug). This naturally rejects aggregator/cross-game pages like the real "SCOOBY BASE IMAGE UPDATE" page (id 2445) that links to firmware for 5 different games — observed during recon and locked into the test suite.

The S3-derived slug becomes GameRecord.Slug, so games whose WP slug is a numeric placeholder (like 2486-2 for "Texas Chainsaw Massacre") still get a stable human-meaningful canonical slug (texaschainsaw). That's what Spooky uses internally and is more useful than a WP page slug for cross-source linking.

What's added

  • PinballWizard.Core/Configuration/SpookyOptions.cs — BaseUrl, PagesEndpointPath, PageSize (max 100), S3Host
  • PinballWizard.Core/Models/Enums.cs — adds SourceType.SpookyPinballGamePage
  • PinballWizard.Application/ScraperOrchestrator.cs — adds ["spooky"] = "Spooky Pinball" source-filter alias
  • PinballWizard.Infrastructure/Scraping/Spooky/:
    • SpookyPageRaw + WpRenderedField — WP REST DTOs as classes with init-only accessors (records bug carried forward from AP scraper)
    • SpookyWpPagesClient extends PoliteScraperBase — paginated WP REST consumer; static parsing surface (page JSON deserialization, S3-slug extraction, single-slug game filter) kept testable
    • SpookyGamePageExtractor — pure-function page → GameRecord + downloads, HTML-entity decoded, anchor-text labels attached where present
    • SpookyGamePageScraper extends PoliteScraperBase, implements ISourceScraper — yields one .Game ScrapedItem and one .Link ScrapedItem per S3 firmware URL
    • AddSpookyPinballScraping DI extension
  • CLI: --source spooky

Politeness

  • Spooky's robots.txt declares Crawl-delay: 10; the per-origin throttle picks that up via the shared robots-txt cache
  • All HTTP routes through IPolitenessGate (per-origin throttle, 429 backoff, robots-txt enforcement)
  • User-Agent identifies the project (PinballWizard/0.1 (+https://github.com/Early-Bird-Solutions-LLC/PinballWizard; polite-scraper))

Tests

26 new unit tests → 248 total (was 222).

Coverage:

  • JSON deserialization round-trip (full field binding) + graceful handling of empty / non-array / malformed bodies
  • S3-slug extraction (distinct slugs across content, non-S3 URL rejection, empty content)
  • Single-S3-slug game filter (single-slug game accepted / multi-slug aggregator rejected / no-S3-link rejected)
  • Canonical S3-derived slug used even when WP slug is numeric placeholder (2486-2texaschainsaw)
  • HTML entity decoding in titles (Alice Cooper’sAlice Cooper's)
  • Downloads dedup of repeated S3 hrefs
  • Anchor-text label attachment from rendered HTML
  • Null/blank-arg validation across both client and extractor

Test plan

  • dotnet build — clean, zero warnings
  • dotnet test — 248/248 pass
  • dotnet run -- --status — DI startup clean, app reports existing catalog
  • Live-site recon validated discovery rule against real page IDs (Beetlejuice, Evil Dead, Halloween, Ultraman, Scooby, Looney Tunes, Texas Chainsaw, plus negative case "SCOOBY BASE IMAGE UPDATE")
  • Live-site end-to-end run after merge (deferred — same discipline applied to AP / JJP)

Follow-ups (deferred to their own PRs)

  • Cosmos migration PR — GameRecordMachine repository persistence; same architectural debt called out in OPDB / JJP / AP PRs
  • Update project_parallel_execution_plan.md to mark gates and Tracks B-mfg.{Stern, JJP, AP, Spooky} done
  • Phase 1.3 manufacturer scrapers (Multimorphic, Chicago Gaming, Haggis, Pinball Brothers, Dutch, Barrels of Fun) — same template

🤖 Generated with Claude Code

Third non-Stern manufacturer scraper. Spooky is WordPress + WooCommerce
+ Yoast with the WP REST API fully open at /wp-json/wp/v2/pages, so
the scraper consumes structured JSON instead of scraping rendered
HTML -- more reliable than DOM heuristics, politer (less data per
request), and naturally entity-decoded.

Discovery rule: a WP page is a game page iff its rendered content
contains S3 firmware URLs at spookypinball.s3.us-east-2.amazonaws.com
AND those URLs all share a single distinct first path segment (the
canonical game slug). This naturally rejects aggregator pages like
"SCOOBY BASE IMAGE UPDATE" that link to firmware for several games.
The S3-derived slug becomes GameRecord.Slug, so games whose WP slug
is a numeric placeholder ("2486-2") still get a stable human-meaningful
slug ("texaschainsaw").

26 new unit tests (248 total). CLI: --source spooky.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026
@jkeeley2073 jkeeley2073 merged commit 23d3332 into main May 2, 2026
3 checks passed
Comment on lines +86 to +100
foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase))
{
var url = WebUtility.HtmlDecode(match.Value);
if (!seenUrls.Add(url)) continue;

anchorTextByHref.TryGetValue(url, out var linkText);

links.Add(new DiscoveredLink
{
FileUrl = url,
LinkText = string.IsNullOrWhiteSpace(linkText) ? null : linkText,
DiscoveryContext = "Spooky Pinball Game Page",
GameSlug = canonicalSlug,
});
}
Comment on lines +122 to +127
catch
{
// Anchor lookup is best-effort — a parse failure should not
// block download discovery; the regex pass already captured
// the URLs we care about.
}
Comment on lines +144 to +151
foreach (Match match in Regex.Matches(html, pattern, RegexOptions.IgnoreCase))
{
var slug = match.Groups[1].Value;
if (!string.IsNullOrWhiteSpace(slug))
{
slugs.Add(slug);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants