Skip to content

feat(cgc) Chicago Gaming Company scraper — Phase 1.3.d#40

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-ChicagoGamingScraper
May 3, 2026
Merged

feat(cgc) Chicago Gaming Company scraper — Phase 1.3.d#40
jkeeley2073 merged 1 commit into
mainfrom
Dev-ChicagoGamingScraper

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Eighth manufacturer scraper. CGC ships "Remake" editions of classic Bally/Williams machines: Attack from Mars, Cactus Canyon, Medieval Madness, Monster Bash, Pulp Fiction.

CGC's site is custom Nginx-served HTML — no WordPress, no Shopify, no SPA, no JSON-LD. The scraper is a hybrid of two existing templates:

  • Discovery mirrors BofCategoryClient: /coinop/ index page anchor extraction with a single-segment-slug filter that rejects /coinop/{slug}/update and .../update/mac sub-pages. The site's sitemap.xml is incomplete in practice (missing Pulp Fiction and Cactus Canyon as of 2026-05), so the index page is the canonical source — same defence-in-depth pattern as JJP's collection-handle filter from PR fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit #34.
  • Extraction mirrors ApGamePageExtractor: page <title> with uniform | Chicago Gaming Company suffix stripped, <h1> fallback, prettified-slug fallback. Same-host .pdf link extraction (Pulp Fiction alone exposes 5: brochure, deposit agreement, feature matrix, rules manual, warranty).

Reconciler integration

ScraperManufacturerKey.ChicagoGaming = "cgc" matches OpdbMachineMapper.NormalizeManufacturerKey exactly — records produced here will land in the correct Cosmos partition for the reconciler from PR #35.

Politeness

CGC's robots.txt blocks /images for User-agent: * and bans various search-engine bots specifically (Exabot, Baidu, Yandex, etc.). The scraper never fetches images and we identify ourselves as PinballWizard/0.1, so the policy is honored vacuously by the polite gate. The user explicitly confirmed earlier in this session that we honor all robots.txt — no Disallow boundaries crossed here.

Pre-push self-audit (per PR #34 + PR #36)

Step 0 — /local-review (qualitative)

Local review: 0 🔴 / 4 ⚠️ / 6 categories ✅

# Severity Finding Action
1 ⚠️ Missing test for LinkText populated/null on DiscoveredLink Deferred — family-wide gap; AP has same gap
2 ⚠️ Missing assertion for Source.ScrapedFrom / Source.ScrapedAt / DiscoveredOn provenance fields Deferred — family-wide test-infra gap (same as Multimorphic PR #39)
3 ⚠️ MachinesIndexPath and GamePathPrefix both default to /coinop/ (no [NotEqual] invariant) Deferred — by design; the index page IS at /coinop/
4 ⚠️ Per-page double parse (Game extraction, then Downloads) — sibling parity with AP Deferred — network/politeness delays dominate; refactor only if a future shop has hundreds of pages

Step 1 — Mechanical checklist

  • Every ChicagoGamingOptions property read by code (verified by grep — BaseUrl / MachinesIndexPath / GamePathPrefix all hit)
  • Sibling-diff: CgcMenuClient mirrors BofCategoryClient; CgcGamePageExtractor mirrors ApGamePageExtractor; CgcGamePageScraper mirrors ApGamePageScraper. Log shapes include manufacturer name (the PR feat(pinballbrothers) Pinball Brothers scraper — Phase 1.3.a #37 finding)
  • No bare catch { } in src/PinballWizard.Infrastructure/Scraping/ChicagoGaming/
  • SourceAliasContractTests still passes — Name = \"Chicago Gaming\" matches alias-map value
  • Tests assert behavior — CgcMenuClientTests fixture mixes 5 machines + sub-pages + index page + arcade games + external host (all 5 machines pass, others rejected)
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' — personal noreply

Tests

+19 tests (CgcMenuClient + CgcGamePageExtractor + ScraperManufacturerKey row).

  • CgcMenuClientTests (7) — canonical machine URL filter; sub-page rejection; fragment + query dedup; relative href base resolution; prefix without trailing slash; null/blank-arg validation
  • CgcGamePageExtractorTests (11) — title with suffix strip; h1 fallback; prettified-slug fallback; no-slug → null; same-host PDF set (5 fixtures); external + non-PDF rejection; dedup; slug extraction edge cases; null-arg validation
  • ScraperManufacturerKeyTests (+1 row) — game_cgc_medieval-madnesscgc

Branch state

Branched from main-at-PR-#38 (so #39 Multimorphic is not yet in main). When #39 merges first, this PR will need a trivial rebase: both touch Enums.cs, ScraperManufacturerKey.cs, ScraperOrchestrator.cs, Program.cs, appsettings.json with non-overlapping additions. The conflict is mechanical — different SourceType enum values, different alias-map entries, different DI registrations.

Phase 1.3 status after this PR

Mfr Status
Pinball Brothers (#37) ✅ shipped
Barrels of Fun (#38) ✅ shipped
Multimorphic (#39) 🟡 open, will merge first
Chicago Gaming ✅ this PR
Dutch Pinball 🚫 deferred — Disallow: /
Haggis 🚫 deferred — infrastructure outage; retry in 1-2 weeks

After this and #39 merge: Phase 1.3 closes with 4 new manufacturers (5 if Haggis recovers in time). Catalog coverage ~99% of currently-shipping commercial pinball.

Out of scope

  • Family-wide testing infrastructure (mocking IPolitenessGate + HttpMessageHandler to enable scraper-pipeline integration tests across all 8 scrapers) — separate cleanup PR
  • Refactor JSON-LD parser into shared Common/JsonLd/ helper — three storefronts use it now; planned for next cleanup pass
  • Haggis recon retry — propose /schedule an agent in 2 weeks

🤖 Generated with Claude Code

Eighth manufacturer scraper, second to use a custom-CMS template
(after AP). CGC ships "Remake" editions of classic Bally/Williams
machines: Attack from Mars, Cactus Canyon, Medieval Madness,
Monster Bash, Pulp Fiction.

CGC's site is custom Nginx-served HTML -- no WordPress, no Shopify,
no SPA, no JSON-LD. Hybrid template:
  * Discovery via BofCategoryClient pattern: /coinop/ index page
    anchor extraction with single-segment-slug filter rejecting
    /coinop/{slug}/update and .../update/mac sub-pages. The site's
    sitemap.xml is incomplete (missing Pulp Fiction and Cactus
    Canyon as of 2026-05), so the index page is the canonical source.
  * Extraction via ApGamePageExtractor pattern: page <title> with
    uniform "| Chicago Gaming Company" suffix stripped, h1 fallback,
    prettified-slug fallback. Same-host .pdf extraction for manuals,
    brochures, feature matrices, rules manuals, deposit agreements,
    warranties.

ScraperManufacturerKey adds ChicagoGaming = "cgc" matching
OpdbMachineMapper.NormalizeManufacturerKey exactly so reconciled
records land in the correct Cosmos partition.

CGC's robots.txt blocks /images for User-agent: *; the scraper
never fetches images, so the policy is honored vacuously.

Pre-push self-audit:
  * /local-review (Step 0): 0 🔴 / 4 ⚠️ -- all deferred (family-wide
    test-infra gaps + sibling parity with AP, none are regressions)
  * 7-item mechanical checklist (Step 1): all pass

Build: zero warnings. CLI: --source cgc.

Branched from main-at-PR-#38; will need trivial rebase if PR #39
(Multimorphic) merges first since both touch Enums.cs,
ScraperManufacturerKey.cs, ScraperOrchestrator.cs, Program.cs,
appsettings.json with non-overlapping additions.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 3, 2026
Comment on lines +152 to +156
foreach (var sep in separators)
{
var idx = title.IndexOf(sep, StringComparison.Ordinal);
if (idx > 0) return title[..idx].Trim();
}
Comment on lines +168 to +171
foreach (var ext in DownloadableExtensions)
{
if (path.EndsWith(ext, StringComparison.OrdinalIgnoreCase)) return true;
}
Comment on lines +123 to +127
catch (Exception ex)
{
Logger.LogWarning(ex, "Chicago Gaming scraper: failed to fetch / extract {Url}; skipping.", machineUrl);
return (null, []);
}
Comment on lines +83 to +105
foreach (var anchor in doc.QuerySelectorAll("a[href]"))
{
var href = anchor.GetAttribute("href");
if (string.IsNullOrWhiteSpace(href)) continue;
if (!Uri.TryCreate(baseUri, href, out var absolute)) continue;

if (!string.Equals(absolute.Host, baseUri.Host, StringComparison.OrdinalIgnoreCase)) continue;
if (!absolute.AbsolutePath.StartsWith(normalizedPrefix, StringComparison.OrdinalIgnoreCase)) continue;

// Single-slug-segment requirement rejects /coinop/ itself, /coinop/{slug}/update,
// /coinop/{slug}/update/mac, etc.
var afterPrefix = absolute.AbsolutePath[normalizedPrefix.Length..].TrimEnd('/');
if (afterPrefix.Length == 0) continue;
if (afterPrefix.Contains('/', StringComparison.Ordinal)) continue;

// Drop fragment / query so anchor variants of the same machine
// canonicalise to one URL in the result set.
var canonical = new UriBuilder(absolute) { Fragment = "", Query = "" }.Uri;
if (seen.Add(canonical.AbsoluteUri))
{
urls.Add(canonical);
}
}
@jkeeley2073 jkeeley2073 merged commit 10fff1b into main May 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants