feat(cgc) Chicago Gaming Company scraper — Phase 1.3.d#40
Merged
Conversation
Eighth manufacturer scraper, second to use a custom-CMS template
(after AP). CGC ships "Remake" editions of classic Bally/Williams
machines: Attack from Mars, Cactus Canyon, Medieval Madness,
Monster Bash, Pulp Fiction.
CGC's site is custom Nginx-served HTML -- no WordPress, no Shopify,
no SPA, no JSON-LD. Hybrid template:
* Discovery via BofCategoryClient pattern: /coinop/ index page
anchor extraction with single-segment-slug filter rejecting
/coinop/{slug}/update and .../update/mac sub-pages. The site's
sitemap.xml is incomplete (missing Pulp Fiction and Cactus
Canyon as of 2026-05), so the index page is the canonical source.
* Extraction via ApGamePageExtractor pattern: page <title> with
uniform "| Chicago Gaming Company" suffix stripped, h1 fallback,
prettified-slug fallback. Same-host .pdf extraction for manuals,
brochures, feature matrices, rules manuals, deposit agreements,
warranties.
ScraperManufacturerKey adds ChicagoGaming = "cgc" matching
OpdbMachineMapper.NormalizeManufacturerKey exactly so reconciled
records land in the correct Cosmos partition.
CGC's robots.txt blocks /images for User-agent: *; the scraper
never fetches images, so the policy is honored vacuously.
Pre-push self-audit:
* /local-review (Step 0): 0 🔴 / 4 ⚠️ -- all deferred (family-wide
test-infra gaps + sibling parity with AP, none are regressions)
* 7-item mechanical checklist (Step 1): all pass
Build: zero warnings. CLI: --source cgc.
Branched from main-at-PR-#38; will need trivial rebase if PR #39
(Multimorphic) merges first since both touch Enums.cs,
ScraperManufacturerKey.cs, ScraperOrchestrator.cs, Program.cs,
appsettings.json with non-overlapping additions.
Comment on lines
+152
to
+156
| foreach (var sep in separators) | ||
| { | ||
| var idx = title.IndexOf(sep, StringComparison.Ordinal); | ||
| if (idx > 0) return title[..idx].Trim(); | ||
| } |
Comment on lines
+168
to
+171
| foreach (var ext in DownloadableExtensions) | ||
| { | ||
| if (path.EndsWith(ext, StringComparison.OrdinalIgnoreCase)) return true; | ||
| } |
Comment on lines
+123
to
+127
| catch (Exception ex) | ||
| { | ||
| Logger.LogWarning(ex, "Chicago Gaming scraper: failed to fetch / extract {Url}; skipping.", machineUrl); | ||
| return (null, []); | ||
| } |
Comment on lines
+83
to
+105
| foreach (var anchor in doc.QuerySelectorAll("a[href]")) | ||
| { | ||
| var href = anchor.GetAttribute("href"); | ||
| if (string.IsNullOrWhiteSpace(href)) continue; | ||
| if (!Uri.TryCreate(baseUri, href, out var absolute)) continue; | ||
|
|
||
| if (!string.Equals(absolute.Host, baseUri.Host, StringComparison.OrdinalIgnoreCase)) continue; | ||
| if (!absolute.AbsolutePath.StartsWith(normalizedPrefix, StringComparison.OrdinalIgnoreCase)) continue; | ||
|
|
||
| // Single-slug-segment requirement rejects /coinop/ itself, /coinop/{slug}/update, | ||
| // /coinop/{slug}/update/mac, etc. | ||
| var afterPrefix = absolute.AbsolutePath[normalizedPrefix.Length..].TrimEnd('/'); | ||
| if (afterPrefix.Length == 0) continue; | ||
| if (afterPrefix.Contains('/', StringComparison.Ordinal)) continue; | ||
|
|
||
| // Drop fragment / query so anchor variants of the same machine | ||
| // canonicalise to one URL in the result set. | ||
| var canonical = new UriBuilder(absolute) { Fragment = "", Query = "" }.Uri; | ||
| if (seen.Add(canonical.AbsoluteUri)) | ||
| { | ||
| urls.Add(canonical); | ||
| } | ||
| } |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eighth manufacturer scraper. CGC ships "Remake" editions of classic Bally/Williams machines: Attack from Mars, Cactus Canyon, Medieval Madness, Monster Bash, Pulp Fiction.
CGC's site is custom Nginx-served HTML — no WordPress, no Shopify, no SPA, no JSON-LD. The scraper is a hybrid of two existing templates:
BofCategoryClient:/coinop/index page anchor extraction with a single-segment-slug filter that rejects/coinop/{slug}/updateand.../update/macsub-pages. The site'ssitemap.xmlis incomplete in practice (missing Pulp Fiction and Cactus Canyon as of 2026-05), so the index page is the canonical source — same defence-in-depth pattern as JJP's collection-handle filter from PR fix(scrapers) wire JJP machine filter, harden Spooky, add pre-push self-audit #34.ApGamePageExtractor: page<title>with uniform| Chicago Gaming Companysuffix stripped,<h1>fallback, prettified-slug fallback. Same-host.pdflink extraction (Pulp Fiction alone exposes 5: brochure, deposit agreement, feature matrix, rules manual, warranty).Reconciler integration
ScraperManufacturerKey.ChicagoGaming = "cgc"matchesOpdbMachineMapper.NormalizeManufacturerKeyexactly — records produced here will land in the correct Cosmos partition for the reconciler from PR #35.Politeness
CGC's robots.txt blocks
/imagesforUser-agent: *and bans various search-engine bots specifically (Exabot, Baidu, Yandex, etc.). The scraper never fetches images and we identify ourselves asPinballWizard/0.1, so the policy is honored vacuously by the polite gate. The user explicitly confirmed earlier in this session that we honor all robots.txt — no Disallow boundaries crossed here.Pre-push self-audit (per PR #34 + PR #36)
Step 0 —
/local-review(qualitative)Local review: 0 🔴 / 4 ⚠️ / 6 categories ✅LinkTextpopulated/null onDiscoveredLinkSource.ScrapedFrom/Source.ScrapedAt/DiscoveredOnprovenance fieldsMachinesIndexPathandGamePathPrefixboth default to/coinop/(no[NotEqual]invariant)/coinop/Step 1 — Mechanical checklist
ChicagoGamingOptionsproperty read by code (verified by grep —BaseUrl/MachinesIndexPath/GamePathPrefixall hit)CgcMenuClientmirrorsBofCategoryClient;CgcGamePageExtractormirrorsApGamePageExtractor;CgcGamePageScrapermirrorsApGamePageScraper. Log shapes include manufacturer name (the PR feat(pinballbrothers) Pinball Brothers scraper — Phase 1.3.a #37 finding)catch { }insrc/PinballWizard.Infrastructure/Scraping/ChicagoGaming/SourceAliasContractTestsstill passes —Name = \"Chicago Gaming\"matches alias-map valueCgcMenuClientTestsfixture mixes 5 machines + sub-pages + index page + arcade games + external host (all 5 machines pass, others rejected)git log -1 --format='%an <%ae>'— personal noreplyTests
+19 tests (CgcMenuClient + CgcGamePageExtractor + ScraperManufacturerKey row).
CgcMenuClientTests(7) — canonical machine URL filter; sub-page rejection; fragment + query dedup; relative href base resolution; prefix without trailing slash; null/blank-arg validationCgcGamePageExtractorTests(11) — title with suffix strip; h1 fallback; prettified-slug fallback; no-slug → null; same-host PDF set (5 fixtures); external + non-PDF rejection; dedup; slug extraction edge cases; null-arg validationScraperManufacturerKeyTests(+1 row) —game_cgc_medieval-madness→cgcBranch state
Branched from main-at-PR-#38 (so #39 Multimorphic is not yet in main). When #39 merges first, this PR will need a trivial rebase: both touch
Enums.cs,ScraperManufacturerKey.cs,ScraperOrchestrator.cs,Program.cs,appsettings.jsonwith non-overlapping additions. The conflict is mechanical — differentSourceTypeenum values, different alias-map entries, different DI registrations.Phase 1.3 status after this PR
Disallow: /After this and #39 merge: Phase 1.3 closes with 4 new manufacturers (5 if Haggis recovers in time). Catalog coverage ~99% of currently-shipping commercial pinball.
Out of scope
IPolitenessGate+HttpMessageHandlerto enable scraper-pipeline integration tests across all 8 scrapers) — separate cleanup PRCommon/JsonLd/helper — three storefronts use it now; planned for next cleanup pass/schedulean agent in 2 weeks🤖 Generated with Claude Code