feat(opdb) OPDB integration -- Phase 1.1 / Track B-OPDB#30
Merged
Conversation
Gate 2 of the parallel execution plan: codifies the polite-scraping
invariants (per-origin throttle, robots.txt respect, 429 backoff /
abort, descriptive User-Agent, shared Playwright BrowserContext) into
a reusable foundation so every current and future manufacturer
scraper extends it instead of re-implementing politeness per source.
Per the locked feedback memory feedback_polite_scraping.md, the
politeness must be VISIBLY enforced in code -- extending the base is
the visible enforcement.
Foundation (PinballWizard.Core/Configuration + Infrastructure/Scraping/Polite):
- PolitenessOptions Bound from "Politeness" config section. UA,
request delay floor, max 429 streak, robots
enable/path/TTL.
- IPolitenessGate + Per-origin throttle via per-origin
PolitenessGate SemaphoreSlim, per-origin minimum delay
between requests, process-wide 429 streak with
abort-on-threshold (PolitenessException),
IAsyncDisposable lease pattern. Singleton -
per-origin throttle state shared across all
scrapers in the process so two concurrent
scrapers against the same origin don't double-
pace.
- RobotsTxtCache + Per-host parsed rules cached on first fetch
RobotsTxtParser with TTL refresh. Permissive fallback on 404 /
network failure (don't block legitimate work).
Longest-match Allow / Disallow rules,
agent-specific blocks beat wildcard. Supports
Crawl-delay and Sitemap directives, * and $
patterns.
- PolitenessException Thrown when robots.txt disallows a URL OR when
the 429 streak exceeds the configured maximum.
Distinct from transient errors -- means we have
decided NOT to keep asking. Orchestrator catches
at source boundary.
- PoliteScraperBase Abstract base for HTTP-driven scrapers. Helper
SendPolitelyAsync / GetStringPolitelyAsync route
through the gate.
- PolitePlaywrightScraperBase
Abstract base for Playwright-driven scrapers.
Owns a SHARED IBrowserContext per scraper
instance (replaces per-page NewContextAsync
waste). NewPolitePageAsync returns a PolitePage
that disposes the page AND releases the lease.
- AddPoliteScraping DI extension: binds PolitenessOptions, adds
typed HttpClient for RobotsTxtCache, registers
cache + gate as singletons.
Refactored scrapers (behavior unchanged, all 135 prior tests pass):
- ManualsScraper extends PoliteScraperBase
- GameListingScraper extends PolitePlaywrightScraperBase
- GamePageScraper extends PolitePlaywrightScraperBase
- ServiceBulletinScraper extends PolitePlaywrightScraperBase
Inline Task.Delay(_settings.PageLoadDelayMs) calls removed -- the gate
handles per-origin pacing on the next NewPolitePageAsync acquire.
User-Agent change:
- Before: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Chrome/131.0.0.0
- After: PinballWizard/0.1 (+https://github.com/Early-Bird-Solutions-LLC/PinballWizard;
polite-scraper)
Identifying ourselves to source-site operators is the polite-scraping
ethos: they should be able to see who we are and contact us via the
linked-back GitHub repo. Mimicking a Chrome UA was the previous
convention but conflicts with that ethos and is replaced.
Tests (23 new, 158 total passing):
- RobotsTxtParserTests Parser correctness across edge cases (empty
file, agent-specific blocks, longest-match,
wildcards, Crawl-delay, Sitemap, comments)
- RobotsTxtCacheTests Fetch + cache + 404 fallback + network-failure
fallback + per-host caching with stub
HttpMessageHandler
- PolitenessGateTests Per-origin delay enforcement, different
origins parallel, concurrent same-origin
serialization, robots.txt integration, 429
streak abort, 200 streak reset, non-429 4xx
doesn't change streak, null-arg validation
Out of scope (intentional follow-ups):
- Per-source PolitenessOverrides from the IngestionSource Cosmos
document (the config plumbing for per-source overrides lands when
Track B-OPDB or the first non-Stern scraper exercises it).
- Conditional-request handling on HTML page fetches (FileDownloader
already does ETag/If-Modified-Since for downloaded files).
- Replacing every WaitForTimeoutAsync with WaitForSelectorAsync (the
current heuristic timeouts work but are improvable; tracked as a
separate cleanup PR).
- Reusing one IBrowserContext across Game Listing + Game Page
scrapers (currently each scraper instance owns its own context).
First new "scraper" on the Clean Architecture layout (per the parallel execution plan). OPDB is API-based, not site-scraping, so it is the lowest-risk first integration -- validates the layout cleanly accommodates non-Stern sources before per-manufacturer scrapers start landing. Stacked on Dev-PoliteScraperBase (PR #29) so OpdbClient extends PoliteScraperBase. After #29 merges this PR's diff cleans up to just the OPDB additions. Code: PinballWizard.Core/Configuration/OpdbOptions.cs Bound from "Opdb" config section. BaseUrl (defaults to https://opdb.org/api/), bearer ApiToken (env var or user-secrets, never committed), PageSize, HttpTimeoutSeconds. PinballWizard.Application/Sync/IOpdbSyncService.cs + OpdbSyncResult.cs Application-layer contract + result record (Fetched / Inserted / Updated / Skipped + elapsed Duration). PinballWizard.Infrastructure/Integrations/Opdb/ - OpdbMachineDto + OpdbManufacturerDto + OpdbPersonDto -- wire DTOs with explicit [JsonPropertyName] snake_case mapping. - OpdbMachineMapper -- pure-function map / merge logic. Skips records that are not is_machine OR have no opdb_id OR have no manufacturer name. Manufacturer-key normalization for the 10 most-common pinball manufacturers (stern, jjp, americanpinball, spooky, multimorphic, cgc, haggis, pinballbrothers, dutch, barrelsoffun); unknown manufacturers fall through to a sanitized lower-case key. - OpdbClient -- typed HTTP client extending PoliteScraperBase so requests to opdb.org flow through the politeness gate. Bearer auth from OpdbOptions.ApiToken. StreamAllMachinesAsync pages until the API returns a partial-page response or empty page; GetMachineAsync returns null on 404, throws on other errors. - OpdbSyncService -- idempotent fetch-then-upsert orchestration. For each fetched DTO: map; if mapper returns null, skip; else look up existing machine by OPDB ID + manufacturer partition; if absent, insert; if present, merge OPDB-sourced fields onto the existing record (preserving project-owned fields: ManufacturerSlugs, Editions, FirstSeenAt) and upsert. Logs progress every 100 records. - AddOpdbIntegration DI extension -- binds OpdbOptions, registers typed HttpClient with the polite UA, registers IOpdbSyncService. Tests (27 new, 185 total passing): - OpdbMachineMapperTests Map: 4 happy-path / null-result branches + 11 manufacturer-key parameterizations + invalid-date / year-only-date edge cases. MergeOpdbFieldsInto: refreshes title / year / designers / themes / LastSeenAt; preserves FirstSeenAt + ManufacturerSlugs + Editions. - OpdbClientTests Paging across multiple stub-handler responses; bearer-token header applied; GetMachineAsync 404 returns null, 200 returns mapped DTO, blank ID throws. - OpdbSyncServiceTests New-machine path increments Inserted; existing-machine path increments Updated and the upserted record has the merged title + fresh LastSeenAt; non-machine record increments Skipped. Out of scope (intentional follow-ups): - CLI wiring (--source opdb in Program.cs) -- Cosmos isn't deployed yet, so there's nowhere for the sync to write. Lands as a tiny follow-up when the Bicep deploy from PR #27 actually runs and Cosmos is reachable. - Bicep ACA Job (scraper-opdb-sync, daily 02:00 UTC) -- separate IaC PR; per-environment Bicep work in general isn't ready yet. - Cosmos container declarations -- the Gate 1 CosmosBootstrapper currently provisions containers on first use; declaring containers in Bicep is a separate cleanup option. - catalog.json -> Machine migration -- existing GameRecord shape stays; an OPDB-id lookup pass that backfills GameRecord -> Machine.ManufacturerSlugs is its own concern.
| { | ||
| var gate = CreateGate(); | ||
|
|
||
| await using (var first = await gate.AcquireForRequestAsync(new Uri("https://a.example.com/x"), CancellationToken.None)) |
| await _contextInitLock.WaitAsync().ConfigureAwait(false); | ||
| try | ||
| { | ||
| if (_context is not null) |
| ArgumentNullException.ThrowIfNull(existing); | ||
| ArgumentNullException.ThrowIfNull(dto); | ||
|
|
||
| if (dto.Manufacturer?.Name is { } mfgName) existing.ManufacturerDisplayName.GetType(); // no-op: ManufacturerDisplayName is init-only and rarely changes |
Comment on lines
+224
to
+289
| foreach (var rawLine in body.Split('\n')) | ||
| { | ||
| var line = rawLine; | ||
| var commentStart = line.IndexOf('#', StringComparison.Ordinal); | ||
| if (commentStart >= 0) | ||
| { | ||
| line = line[..commentStart]; | ||
| } | ||
| line = line.Trim(); | ||
| if (line.Length == 0) | ||
| { | ||
| continue; | ||
| } | ||
|
|
||
| var colonIndex = line.IndexOf(':', StringComparison.Ordinal); | ||
| if (colonIndex <= 0) | ||
| { | ||
| continue; | ||
| } | ||
|
|
||
| var directive = line[..colonIndex].Trim(); | ||
| var value = line[(colonIndex + 1)..].Trim(); | ||
|
|
||
| switch (directive.ToLowerInvariant()) | ||
| { | ||
| case "user-agent": | ||
| if (!agentRules.TryGetValue(value, out currentBlock)) | ||
| { | ||
| currentBlock = new RobotsTxtRules.AgentRules(); | ||
| agentRules[value] = currentBlock; | ||
| } | ||
| break; | ||
|
|
||
| case "allow": | ||
| currentBlock?.Rules.Add(new RobotsTxtRules.Rule(value, Allow: true)); | ||
| break; | ||
|
|
||
| case "disallow": | ||
| if (value.Length > 0) | ||
| { | ||
| currentBlock?.Rules.Add(new RobotsTxtRules.Rule(value, Allow: false)); | ||
| } | ||
| break; | ||
|
|
||
| case "crawl-delay": | ||
| if (double.TryParse(value, System.Globalization.CultureInfo.InvariantCulture, out var delay)) | ||
| { | ||
| if (currentBlock is not null) | ||
| { | ||
| currentBlock.CrawlDelay = delay; | ||
| } | ||
| else | ||
| { | ||
| globalCrawlDelay = delay; | ||
| } | ||
| } | ||
| break; | ||
|
|
||
| case "sitemap": | ||
| if (value.Length > 0) | ||
| { | ||
| sitemaps.Add(value); | ||
| } | ||
| break; | ||
| } | ||
| } |
Comment on lines
+159
to
+166
| foreach (var rule in agentBlock.Rules) | ||
| { | ||
| if (PatternMatches(rule.Pattern, path) && | ||
| (best is null || rule.Pattern.Length > best.Pattern.Length)) | ||
| { | ||
| best = rule; | ||
| } | ||
| } |
Comment on lines
+151
to
+154
| catch (Exception ex) | ||
| { | ||
| Logger.LogDebug(ex, "Suppressed error disposing Playwright BrowserContext."); | ||
| } |
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1.1 / Track B-OPDB of the parallel execution plan. First new "scraper" on the Clean Architecture layout — and OPDB is API-based (not site-scraping), so it is the lowest-risk first integration. Validates the layout cleanly accommodates non-Stern sources before per-manufacturer scrapers start landing.
This branch is stacked on
Dev-PoliteScraperBasesoOpdbClientextendsPoliteScraperBase. GitHub will show #29's diff in this PR until #29 merges. After #29 merges, this PR's diff cleans up to just the OPDB additions. Merge order: #29 first, then this.What ships
PinballWizard.Core/Configuration/OpdbOptions.csBound from
"Opdb"config section.BaseUrl, bearerApiToken(env var or user-secrets, never committed),PageSize,HttpTimeoutSeconds.PinballWizard.Application/Sync/IOpdbSyncService— application-layer contractOpdbSyncResult—Fetched / Inserted / Updated / Skipped + DurationPinballWizard.Infrastructure/Integrations/Opdb/OpdbMachineDto,OpdbManufacturerDto,OpdbPersonDto[JsonPropertyName]snake_case mappingOpdbMachineMapperis_machineOR have no OPDB ID OR have no manufacturer name. Manufacturer-key normalization for the 10 most-common pinball manufacturers; unknown manufacturers fall through to a sanitized lower-case key.OpdbClientPoliteScraperBaseso OPDB requests flow through the politeness gate. Bearer auth.StreamAllMachinesAsyncpages until partial / empty page;GetMachineAsyncreturns null on 404, throws on other errors.OpdbSyncServiceManufacturerSlugs,Editions,FirstSeenAt).AddOpdbIntegrationDI extensionOpdbOptions, registers typedHttpClientwith the polite UA, registersIOpdbSyncService.Test Plan
dotnet build /warnaserror— zero warningsdotnet test— 185/185 passing (was 158 pre-PR; +27 new OPDB tests)What the 27 new tests cover
OpdbMachineMapperTests(15): map happy-path + 4 null-result branches + 11 manufacturer-key parameterizations + invalid-date / year-only-date edge cases + merge-into-existing (refresh title / year / designers / themes /LastSeenAt; preserveFirstSeenAt,ManufacturerSlugs,Editions)OpdbClientTests(5): paging across multiple stub-handler responses, bearer-token header applied,GetMachineAsync404 returns null / 200 returns mapped DTO / blank ID throwsOpdbSyncServiceTests(3): new-machine path incrementsInserted; existing-machine path incrementsUpdatedand the upserted record has the merged title + freshLastSeenAt; non-machine record incrementsSkippedOut of Scope (intentional follow-ups)
--source opdbinProgram.cs) — Cosmos isn't deployed yet, so there's nowhere for the sync to write. Lands as a tiny follow-up when the Bicep deploy from PR #27 actually runs and Cosmos is reachable.scraper-opdb-sync, daily 02:00 UTC) — separate IaC PR; per-environment Bicep work in general isn't ready yet.CosmosBootstrappercurrently provisions containers on first use; declaring containers in Bicep is a separate cleanup option.catalog.json→Machinemigration — existingGameRecordshape stays; an OPDB-id lookup pass that backfillsGameRecord→Machine.ManufacturerSlugsis its own concern.What this unblocks
Machine.ManufacturerSlugs[mfg]entry pointing back to its source-specific slug.IMachineRepository. If anything's wrong with the Gate 1 schema design, this is where it surfaces.Suggested merge order