Skip to content

feat(opdb) OPDB integration -- Phase 1.1 / Track B-OPDB#30

Merged
jkeeley2073 merged 2 commits into
mainfrom
Dev-OpdbIntegration
May 2, 2026
Merged

feat(opdb) OPDB integration -- Phase 1.1 / Track B-OPDB#30
jkeeley2073 merged 2 commits into
mainfrom
Dev-OpdbIntegration

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Phase 1.1 / Track B-OPDB of the parallel execution plan. First new "scraper" on the Clean Architecture layout — and OPDB is API-based (not site-scraping), so it is the lowest-risk first integration. Validates the layout cleanly accommodates non-Stern sources before per-manufacturer scrapers start landing.

⚠️ Stacked on PR #29 (Gate 2)

This branch is stacked on Dev-PoliteScraperBase so OpdbClient extends PoliteScraperBase. GitHub will show #29's diff in this PR until #29 merges. After #29 merges, this PR's diff cleans up to just the OPDB additions. Merge order: #29 first, then this.

What ships

PinballWizard.Core/Configuration/OpdbOptions.cs

Bound from "Opdb" config section. BaseUrl, bearer ApiToken (env var or user-secrets, never committed), PageSize, HttpTimeoutSeconds.

PinballWizard.Application/Sync/

  • IOpdbSyncService — application-layer contract
  • OpdbSyncResultFetched / Inserted / Updated / Skipped + Duration

PinballWizard.Infrastructure/Integrations/Opdb/

File What
OpdbMachineDto, OpdbManufacturerDto, OpdbPersonDto Wire DTOs with explicit [JsonPropertyName] snake_case mapping
OpdbMachineMapper Pure-function map / merge. Skips records that aren't is_machine OR have no OPDB ID OR have no manufacturer name. Manufacturer-key normalization for the 10 most-common pinball manufacturers; unknown manufacturers fall through to a sanitized lower-case key.
OpdbClient Typed HTTP client extending PoliteScraperBase so OPDB requests flow through the politeness gate. Bearer auth. StreamAllMachinesAsync pages until partial / empty page; GetMachineAsync returns null on 404, throws on other errors.
OpdbSyncService Idempotent fetch-then-upsert orchestration. Existing-machine path merges OPDB-sourced fields onto the existing record, preserving project-owned fields (ManufacturerSlugs, Editions, FirstSeenAt).
AddOpdbIntegration DI extension Binds OpdbOptions, registers typed HttpClient with the polite UA, registers IOpdbSyncService.

Test Plan

  • dotnet build /warnaserrorzero warnings
  • dotnet test185/185 passing (was 158 pre-PR; +27 new OPDB tests)
  • All 158 prior tests still pass
  • CI re-validates on this PR
  • Author = personal noreply

What the 27 new tests cover

  • OpdbMachineMapperTests (15): map happy-path + 4 null-result branches + 11 manufacturer-key parameterizations + invalid-date / year-only-date edge cases + merge-into-existing (refresh title / year / designers / themes / LastSeenAt; preserve FirstSeenAt, ManufacturerSlugs, Editions)
  • OpdbClientTests (5): paging across multiple stub-handler responses, bearer-token header applied, GetMachineAsync 404 returns null / 200 returns mapped DTO / blank ID throws
  • OpdbSyncServiceTests (3): new-machine path increments Inserted; existing-machine path increments Updated and the upserted record has the merged title + fresh LastSeenAt; non-machine record increments Skipped

Out of Scope (intentional follow-ups)

  • CLI wiring (--source opdb in Program.cs) — Cosmos isn't deployed yet, so there's nowhere for the sync to write. Lands as a tiny follow-up when the Bicep deploy from PR #27 actually runs and Cosmos is reachable.
  • Bicep ACA Job (scraper-opdb-sync, daily 02:00 UTC) — separate IaC PR; per-environment Bicep work in general isn't ready yet.
  • Cosmos container declarations — the Gate 1 CosmosBootstrapper currently provisions containers on first use; declaring containers in Bicep is a separate cleanup option.
  • catalog.jsonMachine migration — existing GameRecord shape stays; an OPDB-id lookup pass that backfills GameRecordMachine.ManufacturerSlugs is its own concern.

What this unblocks

  • Phase 1.2 manufacturer scrapers (JJP, American Pinball, Spooky) can now reference OPDB IDs as the canonical machine key. Each manufacturer scraper writes its own Machine.ManufacturerSlugs[mfg] entry pointing back to its source-specific slug.
  • Cosmos schema validation — this PR is the first concrete consumer of IMachineRepository. If anything's wrong with the Gate 1 schema design, this is where it surfaces.

Suggested merge order

  1. feat(scraping) PoliteScraper base + politeness gate (Gate 2) #29 (Gate 2 — PoliteScraper base) — must merge first; this PR depends on it
  2. feat(opdb) OPDB integration -- Phase 1.1 / Track B-OPDB #30 (this PR — OPDB integration)
  3. Then start one of: JJP / AP / Spooky scrapers as Phase 1.2 PRs

Gate 2 of the parallel execution plan: codifies the polite-scraping
invariants (per-origin throttle, robots.txt respect, 429 backoff /
abort, descriptive User-Agent, shared Playwright BrowserContext) into
a reusable foundation so every current and future manufacturer
scraper extends it instead of re-implementing politeness per source.
Per the locked feedback memory feedback_polite_scraping.md, the
politeness must be VISIBLY enforced in code -- extending the base is
the visible enforcement.

Foundation (PinballWizard.Core/Configuration + Infrastructure/Scraping/Polite):

- PolitenessOptions    Bound from "Politeness" config section. UA,
                       request delay floor, max 429 streak, robots
                       enable/path/TTL.
- IPolitenessGate +    Per-origin throttle via per-origin
  PolitenessGate       SemaphoreSlim, per-origin minimum delay
                       between requests, process-wide 429 streak with
                       abort-on-threshold (PolitenessException),
                       IAsyncDisposable lease pattern. Singleton -
                       per-origin throttle state shared across all
                       scrapers in the process so two concurrent
                       scrapers against the same origin don't double-
                       pace.
- RobotsTxtCache +     Per-host parsed rules cached on first fetch
  RobotsTxtParser      with TTL refresh. Permissive fallback on 404 /
                       network failure (don't block legitimate work).
                       Longest-match Allow / Disallow rules,
                       agent-specific blocks beat wildcard. Supports
                       Crawl-delay and Sitemap directives, * and $
                       patterns.
- PolitenessException  Thrown when robots.txt disallows a URL OR when
                       the 429 streak exceeds the configured maximum.
                       Distinct from transient errors -- means we have
                       decided NOT to keep asking. Orchestrator catches
                       at source boundary.
- PoliteScraperBase    Abstract base for HTTP-driven scrapers. Helper
                       SendPolitelyAsync / GetStringPolitelyAsync route
                       through the gate.
- PolitePlaywrightScraperBase
                       Abstract base for Playwright-driven scrapers.
                       Owns a SHARED IBrowserContext per scraper
                       instance (replaces per-page NewContextAsync
                       waste). NewPolitePageAsync returns a PolitePage
                       that disposes the page AND releases the lease.
- AddPoliteScraping    DI extension: binds PolitenessOptions, adds
                       typed HttpClient for RobotsTxtCache, registers
                       cache + gate as singletons.

Refactored scrapers (behavior unchanged, all 135 prior tests pass):

- ManualsScraper             extends PoliteScraperBase
- GameListingScraper         extends PolitePlaywrightScraperBase
- GamePageScraper            extends PolitePlaywrightScraperBase
- ServiceBulletinScraper     extends PolitePlaywrightScraperBase

Inline Task.Delay(_settings.PageLoadDelayMs) calls removed -- the gate
handles per-origin pacing on the next NewPolitePageAsync acquire.

User-Agent change:

- Before: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Chrome/131.0.0.0
- After:  PinballWizard/0.1 (+https://github.com/Early-Bird-Solutions-LLC/PinballWizard;
          polite-scraper)

Identifying ourselves to source-site operators is the polite-scraping
ethos: they should be able to see who we are and contact us via the
linked-back GitHub repo. Mimicking a Chrome UA was the previous
convention but conflicts with that ethos and is replaced.

Tests (23 new, 158 total passing):

- RobotsTxtParserTests   Parser correctness across edge cases (empty
                         file, agent-specific blocks, longest-match,
                         wildcards, Crawl-delay, Sitemap, comments)
- RobotsTxtCacheTests    Fetch + cache + 404 fallback + network-failure
                         fallback + per-host caching with stub
                         HttpMessageHandler
- PolitenessGateTests    Per-origin delay enforcement, different
                         origins parallel, concurrent same-origin
                         serialization, robots.txt integration, 429
                         streak abort, 200 streak reset, non-429 4xx
                         doesn't change streak, null-arg validation

Out of scope (intentional follow-ups):

- Per-source PolitenessOverrides from the IngestionSource Cosmos
  document (the config plumbing for per-source overrides lands when
  Track B-OPDB or the first non-Stern scraper exercises it).
- Conditional-request handling on HTML page fetches (FileDownloader
  already does ETag/If-Modified-Since for downloaded files).
- Replacing every WaitForTimeoutAsync with WaitForSelectorAsync (the
  current heuristic timeouts work but are improvable; tracked as a
  separate cleanup PR).
- Reusing one IBrowserContext across Game Listing + Game Page
  scrapers (currently each scraper instance owns its own context).
First new "scraper" on the Clean Architecture layout (per the parallel
execution plan). OPDB is API-based, not site-scraping, so it is the
lowest-risk first integration -- validates the layout cleanly
accommodates non-Stern sources before per-manufacturer scrapers start
landing.

Stacked on Dev-PoliteScraperBase (PR #29) so OpdbClient extends
PoliteScraperBase. After #29 merges this PR's diff cleans up to just
the OPDB additions.

Code:

PinballWizard.Core/Configuration/OpdbOptions.cs
  Bound from "Opdb" config section. BaseUrl (defaults to
  https://opdb.org/api/), bearer ApiToken (env var or user-secrets,
  never committed), PageSize, HttpTimeoutSeconds.

PinballWizard.Application/Sync/IOpdbSyncService.cs + OpdbSyncResult.cs
  Application-layer contract + result record (Fetched / Inserted /
  Updated / Skipped + elapsed Duration).

PinballWizard.Infrastructure/Integrations/Opdb/
  - OpdbMachineDto + OpdbManufacturerDto + OpdbPersonDto -- wire
    DTOs with explicit [JsonPropertyName] snake_case mapping.
  - OpdbMachineMapper -- pure-function map / merge logic. Skips
    records that are not is_machine OR have no opdb_id OR have no
    manufacturer name. Manufacturer-key normalization for the 10
    most-common pinball manufacturers (stern, jjp, americanpinball,
    spooky, multimorphic, cgc, haggis, pinballbrothers, dutch,
    barrelsoffun); unknown manufacturers fall through to a sanitized
    lower-case key.
  - OpdbClient -- typed HTTP client extending PoliteScraperBase so
    requests to opdb.org flow through the politeness gate. Bearer
    auth from OpdbOptions.ApiToken. StreamAllMachinesAsync pages
    until the API returns a partial-page response or empty page;
    GetMachineAsync returns null on 404, throws on other errors.
  - OpdbSyncService -- idempotent fetch-then-upsert orchestration.
    For each fetched DTO: map; if mapper returns null, skip; else
    look up existing machine by OPDB ID + manufacturer partition; if
    absent, insert; if present, merge OPDB-sourced fields onto the
    existing record (preserving project-owned fields:
    ManufacturerSlugs, Editions, FirstSeenAt) and upsert. Logs
    progress every 100 records.
  - AddOpdbIntegration DI extension -- binds OpdbOptions, registers
    typed HttpClient with the polite UA, registers IOpdbSyncService.

Tests (27 new, 185 total passing):

- OpdbMachineMapperTests
  Map: 4 happy-path / null-result branches + 11 manufacturer-key
  parameterizations + invalid-date / year-only-date edge cases.
  MergeOpdbFieldsInto: refreshes title / year / designers / themes /
  LastSeenAt; preserves FirstSeenAt + ManufacturerSlugs + Editions.

- OpdbClientTests
  Paging across multiple stub-handler responses; bearer-token header
  applied; GetMachineAsync 404 returns null, 200 returns mapped DTO,
  blank ID throws.

- OpdbSyncServiceTests
  New-machine path increments Inserted; existing-machine path
  increments Updated and the upserted record has the merged title +
  fresh LastSeenAt; non-machine record increments Skipped.

Out of scope (intentional follow-ups):

- CLI wiring (--source opdb in Program.cs) -- Cosmos isn't deployed
  yet, so there's nowhere for the sync to write. Lands as a tiny
  follow-up when the Bicep deploy from PR #27 actually runs and
  Cosmos is reachable.
- Bicep ACA Job (scraper-opdb-sync, daily 02:00 UTC) -- separate IaC
  PR; per-environment Bicep work in general isn't ready yet.
- Cosmos container declarations -- the Gate 1 CosmosBootstrapper
  currently provisions containers on first use; declaring containers
  in Bicep is a separate cleanup option.
- catalog.json -> Machine migration -- existing GameRecord shape
  stays; an OPDB-id lookup pass that backfills GameRecord ->
  Machine.ManufacturerSlugs is its own concern.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 2, 2026
@jkeeley2073 jkeeley2073 merged commit 90a0a34 into main May 2, 2026
3 checks passed
Comment thread tests/PinballWizard.Scraper.Tests/Scraping/Polite/RobotsTxtCacheTests.cs Dismissed
Comment thread tests/PinballWizard.Scraper.Tests/Scraping/Polite/RobotsTxtCacheTests.cs Dismissed
Comment thread tests/PinballWizard.Scraper.Tests/Scraping/Polite/PolitenessGateTests.cs Dismissed
Comment thread tests/PinballWizard.Scraper.Tests/Scraping/Polite/PolitenessGateTests.cs Dismissed
{
var gate = CreateGate();

await using (var first = await gate.AcquireForRequestAsync(new Uri("https://a.example.com/x"), CancellationToken.None))
await _contextInitLock.WaitAsync().ConfigureAwait(false);
try
{
if (_context is not null)
ArgumentNullException.ThrowIfNull(existing);
ArgumentNullException.ThrowIfNull(dto);

if (dto.Manufacturer?.Name is { } mfgName) existing.ManufacturerDisplayName.GetType(); // no-op: ManufacturerDisplayName is init-only and rarely changes
Comment on lines +224 to +289
foreach (var rawLine in body.Split('\n'))
{
var line = rawLine;
var commentStart = line.IndexOf('#', StringComparison.Ordinal);
if (commentStart >= 0)
{
line = line[..commentStart];
}
line = line.Trim();
if (line.Length == 0)
{
continue;
}

var colonIndex = line.IndexOf(':', StringComparison.Ordinal);
if (colonIndex <= 0)
{
continue;
}

var directive = line[..colonIndex].Trim();
var value = line[(colonIndex + 1)..].Trim();

switch (directive.ToLowerInvariant())
{
case "user-agent":
if (!agentRules.TryGetValue(value, out currentBlock))
{
currentBlock = new RobotsTxtRules.AgentRules();
agentRules[value] = currentBlock;
}
break;

case "allow":
currentBlock?.Rules.Add(new RobotsTxtRules.Rule(value, Allow: true));
break;

case "disallow":
if (value.Length > 0)
{
currentBlock?.Rules.Add(new RobotsTxtRules.Rule(value, Allow: false));
}
break;

case "crawl-delay":
if (double.TryParse(value, System.Globalization.CultureInfo.InvariantCulture, out var delay))
{
if (currentBlock is not null)
{
currentBlock.CrawlDelay = delay;
}
else
{
globalCrawlDelay = delay;
}
}
break;

case "sitemap":
if (value.Length > 0)
{
sitemaps.Add(value);
}
break;
}
}
Comment on lines +159 to +166
foreach (var rule in agentBlock.Rules)
{
if (PatternMatches(rule.Pattern, path) &&
(best is null || rule.Pattern.Length > best.Pattern.Length))
{
best = rule;
}
}
Comment on lines +151 to +154
catch (Exception ex)
{
Logger.LogDebug(ex, "Suppressed error disposing Playwright BrowserContext.");
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants