Skip to content

test(infra) family-wide scraper-pipeline test infrastructure#41

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-ScraperTestInfra
May 3, 2026
Merged

test(infra) family-wide scraper-pipeline test infrastructure#41
jkeeley2073 merged 1 commit into
mainfrom
Dev-ScraperTestInfra

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Closes the recurring ⚠️ finding from /local-review runs across PRs #38 / #39 / #40: "no scraper-pipeline integration test asserts yield order, provenance propagation, per-page failure isolation, or the polite-gate routing invariants."

Two shared test fakes plus a proof-of-concept against CgcGamePageScraper. The fakes are reusable across all 8 scrapers; backfilling each one is intentionally deferred to follow-up PRs (so each can land independently as scrapers are touched). What this PR establishes is the template.

What's added

  • FakePolitenessGate — implements IPolitenessGate, records every Acquire/Report/lease-dispose so tests can assert the polite-scraping invariants are honored end-to-end. ThrowOnAcquire / ThrowOnReport for testing abort paths.
  • QueueingHttpMessageHandler — maps absolute URLs to pre-canned responses. Throws UnexpectedRequestException (with the mapped-URL list embedded in the message) on unmapped requests so a regression that fetches the wrong URL fails loudly instead of silently 404-ing.
  • CgcGamePageScraperTests (proof-of-concept) — 5 tests covering:
    • Happy path: yield order, full provenance assertions, URL-equality between gate and wire
    • Per-page failure isolation with politeness invariants still holding under failure
    • Discovery failure aborts the source only (yield-break, no throw)
    • PolitenessException propagation on the Acquire path with assertions that zero requests fired
    • PolitenessException propagation on the Report path

Picked CGC for the proof-of-concept because it exercises BOTH .Game and .Link yields — the template covers the dual-yield case, so backfills to single-yield scrapers (PB, JJP, BoF, Multimorphic, Spooky) will be a strict subset.

Pre-push self-audit (per PR #34 + PR #36)

Step 0 — /local-review

0 🔴 / 5 ⚠️ / 4 ✅ — and all 5 ⚠️ fixed in this PR rather than deferred. The reviewer made a strong asymmetric-cost argument: this PR is the template for ~7 future backfill PRs, so template gaps multiply. Fixing once now is much cheaper than fixing 7 times later.

# Finding Action
1 URL-equality assertion between handler.Requests and gate.Acquired/Reported missing — without it a future re-canonicalisation refactor could silently throttle a different origin Fixed: added Assert.Equal(...) on URL sequences in the happy-path test
2 Per-page failure test threw away gate + handler — politeness invariants under failure not pinned Fixed: added invariant assertions on the failure path; the 500 response must still be reported back so the 429-streak detector sees real failures
3 PolitenessException test asserted propagation but not "zero requests fired" — a regression that swallowed the throw and continued would still pass for the first call Fixed: added Assert.Empty(handler.Requests) and Assert.Empty(gate.Reported)
4 Provenance fields (Source.ScrapedFrom, ScrapedAt, DiscoveredOn) not asserted — these are what survive into catalog.json and Phase 2 RAG citations; provenance is "sacred" per CLAUDE.md Fixed: added explicit assertions on firstGame.Source!.ScrapedFrom etc.
5 ThrowOnReport was unused API surface ("I added a hook, never tested it") Fixed: added a 5th test pinning the Report-side throw path

Step 1 — Mechanical checklist

  • No new *Options properties (N/A)
  • No scraper changes (sibling-diff N/A)
  • No bare catch { } in tests/PinballWizard.Scraper.Tests/Scraping/_TestInfra/
  • SourceAliasContractTests still passes
  • Tests assert behavior — every test pins a specific invariant the unit-test surface couldn't reach
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' — personal noreply

Tests

378 / 378 passing (was 373). +5:

  • CgcGamePageScraperTests (5)

Folder convention

Established new convention: tests/.../Scraping/_TestInfra/ for shared test fakes. Leading underscore sorts to top of directory listings — discoverability for the next test author. (Reviewer noted this is a new convention; one-line README in a follow-up could document it formally if it sticks.)

Out of scope

  • Backfill across the other 7 scrapers — separate PRs each, can land independently. Each backfill should copy the 5-test pattern from CgcGamePageScraperTests. Single-yield scrapers (PB / JJP / BoF / Multimorphic / Spooky) drop the Link-yield assertions; multi-yield scrapers (AP / Stern's GamePageScraper) keep them.
  • README in _TestInfra/ — defer until pattern proves out across 2-3 backfills.

🤖 Generated with Claude Code

Closes the recurring ⚠️ finding from /local-review runs across PRs
#38 / #39 / #40: no scraper-pipeline integration test asserts yield
order, provenance propagation, per-page failure isolation, or the
polite-gate routing invariants. Adds two shared fakes plus a
proof-of-concept against CgcGamePageScraper.

  * FakePolitenessGate: records every Acquire/Report and lease
    dispose so tests can assert the polite-scraping invariants
    end-to-end. Includes ThrowOnAcquire / ThrowOnReport for the
    abort paths.
  * QueueingHttpMessageHandler: pre-canned URL→response map; throws
    UnexpectedRequestException with the mapped-URL list on misses,
    so a regression fetching the wrong URL fails loudly with a
    diff-friendly error message instead of silently 404-ing.
  * CgcGamePageScraperTests (proof-of-concept): 5 tests covering
    happy path (yield order + provenance + URL-equality between
    gate and wire), per-page failure isolation with politeness
    invariants still holding under failure, discovery-failure
    yield-break, PolitenessException propagation on both Acquire
    and Report paths.

Picked CGC for the POC because it exercises BOTH .Game and .Link
yields — the template covers the dual-yield case so backfills to
single-yield scrapers will be a strict subset.

Pre-push self-audit:
  * /local-review (Step 0): 0 🔴 / 5 ⚠️ -- all five fixed in this
    PR rather than deferred, because the proof-of-concept becomes
    the template for ~7 future backfill PRs and template gaps
    multiply. Reviewer's exact recommendations applied:
      1. URL-equality assertion between handler.Requests and
         gate.Acquired/Reported in the happy path
      2. Politeness invariants pinned under per-page failure path
      3. PolitenessException test asserts zero requests + zero
         reports on the abort
      4. Provenance assertions on Source.ScrapedFrom / ScrapedAt /
         DiscoveredOn
      5. New test for ThrowOnReport (was unused API surface)
  * 7-item mechanical checklist (Step 1): all pass

Tests: 373 -> 378 (+5). Build: zero warnings.

Backfill across the other 7 scrapers (Stern, JJP, AP, Spooky, PB,
BoF, Multimorphic) is intentionally deferred -- follow-up PRs can
land each independently using this template. The audit-flagged
invariants are now codified once and reusable.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 3, 2026
@jkeeley2073 jkeeley2073 merged commit 1155940 into main May 3, 2026
4 checks passed
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-page
failure isolation, discovery failure, PolitenessException on both Acquire
and Report. Single-yield (.Game only) WooCommerce storefront via
/product-category/machines/.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-page
failure isolation, discovery failure, PolitenessException on both Acquire
and Report. Multi-yield (.Game + .Link) scraper so tests pin both kinds
plus .Link.GameSlug lineage to parent game.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-page
failure isolation, discovery failure, PolitenessException on both Acquire
and Report. Single-yield (.Game only) Shopify storefront via
/collections/{slug}/products.json.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-page
failure isolation, sitemap discovery failure, PolitenessException on
both Acquire and Report. Single-yield (.Game only) WordPress storefront
via /wp-sitemap.xml walk filtered to /store/p3-game-kits/multimorphic-game-kits/.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-page
failure isolation, discovery failure, PolitenessException on both Acquire
and Report. Multi-yield (.Game + .Link) scraper so tests pin both kinds
plus .Link.GameSlug lineage to parent game.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Backfill of the PR #41 template across the family. 5 tests covering
happy-path with full provenance + gate-vs-wire URL equality, per-link
extraction failure isolation, discovery failure, PolitenessException on
both Acquire and Report. Single-yield-link scraper (manuals are
documents, not games) so the tests assert .Link yield order only with
no .Game items.

Pre-push audit: 7-item mechanical (all pass). /local-review deferred to
human reviewer at merge time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant