Skip to content

feat(multimorphic) Multimorphic scraper — Phase 1.3.c#39

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-MultimorphicScraper
May 3, 2026
Merged

feat(multimorphic) Multimorphic scraper — Phase 1.3.c#39
jkeeley2073 merged 1 commit into
mainfrom
Dev-MultimorphicScraper

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Seventh manufacturer scraper, third to consume JSON-LD schema.org/Product (after JJP and BoF). Multimorphic is WordPress + WooCommerce; discovery walks the WP sitemap index (/wp-sitemap.xml) → product sub-sitemaps and filters URLs to /store/p3-game-kits/multimorphic-game-kits/{slug}/ only — the 16 Multimorphic-published P3 game kits.

Why exclude third-party kits

Multimorphic's storefront sells both their own game kits AND third-party kits (Drained, Princess Bride, Portal, Silver Falls, Flipper Foxtrot, etc.) for the P3 platform. Per ADR 0011, OPDB owns the catalog spine: it attributes those games to their originating studios, not Multimorphic. Running third-party kits through the reconciler with manufacturer = multimorphic would land them in the wrong Cosmos partition.

The path-prefix filter (/store/p3-game-kits/multimorphic-game-kits/ exclusively) draws the line cleanly. Code-level rationale lives in MultimorphicOptions.cs and MultimorphicProductScraper.cs, not just this PR description.

JSON-LD shape — third permutation handled

Storefront offers priceSpecification Availability protocol
JJP (Shopify) flat price absent https://schema.org/...
BoF (WooCommerce) absent (nested only) array https://schema.org/...
Multimorphic flat price AND nested object (not array) http://schema.org/...

MultimorphicProductExtractor handles every combination — flat-only / nested-only / both, object-or-array priceSpecification, http-or-https availability, plus @graph wrapping.

Phase 1.3 status

Mfr Status
Pinball Brothers (#37) ✅ shipped
Barrels of Fun (#38) ✅ shipped
Multimorphic ✅ this PR
Chicago Gaming next — custom CMS, AP-style template
Dutch Pinball 🚫 deferred with prejudicerobots.txt says Disallow: / (complete crawl ban). We honor robots.txt; user explicitly confirmed the policy. Would require polite outreach to dutchpinball.com for explicit permission before any code.
Haggis 🚫 deferred (infrastructure) — domain haggispinball.com.au resolves to a real Australian IP but the web server is unreachable (connection timeouts on all ports). Different blocker than Dutch Pinball; retry in 1-2 weeks.

Pre-push self-audit (per PR #34 + PR #36)

Step 0 — /local-review (qualitative)

Local review: 0 🔴 / 1 ⚠️ / 9 categories ✅ — same clean result as BoF.

# Severity Finding Action
1 ⚠️ No scraper-pipeline integration test asserting ScrapedItem.DiscoveryUrl / DiscoveryContext / Source.ScrapedFrom propagation Deferred — family-wide gap. None of the 7 scrapers have this; closing it requires shared IPolitenessGate + HttpMessageHandler mocking infrastructure that doesn't exist yet. Better as a focused cross-cutting test-infra PR than a one-off addition here.

Step 1 — Mechanical checklist

  • Every MultimorphicOptions property read by code (verified by grep — BaseUrl / SitemapPath / MultimorphicGameKitsPathPrefix all hit, no dead config)
  • Sibling-diff: MultimorphicSitemapClient mirrors JjpSitemapClient (index walk); MultimorphicProductExtractor mirrors BofProductExtractor (JSON-LD parser); MultimorphicProductScraper mirrors BofProductScraper (TryExtractAsync, log shapes including the manufacturer name per the PR feat(pinballbrothers) Pinball Brothers scraper — Phase 1.3.a #37 finding)
  • No bare catch { } in src/PinballWizard.Infrastructure/Scraping/Multimorphic/
  • SourceAliasContractTests still passes — Name = \"Multimorphic\" matches alias-map value
  • Tests assert behavior — sitemap fixture mixes Multimorphic kits + 3rd-party kits + circuit boards + accessories + apparel + the platform itself; only the 3 multimorphic-game-kits URLs pass
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' — personal noreply

Tests

375 / 375 passing (was 347). +28:

  • MultimorphicSitemapClientTests (8) — index parse rejects non-product sub-sitemaps; paginated sub-sitemaps walked; path-prefix filter rejects 3rd-party kits / circuit boards / accessories / apparel / the P3 platform; sub-page rejection; prefix-without-trailing-slash; null/blank-arg validation
  • MultimorphicProductExtractorTests (12) — real Multimorphic shape (flat + nested + http schema.org); nested-only path; @graph wrap; og:title fallback; no-title → null; malformed JSON-LD fall-through; slug extraction (positive + negative landmark multimorphic-game-kits); availability normalization across http / https / bare token; null-arg validation
  • ScraperManufacturerKeyTests (+1 row) — game_multimorphic_lexy-lightspeed-escape-from-earthmultimorphic

Out of scope

  • Scraper-pipeline integration tests (deferred per Step 0 review — family-wide cleanup)
  • Chicago Gaming scraper (separate PR)
  • Refactor JSON-LD parser to shared helper (3-storefront threshold reached this PR; planned for next cleanup pass)

🤖 Generated with Claude Code

@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 3, 2026
"outofstock" => "out_of_stock",
"preorder" => "preorder",
"discontinued" => "discontinued",
_ => lastSegment?.ToLowerInvariant(),
Comment on lines +133 to +167
foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']"))
{
var text = script.TextContent;
if (string.IsNullOrWhiteSpace(text)) continue;

JsonElement root;
try
{
using var parsed = JsonDocument.Parse(text);
root = parsed.RootElement.Clone();
}
catch (JsonException)
{
continue;
}

if (root.ValueKind == JsonValueKind.Object && root.TryGetProperty("@graph", out var graph) && graph.ValueKind == JsonValueKind.Array)
{
foreach (var item in graph.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
}
else if (root.ValueKind == JsonValueKind.Array)
{
foreach (var item in root.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
}
else
{
if (TryReadProduct(root) is { } prod) return prod;
}
}
Comment on lines +151 to +154
foreach (var item in graph.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
Comment on lines +158 to +161
foreach (var item in root.EnumerateArray())
{
if (TryReadProduct(item) is { } prod) return prod;
}
Comment on lines +206 to +212
foreach (var item in imageProp.EnumerateArray())
{
if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s)
{
images.Add(s);
}
}
Comment on lines +240 to +243
if (offer.TryGetProperty("price", out var direct))
{
if (FormatPrice(direct) is { } flat) return flat;
}
Comment on lines +99 to +104
catch (Exception ex)
{
Logger.LogWarning(
ex, "Multimorphic scraper: failed to fetch / extract {Url}; skipping.", productUrl);
return null;
}
Comment on lines +91 to +100
foreach (var sitemap in doc.Descendants(SitemapNs + "sitemap"))
{
var loc = sitemap.Element(SitemapNs + "loc")?.Value;
if (string.IsNullOrWhiteSpace(loc)) continue;
if (loc.Contains("wp-sitemap-posts-product", StringComparison.OrdinalIgnoreCase)
&& Uri.TryCreate(loc, UriKind.Absolute, out var uri))
{
sitemaps.Add(uri);
}
}
Comment on lines +121 to +135
foreach (var url in doc.Descendants(SitemapNs + "url"))
{
var loc = url.Element(SitemapNs + "loc")?.Value;
if (string.IsNullOrWhiteSpace(loc)) continue;
if (!Uri.TryCreate(loc, UriKind.Absolute, out var uri)) continue;

var path = uri.AbsolutePath;
if (!path.StartsWith(normalizedPrefix, StringComparison.OrdinalIgnoreCase)) continue;

var afterPrefix = path[normalizedPrefix.Length..].TrimEnd('/');
if (afterPrefix.Length == 0) continue;
if (afterPrefix.Contains('/', StringComparison.Ordinal)) continue;

urls.Add(uri);
}
Seventh manufacturer scraper, third using JSON-LD product schema
(after JJP and BoF). WordPress + WooCommerce; discovery walks the
WP sitemap index and filters URLs to /store/p3-game-kits/multimorphic-game-kits/{slug}/
only -- Multimorphic-published P3 game kits, not the 13 third-party
kits sold through the same storefront. Third-party kits belong to
their originating studios per OPDB attribution; running them through
the reconciler with manufacturer=multimorphic would land them in the
wrong Cosmos partition (see ADR 0011).

Multimorphic JSON-LD ships BOTH flat offers[].price AND nested
offers[].priceSpecification (object, not array -- distinct from BoF),
and uses http://schema.org/... not https:// for availability.
MultimorphicProductExtractor handles every combination + @graph
wrapping -- the same code would work against any well-formed
WooCommerce-on-WordPress storefront.

ScraperManufacturerKey adds Multimorphic = "multimorphic" matching
OpdbMachineMapper.NormalizeManufacturerKey exactly so reconciled
records land in the correct Cosmos partition.

Pre-push self-audit:
  * /local-review (Step 0): 0 🔴 / 1 ⚠️ -- family-wide test-infra
    gap (no scraper-pipeline integration test exists for any of the
    7 scrapers; would need shared IPolitenessGate + HttpMessageHandler
    mocking infra to close across all of them); deferred to a focused
    cross-cutting follow-up
  * 7-item mechanical checklist (Step 1): all pass

Tests: 347 -> 375 (+28). Build: zero warnings. CLI: --source multimorphic.

Phase 1.3 status after this PR:
  * Pinball Brothers (#37) -- shipped
  * Barrels of Fun (#38) -- shipped
  * Multimorphic -- this PR
  * Chicago Gaming -- next candidate (custom CMS, AP-style template)
  * Dutch Pinball -- DEFERRED with prejudice (robots.txt Disallow: /)
  * Haggis -- DEFERRED, infrastructure outage (haggispinball.com.au
    web server unreachable; recommend retry in 1-2 weeks)
@jkeeley2073 jkeeley2073 force-pushed the Dev-MultimorphicScraper branch from 5cdf85e to 0f3dcea Compare May 3, 2026 11:11
@jkeeley2073 jkeeley2073 merged commit ec8296b into main May 3, 2026
5 checks passed
@jkeeley2073 jkeeley2073 deleted the Dev-MultimorphicScraper branch May 3, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants