Skip to content

refactor(scrapers) consolidate JSON-LD parsing into shared helper#42

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-JsonLdParserRefactor
May 3, 2026
Merged

refactor(scrapers) consolidate JSON-LD parsing into shared helper#42
jkeeley2073 merged 1 commit into
mainfrom
Dev-JsonLdParserRefactor

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Pure refactor — net -300 lines. JJP and BoF shipped near-identical 100-line copies of the JSON-LD walker; the same pattern would ship a third time when PR #39 (Multimorphic) merges. Three storefronts is the threshold called out in PR #38's review and PR #39's CHANGELOG note — extracting now keeps the next storefront PR cheap.

What's new

  • PinballWizard.Infrastructure/Scraping/JsonLd/ — new namespace
    • JsonLdProductParser (static; FindFirstProduct entry point, ReadProduct exposed for tests)
    • JsonLdProduct + JsonLdOffer (storefront-agnostic DTOs)

The parser handles every shape across JJP / BoF / Multimorphic:

Aspect Shapes supported
Container bare object / top-level array / @graph wrapper
Price flat offers[].price (Shopify) AND nested offers[].priceSpecification (object or array; both WooCommerce dialects)
Image string or array, with empty-string filter
@type string or array-containing-"Product"
Malformed JSON-LD block falls through to next sibling block

What changed

  • JjpProductExtractor reduced ~270 → ~140 lines by delegating
  • BofProductExtractor reduced ~310 → ~140 lines by delegating

Each kept its manufacturer-specific surface: slug-segment landmark, GameId prefix, DiscoveredOn sentinel, OG/h1 fallbacks, Edition construction.

Behavior preservation

Every pre-existing test still passes without modification (378 → 402 with the new shared-parser tests). Diffed GameRecord construction blocks against origin/main in both extractors — line-for-line identical.

JJP gains @graph support as a strict superset (the shape doesn't appear on Shopify so no real-world impact, but the parser is now uniform).

Multimorphic follow-up

PR #39 adoption is a strict-subset change: delete the duplicated parser block, add the using, swap the method call. The shared parser already covers Multimorphic's simultaneous-flat-and-nested-price case — verified by JsonLdProductParserTests.FindFirstProduct_FlatAndNestedBothPresent_PrefersFlat.

Pre-push self-audit (per PR #34 + PR #36)

Step 0 — /local-review

Local review: 0 🔴 / 2 ⚠️ / 8 categories ✅

Both ⚠️ fixed:

# Finding Action
1 JsonLdOffers (plural type name) holds a single offer's fields — should be JsonLdOffer (singular). Schema.org property name stays plural since offers can be an array. Fixed: renamed JsonLdOffersJsonLdOffer
2 Missing robustness tests: empty @graph array fall-through, empty-string image filter, graph-without-Product-falls-through-to-sibling-script Fixed: added 3 tests

Step 1 — Mechanical checklist

  • No new *Options properties (N/A)
  • Sibling-diff: JJP and BoF extractors are now near-twins; differences justified per-storefront (slug landmark, GameId prefix, DiscoveredOn sentinel)
  • No bare catch { } in src/PinballWizard.Infrastructure/Scraping/JsonLd/. The narrow catch (JsonException) for malformed-JSON-block fall-through is intentional and tested
  • SourceAliasContractTests still passes
  • Tests assert behavior — every shape we've seen in the wild is pinned with a fixture
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' — personal noreply

Tests

402 / 402 passing (was 378). +24:

  • JsonLdProductParserTests (24) — container shapes (7), type matching (3), price shapes (6), image shapes (3), ReadProduct direct (3), null arg (1), plus 3 added in response to the review

Out of scope

  • Multimorphic adoption — strict-subset follow-up after PR feat(multimorphic) Multimorphic scraper — Phase 1.3.c #39 merges
  • OpenGraphExtractor shared helper — the next deduplication target (GetMetaContent is now byte-identical in both extractors). Defer until a third caller appears
  • BofProductExtractor.NormalizeAvailability is public, JJP's is private — pre-existing inconsistency, out of scope here

🤖 Generated with Claude Code

JJP and BoF previously shipped near-identical 100-line copies of the
JSON-LD walker. The same pattern would have shipped a third time when
PR #39 (Multimorphic) merges. Three storefronts is the threshold called
out in PR #38's review and PR #39's CHANGELOG note -- extracting now
keeps the next storefront PR cheap.

New: PinballWizard.Infrastructure/Scraping/JsonLd/
  * JsonLdProductParser (static; FindFirstProduct entry point,
    ReadProduct exposed for direct test access)
  * JsonLdProduct + JsonLdOffer (storefront-agnostic DTOs)

Shared parser handles every shape we've seen across JJP / BoF /
Multimorphic:
  * Container: bare object / top-level array / @graph wrapper
  * Price: flat offers[].price (Shopify) AND nested
    offers[].priceSpecification (object or array; both WooCommerce
    dialects)
  * Image: string or array, with empty-string filter
  * @type as string or as array containing "Product"
  * Malformed JSON-LD blocks fall through to next sibling block

JJP and BoF extractors reduced ~270/~310 lines -> ~140/~140 (-300
lines net) by delegating to the shared parser. Each kept its own
manufacturer-specific surface: slug-segment landmark, GameId prefix,
DiscoveredOn sentinel, OG/h1 fallbacks, Edition construction.

End-to-end behavior preserved: every pre-existing test still passes
without modification. JJP gains @graph support as a strict superset
(shape doesn't appear on Shopify so no real-world impact).

Multimorphic adoption is a strict-subset follow-up once PR #39 merges:
delete duplicated parser block, add the using, swap the call. The
shared parser already covers Multimorphic's simultaneous-flat-and-nested
case (verified by FlatAndNestedBothPresent_PrefersFlat test).

Pre-push self-audit:
  * /local-review (Step 0): 0 🔴 / 2 ⚠️ -- both fixed:
    - JsonLdOffers (plural) renamed to JsonLdOffer (singular,
      since the type holds one offer; schema property name stays
      plural)
    - Added 3 robustness tests: empty @graph fall-through,
      empty-string image filter, graph-without-Product
      fall-through-to-sibling-script
  * 7-item mechanical checklist (Step 1): all pass

Tests: 378 -> 402 (+24). Build: zero warnings.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 3, 2026
@jkeeley2073 jkeeley2073 merged commit 99b0246 into main May 3, 2026
3 checks passed
Comment on lines +49 to +85
foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']"))
{
var text = script.TextContent;
if (string.IsNullOrWhiteSpace(text)) continue;

JsonElement root;
try
{
using var parsed = JsonDocument.Parse(text);
root = parsed.RootElement.Clone();
}
catch (JsonException)
{
continue;
}

if (root.ValueKind == JsonValueKind.Object
&& root.TryGetProperty("@graph", out var graph)
&& graph.ValueKind == JsonValueKind.Array)
{
foreach (var item in graph.EnumerateArray())
{
if (ReadProduct(item) is { } prod) return prod;
}
}
else if (root.ValueKind == JsonValueKind.Array)
{
foreach (var item in root.EnumerateArray())
{
if (ReadProduct(item) is { } prod) return prod;
}
}
else
{
if (ReadProduct(root) is { } prod) return prod;
}
}
Comment on lines +69 to +72
foreach (var item in graph.EnumerateArray())
{
if (ReadProduct(item) is { } prod) return prod;
}
Comment on lines +76 to +79
foreach (var item in root.EnumerateArray())
{
if (ReadProduct(item) is { } prod) return prod;
}
Comment on lines +131 to +137
foreach (var item in imageProp.EnumerateArray())
{
if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s)
{
images.Add(s);
}
}
Comment on lines +165 to +168
if (offer.TryGetProperty("price", out var direct))
{
if (FormatPrice(direct) is { } flat) return flat;
}
jkeeley2073 added a commit that referenced this pull request May 3, 2026
Third dedup PR in the series (after #42 JsonLdProductParser and #43
Multimorphic adoption). New shared static helper at
PinballWizard.Infrastructure.Scraping.OpenGraph.OpenGraphExtractor
exposes GetMetaContent(IHtmlDocument doc, string property) which routes
meta[property=] first then meta[name=], returning the trimmed content
attribute. JJP, BoF, Multimorphic each delete the byte-identical
private GetMetaContent and add a using; net -30/+63 across the three
consumers and the helper.

Behavior preserved exactly — including the content="" returns the
empty string semantics that the consumer fallback chains depend on
(the ?? operator only triggers on null; changing empty->null would
silently change downstream fallback ordering).

12 new tests pin every shape: spec form, loose form, both-prefer-property,
missing meta, present-meta-without-content-attribute, whitespace
trimming, empty-content-string-parity (load-bearing), first-match-wins
on duplicates, null/empty/whitespace guards.

Pre-push self-audit: /local-review (0 critical / 3 minor / 7 categories
clean — namespace mild misnomer, doc overpromise, unescaped CSS
interpolation; all documented and acceptable for an internal helper)
plus 7-item mechanical checklist (all pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants