refactor(scrapers) consolidate JSON-LD parsing into shared helper#42
Merged
Conversation
JJP and BoF previously shipped near-identical 100-line copies of the JSON-LD walker. The same pattern would have shipped a third time when PR #39 (Multimorphic) merges. Three storefronts is the threshold called out in PR #38's review and PR #39's CHANGELOG note -- extracting now keeps the next storefront PR cheap. New: PinballWizard.Infrastructure/Scraping/JsonLd/ * JsonLdProductParser (static; FindFirstProduct entry point, ReadProduct exposed for direct test access) * JsonLdProduct + JsonLdOffer (storefront-agnostic DTOs) Shared parser handles every shape we've seen across JJP / BoF / Multimorphic: * Container: bare object / top-level array / @graph wrapper * Price: flat offers[].price (Shopify) AND nested offers[].priceSpecification (object or array; both WooCommerce dialects) * Image: string or array, with empty-string filter * @type as string or as array containing "Product" * Malformed JSON-LD blocks fall through to next sibling block JJP and BoF extractors reduced ~270/~310 lines -> ~140/~140 (-300 lines net) by delegating to the shared parser. Each kept its own manufacturer-specific surface: slug-segment landmark, GameId prefix, DiscoveredOn sentinel, OG/h1 fallbacks, Edition construction. End-to-end behavior preserved: every pre-existing test still passes without modification. JJP gains @graph support as a strict superset (shape doesn't appear on Shopify so no real-world impact). Multimorphic adoption is a strict-subset follow-up once PR #39 merges: delete duplicated parser block, add the using, swap the call. The shared parser already covers Multimorphic's simultaneous-flat-and-nested case (verified by FlatAndNestedBothPresent_PrefersFlat test). Pre-push self-audit: * /local-review (Step 0): 0 🔴 / 2⚠️ -- both fixed: - JsonLdOffers (plural) renamed to JsonLdOffer (singular, since the type holds one offer; schema property name stays plural) - Added 3 robustness tests: empty @graph fall-through, empty-string image filter, graph-without-Product fall-through-to-sibling-script * 7-item mechanical checklist (Step 1): all pass Tests: 378 -> 402 (+24). Build: zero warnings.
Comment on lines
+49
to
+85
| foreach (var script in doc.QuerySelectorAll("script[type='application/ld+json']")) | ||
| { | ||
| var text = script.TextContent; | ||
| if (string.IsNullOrWhiteSpace(text)) continue; | ||
|
|
||
| JsonElement root; | ||
| try | ||
| { | ||
| using var parsed = JsonDocument.Parse(text); | ||
| root = parsed.RootElement.Clone(); | ||
| } | ||
| catch (JsonException) | ||
| { | ||
| continue; | ||
| } | ||
|
|
||
| if (root.ValueKind == JsonValueKind.Object | ||
| && root.TryGetProperty("@graph", out var graph) | ||
| && graph.ValueKind == JsonValueKind.Array) | ||
| { | ||
| foreach (var item in graph.EnumerateArray()) | ||
| { | ||
| if (ReadProduct(item) is { } prod) return prod; | ||
| } | ||
| } | ||
| else if (root.ValueKind == JsonValueKind.Array) | ||
| { | ||
| foreach (var item in root.EnumerateArray()) | ||
| { | ||
| if (ReadProduct(item) is { } prod) return prod; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| if (ReadProduct(root) is { } prod) return prod; | ||
| } | ||
| } |
Comment on lines
+69
to
+72
| foreach (var item in graph.EnumerateArray()) | ||
| { | ||
| if (ReadProduct(item) is { } prod) return prod; | ||
| } |
Comment on lines
+76
to
+79
| foreach (var item in root.EnumerateArray()) | ||
| { | ||
| if (ReadProduct(item) is { } prod) return prod; | ||
| } |
Comment on lines
+131
to
+137
| foreach (var item in imageProp.EnumerateArray()) | ||
| { | ||
| if (item.ValueKind == JsonValueKind.String && item.GetString() is { Length: > 0 } s) | ||
| { | ||
| images.Add(s); | ||
| } | ||
| } |
Comment on lines
+165
to
+168
| if (offer.TryGetProperty("price", out var direct)) | ||
| { | ||
| if (FormatPrice(direct) is { } flat) return flat; | ||
| } |
This was referenced May 3, 2026
jkeeley2073
added a commit
that referenced
this pull request
May 3, 2026
Third dedup PR in the series (after #42 JsonLdProductParser and #43 Multimorphic adoption). New shared static helper at PinballWizard.Infrastructure.Scraping.OpenGraph.OpenGraphExtractor exposes GetMetaContent(IHtmlDocument doc, string property) which routes meta[property=] first then meta[name=], returning the trimmed content attribute. JJP, BoF, Multimorphic each delete the byte-identical private GetMetaContent and add a using; net -30/+63 across the three consumers and the helper. Behavior preserved exactly — including the content="" returns the empty string semantics that the consumer fallback chains depend on (the ?? operator only triggers on null; changing empty->null would silently change downstream fallback ordering). 12 new tests pin every shape: spec form, loose form, both-prefer-property, missing meta, present-meta-without-content-attribute, whitespace trimming, empty-content-string-parity (load-bearing), first-match-wins on duplicates, null/empty/whitespace guards. Pre-push self-audit: /local-review (0 critical / 3 minor / 7 categories clean — namespace mild misnomer, doc overpromise, unescaped CSS interpolation; all documented and acceptable for an internal helper) plus 7-item mechanical checklist (all pass).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure refactor — net -300 lines. JJP and BoF shipped near-identical 100-line copies of the JSON-LD walker; the same pattern would ship a third time when PR #39 (Multimorphic) merges. Three storefronts is the threshold called out in PR #38's review and PR #39's CHANGELOG note — extracting now keeps the next storefront PR cheap.
What's new
PinballWizard.Infrastructure/Scraping/JsonLd/— new namespaceJsonLdProductParser(static;FindFirstProductentry point,ReadProductexposed for tests)JsonLdProduct+JsonLdOffer(storefront-agnostic DTOs)The parser handles every shape across JJP / BoF / Multimorphic:
@graphwrapperoffers[].price(Shopify) AND nestedoffers[].priceSpecification(object or array; both WooCommerce dialects)@type"Product"What changed
JjpProductExtractorreduced ~270 → ~140 lines by delegatingBofProductExtractorreduced ~310 → ~140 lines by delegatingEach kept its manufacturer-specific surface: slug-segment landmark, GameId prefix,
DiscoveredOnsentinel, OG/h1 fallbacks, Edition construction.Behavior preservation
Every pre-existing test still passes without modification (378 → 402 with the new shared-parser tests). Diffed
GameRecordconstruction blocks againstorigin/mainin both extractors — line-for-line identical.JJP gains
@graphsupport as a strict superset (the shape doesn't appear on Shopify so no real-world impact, but the parser is now uniform).Multimorphic follow-up
PR #39 adoption is a strict-subset change: delete the duplicated parser block, add the
using, swap the method call. The shared parser already covers Multimorphic's simultaneous-flat-and-nested-price case — verified byJsonLdProductParserTests.FindFirstProduct_FlatAndNestedBothPresent_PrefersFlat.Pre-push self-audit (per PR #34 + PR #36)
Step 0 —
/local-reviewLocal review: 0 🔴 / 2 ⚠️ / 8 categories ✅Both⚠️ fixed:
JsonLdOffers(plural type name) holds a single offer's fields — should beJsonLdOffer(singular). Schema.org property name stays plural sinceofferscan be an array.JsonLdOffers→JsonLdOffer@grapharray fall-through, empty-string image filter, graph-without-Product-falls-through-to-sibling-scriptStep 1 — Mechanical checklist
*Optionsproperties (N/A)catch { }insrc/PinballWizard.Infrastructure/Scraping/JsonLd/. The narrowcatch (JsonException)for malformed-JSON-block fall-through is intentional and testedSourceAliasContractTestsstill passesgit log -1 --format='%an <%ae>'— personal noreplyTests
402 / 402 passing (was 378). +24:
JsonLdProductParserTests(24) — container shapes (7), type matching (3), price shapes (6), image shapes (3),ReadProductdirect (3), null arg (1), plus 3 added in response to the reviewOut of scope
OpenGraphExtractorshared helper — the next deduplication target (GetMetaContentis now byte-identical in both extractors). Defer until a third caller appearsBofProductExtractor.NormalizeAvailabilityispublic, JJP's isprivate— pre-existing inconsistency, out of scope here🤖 Generated with Claude Code