Skip to content

Round 2 polish: README, HTTP retry, cross-source linking, DOM research#1

Merged
jkeeley2073 merged 4 commits into
mainfrom
Dev-Round2Polish
May 1, 2026
Merged

Round 2 polish: README, HTTP retry, cross-source linking, DOM research#1
jkeeley2073 merged 4 commits into
mainfrom
Dev-Round2Polish

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Four parallel-agent tracks landing as four focused commits.

Commit Type What
f8da7e0 docs README.md for the public landing page
2979cab feat(downloader) HTTP retry with exponential backoff (+ 1 latent-bug fix on non-retryable status handling)
95e990e feat(catalog) Cross-source game linking — manuals page docs now resolve to known games
882a204 docs DOM validation research notes for the next selector-fix round

Test plan

  • dotnet build PinballWizard.slnx clean (1 pre-existing nullable warning at CatalogBuilder.cs:154, untouched in this round)
  • dotnet test --no-build → 71/71 pass (was 63; +8 new tests: 3 retry, 5 cross-source linking)
  • Per-commit message body explains WHY for each change
  • Manual: spot-check README.md on the GitHub PR diff for client refs (none expected — sanitized in initial commit, agent was instructed to preserve)
  • Manual: read docs/dom-validation.md to confirm it's a useful baseline for the next selector-fix round (the agent fell back to read-only analysis since it couldn't launch Playwright from its sandbox)

Out of scope (deferred to future rounds)

  • Playwright 1.12 → 1.49 upgrade
  • ChangeDetector + snapshot system wiring
  • First non-dry-run end-to-end download (needs a human watching the first ~50 downloads)
  • IFileDownloader extraction (testability seam noted by test-engineer)
  • Nullable warning at CatalogBuilder.cs:154 (too small for an agent)
  • Phase 2 design artifacts (Bicep, EF Core) — explicitly excluded per Phase 1-only scope
  • 4 in-FileDownloader.cs observations from the retry agent (SHA256 leak on exception, partial-file orphan on size-cap mid-stream, weak ETag handling, buffer-size constant) — not regressions, queued for future round

The repo went public without a README, which made the GitHub landing page
empty for new visitors. CLAUDE.md is internal context for Claude (not a
substitute for a README), so add a proper front door covering: what the
project is, three data sources, project status (Phase 1 of 2), tech stack,
quickstart, CLI flags, project layout, Docker, and links to the design docs.
A real production run downloads ~600+ files; without retry, any single
transient failure (5xx, 408, 429, network blip, HttpClient timeout) aborts
that file's download permanently for the run. Add retry with exponential
backoff capped at 30s, Retry-After header honoring (clamped to 60s for
safety), and proper outer-cancellation propagation.

Also fix a bug exposed by the new test coverage: a non-retryable status
like 404 was calling EnsureSuccessStatusCode(), throwing HttpRequestException,
which the retryable-exception handler then caught and looped on. Permanent
client errors now return Failed directly without going through the throw.

3 new tests; 71/71 pass.
Manuals discovered on /manuals/ previously had Game = null even when their
filename clearly identified a known game (e.g. StrangerThings_Pro_web.pdf).
Phase 2's RAG attribution chain depends on every chunk being able to resolve
back to its game; broken cross-source linking would produce manuals-page
search results with no game context.

Add LinkDocumentsToGames as a post-pass in ScrapeAsync: normalize filename
and known slugs (strip _ - . space, lowercase), match by substring, longest
slug wins, ties leave Game null (no guessing). Also backfills GameReference
.Title from the canonical GameRecord.Title when the previous value was the
slug-cased guess from BuildGameReference.

5 new tests covering match, no-match, title backfill, longest-wins, and
tied-leaves-null.
The selectors and JS heuristics in GamePageScraper and ServiceBulletinScraper
were written without inspecting Stern's actual DOM. Capture what each scraper
EXPECTS and the specific empirical probes needed before any selector fix
round, so the next iteration can start from a written baseline rather than
re-deriving the questions.

Includes a flagged bug independent of any DOM finding: bulletin date and
related-game text only land in DiscoveryContext today; they should be
typed fields.
@jkeeley2073 jkeeley2073 merged commit d5f56ff into main May 1, 2026
@jkeeley2073 jkeeley2073 deleted the Dev-Round2Polish branch May 2, 2026 00:07
jkeeley2073 added a commit that referenced this pull request May 4, 2026
Per /local-review on PR #68: grep -E exits 2 on a malformed extended
regex, but run_rule wraps the grep call with `|| true` which masks
exit 2 as "no match" — silently disabling the rule. A typo in the
WORK_EMAIL_PATTERN secret would pass the workflow without ever
checking commits against the work-email pattern.

The narrow fix: pre-validate the pattern by running it against an
empty stdin via printf '' | grep -E "$WORK_EMAIL_PATTERN" and
checking grep's exit code directly. Exit 2 = malformed pattern,
fail the workflow with an error annotation that names the issue
and points the operator at how to fix the secret. Exit 0 (matches
empty) or 1 (no match against empty) → pattern is well-formed,
proceed to run_rule normally.

The broader cleanup of run_rule itself (distinguishing grep exit
codes 0/1/2 for every rule) is out of scope for this PR — the
narrow fix here addresses the new rule's specific risk without
touching pre-existing behavior of the other rules.

Local review summary (retroactive on PR #68): 0 🔴, 3 ⚠️ findings.
- ⚠️ #1 (post-merge smoke test): already covered in the PR
  description's "Validation hand-off after merge" section.
- ⚠️ #2 (grep exit-2 silent swallow): fixed by this commit.
- ⚠️ #3 (doc-anchor verification): the comment cites
  "docs/build-spec.md Phase 2 § Scope item 9" which exists at
  build-spec.md:225 — verified, no change needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant