Round 2 polish: README, HTTP retry, cross-source linking, DOM research#1
Merged
Conversation
The repo went public without a README, which made the GitHub landing page empty for new visitors. CLAUDE.md is internal context for Claude (not a substitute for a README), so add a proper front door covering: what the project is, three data sources, project status (Phase 1 of 2), tech stack, quickstart, CLI flags, project layout, Docker, and links to the design docs.
A real production run downloads ~600+ files; without retry, any single transient failure (5xx, 408, 429, network blip, HttpClient timeout) aborts that file's download permanently for the run. Add retry with exponential backoff capped at 30s, Retry-After header honoring (clamped to 60s for safety), and proper outer-cancellation propagation. Also fix a bug exposed by the new test coverage: a non-retryable status like 404 was calling EnsureSuccessStatusCode(), throwing HttpRequestException, which the retryable-exception handler then caught and looped on. Permanent client errors now return Failed directly without going through the throw. 3 new tests; 71/71 pass.
Manuals discovered on /manuals/ previously had Game = null even when their filename clearly identified a known game (e.g. StrangerThings_Pro_web.pdf). Phase 2's RAG attribution chain depends on every chunk being able to resolve back to its game; broken cross-source linking would produce manuals-page search results with no game context. Add LinkDocumentsToGames as a post-pass in ScrapeAsync: normalize filename and known slugs (strip _ - . space, lowercase), match by substring, longest slug wins, ties leave Game null (no guessing). Also backfills GameReference .Title from the canonical GameRecord.Title when the previous value was the slug-cased guess from BuildGameReference. 5 new tests covering match, no-match, title backfill, longest-wins, and tied-leaves-null.
The selectors and JS heuristics in GamePageScraper and ServiceBulletinScraper were written without inspecting Stern's actual DOM. Capture what each scraper EXPECTS and the specific empirical probes needed before any selector fix round, so the next iteration can start from a written baseline rather than re-deriving the questions. Includes a flagged bug independent of any DOM finding: bulletin date and related-game text only land in DiscoveryContext today; they should be typed fields.
7 tasks
jkeeley2073
added a commit
that referenced
this pull request
May 4, 2026
Per /local-review on PR #68: grep -E exits 2 on a malformed extended regex, but run_rule wraps the grep call with `|| true` which masks exit 2 as "no match" — silently disabling the rule. A typo in the WORK_EMAIL_PATTERN secret would pass the workflow without ever checking commits against the work-email pattern. The narrow fix: pre-validate the pattern by running it against an empty stdin via printf '' | grep -E "$WORK_EMAIL_PATTERN" and checking grep's exit code directly. Exit 2 = malformed pattern, fail the workflow with an error annotation that names the issue and points the operator at how to fix the secret. Exit 0 (matches empty) or 1 (no match against empty) → pattern is well-formed, proceed to run_rule normally. The broader cleanup of run_rule itself (distinguishing grep exit codes 0/1/2 for every rule) is out of scope for this PR — the narrow fix here addresses the new rule's specific risk without touching pre-existing behavior of the other rules. Local review summary (retroactive on PR #68): 0 🔴, 3⚠️ findings. -⚠️ #1 (post-merge smoke test): already covered in the PR description's "Validation hand-off after merge" section. -⚠️ #2 (grep exit-2 silent swallow): fixed by this commit. -⚠️ #3 (doc-anchor verification): the comment cites "docs/build-spec.md Phase 2 § Scope item 9" which exists at build-spec.md:225 — verified, no change needed.
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four parallel-agent tracks landing as four focused commits.
README.mdfor the public landing pageTest plan
dotnet build PinballWizard.slnxclean (1 pre-existing nullable warning atCatalogBuilder.cs:154, untouched in this round)dotnet test --no-build→ 71/71 pass (was 63; +8 new tests: 3 retry, 5 cross-source linking)README.mdon the GitHub PR diff for client refs (none expected — sanitized in initial commit, agent was instructed to preserve)docs/dom-validation.mdto confirm it's a useful baseline for the next selector-fix round (the agent fell back to read-only analysis since it couldn't launch Playwright from its sandbox)Out of scope (deferred to future rounds)
ChangeDetector+ snapshot system wiringIFileDownloaderextraction (testability seam noted by test-engineer)CatalogBuilder.cs:154(too small for an agent)FileDownloader.csobservations from the retry agent (SHA256 leak on exception, partial-file orphan on size-cap mid-stream, weak ETag handling, buffer-size constant) — not regressions, queued for future round