Skip to content

Replace hand-rolled HTTP retry with Microsoft.Extensions.Http.Resilience#2

Merged
jkeeley2073 merged 2 commits into
mainfrom
Dev-HttpResilience
May 2, 2026
Merged

Replace hand-rolled HTTP retry with Microsoft.Extensions.Http.Resilience#2
jkeeley2073 merged 2 commits into
mainfrom
Dev-HttpResilience

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Round 2 added hand-rolled retry to FileDownloader only. This PR replaces it with the Microsoft-blessed Microsoft.Extensions.Http.Resilience package, applied to both HttpClients via a shared resilience pipeline. Decision and rationale are committed alongside as docs/http-resilience-research.md.

Commit Type What
9397cd6 docs Research and recommendation
21f05d1 refactor(http) Implementation

Why not the stock AddStandardResilienceHandler?

Defaults are wrong for a polite single-host scraper:

  • 10s per-attempt timeout would kill 50MB PDF downloads mid-stream
  • 30s total request timeout would kill any large file
  • 1,000 concurrent permits is bad citizenship by default
  • 100-throughput circuit-breaker minimum means it never engages on a 600-file batch run

We use AddResilienceHandler (custom pipeline) with two stages: concurrency limiter (politeness) + Polly v8 retry. Skip per-attempt timeout (HttpClient.Timeout already suffices), total timeout, circuit breaker, hedging.

Test plan

  • dotnet build clean (0 warnings, 0 errors)
  • dotnet test → 68/68 pass (was 71; -3 retry-specific tests now Polly's responsibility)
  • Live smoke test: --source manuals --scrape-only --dry-run returns 166 links, 0 errors
  • Manual: confirm ManualsScraper retry behavior next time Stern hiccups (no longer relies on a single fetch succeeding first try)

Notable LOC delta

4 files changed, 143 insertions(+), 402 deletions(-)

FileDownloader.cs shrinks from 326 to 187 lines. Retry policy lives in Polly's tested code, not ours.

Out of scope (deferred)

  • Stream resumption: none of the three options (Microsoft, Polly direct, hand-roll) re-attaches a partial-write download mid-stream. Worth a Range: header follow-up in FileDownloader itself, independent of the resilience handler. For typical Stern PDF sizes (<20MB) a from-scratch retry is acceptable.
  • Pipeline-end-to-end integration tests via ConfigurePrimaryHttpMessageHandler: skipped this PR; Polly's behavior is tested upstream and we don't need to re-verify it. Add if/when we customize ShouldHandle ourselves.

Captures the decision (and rationale) to adopt Microsoft.Extensions.Http
.Resilience with a custom AddResilienceHandler pipeline rather than the
stock AddStandardResilienceHandler or hand-rolled retry. The standard
handler's defaults (10s per-attempt timeout, 1000 concurrent permits,
30s total timeout) would break large-PDF downloads and hammer
sternpinball.com. The hand-rolled approach (added in Round 2 polish)
already exists but only protects FileDownloader, leaving ManualsScraper
without retry.

This doc precedes the implementation commit so the rationale is
preserved alongside the change for future readers.
…peline

Wire Microsoft.Extensions.Http.Resilience 10.4.0 onto both HttpClients
in Program.cs via AddResilienceHandler with a custom 2-stage pipeline:
concurrency limiter (politeness, permit=MaxConcurrentDownloads) and Polly
v8 retry (3 attempts, exponential backoff, jitter, MaxDelay 30s,
ShouldRetryAfterHeader=true). The standard handler defaults are wrong
for a polite single-host scraper — see docs/http-resilience-research.md.

Effects:
- ManualsScraper now gets retry for free (the gap that motivated this).
- FileDownloader sheds 110 LOC: the retry loop, IsRetryableStatus,
  IsRetryableException, ComputeDelay, TryGetRetryAfterMs, MaxBackoffMs,
  and MaxRetryAfterSeconds — Polly owns all of it.
- Pipeline is per-client by design — each AddResilienceHandler call gets
  its own name and config, so future Phase 2 clients (Azure AI, embeddings)
  can have different policies without touching this code.

Tests: drop 3 retry-specific tests (RetriesAndSucceeds, ReturnsFailedAfter
MaxRetries, ClientErrorDoesNotRetry) plus the SequencedHandler /
CountingStatusHandler helpers. The behavior they verified is now Polly's
responsibility, tested in the upstream Polly suite. All other
FileDownloader tests (304 fast-path, 200 streaming, conditional headers,
size cap, network exception) still pass: 68/68 green.

Verified end-to-end: dry-run manuals scrape returns 166 links, 0 errors.
@jkeeley2073 jkeeley2073 merged commit 93f0547 into main May 2, 2026
@jkeeley2073 jkeeley2073 deleted the Dev-HttpResilience branch May 2, 2026 11:31
jkeeley2073 added a commit that referenced this pull request May 4, 2026
Per /local-review on PR #68: grep -E exits 2 on a malformed extended
regex, but run_rule wraps the grep call with `|| true` which masks
exit 2 as "no match" — silently disabling the rule. A typo in the
WORK_EMAIL_PATTERN secret would pass the workflow without ever
checking commits against the work-email pattern.

The narrow fix: pre-validate the pattern by running it against an
empty stdin via printf '' | grep -E "$WORK_EMAIL_PATTERN" and
checking grep's exit code directly. Exit 2 = malformed pattern,
fail the workflow with an error annotation that names the issue
and points the operator at how to fix the secret. Exit 0 (matches
empty) or 1 (no match against empty) → pattern is well-formed,
proceed to run_rule normally.

The broader cleanup of run_rule itself (distinguishing grep exit
codes 0/1/2 for every rule) is out of scope for this PR — the
narrow fix here addresses the new rule's specific risk without
touching pre-existing behavior of the other rules.

Local review summary (retroactive on PR #68): 0 🔴, 3 ⚠️ findings.
- ⚠️ #1 (post-merge smoke test): already covered in the PR
  description's "Validation hand-off after merge" section.
- ⚠️ #2 (grep exit-2 silent swallow): fixed by this commit.
- ⚠️ #3 (doc-anchor verification): the comment cites
  "docs/build-spec.md Phase 2 § Scope item 9" which exists at
  build-spec.md:225 — verified, no change needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant