Enhance HTTP requests with retry logic and headers#1691
Conversation
WalkthroughThe LondonBoroughSutton council scraper is enhanced with HTTP resilience via retry-backed sessions, polling logic with exponential backoff for page-loading states, and improved date extraction with ordinal normalization, format flexibility, and year rollover handling for December-January transitions. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Session as HTTP Session<br/>(w/ Retry)
participant Endpoint as Endpoint
participant Parser as Date Parser
Caller->>Session: Fetch with timeout
Session->>Endpoint: GET request
alt Page loading (status indicates)
Endpoint-->>Session: 200 + loading indicator
Session->>Session: Exponential backoff
Note over Session: Poll with retry-after
loop Until loaded or max retries
Session->>Endpoint: GET request (retry)
Endpoint-->>Session: 200 + loading/ready
end
else Page ready
Endpoint-->>Session: 200 + content
end
Session-->>Caller: HTML content
Caller->>Parser: Extract date (sibling-walk)
Parser->>Parser: Normalize ordinals
Parser->>Parser: Parse format variants
Parser->>Parser: Year rollover logic
Parser-->>Caller: Extracted date (DD/MM/YYYY)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## November_release #1691 +/- ##
=================================================
Coverage 86.79% 86.79%
=================================================
Files 9 9
Lines 1136 1136
=================================================
Hits 986 986
Misses 150 150 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py (6)
22-38: Solid retry/session setup; minor hardening suggested.
- Use a context manager for the Session to guarantee closure, or explicitly close at the end.
- Prefer allowed_methods=frozenset({"GET"}) for clarity with urllib3 Retry.
- Consider omitting "Connection": "close" to retain keep-alive unless the target specifically requests otherwise.
39-43: Timeouts look good; consider connect/read split.Using timeout=(5, 20) (connect, read) gives quicker fail on TCP connect stalls while keeping a generous read timeout.
60-65: Shorten/standardize the error message (TRY003) and keep it single-line.Trim the message to be concise and avoid multi-line string formatting.
Apply this diff:
@@ - if "Loading your bin days..." in r.text: - # fail fast with a clear message so callers can back off scheduling - raise RuntimeError( - "Sutton page still loading after polite retries; back off and try later." - ) + if "Loading your bin days..." in r.text: + # fail fast with a clear message so callers can back off scheduling + raise RuntimeError("Sutton page still loading after retries; try later.")Based on static analysis hints.
66-69: Compute datetime.now() once to avoid drift and repeated calls.Capture now once and reuse month/year for roll-over logic.
Apply this diff:
@@ - current_year = datetime.now().year - next_year = current_year + 1 + now = datetime.now() + current_year = now.year + next_year = current_year + 1 + current_month = now.month @@ - if datetime.now().month == 12 and next_dt.month == 1: + if current_month == 12 and next_dt.month == 1: next_dt = next_dt.replace(year=next_year) else: next_dt = next_dt.replace(year=current_year)Also applies to: 104-109
76-90: Sibling walk is robust; consider scanning until next h3 and precompiling regex.
- Replace the fixed range(4) with a loop that stops at the next h3 to cover more markup variations.
- Precompile the two regex patterns once at module scope for readability and small perf gains.
115-115: Sorting by parsed date is correct; micro-optimization optional.Current approach is fine. If bins grows large, cache next_dt alongside the entry to avoid reparsing in the sort key, then drop it before return.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py
54-54: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
62-64: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run Integration Tests (3.12, 1.8.4)
- GitHub Check: Run Unit Tests (3.12, 1.8.4)
🔇 Additional comments (1)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py (1)
94-101: Date normalization and flexible parsing — nice.Removing ordinals and supporting both "%A, %d %B" and "%A %d %B" improves resilience without changing outputs.
| while "Loading your bin days..." in r.text and poll < max_polls: | ||
| time.sleep(delay) | ||
| delay = min(delay * 2, 30) # grow delay but cap it | ||
| r = s.get(URI, timeout=20) | ||
| if r.status_code == 429: | ||
| # manual respect if upstream Retry didn’t catch (e.g., no header) | ||
| retry_after = int(r.headers.get("Retry-After", "10")) | ||
| time.sleep(min(retry_after, 60)) | ||
| r.raise_for_status() | ||
| poll += 1 | ||
|
|
There was a problem hiding this comment.
429 handling: int() on HTTP-date and immediate raise abort polling.
- Retry-After can be an HTTP-date or seconds. int() will raise ValueError on dates.
- After sleeping, r.raise_for_status() will still raise on 429, exiting the loop instead of continuing.
Fix by parsing both formats and continuing the loop after sleeping. Also replace the smart quote to satisfy RUF003.
Apply this diff:
@@
- r = s.get(URI, timeout=20)
- if r.status_code == 429:
- # manual respect if upstream Retry didn’t catch (e.g., no header)
- retry_after = int(r.headers.get("Retry-After", "10"))
- time.sleep(min(retry_after, 60))
- r.raise_for_status()
- poll += 1
+ r = s.get(URI, timeout=20)
+ if r.status_code == 429:
+ # manual respect if upstream Retry didn't catch (e.g., no header)
+ hdr = r.headers.get("Retry-After")
+ sleep_for = 10
+ if hdr:
+ try:
+ sleep_for = int(hdr)
+ except ValueError:
+ try:
+ dt = parsedate_to_datetime(hdr)
+ sleep_for = max(0, int(dt.timestamp() - time.time()))
+ except Exception:
+ sleep_for = 10
+ time.sleep(min(sleep_for, 60))
+ poll += 1
+ continue
+ r.raise_for_status()
+ poll += 1And add the required import:
@@
import re
import time
+from email.utils import parsedate_to_datetimeBased on static analysis hints.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.2)
54-54: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
This update improves reliability and politeness of the Sutton Council data fetcher. It introduces retry logic, request headers, timeouts, and capped exponential backoff when polling for bin data. These changes reduce the chance of rate limits (HTTP 429), transient 5xx errors, and infinite loops during slow server responses.
Key changes:
Added a requests.Session with custom headers and polite retry policy (Retry + HTTPAdapter).
Implemented timeouts and exponential backoff for polling while the page loads.
Respects Retry-After headers when rate-limited.
Raises a clear error if the page remains in “Loading your bin days...” state after max retries.
Improved parsing logic to handle minor format variations (missing commas, extra siblings).
Impact:
No breaking changes — same function signature and output schema. Improves resilience, reduces unnecessary requests, and handles network throttling gracefully.
Summary by CodeRabbit
Release Notes