Skip to content

Enhance HTTP requests with retry logic and headers#1691

Merged
robbrad merged 1 commit into
robbrad:November_releasefrom
danish-din:sutton-retry-patch
Nov 8, 2025
Merged

Enhance HTTP requests with retry logic and headers#1691
robbrad merged 1 commit into
robbrad:November_releasefrom
danish-din:sutton-retry-patch

Conversation

@danish-din
Copy link
Copy Markdown
Contributor

@danish-din danish-din commented Oct 29, 2025

This update improves reliability and politeness of the Sutton Council data fetcher. It introduces retry logic, request headers, timeouts, and capped exponential backoff when polling for bin data. These changes reduce the chance of rate limits (HTTP 429), transient 5xx errors, and infinite loops during slow server responses.

Key changes:

Added a requests.Session with custom headers and polite retry policy (Retry + HTTPAdapter).

Implemented timeouts and exponential backoff for polling while the page loads.

Respects Retry-After headers when rate-limited.

Raises a clear error if the page remains in “Loading your bin days...” state after max retries.

Improved parsing logic to handle minor format variations (missing commas, extra siblings).

Impact:
No breaking changes — same function signature and output schema. Improves resilience, reduces unnecessary requests, and handles network throttling gracefully.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved reliability of bin collection date retrieval for this council
    • Enhanced handling of date format variations and year transitions (December to January)
    • Better resilience against temporarily unavailable or slow-loading pages with automatic retry logic

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 29, 2025

Walkthrough

The LondonBoroughSutton council scraper is enhanced with HTTP resilience via retry-backed sessions, polling logic with exponential backoff for page-loading states, and improved date extraction with ordinal normalization, format flexibility, and year rollover handling for December-January transitions.

Changes

Cohort / File(s) Summary
HTTP resilience and retry logic
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py
Replaces simple single-request HTTP fetch with a resilient session using custom headers, Retry configuration with exponential backoff, and mounted adapters. Adds controlled polling loop to detect page-loading states with timeout and explicit 429 (rate-limit) header handling via Retry-After. Removes hard-coded sleep-based polling.
Date extraction and parsing improvements
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py
Implements robust sibling-walking approach to locate text matching "Next collection" across multiple following elements. Normalizes date strings by removing ordinal indicators. Extends date parsing to handle both "%A, %d %B" and "%A %d %B" formats. Adds year rollover logic to select correct year (current or next) based on current month for December-January transitions.
Minor import updates
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py
Adds HTTPAdapter and Retry imports from urllib3. Uses time.sleep from time module for controlled delays within polling loop.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Session as HTTP Session<br/>(w/ Retry)
    participant Endpoint as Endpoint
    participant Parser as Date Parser

    Caller->>Session: Fetch with timeout
    Session->>Endpoint: GET request
    alt Page loading (status indicates)
        Endpoint-->>Session: 200 + loading indicator
        Session->>Session: Exponential backoff
        Note over Session: Poll with retry-after
        loop Until loaded or max retries
            Session->>Endpoint: GET request (retry)
            Endpoint-->>Session: 200 + loading/ready
        end
    else Page ready
        Endpoint-->>Session: 200 + content
    end
    Session-->>Caller: HTML content
    Caller->>Parser: Extract date (sibling-walk)
    Parser->>Parser: Normalize ordinals
    Parser->>Parser: Parse format variants
    Parser->>Parser: Year rollover logic
    Parser-->>Caller: Extracted date (DD/MM/YYYY)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Focus areas:
    • Verify retry configuration (exponential backoff settings, max retries) prevents infinite loops and respects rate limits
    • Validate year rollover logic for December-January edge cases across different current dates
    • Confirm date extraction robustness handles variations in HTML structure (sibling-walking approach)
    • Check ordinal normalization covers all ordinal variants (1st, 2nd, 3rd, 4th, etc.)
    • Ensure error handling for malformed or missing dates doesn't mask upstream failures

Suggested reviewers

  • dp247

Poem

🐰 A rabbit hops through retry loops,
With backoff and resilience—no gloops!
December dates now know their year,
And parsing's robust, crystal clear.
No more hard sleeps, just gentle polls—
Smart collection schedules reach their goals! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Enhance HTTP requests with retry logic and headers" accurately describes a significant portion of the changes in this pull request. The raw summary confirms that the code introduces a resilient HTTP session with custom headers, retry configuration with exponential backoff, and proper handling of HTTP 429 responses—exactly what the title conveys. The title is specific, clear, and directly related to real changes in the codebase. However, the changeset also includes substantial improvements to date parsing logic (sibling-walking approach, ordinal normalization, format variations, and year rollover handling), which the title does not mention. The title therefore captures a primary aspect of the change but not the full scope.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.79%. Comparing base (475e160) to head (4a7e249).
⚠️ Report is 40 commits behind head on November_release.

Additional details and impacted files
@@                Coverage Diff                @@
##           November_release    #1691   +/-   ##
=================================================
  Coverage             86.79%   86.79%           
=================================================
  Files                     9        9           
  Lines                  1136     1136           
=================================================
  Hits                    986      986           
  Misses                  150      150           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py (6)

22-38: Solid retry/session setup; minor hardening suggested.

  • Use a context manager for the Session to guarantee closure, or explicitly close at the end.
  • Prefer allowed_methods=frozenset({"GET"}) for clarity with urllib3 Retry.
  • Consider omitting "Connection": "close" to retain keep-alive unless the target specifically requests otherwise.

39-43: Timeouts look good; consider connect/read split.

Using timeout=(5, 20) (connect, read) gives quicker fail on TCP connect stalls while keeping a generous read timeout.


60-65: Shorten/standardize the error message (TRY003) and keep it single-line.

Trim the message to be concise and avoid multi-line string formatting.

Apply this diff:

@@
-        if "Loading your bin days..." in r.text:
-            # fail fast with a clear message so callers can back off scheduling
-            raise RuntimeError(
-                "Sutton page still loading after polite retries; back off and try later."
-            )
+        if "Loading your bin days..." in r.text:
+            # fail fast with a clear message so callers can back off scheduling
+            raise RuntimeError("Sutton page still loading after retries; try later.")

Based on static analysis hints.


66-69: Compute datetime.now() once to avoid drift and repeated calls.

Capture now once and reuse month/year for roll-over logic.

Apply this diff:

@@
-        current_year = datetime.now().year
-        next_year = current_year + 1
+        now = datetime.now()
+        current_year = now.year
+        next_year = current_year + 1
+        current_month = now.month
@@
-            if datetime.now().month == 12 and next_dt.month == 1:
+            if current_month == 12 and next_dt.month == 1:
                 next_dt = next_dt.replace(year=next_year)
             else:
                 next_dt = next_dt.replace(year=current_year)

Also applies to: 104-109


76-90: Sibling walk is robust; consider scanning until next h3 and precompiling regex.

  • Replace the fixed range(4) with a loop that stops at the next h3 to cover more markup variations.
  • Precompile the two regex patterns once at module scope for readability and small perf gains.

115-115: Sorting by parsed date is correct; micro-optimization optional.

Current approach is fine. If bins grows large, cache next_dt alongside the entry to avoid reparsing in the sort key, then drop it before return.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 475e160 and 4a7e249.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py

54-54: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


62-64: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Run Integration Tests (3.12, 1.8.4)
  • GitHub Check: Run Unit Tests (3.12, 1.8.4)
🔇 Additional comments (1)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughSutton.py (1)

94-101: Date normalization and flexible parsing — nice.

Removing ordinals and supporting both "%A, %d %B" and "%A %d %B" improves resilience without changing outputs.

Comment on lines +49 to +59
while "Loading your bin days..." in r.text and poll < max_polls:
time.sleep(delay)
delay = min(delay * 2, 30) # grow delay but cap it
r = s.get(URI, timeout=20)
if r.status_code == 429:
# manual respect if upstream Retry didn’t catch (e.g., no header)
retry_after = int(r.headers.get("Retry-After", "10"))
time.sleep(min(retry_after, 60))
r.raise_for_status()
poll += 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

429 handling: int() on HTTP-date and immediate raise abort polling.

  • Retry-After can be an HTTP-date or seconds. int() will raise ValueError on dates.
  • After sleeping, r.raise_for_status() will still raise on 429, exiting the loop instead of continuing.

Fix by parsing both formats and continuing the loop after sleeping. Also replace the smart quote to satisfy RUF003.

Apply this diff:

@@
-            r = s.get(URI, timeout=20)
-            if r.status_code == 429:
-                # manual respect if upstream Retry didn’t catch (e.g., no header)
-                retry_after = int(r.headers.get("Retry-After", "10"))
-                time.sleep(min(retry_after, 60))
-            r.raise_for_status()
-            poll += 1
+            r = s.get(URI, timeout=20)
+            if r.status_code == 429:
+                # manual respect if upstream Retry didn't catch (e.g., no header)
+                hdr = r.headers.get("Retry-After")
+                sleep_for = 10
+                if hdr:
+                    try:
+                        sleep_for = int(hdr)
+                    except ValueError:
+                        try:
+                            dt = parsedate_to_datetime(hdr)
+                            sleep_for = max(0, int(dt.timestamp() - time.time()))
+                        except Exception:
+                            sleep_for = 10
+                time.sleep(min(sleep_for, 60))
+                poll += 1
+                continue
+            r.raise_for_status()
+            poll += 1

And add the required import:

@@
 import re
 import time
+from email.utils import parsedate_to_datetime

Based on static analysis hints.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.2)

54-54: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

@robbrad robbrad changed the base branch from master to November_release November 8, 2025 09:03
@robbrad robbrad merged commit ec69c8c into robbrad:November_release Nov 8, 2025
11 of 13 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants