Skip to content

fix: BexleyCouncil - adapt to async page loading#2001

Open
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/BexleyCouncil-async-loading
Open

fix: BexleyCouncil - adapt to async page loading#2001
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/BexleyCouncil-async-loading

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented May 3, 2026

Bexley's WasteWorks site moved from server-side meta-refresh to client-side JS polling (via refresh.auto.min.js with data-every="2"). The old approach of requesting ?page_loading=1 without a session or the correct headers no longer returns bin data.

Changes:

  • Use a requests.Session for cookie persistence (the server tracks computation state per session)
  • Hit the main URL first to trigger server-side schedule computation
  • Poll ?page_loading=1 with x-requested-with: fetch header, matching what the site's JS does
  • Bumped timeout from 30s to 60s per request
  • Up to 30 poll attempts at 2s intervals (60s total) instead of 5 attempts at 3s

Tested against multiple UPRNs on the live site.

Summary by CodeRabbit

  • Bug Fixes

    • More accurate date handling for “collected today” and similar messages so collection dates display correctly.
    • No longer fails when no collections are found; app handles empty results gracefully.
  • Refactor

    • Improved retrieval process for Bexley council bin schedules using a more reliable session-based polling approach for consistent updates.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

📝 Walkthrough

Walkthrough

The Bexley Council data parser now uses a persistent requests.Session with custom headers to establish server-side state before polling for page readiness with up to 30 iterations. Date parsing adds special handling for "collected today" and "could not be collected today" strings, mapping them to the current datetime instead of parsing ordinal date formats.

Changes

Bexley Council Page Fetch and Date Parsing

Layer / File(s) Summary
Session & Polling Infrastructure
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py
Replaced simple requests.get with a requests.Session, performed an initial page request to establish server-side state, and added a 30-iteration polling loop checking ?page_loading=1 responses for waste-service-name.
Parsing adjustments (bin type, dates, validation)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py
Preserved bin-type extraction semantics; added case-insensitive handling of "collected today" and "could not be collected today" (sets parsed_date = datetime.now()); retained ordinal cleanup and two-format parsing fallback; removed the guard that raised when no bins were found.

Sequence Diagram

sequenceDiagram
    actor User
    participant Parser as BexleyCouncil.parse_data
    participant Session as requests.Session
    participant Server as Bexley Server

    User->>Parser: parse_data(page)
    Parser->>Session: create Session, set headers (User-Agent, x-requested-with)
    Parser->>Session: initial GET (establish state/cookies)
    activate Parser
    loop Polling (up to 30 iterations)
        Parser->>Session: GET page?page_loading=1
        Session->>Server: Fetch with x-requested-with: fetch
        Server-->>Session: HTML response
        Parser->>Parser: Check for waste-service-name
        alt Found
            Parser->>Parser: Break polling loop
        else Not found
            Parser->>Parser: Sleep, retry
        end
    end
    Parser->>Parser: Parse bin types and collection dates (handle "collected today")
    Parser->>User: Return parsed bin collection data
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • dp247

Poem

🐰 A rabbit peeked at Bexley's page,
Sessions set to calm the page-load rage,
If "today" appears in text so bright,
I stamp it now — the date is right,
Polling gently through the night.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adapting BexleyCouncil to handle async page loading instead of the previous synchronous approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (8ecf878) to head (2f9dfbc).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2001   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py (1)

73-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't swallow unexpected date formats here.

print(...) drops the bin entry and returns partial data, which makes Bexley markup changes easy to miss. This parser should raise with the raw date string instead.

Proposed fix
-                    except ValueError as e:
-                        print(f"Error parsing date for {bin_type}: {e}")
+                    except ValueError as e:
+                        raise ValueError(
+                            f"Unexpected Bexley collection date for {bin_type}: {next_collection}"
+                        ) from e
Based on learnings: In `uk_bin_collection/**/*.py`, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
73 - 90, The code currently swallows date parsing errors (in the try/except
around parsing next_collection) and just prints them, which drops the bin entry;
instead, replace the print swallow with an explicit raise so failures surface:
in the block that uses remove_ordinal_indicator_from_date_string, parsed_date,
date_format and data["bins"].append (inside the try/except), remove the
print(...) in the except ValueError and raise a new ValueError (or re-raise)
that includes the bin_type and the raw next_collection string plus the original
exception message so callers see the unexpected format and the raw input that
failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 35-45: The polling loop that fetches the council page (using
session.get and inspecting response.text for "waste-service-name") must raise an
explicit error when all 30 attempts still return the loader fragment instead of
falling through to parsing; modify the logic around the for-attempts loop to
check after the loop if "waste-service-name" is not in response.text and raise a
clear RuntimeError (or a project-standard exception) with context (e.g. last
response.status_code and a brief snippet or note about loader/timeout) so
BeautifulSoup(...) is never invoked on a still-loading page; update the code
paths around the variables response, session.get and the subsequent
BeautifulSoup call to use this raised error.
- Around line 30-45: The current polling in BexleyCouncil.py uses timeout=60 per
request which can exceed a 60s overall deadline when combined with the initial
session.get, multiple polls and sleeps; change this to track a single deadline
(use time.monotonic()) set to now + 60s before the initial request, compute
remaining = deadline - now before each network call, and pass
timeout=min(remaining, 60) to session.get (for both the initial bootstrap
request and the "{page}?page_loading=1" polls), abort/raise a timeout if
remaining <= 0, and also cap the time.sleep() between attempts to not sleep past
the remaining budget. Ensure identifiers referenced are the existing session.get
calls and the polling loop around response =
session.get(f"{page}?page_loading=1", ...).

---

Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 73-90: The code currently swallows date parsing errors (in the
try/except around parsing next_collection) and just prints them, which drops the
bin entry; instead, replace the print swallow with an explicit raise so failures
surface: in the block that uses remove_ordinal_indicator_from_date_string,
parsed_date, date_format and data["bins"].append (inside the try/except), remove
the print(...) in the except ValueError and raise a new ValueError (or re-raise)
that includes the bin_type and the raw next_collection string plus the original
exception message so callers see the unexpected format and the raw input that
failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a1ce570-1916-4ab3-83ba-b787de63b659

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecf878 and 2055b98.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py

Comment on lines +30 to +45
# Initial request triggers server-side schedule computation and sets session cookie
session.get(page, timeout=60)

# Poll for results — server computes async, JS client polls ?page_loading=1
# with x-requested-with header to get the content fragment
response = None
for attempt in range(30):
response = session.get(
f"{page}?page_loading=1",
headers={"x-requested-with": "fetch"},
timeout=60,
)
response.raise_for_status()
if "waste-service-name" in response.text:
found = True
break
time.sleep(3)

if not found:
raise ValueError("Bexley WasteWorks page did not return expected content after 3 attempts")
time.sleep(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce a real 60-second overall deadline.

timeout=60 is per request, so this can block for far longer than 60 seconds once you include the bootstrap request, 30 polls, and sleeps. Track a deadline and cap each request by the remaining budget.

Proposed fix
-        # Initial request triggers server-side schedule computation and sets session cookie
-        session.get(page, timeout=60)
+        deadline = time.monotonic() + 60
+
+        # Initial request triggers server-side schedule computation and sets session cookie
+        bootstrap = session.get(page, timeout=10)
+        bootstrap.raise_for_status()

         # Poll for results — server computes async, JS client polls ?page_loading=1
         # with x-requested-with header to get the content fragment
         response = None
-        for attempt in range(30):
+        for _ in range(30):
+            remaining = deadline - time.monotonic()
+            if remaining <= 0:
+                raise TimeoutError("Timed out waiting for Bexley waste data to load")
             response = session.get(
                 f"{page}?page_loading=1",
                 headers={"x-requested-with": "fetch"},
-                timeout=60,
+                timeout=min(10, remaining),
             )
             response.raise_for_status()
             if "waste-service-name" in response.text:
                 break
-            time.sleep(2)
+            time.sleep(min(2, max(0, deadline - time.monotonic())))
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 36-36: Loop control variable attempt not used within loop body

Rename unused attempt to _attempt

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
30 - 45, The current polling in BexleyCouncil.py uses timeout=60 per request
which can exceed a 60s overall deadline when combined with the initial
session.get, multiple polls and sleeps; change this to track a single deadline
(use time.monotonic()) set to now + 60s before the initial request, compute
remaining = deadline - now before each network call, and pass
timeout=min(remaining, 60) to session.get (for both the initial bootstrap
request and the "{page}?page_loading=1" polls), abort/raise a timeout if
remaining <= 0, and also cap the time.sleep() between attempts to not sleep past
the remaining budget. Ensure identifiers referenced are the existing session.get
calls and the polling loop around response =
session.get(f"{page}?page_loading=1", ...).

Comment thread uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py
@InertiaUK InertiaUK force-pushed the fix/BexleyCouncil-async-loading branch from 2055b98 to 2f9dfbc Compare May 12, 2026 15:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py (1)

73-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not swallow date parse failures for collection rows.

Line 89-Line 90 logs and continues, which silently drops data when Bexley changes text format. Raise with context (bin_type, raw next_collection) instead of print.

Proposed fix
-                    except ValueError as e:
-                        print(f"Error parsing date for {bin_type}: {e}")
+                    except ValueError as e:
+                        raise ValueError(
+                            f"Unsupported date format for '{bin_type}': '{next_collection}'"
+                        ) from e

Based on learnings: In uk_bin_collection/**/*.py, prefer explicit failures on unexpected formats over silent defaults/swallowed errors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
73 - 90, The code in BexleyCouncil.py currently catches ValueError when parsing
dates and only prints an error, which silently drops rows; instead re-raise a
descriptive exception so callers can see failures. Replace the except block that
catches ValueError around the datetime.strptime logic (related to parsed_date,
remove_ordinal_indicator_from_date_string and the data["bins"] append) with code
that raises a new ValueError (or re-raises) including bin_type and the raw
next_collection string and the original exception message so the failure is
explicit.
♻️ Duplicate comments (1)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py (1)

30-47: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce a real overall deadline and stop parsing when async loading never completes.

Line 31 and Line 40 use 60s per request, so total runtime can far exceed one minute; and if all polls return loader content, Line 47 still parses it. Please track a single monotonic deadline, cap each request/sleep by remaining time, and raise on exhaustion.

Proposed fix
+        deadline = time.monotonic() + 60
+
         # Initial request triggers server-side schedule computation and sets session cookie
-        session.get(page, timeout=60)
+        remaining = deadline - time.monotonic()
+        if remaining <= 0:
+            raise TimeoutError("Timed out before Bexley bootstrap request")
+        bootstrap = session.get(page, timeout=min(10, remaining))
+        bootstrap.raise_for_status()
 
         # Poll for results — server computes async, JS client polls ?page_loading=1
         # with x-requested-with header to get the content fragment
         response = None
-        for attempt in range(30):
+        for _ in range(30):
+            remaining = deadline - time.monotonic()
+            if remaining <= 0:
+                raise TimeoutError("Timed out waiting for Bexley waste data to load")
             response = session.get(
                 f"{page}?page_loading=1",
                 headers={"x-requested-with": "fetch"},
-                timeout=60,
+                timeout=min(10, remaining),
             )
             response.raise_for_status()
             if "waste-service-name" in response.text:
                 break
-            time.sleep(2)
+            time.sleep(min(2, max(0, deadline - time.monotonic())))
+        else:
+            raise TimeoutError("Timed out waiting for Bexley waste data to load")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
30 - 47, The async-poll loop (where session.get is called repeatedly and
response is parsed into BeautifulSoup) currently allows unbounded total runtime
and will parse a loader page if polling never completes; change it to use a
single monotonic deadline (time.monotonic()) and compute remaining_time for each
request, pass that as the timeout to session.get and cap time.sleep to
remaining_time, break and raise a timeout/error (stop parsing) when
remaining_time <= 0; update the loop that checks "waste-service-name" to raise
on exhaustion instead of proceeding to BeautifulSoup on a non-final response.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 73-90: The code in BexleyCouncil.py currently catches ValueError
when parsing dates and only prints an error, which silently drops rows; instead
re-raise a descriptive exception so callers can see failures. Replace the except
block that catches ValueError around the datetime.strptime logic (related to
parsed_date, remove_ordinal_indicator_from_date_string and the data["bins"]
append) with code that raises a new ValueError (or re-raises) including bin_type
and the raw next_collection string and the original exception message so the
failure is explicit.

---

Duplicate comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 30-47: The async-poll loop (where session.get is called repeatedly
and response is parsed into BeautifulSoup) currently allows unbounded total
runtime and will parse a loader page if polling never completes; change it to
use a single monotonic deadline (time.monotonic()) and compute remaining_time
for each request, pass that as the timeout to session.get and cap time.sleep to
remaining_time, break and raise a timeout/error (stop parsing) when
remaining_time <= 0; update the loop that checks "waste-service-name" to raise
on exhaustion instead of proceeding to BeautifulSoup on a non-final response.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: befa4971-9568-48ff-b1b1-7277286624fb

📥 Commits

Reviewing files that changed from the base of the PR and between 2055b98 and 2f9dfbc.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant