Skip to content

fix: DurhamCouncil — switch to Selenium for JS-rendered bin data#2047

Open
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/durham-selenium-render
Open

fix: DurhamCouncil — switch to Selenium for JS-rendered bin data#2047
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/durham-selenium-render

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented May 12, 2026

The bin collection page at durham.gov.uk/bincollections?uprn= renders data client-side via JavaScript. The existing scraper used requests.get() which only returned the empty HTML shell — the binsrubbish/binsrecycling/binsgardenwaste divs were present but unpopulated.

Switched to Selenium with create_webdriver() to render the JS. Waits for the bin data divs to appear, then parses with BeautifulSoup as before. The CSS class selectors and date parsing are unchanged.

Added web_driver to input.json config.

Summary by CodeRabbit

  • Refactor
    • Durham Council bin collection retrieval updated to use a JavaScript-rendering workflow for more reliable schedule and date extraction.
  • Documentation
    • Test configuration and council note updated to reflect the new rendering approach used for Durham's data.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

DurhamCouncil scraping is migrated from direct HTTP requests to Selenium WebDriver to render JavaScript-populated bin collection data. Test config now documents a web_driver endpoint; the implementation creates a WebDriver, waits for bin elements, parses driver.page_source with BeautifulSoup, extracts dates via regex, and ensures driver.quit() in a finally block.

Changes

DurhamCouncil Selenium Migration

Layer / File(s) Summary
Test configuration and Selenium requirement documentation
uk_bin_collection/tests/input.json
Test configuration adds web_driver endpoint (http://selenium:4444) and updates wiki_note to indicate Selenium is required for rendering JS-populated bin data.
Selenium WebDriver-based parsing implementation
uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py
The parse_data method is rewritten to create a WebDriver from optional config, construct the target URL using uprn, wait for .binsrubbish and .binsrecycling elements to render, parse the rendered page with BeautifulSoup, extract collection dates with a raw-string regex, and ensure the driver is quit in a finally block.

Sequence Diagram

sequenceDiagram
  participant parse_data as parse_data method
  participant webdriver as WebDriver
  participant javascript as Durham website JS
  participant beautifulsoup as BeautifulSoup
  participant bins as bins data
  parse_data->>webdriver: create_webdriver(headless, web_driver)
  webdriver->>javascript: navigate to URL with uprn
  javascript->>webdriver: render .binsrubbish and .binsrecycling
  webdriver->>webdriver: wait for elements to be present
  parse_data->>beautifulsoup: parse driver.page_source
  beautifulsoup->>parse_data: return rendered HTML tree
  parse_data->>bins: extract dates with regex and append {type, collectionDate}
  parse_data->>webdriver: finally: driver.quit()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • robbrad/UKBinCollectionData#1765: Updates create_webdriver in common.py with single driver return and window positioning, directly supporting the Selenium migration in DurhamCouncil.

Suggested reviewers

  • dp247

Poem

🐰 Durham's bins now shimmer on the screen,
With Selenium's magic, rendered so clean,
JS wakes up, and WebDriver peeks to see,
Rubbish and recycling revealed with glee,
The rabbit hops: "Tests pass — tea for me!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: switching DurhamCouncil from requests to Selenium to handle JS-rendered bin data.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (8ecf878) to head (e81ef13).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2047   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@InertiaUK InertiaUK force-pushed the fix/durham-selenium-render branch from 1d207b3 to e81ef13 Compare May 12, 2026 15:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py`:
- Around line 44-53: The current parsing silently ignores non-matching but
non-empty collection_text; update the logic in the DurhamCouncil parser (the
block using variables collection_text, results, bin_type, date_format, and
appending to data["bins"]) to explicitly handle known non-date states (e.g.
recognised phrases like "no collections" or other council-specific tokens) and
for any other non-matching collection_text raise an exception (e.g. ValueError)
including the raw collection_text so scrapers fail noisily; keep the existing
successful path (re.search -> datetime.strptime -> append) unchanged but add
explicit checks before falling through to ensure unexpected formats are
surfaced.
- Around line 26-32: The current wait uses
WebDriverWait(...).until(EC.presence_of_element_located((By.CSS_SELECTOR,
".binsrubbish, .binsrecycling"))) which only ensures the placeholder divs exist
and can race before Durham's JS fills them; change the wait to assert that those
elements contain non-empty text (e.g., use EC.text_to_be_present_in_element for
a known substring or a custom lambda that checks element.text.strip() != "")
before creating BeautifulSoup(driver.page_source) in DurhamCouncil.py so soup
parses populated bin data rather than empty shells.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3cbe3cf6-c2d9-46b9-8b93-3161c049bf10

📥 Commits

Reviewing files that changed from the base of the PR and between 1d207b3 and e81ef13.

📒 Files selected for processing (2)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • uk_bin_collection/tests/input.json

Comment on lines +26 to +32
WebDriverWait(driver, 30).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, ".binsrubbish, .binsrecycling")
)
)

# Make a BS4 object
soup = BeautifulSoup(page.text, features="html.parser")
soup = BeautifulSoup(driver.page_source, features="html.parser")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for populated bin text, not just the placeholder nodes.

The empty shell already contains these divs, so presence_of_element_located can succeed before Durham’s JS has written any collection data. That makes driver.page_source racey and can still parse empty bins.

Suggested fix
-            WebDriverWait(driver, 30).until(
-                EC.presence_of_element_located(
-                    (By.CSS_SELECTOR, ".binsrubbish, .binsrecycling")
-                )
-            )
+            WebDriverWait(driver, 30).until(
+                lambda d: any(
+                    el.text.strip()
+                    for el in d.find_elements(
+                        By.CSS_SELECTOR,
+                        ".binsrubbish, .binsrecycling, .binsgardenwaste",
+                    )
+                )
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WebDriverWait(driver, 30).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, ".binsrubbish, .binsrecycling")
)
)
# Make a BS4 object
soup = BeautifulSoup(page.text, features="html.parser")
soup = BeautifulSoup(driver.page_source, features="html.parser")
WebDriverWait(driver, 30).until(
lambda d: any(
el.text.strip()
for el in d.find_elements(
By.CSS_SELECTOR,
".binsrubbish, .binsrecycling, .binsgardenwaste",
)
)
)
soup = BeautifulSoup(driver.page_source, features="html.parser")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py` around lines
26 - 32, The current wait uses
WebDriverWait(...).until(EC.presence_of_element_located((By.CSS_SELECTOR,
".binsrubbish, .binsrecycling"))) which only ensures the placeholder divs exist
and can race before Durham's JS fills them; change the wait to assert that those
elements contain non-empty text (e.g., use EC.text_to_be_present_in_element for
a known substring or a custom lambda that checks element.text.strip() != "")
before creating BeautifulSoup(driver.page_source) in DurhamCouncil.py so soup
parses populated bin data rather than empty shells.

Comment on lines 44 to +53
if collection_text:
results = re.search("\\d\\d? [A-Za-z]+ \\d{4}", collection_text)
results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text)
if results:
date = datetime.strptime(results[0], "%d %B %Y")
if date:
data["bins"].append(
{
"type": bin_type,
"collectionDate": date.strftime(date_format),
}
)
data["bins"].append(
{
"type": bin_type,
"collectionDate": date.strftime(date_format),
}
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Don't silently drop non-empty collection text that stops matching the date regex.

If Durham changes the copy here, this returns partial data and looks like “no collections” instead of surfacing a scraper break. Handle any known non-date states explicitly, and raise on anything else.

Suggested fix
                 if collection_text:
                     results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text)
-                    if results:
-                        date = datetime.strptime(results[0], "%d %B %Y")
-                        data["bins"].append(
-                            {
-                                "type": bin_type,
-                                "collectionDate": date.strftime(date_format),
-                            }
-                        )
+                    if not results:
+                        raise ValueError(
+                            f"Unexpected Durham collection text for {bin_type}: {collection_text!r}"
+                        )
+                    date = datetime.strptime(results[0], "%d %B %Y")
+                    data["bins"].append(
+                        {
+                            "type": bin_type,
+                            "collectionDate": date.strftime(date_format),
+                        }
+                    )

Based on learnings: In uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if collection_text:
results = re.search("\\d\\d? [A-Za-z]+ \\d{4}", collection_text)
results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text)
if results:
date = datetime.strptime(results[0], "%d %B %Y")
if date:
data["bins"].append(
{
"type": bin_type,
"collectionDate": date.strftime(date_format),
}
)
data["bins"].append(
{
"type": bin_type,
"collectionDate": date.strftime(date_format),
}
)
if collection_text:
results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text)
if not results:
raise ValueError(
f"Unexpected Durham collection text for {bin_type}: {collection_text!r}"
)
date = datetime.strptime(results[0], "%d %B %Y")
data["bins"].append(
{
"type": bin_type,
"collectionDate": date.strftime(date_format),
}
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py` around lines
44 - 53, The current parsing silently ignores non-matching but non-empty
collection_text; update the logic in the DurhamCouncil parser (the block using
variables collection_text, results, bin_type, date_format, and appending to
data["bins"]) to explicitly handle known non-date states (e.g. recognised
phrases like "no collections" or other council-specific tokens) and for any
other non-matching collection_text raise an exception (e.g. ValueError)
including the raw collection_text so scrapers fail noisily; keep the existing
successful path (re.search -> datetime.strptime -> append) unchanged but add
explicit checks before falling through to ensure unexpected formats are
surfaced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant