fix: DurhamCouncil — switch to Selenium for JS-rendered bin data#2047
fix: DurhamCouncil — switch to Selenium for JS-rendered bin data#2047InertiaUK wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughDurhamCouncil scraping is migrated from direct HTTP requests to Selenium WebDriver to render JavaScript-populated bin collection data. Test config now documents a web_driver endpoint; the implementation creates a WebDriver, waits for bin elements, parses driver.page_source with BeautifulSoup, extracts dates via regex, and ensures driver.quit() in a finally block. ChangesDurhamCouncil Selenium Migration
Sequence DiagramsequenceDiagram
participant parse_data as parse_data method
participant webdriver as WebDriver
participant javascript as Durham website JS
participant beautifulsoup as BeautifulSoup
participant bins as bins data
parse_data->>webdriver: create_webdriver(headless, web_driver)
webdriver->>javascript: navigate to URL with uprn
javascript->>webdriver: render .binsrubbish and .binsrecycling
webdriver->>webdriver: wait for elements to be present
parse_data->>beautifulsoup: parse driver.page_source
beautifulsoup->>parse_data: return rendered HTML tree
parse_data->>bins: extract dates with regex and append {type, collectionDate}
parse_data->>webdriver: finally: driver.quit()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2047 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
1d207b3 to
e81ef13
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py`:
- Around line 44-53: The current parsing silently ignores non-matching but
non-empty collection_text; update the logic in the DurhamCouncil parser (the
block using variables collection_text, results, bin_type, date_format, and
appending to data["bins"]) to explicitly handle known non-date states (e.g.
recognised phrases like "no collections" or other council-specific tokens) and
for any other non-matching collection_text raise an exception (e.g. ValueError)
including the raw collection_text so scrapers fail noisily; keep the existing
successful path (re.search -> datetime.strptime -> append) unchanged but add
explicit checks before falling through to ensure unexpected formats are
surfaced.
- Around line 26-32: The current wait uses
WebDriverWait(...).until(EC.presence_of_element_located((By.CSS_SELECTOR,
".binsrubbish, .binsrecycling"))) which only ensures the placeholder divs exist
and can race before Durham's JS fills them; change the wait to assert that those
elements contain non-empty text (e.g., use EC.text_to_be_present_in_element for
a known substring or a custom lambda that checks element.text.strip() != "")
before creating BeautifulSoup(driver.page_source) in DurhamCouncil.py so soup
parses populated bin data rather than empty shells.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3cbe3cf6-c2d9-46b9-8b93-3161c049bf10
📒 Files selected for processing (2)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py
🚧 Files skipped from review as they are similar to previous changes (1)
- uk_bin_collection/tests/input.json
| WebDriverWait(driver, 30).until( | ||
| EC.presence_of_element_located( | ||
| (By.CSS_SELECTOR, ".binsrubbish, .binsrecycling") | ||
| ) | ||
| ) | ||
|
|
||
| # Make a BS4 object | ||
| soup = BeautifulSoup(page.text, features="html.parser") | ||
| soup = BeautifulSoup(driver.page_source, features="html.parser") |
There was a problem hiding this comment.
Wait for populated bin text, not just the placeholder nodes.
The empty shell already contains these divs, so presence_of_element_located can succeed before Durham’s JS has written any collection data. That makes driver.page_source racey and can still parse empty bins.
Suggested fix
- WebDriverWait(driver, 30).until(
- EC.presence_of_element_located(
- (By.CSS_SELECTOR, ".binsrubbish, .binsrecycling")
- )
- )
+ WebDriverWait(driver, 30).until(
+ lambda d: any(
+ el.text.strip()
+ for el in d.find_elements(
+ By.CSS_SELECTOR,
+ ".binsrubbish, .binsrecycling, .binsgardenwaste",
+ )
+ )
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| WebDriverWait(driver, 30).until( | |
| EC.presence_of_element_located( | |
| (By.CSS_SELECTOR, ".binsrubbish, .binsrecycling") | |
| ) | |
| ) | |
| # Make a BS4 object | |
| soup = BeautifulSoup(page.text, features="html.parser") | |
| soup = BeautifulSoup(driver.page_source, features="html.parser") | |
| WebDriverWait(driver, 30).until( | |
| lambda d: any( | |
| el.text.strip() | |
| for el in d.find_elements( | |
| By.CSS_SELECTOR, | |
| ".binsrubbish, .binsrecycling, .binsgardenwaste", | |
| ) | |
| ) | |
| ) | |
| soup = BeautifulSoup(driver.page_source, features="html.parser") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py` around lines
26 - 32, The current wait uses
WebDriverWait(...).until(EC.presence_of_element_located((By.CSS_SELECTOR,
".binsrubbish, .binsrecycling"))) which only ensures the placeholder divs exist
and can race before Durham's JS fills them; change the wait to assert that those
elements contain non-empty text (e.g., use EC.text_to_be_present_in_element for
a known substring or a custom lambda that checks element.text.strip() != "")
before creating BeautifulSoup(driver.page_source) in DurhamCouncil.py so soup
parses populated bin data rather than empty shells.
| if collection_text: | ||
| results = re.search("\\d\\d? [A-Za-z]+ \\d{4}", collection_text) | ||
| results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text) | ||
| if results: | ||
| date = datetime.strptime(results[0], "%d %B %Y") | ||
| if date: | ||
| data["bins"].append( | ||
| { | ||
| "type": bin_type, | ||
| "collectionDate": date.strftime(date_format), | ||
| } | ||
| ) | ||
| data["bins"].append( | ||
| { | ||
| "type": bin_type, | ||
| "collectionDate": date.strftime(date_format), | ||
| } | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Don't silently drop non-empty collection text that stops matching the date regex.
If Durham changes the copy here, this returns partial data and looks like “no collections” instead of surfacing a scraper break. Handle any known non-date states explicitly, and raise on anything else.
Suggested fix
if collection_text:
results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text)
- if results:
- date = datetime.strptime(results[0], "%d %B %Y")
- data["bins"].append(
- {
- "type": bin_type,
- "collectionDate": date.strftime(date_format),
- }
- )
+ if not results:
+ raise ValueError(
+ f"Unexpected Durham collection text for {bin_type}: {collection_text!r}"
+ )
+ date = datetime.strptime(results[0], "%d %B %Y")
+ data["bins"].append(
+ {
+ "type": bin_type,
+ "collectionDate": date.strftime(date_format),
+ }
+ )Based on learnings: In uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if collection_text: | |
| results = re.search("\\d\\d? [A-Za-z]+ \\d{4}", collection_text) | |
| results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text) | |
| if results: | |
| date = datetime.strptime(results[0], "%d %B %Y") | |
| if date: | |
| data["bins"].append( | |
| { | |
| "type": bin_type, | |
| "collectionDate": date.strftime(date_format), | |
| } | |
| ) | |
| data["bins"].append( | |
| { | |
| "type": bin_type, | |
| "collectionDate": date.strftime(date_format), | |
| } | |
| ) | |
| if collection_text: | |
| results = re.search(r"\d\d? [A-Za-z]+ \d{4}", collection_text) | |
| if not results: | |
| raise ValueError( | |
| f"Unexpected Durham collection text for {bin_type}: {collection_text!r}" | |
| ) | |
| date = datetime.strptime(results[0], "%d %B %Y") | |
| data["bins"].append( | |
| { | |
| "type": bin_type, | |
| "collectionDate": date.strftime(date_format), | |
| } | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@uk_bin_collection/uk_bin_collection/councils/DurhamCouncil.py` around lines
44 - 53, The current parsing silently ignores non-matching but non-empty
collection_text; update the logic in the DurhamCouncil parser (the block using
variables collection_text, results, bin_type, date_format, and appending to
data["bins"]) to explicitly handle known non-date states (e.g. recognised
phrases like "no collections" or other council-specific tokens) and for any
other non-matching collection_text raise an exception (e.g. ValueError)
including the raw collection_text so scrapers fail noisily; keep the existing
successful path (re.search -> datetime.strptime -> append) unchanged but add
explicit checks before falling through to ensure unexpected formats are
surfaced.
The bin collection page at
durham.gov.uk/bincollections?uprn=renders data client-side via JavaScript. The existing scraper usedrequests.get()which only returned the empty HTML shell — thebinsrubbish/binsrecycling/binsgardenwastedivs were present but unpopulated.Switched to Selenium with
create_webdriver()to render the JS. Waits for the bin data divs to appear, then parses with BeautifulSoup as before. The CSS class selectors and date parsing are unchanged.Added
web_driverto input.json config.Summary by CodeRabbit