Skip to content

fix: EastRidingCouncil - replace Selenium with direct JSON API#1989

Closed
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/EastRidingCouncil-json-api
Closed

fix: EastRidingCouncil - replace Selenium with direct JSON API#1989
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/EastRidingCouncil-json-api

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented May 1, 2026

Replaces the Selenium-based dropdown scraper with a direct call to the council's public JSON API.

Problem: The old scraper required the exact full address string (e.g. 14 THE LEASES BEVERLEY HU17 8LG) in the paon parameter for select_by_visible_text(). Any mismatch in formatting caused failures.

Fix: Calls wasterecyclingapi.eastriding.gov.uk/api/RecyclingData/CollectionsData directly with the postcode, then matches by house number against the HouseNameOrder field. Three-tier matching: exact HouseNameOrder, Address startswith, Address contains.

Changes:

  • EastRidingCouncil.py — full rewrite, requests-only (no Selenium)
  • input.json — updated to use house_number + postcode (was paon with full address)

Testing: Verified with HU17 8LG (14 The Leases) — returns 3 bins (Blue, Green, Brown).

Summary by CodeRabbit

  • Refactor

    • Bin collection schedule retrieval system has been completely reimplemented to use direct API integration instead of web scraping, delivering improvements in reliability, data retrieval speed, error handling, and overall system stability.
  • Tests

    • Test configuration updated to align with new API endpoint specifications and revised address field format requirements.

The old scraper used Selenium to interact with a dropdown, requiring the
exact full address string in paon (e.g. "14 THE LEASES BEVERLEY HU17 8LG").
This was fragile and broke when the address format didn't match exactly.

Rewritten to call the council's public JSON API directly:
- No Selenium dependency
- Accepts standard postcode + house_number parameters
- Three-tier address matching: HouseNameOrder exact → Address startswith → Address contains
- Handles single-result postcodes without house_number
- Returns Blue/Green/Brown bin dates from API response
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

The EastRidingCouncil bin collection scraper has been migrated from Selenium-driven HTML scraping to a direct HTTP API call. Test input data has been updated with a simplified house number and corrected API endpoint URL to reflect the new implementation approach.

Changes

Cohort / File(s) Summary
Test Data Updates
uk_bin_collection/tests/input.json
Simplified house_number from full address string to "14" and updated url endpoint from wasterecyclingapi to www.eastriding.gov.uk; removed trailing newline.
API Migration
uk_bin_collection/uk_bin_collection/councils/EastRidingCouncil.py
Replaced Selenium/HTML parsing with direct HTTP API integration using fixed endpoint and credentials (APIKey, Licensee). Implements postcode normalization, address matching logic (HouseNameOrder exact match with Address fallback), and extracts BlueDate/GreenDate/BrownDate fields into typed bin entries with parsed collection dates; removed all browser automation and HTML parsing code.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

  • #1715: Converts council scraper from Selenium/HTML scraping to API-based implementation with similar CouncilClass.parse_data rewrite pattern.
  • #1742: Refactors another council's parse_data from browser automation to direct API calls with JSON response handling.

Suggested reviewers

  • dp247

Poem

🐰 From tangled webs of Selenium dreams,
To API calls with cleaner streams,
East Riding's bins now answer quick,
No more browser tricks to click!
Dates cascade in JSON light! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: replacing Selenium-based scraping with a direct JSON API call for EastRidingCouncil. It directly reflects the core modification evident in both the test data and the main implementation changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/tests/input.json`:
- Around line 809-815: The fixture's guidance is stale: update the wiki_note to
reflect the new input contract that uses discrete fields like house_number
(instead of the old full-dropdown Selenium address) and remove the unused
Selenium metadata; specifically change the wiki_note for the East Riding entry
(wiki_name "East Riding of Yorkshire") to instruct contributors to supply the
discrete address fields (house_number, postcode, etc.) consistent with
skip_get_url: true and delete the web_driver property if the site no longer
requires Selenium.

In `@uk_bin_collection/uk_bin_collection/councils/EastRidingCouncil.py`:
- Around line 97-117: The loop over BIN_DATE_FIELDS currently swallows
ValueError/TypeError and silently drops malformed dates; instead, stop
swallowing errors—when datetime.strptime for a given field fails, raise a new
exception (or re-raise) with a clear message including the field name, the
offending date_str and the matched record so format drift is visible; update the
code around BIN_DATE_FIELDS, matched, collection_date and the assembly of
data["bins"] to validate/convert dates and fail loudly rather than continue, and
ensure the subsequent sort still assumes all collectionDate values are valid (so
no silent omissions occur).
- Around line 57-78: The current PAON fallback matching in EastRidingCouncil.py
(the user_paon handling that iterates entries and checks HouseNameOrder and
Address) is too permissive and can wrongly match short inputs; tighten it by
first validating user_paon (ignore empty/whitespace, require a minimum
meaningful length or digits when numeric), then change the Address fallback
checks to use bounded/whole-token matching (e.g., word-boundary or token
equality against split Address tokens) instead of plain startswith/contains, and
only accept startswith/contains if user_paon meets the minimum length check;
update the three matching blocks (HouseNameOrder exact, Address startswith,
Address contains) to apply these validation rules so you fail-safe (leave
matched None) rather than returning a likely-wrong first hit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 27539691-7fe4-4bd0-8f67-c9400dd45e6d

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and bc6640c.

📒 Files selected for processing (2)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/EastRidingCouncil.py

Comment on lines +809 to 815
"house_number": "14",
"postcode": "HU17 8LG",
"skip_get_url": true,
"url": "https://wasterecyclingapi.eastriding.gov.uk",
"url": "https://www.eastriding.gov.uk",
"web_driver": "http://selenium:4444",
"wiki_name": "East Riding of Yorkshire",
"wiki_note": "Put the full address as it displays on the council website dropdown when you do the check manually."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the East Riding fixture guidance to match the new input contract.

This block now uses house_number: "14", but the wiki_note still tells users to pass the full dropdown address from the old Selenium flow. That mismatch will send users down the wrong path. If East Riding no longer needs Selenium at all, this is also a good place to drop the stale web_driver metadata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/tests/input.json` around lines 809 - 815, The fixture's
guidance is stale: update the wiki_note to reflect the new input contract that
uses discrete fields like house_number (instead of the old full-dropdown
Selenium address) and remove the unused Selenium metadata; specifically change
the wiki_note for the East Riding entry (wiki_name "East Riding of Yorkshire")
to instruct contributors to supply the discrete address fields (house_number,
postcode, etc.) consistent with skip_get_url: true and delete the web_driver
property if the site no longer requires Selenium.

Comment on lines +57 to +78
if user_paon:
paon_upper = user_paon.strip().upper()
# Try exact match on HouseNameOrder first (most reliable)
for entry in entries:
house_name = (entry.get("HouseNameOrder") or "").strip().upper()
if house_name == paon_upper:
matched = entry
break
# Fall back to Address startswith
if not matched:
for entry in entries:
address = (entry.get("Address") or "").upper()
if address.startswith(paon_upper):
matched = entry
break
# Fall back to Address contains
if not matched:
for entry in entries:
address = (entry.get("Address") or "").upper()
if paon_upper in address:
matched = entry
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten the fallback address match before selecting a property.

The startswith/contains fallbacks can silently pick the wrong address for short or blank-ish inputs, e.g. "1" matching "10 ..." or "11 ...". At that point returning the first hit is worse than failing, because it gives the user another household’s collection dates.

Suggested direction
+import re
...
         if user_paon:
             paon_upper = user_paon.strip().upper()
+            if not paon_upper:
+                raise ValueError("House number/name cannot be blank")

             # Try exact match on HouseNameOrder first (most reliable)
             for entry in entries:
                 house_name = (entry.get("HouseNameOrder") or "").strip().upper()
                 if house_name == paon_upper:
                     matched = entry
                     break
-            # Fall back to Address startswith
+
+            def address_starts_with_paon(entry):
+                address = (entry.get("Address") or "").upper()
+                return re.match(rf"^{re.escape(paon_upper)}(?:\\b|[, ])", address)
+
             if not matched:
-                for entry in entries:
-                    address = (entry.get("Address") or "").upper()
-                    if address.startswith(paon_upper):
-                        matched = entry
-                        break
-            # Fall back to Address contains
-            if not matched:
-                for entry in entries:
-                    address = (entry.get("Address") or "").upper()
-                    if paon_upper in address:
-                        matched = entry
-                        break
+                candidates = [entry for entry in entries if address_starts_with_paon(entry)]
+                if len(candidates) == 1:
+                    matched = candidates[0]
+                elif len(candidates) > 1:
+                    raise ValueError(
+                        f"Ambiguous address match for '{user_paon}' at postcode {user_postcode}"
+                    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EastRidingCouncil.py` around
lines 57 - 78, The current PAON fallback matching in EastRidingCouncil.py (the
user_paon handling that iterates entries and checks HouseNameOrder and Address)
is too permissive and can wrongly match short inputs; tighten it by first
validating user_paon (ignore empty/whitespace, require a minimum meaningful
length or digits when numeric), then change the Address fallback checks to use
bounded/whole-token matching (e.g., word-boundary or token equality against
split Address tokens) instead of plain startswith/contains, and only accept
startswith/contains if user_paon meets the minimum length check; update the
three matching blocks (HouseNameOrder exact, Address startswith, Address
contains) to apply these validation rules so you fail-safe (leave matched None)
rather than returning a likely-wrong first hit.

Comment on lines +97 to +117
# Extract bin collection dates from the matched entry
for field, bin_type in BIN_DATE_FIELDS.items():
date_str = matched.get(field)
if not date_str:
continue
try:
collection_date = datetime.strptime(
date_str, "%Y-%m-%dT%H:%M:%S"
)
)
drop = Select(dropdown)
drop.select_by_visible_text(str(user_paon))

results_present = wait.until(
EC.presence_of_element_located(
(
By.CLASS_NAME,
"results",
)
data["bins"].append(
{
"type": bin_type,
"collectionDate": collection_date.strftime(date_format),
}
)
)

data = {"bins": []} # dictionary for data
soup = BeautifulSoup(driver.page_source, "html.parser")
except (ValueError, TypeError):
continue

bin_types = {} # Dictionary to store bin types
data["bins"].sort(
key=lambda x: datetime.strptime(x.get("collectionDate"), date_format)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t swallow invalid collection dates here.

If East Riding changes one of these fields, the parser currently drops that bin silently and can return a partial or empty schedule. This project usually benefits from failing loudly on unexpected council payloads so format drift is caught immediately.

Suggested change
         for field, bin_type in BIN_DATE_FIELDS.items():
             date_str = matched.get(field)
             if not date_str:
                 continue
             try:
                 collection_date = datetime.strptime(
                     date_str, "%Y-%m-%dT%H:%M:%S"
                 )
-                data["bins"].append(
-                    {
-                        "type": bin_type,
-                        "collectionDate": collection_date.strftime(date_format),
-                    }
-                )
-            except (ValueError, TypeError):
-                continue
+            except (ValueError, TypeError) as exc:
+                raise ValueError(
+                    f"Unexpected {field} value {date_str!r} for postcode {user_postcode}"
+                ) from exc
+
+            data["bins"].append(
+                {
+                    "type": bin_type,
+                    "collectionDate": collection_date.strftime(date_format),
+                }
+            )
+
+        if not data["bins"]:
+            raise ValueError(
+                f"No collection dates found for matched address at postcode {user_postcode}"
+            )

Based on learnings: in uk_bin_collection/**/*.py, prefer explicit failures over silent error handling when parsing council data so upstream format changes are detected early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EastRidingCouncil.py` around
lines 97 - 117, The loop over BIN_DATE_FIELDS currently swallows
ValueError/TypeError and silently drops malformed dates; instead, stop
swallowing errors—when datetime.strptime for a given field fails, raise a new
exception (or re-raise) with a clear message including the field name, the
offending date_str and the matched record so format drift is visible; update the
code around BIN_DATE_FIELDS, matched, collection_date and the assembly of
data["bins"] to validate/convert dates and fail loudly rather than continue, and
ensure the subsequent sort still assumes all collectionDate values are valid (so
no silent omissions occur).

@robbrad robbrad mentioned this pull request May 1, 2026
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented May 1, 2026

Included in May 2026 Release PR #1992. Closing.

@robbrad robbrad closed this May 1, 2026
pull Bot pushed a commit to mrw298/UKBinCollectionData that referenced this pull request May 1, 2026
@robbrad robbrad mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants