Skip to content

fix: EastLothianCouncil - rewrite for Drupal migration (ASP site gone)#1942

Closed
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/EastLothianCouncil
Closed

fix: EastLothianCouncil - rewrite for Drupal migration (ASP site gone)#1942
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/EastLothianCouncil

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented Apr 10, 2026

East Lothian migrated from their old ASP.NET site to a Drupal-based platform. The old scraper relied on ASP form state that no longer exists.

The new site uses a 3-step form flow with CSRF tokens. The rewrite follows the form steps: POST the postcode, select the address from the results, then parse the schedule table from the final page. CSRF tokens are extracted from hidden inputs on each step.

Tested with an EH postcode in East Lothian.

Summary by CodeRabbit

Bug Fixes

  • Improved reliability of East Lothian Council bin collection schedule retrieval with enhanced address matching accuracy.
  • More robust extraction of collection dates and schedule information from council systems.
  • Enhanced error reporting when schedules cannot be retrieved.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

The EastLothianCouncil module's collection data parsing was refactored to replace a two-step direct AJAX endpoint flow with a session-based multi-request interaction against a new base URL. The new logic captures form identifiers, fetches an address selection list via postcode, matches addresses using normalized text comparison, and parses collection dates from <time datetime> elements instead of previously parsed human-readable strings.

Changes

Cohort / File(s) Summary
EastLothianCouncil request and parsing logic
uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py
Rewrote parse_data() method to implement session-based flow: captures form_build_id from base page, posts postcode to retrieve address list, performs normalized address matching with special token handling, derives action URL from form, posts uprn to fetch schedule, and parses collection dates from time elements rather than parsed date strings. Added form/build identifier validation and enhanced error handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • dp247

Poem

🐰 Hop! Skip! The forms are rearranged,
Session flows now gracefully exchanged,
Dates jump from strings to datetime's embrace,
Addresses matched with normalization's grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: a rewrite of the EastLothianCouncil scraper to handle the migration from ASP to Drupal, with the parenthetical note explaining why the change was necessary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py (2)

95-97: Consider extracting the domain as a constant.

The domain https://collectiondates.eastlothian.gov.uk appears at both line 22 (in base_url) and line 97. Extracting it would reduce duplication.

♻️ Suggested refactor
+        domain = "https://collectiondates.eastlothian.gov.uk"
+        base_url = f"{domain}/waste-collection-schedule"
         ...
         action_url = form.get("action", "")
         if action_url.startswith("/"):
-            action_url = "https://collectiondates.eastlothian.gov.uk" + action_url
+            action_url = domain + action_url
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py` around
lines 95 - 97, Extract the repeated domain string
"https://collectiondates.eastlothian.gov.uk" into a single constant (e.g.,
EASTLOTHIAN_DOMAIN or BASE_DOMAIN) at module scope in EastLothianCouncil.py and
replace the hard-coded occurrences used to build base_url and to prefix
action_url in the form-processing code; update the logic that currently does
action_url = "https://collectiondates.eastlothian.gov.uk" + action_url to use
the new constant (preserve the leading-slash handling) and replace the base_url
construction to reference the same constant so both locations share the single
source of truth.

27-31: Consider using a context manager for the session.

The requests.Session() is created but never explicitly closed. While Python's garbage collector will eventually handle it, using a context manager ensures connections are properly released.

♻️ Suggested refactor using context manager

The session usage spans the entire method, so wrap the logic in a with block:

-        session = requests.Session()
-
-        # Step 1: GET the page to obtain session cookie and form_build_id
-        response = session.get(base_url, headers=headers)
+        with requests.Session() as session:
+            # Step 1: GET the page to obtain session cookie and form_build_id
+            response = session.get(base_url, headers=headers)

Then indent the remaining session-dependent code within the with block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py` around
lines 27 - 31, The session created with requests.Session() in
EastLothianCouncil.py should be used as a context manager to ensure it is
closed: replace the standalone session = requests.Session() with a with
requests.Session() as session: block and indent all subsequent session-dependent
calls (e.g., session.get(base_url, headers=headers), response handling, and any
further requests) inside that block so connections are properly released when
the method (or function) finishes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py`:
- Around line 118-142: The parsing loop over days (variable days) currently
swallows missing elements (time_el/type_el) and parsing errors
(datetime.strptime ValueError) which can hide HTML changes; update the loop in
EastLothianCouncil.py to record/log parsing problems and fail fast: when time_el
or type_el are missing or date_str/waste_type are empty, append a descriptive
error entry or raise an exception (including the offending day's HTML/snippet)
instead of continue, and when datetime.strptime raises ValueError capture the
error and include the raw date_str in the error/exception; after the loop, if
bindata["bins"] is empty, raise an explicit error indicating no valid collection
entries were parsed so CI/consumers surface format changes.

---

Nitpick comments:
In `@uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py`:
- Around line 95-97: Extract the repeated domain string
"https://collectiondates.eastlothian.gov.uk" into a single constant (e.g.,
EASTLOTHIAN_DOMAIN or BASE_DOMAIN) at module scope in EastLothianCouncil.py and
replace the hard-coded occurrences used to build base_url and to prefix
action_url in the form-processing code; update the logic that currently does
action_url = "https://collectiondates.eastlothian.gov.uk" + action_url to use
the new constant (preserve the leading-slash handling) and replace the base_url
construction to reference the same constant so both locations share the single
source of truth.
- Around line 27-31: The session created with requests.Session() in
EastLothianCouncil.py should be used as a context manager to ensure it is
closed: replace the standalone session = requests.Session() with a with
requests.Session() as session: block and indent all subsequent session-dependent
calls (e.g., session.get(base_url, headers=headers), response handling, and any
further requests) inside that block so connections are properly released when
the method (or function) finishes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2dcd0dc2-834e-4c6c-baa0-4eb9868d61b6

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and 6cf1af0.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py

Comment on lines +118 to 142
# Parse collection dates from <time datetime="YYYY-MM-DD"> elements
days = soup.find_all("li", class_="waste-collection__day")
for day in days:
time_el = day.find("time")
type_el = day.find("span", class_="waste-collection__day--type")

if not time_el or not type_el:
continue

date_str = time_el.get("datetime", "")
waste_type = type_el.text.strip()

if not date_str or not waste_type:
continue

try:
collection_date = datetime.strptime(
remove_ordinal_indicator_from_date_string(waste_value),
"%A %d %B %Y",
collection_date = datetime.strptime(date_str, "%Y-%m-%d")
bindata["bins"].append(
{
"type": waste_type,
"collectionDate": collection_date.strftime(date_format),
}
)

bindata["bins"].append({
"type": waste_label.replace(" is:", ""),
"collectionDate": collection_date.strftime(date_format),
})
except ValueError:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent failures may mask parsing issues.

The code silently continues when elements are missing (lines 124-125, 130-131) or when date parsing fails (lines 141-142). If the council changes their HTML structure, this would return an empty result with no indication of what went wrong.

Based on learnings from this repository: prefer explicit failures over silent defaults to ensure format changes are detected early.

🛡️ Suggested improvement: fail if no valid entries are parsed
         # Parse collection dates from <time datetime="YYYY-MM-DD"> elements
         days = soup.find_all("li", class_="waste-collection__day")
+        if not days:
+            raise ValueError("No collection day elements found - page structure may have changed")
+
         for day in days:
             time_el = day.find("time")
             type_el = day.find("span", class_="waste-collection__day--type")

             if not time_el or not type_el:
                 continue

             date_str = time_el.get("datetime", "")
             waste_type = type_el.text.strip()

             if not date_str or not waste_type:
                 continue

             try:
                 collection_date = datetime.strptime(date_str, "%Y-%m-%d")
                 bindata["bins"].append(
                     {
                         "type": waste_type,
                         "collectionDate": collection_date.strftime(date_format),
                     }
                 )
             except ValueError:
                 continue

+        if not bindata["bins"]:
+            raise ValueError(
+                f"No valid collection dates parsed for {user_postcode} - page structure may have changed"
+            )
+
         bindata["bins"].sort(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EastLothianCouncil.py` around
lines 118 - 142, The parsing loop over days (variable days) currently swallows
missing elements (time_el/type_el) and parsing errors (datetime.strptime
ValueError) which can hide HTML changes; update the loop in
EastLothianCouncil.py to record/log parsing problems and fail fast: when time_el
or type_el are missing or date_str/waste_type are empty, append a descriptive
error entry or raise an exception (including the offending day's HTML/snippet)
instead of continue, and when datetime.strptime raises ValueError capture the
error and include the raw date_str in the error/exception; after the loop, if
bindata["bins"] is empty, raise an explicit error indicating no valid collection
entries were parsed so CI/consumers surface format changes.

@robbrad robbrad mentioned this pull request May 1, 2026
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented May 1, 2026

Included in May 2026 Release PR #1992. Closing.

@robbrad robbrad closed this May 1, 2026
@robbrad robbrad mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants