Skip to content

fix: EnvironmentFirst/Eastbourne/Lewes - handle restructured page#1979

Closed
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/environmentfirst-page-restructure
Closed

fix: EnvironmentFirst/Eastbourne/Lewes - handle restructured page#1979
InertiaUK wants to merge 1 commit into
robbrad:masterfrom
InertiaUK:fix/environmentfirst-page-restructure

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented Apr 25, 2026

The environmentfirst.co.uk results page (used directly by EnvironmentFirst and via the same parser by EastbourneBoroughCouncil + LewesDistrictCouncil) was restructured in April 2026:

  • The bin dates are now wrapped in a <strong> inside the same <p> that carries the label. The previous separator paragraphs are gone, so the fixed indices page_text[2] / [4] / [6] no longer line up — they're now [1]/[2]/[3].
  • The date string also gained a leading weekday name. Values like Tuesday 28th April 2026 no longer match '%d %B %Y' and the parser raises ValueError: time data 'Tuesday 28 April 2026' does not match format '%d %B %Y'.

Rewritten to identify each row by the surrounding label (rubbish / recycling / garden) instead of a fixed paragraph index, and to extract the date with a regex (\d{1,2}(?:st|nd|rd|th)?\s+\w+\s+\d{4}) so the weekday prefix is harmlessly skipped.

Verified against UPRN 100060011258 (Eastbourne) and 100061930155 (Lewes) — both return their full schedules including the garden waste line.

Summary by CodeRabbit

Bug Fixes

  • Improved bin collection date extraction for Eastbourne Borough Council, Environment First, and Lewes District Council with enhanced handling of HTML structure variations and date formats, including improved error recovery for edge cases.

The environmentfirst.co.uk results page (used directly by EnvironmentFirst,
EastbourneBoroughCouncil and LewesDistrictCouncil) was restructured in
April 2026:

- The dates are now wrapped in a `<strong>` inside the same `<p>` that
  carries the label, with no intervening separator paragraphs. The
  previous indices `page_text[2]` (rubbish), `[4]` (recycling), `[6]`
  (garden) no longer line up — they are now `[1]/[2]/[3]`.
- The date string itself gained a leading weekday, so values like
  `Tuesday 28th April 2026` no longer match the existing
  `'%d %B %Y'` strptime format and fail with `ValueError: time data
  '...' does not match format`.

Rewritten to identify each row by the surrounding label
(rubbish/recycling/garden) rather than a fixed index, and to extract
the date with a regex (`\d{1,2}(?:st|nd|rd|th)?\s+\w+\s+\d{4}`) so the
weekday prefix is ignored. Verified against UPRN 100060011258
(Eastbourne) and 100061930155 (Lewes) — full schedules return.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

Three council bin collection parsers were refactored from rigid index-based HTML extraction to flexible, content-driven parsing. They now dynamically iterate over paragraph elements within div.collect containers, determine bin types via keyword matching, and handle ordinal date suffixes with regex normalization.

Changes

Cohort / File(s) Summary
Council Bin Collection Parser Refactoring
uk_bin_collection/councils/EastbourneBoroughCouncil.py, EnvironmentFirst.py, LewesDistrictCouncil.py
Replaced fragile index-based extraction (page_text[2], [4], [6]) with robust iteration over <p> tags in div.collect. Bin types now detected via keyword matching (rubbish, recycling, garden), dates extracted using regex supporting ordinal suffixes, and normalization/parsing applied per entry with graceful handling of missing or invalid data.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • dp247

Poem

🐰 Three parsers hop with newfound grace,
No more index-based disgrace,
With regex and keywords in hand,
Our bin dates now parse oh so grand!
Ordinals normalized—what a sight,
Robust parsing done right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing parsers for three councils to handle a restructured results page.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/EnvironmentFirst.py (1)

1-1: Recommend extracting the shared parser to avoid maintaining three copies.

The leading comment ("Legacy script. Copied to Lewes and Eastbourne.") and the byte-for-byte identical parse blocks across EnvironmentFirst.py, LewesDistrictCouncil.py and EastbourneBoroughCouncil.py mean this fix has had to be applied three times — and any future page change will need three identical edits. Consider hoisting the div.collect<p><strong> parsing into a single helper (e.g. in common.py or a new _environmentfirst.py module) and have all three council classes delegate to it. The PR description itself notes the page layout has shifted twice in twelve months, so this is a recurring cost.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/EnvironmentFirst.py` at line 1,
The three council modules (EnvironmentFirst.py, LewesDistrictCouncil.py,
EastbourneBoroughCouncil.py) duplicate the same DOM parsing logic that walks
div.collect → <p> → <strong>; extract that block into a single helper function
(e.g. parse_environmentfirst_schedule) placed in a shared module (common.py or
_environmentfirst.py) and have each council class call that helper instead of
duplicating the parse code; update the council classes to import and delegate to
the new function (keep the same input/output contract so callers in
EnvironmentFirst, LewesDistrictCouncil, and EastbourneBoroughCouncil need only
replace the inline parse block with a single function call).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/LewesDistrictCouncil.py`:
- Around line 35-71: The parser currently returns an empty data["bins"] silently
when parsing fails; update LewesDistrictCouncil.py so that if the collect_div
variable is None you raise an informative exception (e.g.,
RuntimeError("div.collect not found")), and after the parsing loop if
data["bins"] is still empty raise another informative exception (e.g.,
RuntimeError("no collection entries parsed from div.collect")). Also stop
swallowing date parse errors: replace the except ValueError block around
datetime.strptime(...) so it re-raises (or raises a new RuntimeError) including
the offending cleaned string and context (mentioning
remove_ordinal_indicator_from_date_string and the date_pattern) instead of
continue, so format changes are surfaced immediately.

---

Nitpick comments:
In `@uk_bin_collection/uk_bin_collection/councils/EnvironmentFirst.py`:
- Line 1: The three council modules (EnvironmentFirst.py,
LewesDistrictCouncil.py, EastbourneBoroughCouncil.py) duplicate the same DOM
parsing logic that walks div.collect → <p> → <strong>; extract that block into a
single helper function (e.g. parse_environmentfirst_schedule) placed in a shared
module (common.py or _environmentfirst.py) and have each council class call that
helper instead of duplicating the parse code; update the council classes to
import and delegate to the new function (keep the same input/output contract so
callers in EnvironmentFirst, LewesDistrictCouncil, and EastbourneBoroughCouncil
need only replace the inline parse block with a single function call).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9dee2f8-c62f-452f-9ec8-5310a89c84c6

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and c3182a6.

📒 Files selected for processing (3)
  • uk_bin_collection/uk_bin_collection/councils/EastbourneBoroughCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/EnvironmentFirst.py
  • uk_bin_collection/uk_bin_collection/councils/LewesDistrictCouncil.py

Comment on lines +35 to +71
collect_div = soup.find("div", {"class": "collect"})
if collect_div is None:
return data

# Parse the correct lines (find them, remove the ordinal indicator and make them the correct format date) and
# then add them to the dictionary
rubbish_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[2].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Rubbish",
"collectionDate": rubbish_day,
}
data["bins"].append(dict_data)
recycling_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[4].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Recycling",
"collectionDate": recycling_day,
}
data["bins"].append(dict_data)
date_pattern = re.compile(
r"(\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4})"
)

if len(page_text) > 5:
garden_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[6].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Garden",
"collectionDate": garden_day,
}
data["bins"].append(dict_data)
for p in collect_div.find_all("p"):
strong = p.find("strong")
if not strong:
continue
label = p.get_text(" ", strip=True).lower()
if "rubbish" in label:
bin_type = "Rubbish"
elif "recycling" in label:
bin_type = "Recycling"
elif "garden" in label:
bin_type = "Garden"
else:
continue
match = date_pattern.search(strong.get_text(" ", strip=True))
if not match:
continue
cleaned = remove_ordinal_indicator_from_date_string(match.group(1))
try:
collection_date = datetime.strptime(
cleaned, "%d %B %Y"
).strftime(date_format)
except ValueError:
continue
data["bins"].append(
{
"type": bin_type,
"collectionDate": collection_date,
}
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid silently returning empty schedules when parsing fails.

If the site changes shape again, this implementation degrades to an empty bins list without raising — users will see "no upcoming collections" instead of a clear error. Consider raising when div.collect is found but no entries could be parsed (and arguably when div.collect itself is missing). The per-<p> skip on missing <strong>/non-matching keyword is fine, but the ValueError on date parsing is exactly the kind of format-shift signal that should not be swallowed silently.

Based on learnings: in uk_bin_collection/**/*.py, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors so that format changes are detected early.

🛡️ Suggested minimal hardening
         data = {"bins": []}
         collect_div = soup.find("div", {"class": "collect"})
         if collect_div is None:
-            return data
+            raise ValueError(
+                "Could not locate <div class='collect'> – page layout may have changed."
+            )
@@
             cleaned = remove_ordinal_indicator_from_date_string(match.group(1))
-            try:
-                collection_date = datetime.strptime(
-                    cleaned, "%d %B %Y"
-                ).strftime(date_format)
-            except ValueError:
-                continue
+            collection_date = datetime.strptime(
+                cleaned, "%d %B %Y"
+            ).strftime(date_format)
             data["bins"].append(
                 {
                     "type": bin_type,
                     "collectionDate": collection_date,
                 }
             )
+
+        if not data["bins"]:
+            raise ValueError(
+                "No bin collections parsed from div.collect – page layout may have changed."
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
collect_div = soup.find("div", {"class": "collect"})
if collect_div is None:
return data
# Parse the correct lines (find them, remove the ordinal indicator and make them the correct format date) and
# then add them to the dictionary
rubbish_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[2].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Rubbish",
"collectionDate": rubbish_day,
}
data["bins"].append(dict_data)
recycling_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[4].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Recycling",
"collectionDate": recycling_day,
}
data["bins"].append(dict_data)
date_pattern = re.compile(
r"(\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4})"
)
if len(page_text) > 5:
garden_day = datetime.strptime(
remove_ordinal_indicator_from_date_string(
page_text[6].find_next("strong").text
),
"%d %B %Y",
).strftime(date_format)
dict_data = {
"type": "Garden",
"collectionDate": garden_day,
}
data["bins"].append(dict_data)
for p in collect_div.find_all("p"):
strong = p.find("strong")
if not strong:
continue
label = p.get_text(" ", strip=True).lower()
if "rubbish" in label:
bin_type = "Rubbish"
elif "recycling" in label:
bin_type = "Recycling"
elif "garden" in label:
bin_type = "Garden"
else:
continue
match = date_pattern.search(strong.get_text(" ", strip=True))
if not match:
continue
cleaned = remove_ordinal_indicator_from_date_string(match.group(1))
try:
collection_date = datetime.strptime(
cleaned, "%d %B %Y"
).strftime(date_format)
except ValueError:
continue
data["bins"].append(
{
"type": bin_type,
"collectionDate": collection_date,
}
)
collect_div = soup.find("div", {"class": "collect"})
if collect_div is None:
raise ValueError(
"Could not locate <div class='collect'> – page layout may have changed."
)
date_pattern = re.compile(
r"(\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4})"
)
for p in collect_div.find_all("p"):
strong = p.find("strong")
if not strong:
continue
label = p.get_text(" ", strip=True).lower()
if "rubbish" in label:
bin_type = "Rubbish"
elif "recycling" in label:
bin_type = "Recycling"
elif "garden" in label:
bin_type = "Garden"
else:
continue
match = date_pattern.search(strong.get_text(" ", strip=True))
if not match:
continue
cleaned = remove_ordinal_indicator_from_date_string(match.group(1))
collection_date = datetime.strptime(
cleaned, "%d %B %Y"
).strftime(date_format)
data["bins"].append(
{
"type": bin_type,
"collectionDate": collection_date,
}
)
if not data["bins"]:
raise ValueError(
"No bin collections parsed from div.collect – page layout may have changed."
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/LewesDistrictCouncil.py` around
lines 35 - 71, The parser currently returns an empty data["bins"] silently when
parsing fails; update LewesDistrictCouncil.py so that if the collect_div
variable is None you raise an informative exception (e.g.,
RuntimeError("div.collect not found")), and after the parsing loop if
data["bins"] is still empty raise another informative exception (e.g.,
RuntimeError("no collection entries parsed from div.collect")). Also stop
swallowing date parse errors: replace the except ValueError block around
datetime.strptime(...) so it re-raises (or raises a new RuntimeError) including
the offending cleaned string and context (mentioning
remove_ordinal_indicator_from_date_string and the date_pattern) instead of
continue, so format changes are surfaced immediately.

@robbrad robbrad mentioned this pull request May 1, 2026
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented May 1, 2026

Included in May 2026 Release PR #1992. Closing.

@robbrad robbrad closed this May 1, 2026
@robbrad robbrad mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants