Skip to content

fix: SouthOxfordshireCouncil - session establishment + bin type cleanup#1983

Closed
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:fix/south-oxfordshire-session-and-bin-type
Closed

fix: SouthOxfordshireCouncil - session establishment + bin type cleanup#1983
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:fix/south-oxfordshire-session-and-bin-type

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented Apr 27, 2026

What

Two fixes for South Oxfordshire District Council:

1. Intermittent 403 from Azure App Gateway

Direct requests to eform.southoxon.gov.uk/ebase/BINZONE_DESKTOP.eb are intermittently rejected (HTTP 403) without a valid JSESSIONID session cookie. Fix: use requests.Session() to visit the endpoint first (establishing the session), then set the SVBINZONE UPRN cookie and fetch the bin data.

2. Bin type strings including supplementary notes

The page's binextra divs sometimes contain extra notes after the bin description (e.g. "Don't forget - you can put out double your garden waste...", "Extra garden waste collected for brown bin subscriber."). The previous join of all stripped_strings elements included these notes in the bin type. Fix: stop building the type string at the first element containing "don't" or starting with "Extra".

Test

Tested against UPRN 10033002851 (from input.json). Returns clean bin types:

  • Green bin, textiles, food bin and garden waste bin
  • Grey bin, small electrical items and food bin

Checklist

  • input.json already has "skip_get_url": true — no change needed

Summary by CodeRabbit

Release Notes

  • Updates

    • Rotherham: Collection lookup now uses postcode and house number instead of UPRN.
  • Bug Fixes

    • South Oxfordshire: Fixed intermittent connection issues affecting bin collection lookups and improved bin type descriptions for clarity.
    • Rotherham: Enhanced reliability and robustness through improved data source integration.

Rotherham's own bin-day page directs residents to PDF calendars only —
there is no usable web lookup at rotherham.gov.uk. The Rotherham Bins
mobile app uses the shared Imactivate backend at
`bins.azurewebsites.net`. Earlier this year that endpoint had no
Rotherham data deployed; as of April 2026 it does:

- `getaddress?postcode=S60+1JD&localauthority=Rotherham` returns the
  full address list with `PremiseID` per row.
- `getcollections?premisesid=<id>&localauthority=Rotherham` returns
  6 future collections (PINK / BLACK / GREEN bins) per property.

The previous scraper only accepted a `premisesid` (or treated a numeric
`uprn` as one), but standard UPRNs are not Imactivate PremiseIDs and
yield empty results. Rewritten to accept `postcode` + `paon`, resolve
the PremiseID via getaddress (matching against Address1 / Address2 /
Street with a substring fallback for compound addresses like
"Flat 3, 22A"), then fetch the schedule. `premisesid` is still
honoured if explicitly passed.

`input.json` updated to use postcode `S60 1JD` + paon `77` so the
parity check stays green and the integration test can hit the live API.

Verified: returns the full upcoming bin schedule.
…strip supplementary notes from bin type

Azure App Gateway intermittently returns 403 on direct requests without
a JSESSIONID session cookie. Fix: use requests.Session() to visit the
page first (obtaining JSESSIONID), then set the SVBINZONE cookie and
fetch the bin data.

Also fixes bin type strings incorrectly including supplementary notes
(e.g. "Don't forget...", "Extra garden waste...") that appear after
the actual bin description in the binextra divs.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

The changes migrate two council bin collection scrapers to improved data sources: Rotherham switches from UPRN-based endpoint to postcode/house-number resolution via Imactivate backend with address lookup, while SouthOxfordshire improves session persistence and bin-type parsing logic.

Changes

Cohort / File(s) Summary
Rotherham Configuration
uk_bin_collection/tests/input.json
Updates RotherhamCouncil entry from UPRN-based URL pattern to postcode/house-number approach with Imactivate backend; sets skip_get_url to true and rewrites wiki_note describing new input requirements.
Rotherham Implementation
uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py
Switches data source to Imactivate's shared backend, adds BASE/LA/UA constants and request headers/timeouts, implements premise resolution via postcode+paon lookup through /getaddress endpoint, maintains backward UPRN compatibility, improves payload handling, and sorts results by collection date.
SouthOxfordshire Implementation
uk_bin_collection/uk_bin_collection/councils/SouthOxfordshireCouncil.py
Introduces persistent requests.Session with two-step fetch to handle session state, refines bin-type parsing by truncating supplementary notes before capitalization.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Imactivate as Imactivate API
    participant AddressAPI as Address Resolution
    participant CollectionAPI as Collection Data API
    
    Client->>Imactivate: POST /getaddress<br/>(postcode, paon)
    Imactivate->>AddressAPI: Resolve address
    AddressAPI-->>Imactivate: Return matched addresses
    Imactivate-->>Client: Return PremiseID<br/>(from matching fields)
    
    Client->>CollectionAPI: GET /collection<br/>(localauthority=Rotherham,<br/>PremiseID)
    CollectionAPI-->>Client: Return collection schedules
    Client->>Client: Sort by collectionDate<br/>& return bins
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • dp247

Poem

🐰 A rabbit's rejoicing!

From UPRN to postcode we spring,
Imactivate's service now takes the ring,
Sessions persist and addresses align,
Two councils improved, their data divine! 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title focuses exclusively on SouthOxfordshireCouncil, but the changeset includes significant updates to RotherhamCouncil as well. Update the title to reflect all major changes, such as: 'fix: Update RotherhamCouncil to Imactivate backend and fix SouthOxfordshireCouncil session handling' or split into separate PRs.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py (2)

54-61: Comment vs. code drift.

The comment says "Match against Address2 (house number/name) first, then Street", but the loop also checks Address1. Either drop Address1 or update the comment so the precedence is documented accurately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py` around
lines 54 - 61, The comment above the loop no longer matches the checked keys:
the loop iterates for key in ("Address2", "Address1", "Street") but the comment
only mentions Address2 then Street; update the comment to accurately describe
the precedence (Address2, then Address1, then Street) or remove Address1 from
the tuple if it was added accidentally; locate the loop that iterates over rows
and keys ("Address2","Address1","Street") (which returns row.get("PremiseID") on
match) and either change the comment to state "Match against Address2 (house
number/name) first, then Address1, then Street" or drop "Address1" from the
tuple to preserve the original comment.

134-139: Silently dropping rows on date-parse failure can hide upstream format changes.

If Imactivate ever changes its CollectionDate format, every row will silently continue and parse_data will return an empty bins list with no signal. Tighten the catch to (ValueError, TypeError) and at minimum log the offending payload so a format change is noticed.

-            try:
-                iso_date = date_str.split("T")[0]
-                parsed = datetime.strptime(iso_date, "%Y-%m-%d")
-                formatted = parsed.strftime(date_format)
-            except Exception:
-                continue
+            try:
+                iso_date = date_str.split("T")[0]
+                parsed = datetime.strptime(iso_date, "%Y-%m-%d")
+                formatted = parsed.strftime(date_format)
+            except (ValueError, TypeError) as exc:
+                print(f"Rotherham: skipping unparseable date {date_str!r}: {exc}")
+                continue

Based on learnings: in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors so format changes are detected early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py` around
lines 134 - 139, The current broad except in the date parsing block swallows all
errors and silently skips rows; tighten the except to catch only (ValueError,
TypeError) for the date parsing in the parse_data routine (the block that
computes iso_date, parsed, formatted) and, when caught, log the offending
payload/row (include the raw date_str and the full row or item) via the
module/logger used in this file and then re-raise or explicitly raise a
descriptive exception so upstream callers detect a format change instead of
returning empty bins; this ensures issues in CollectionDate format are visible
while still only catching parse-related errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py`:
- Around line 63-75: The substring fallback in the rows loop (building blob from
Address1/Address2/Street and checking `target in blob`) can falsely match short
paon values (e.g., "5" matching "15 High Street"); replace this raw substring
test with a token/word-boundary match (e.g., split blob into
whitespace/punctuation-separated tokens or use a regex with \b around the
normalized `target`) so only whole PAON tokens match, remove the redundant
`target and` guard, and keep the explicit ValueError raise if no proper
token-boundary match for `paon` is found (refer to the loop over `rows`, the
`blob` construction, `paon`/`target`, and the returned `PremiseID`).

In `@uk_bin_collection/uk_bin_collection/councils/SouthOxfordshireCouncil.py`:
- Around line 91-98: The stop-marker check is inconsistent: it uses
case-insensitive `"don't" in part.lower()` but case-sensitive
`part.startswith("Extra")`, so normalize the part before checking; in the loop
that builds type_parts (variables: bin_info, type_start, type_parts, bin_type)
convert each part to lowercase once (e.g., part_lower = part.lower()) and use
part_lower.startswith("extra") alongside the existing `"don't" in part_lower`
check to break, ensuring supplemental notes like "extra…" are removed regardless
of case.

---

Nitpick comments:
In `@uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py`:
- Around line 54-61: The comment above the loop no longer matches the checked
keys: the loop iterates for key in ("Address2", "Address1", "Street") but the
comment only mentions Address2 then Street; update the comment to accurately
describe the precedence (Address2, then Address1, then Street) or remove
Address1 from the tuple if it was added accidentally; locate the loop that
iterates over rows and keys ("Address2","Address1","Street") (which returns
row.get("PremiseID") on match) and either change the comment to state "Match
against Address2 (house number/name) first, then Address1, then Street" or drop
"Address1" from the tuple to preserve the original comment.
- Around line 134-139: The current broad except in the date parsing block
swallows all errors and silently skips rows; tighten the except to catch only
(ValueError, TypeError) for the date parsing in the parse_data routine (the
block that computes iso_date, parsed, formatted) and, when caught, log the
offending payload/row (include the raw date_str and the full row or item) via
the module/logger used in this file and then re-raise or explicitly raise a
descriptive exception so upstream callers detect a format change instead of
returning empty bins; this ensures issues in CollectionDate format are visible
while still only catching parse-related errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5ae0774f-1c16-4dee-9f11-5174ceb29910

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and bf6dd1f.

📒 Files selected for processing (3)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/SouthOxfordshireCouncil.py

Comment on lines +63 to +75
# Looser substring fallback so addresses like "Flat 3, 22A" match
# against a paon of "22A".
for row in rows:
blob = " ".join(
str(row.get(k, "")).strip()
for k in ("Address1", "Address2", "Street")
).lower()
if target and target in blob:
return str(row.get("PremiseID"))

raise ValueError(
f"No address matching '{paon}' for postcode {postcode}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Substring fallback can silently match the wrong property.

target in blob does a raw substring test, so a paon of "5" will match an address blob like "15 High Street" and the first row wins. Because the upstream loop already handles exact matches, this fallback should be tightened (token / word‑boundary match) to avoid silently returning a wrong PremiseID for users with short numeric house numbers. The target and guard on line 70 is also redundant — target was already verified non‑empty at line 47.

Per the retrieved learning, prefer explicit failure over a silent wrong‑match here.

♻️ Suggested tighter match
-        # Looser substring fallback so addresses like "Flat 3, 22A" match
-        # against a paon of "22A".
-        for row in rows:
-            blob = " ".join(
-                str(row.get(k, "")).strip()
-                for k in ("Address1", "Address2", "Street")
-            ).lower()
-            if target and target in blob:
-                return str(row.get("PremiseID"))
+        # Looser token-based fallback so addresses like "Flat 3, 22A" match
+        # against a paon of "22A", without "5" matching "15"/"25"/"50".
+        import re
+        for row in rows:
+            tokens = re.findall(
+                r"[\w]+",
+                " ".join(
+                    str(row.get(k, "")) for k in ("Address1", "Address2", "Street")
+                ).lower(),
+            )
+            if target in tokens:
+                return str(row.get("PremiseID"))

Based on learnings: in the UKBinCollectionData project, prefer explicit failures over silent error handling when parsing council bin collection data, to ensure format changes (or here, ambiguous matches) are detected immediately rather than potentially going unnoticed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Looser substring fallback so addresses like "Flat 3, 22A" match
# against a paon of "22A".
for row in rows:
blob = " ".join(
str(row.get(k, "")).strip()
for k in ("Address1", "Address2", "Street")
).lower()
if target and target in blob:
return str(row.get("PremiseID"))
raise ValueError(
f"No address matching '{paon}' for postcode {postcode}"
)
# Looser token-based fallback so addresses like "Flat 3, 22A" match
# against a paon of "22A", without "5" matching "15"/"25"/"50".
import re
for row in rows:
tokens = re.findall(
r"[\w]+",
" ".join(
str(row.get(k, "")) for k in ("Address1", "Address2", "Street")
).lower(),
)
if target in tokens:
return str(row.get("PremiseID"))
raise ValueError(
f"No address matching '{paon}' for postcode {postcode}"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/RotherhamCouncil.py` around
lines 63 - 75, The substring fallback in the rows loop (building blob from
Address1/Address2/Street and checking `target in blob`) can falsely match short
paon values (e.g., "5" matching "15 High Street"); replace this raw substring
test with a token/word-boundary match (e.g., split blob into
whitespace/punctuation-separated tokens or use a regex with \b around the
normalized `target`) so only whole PAON tokens match, remove the redundant
`target and` guard, and keep the explicit ValueError raise if no proper
token-boundary match for `paon` is found (refer to the loop over `rows`, the
`blob` construction, `paon`/`target`, and the returned `PremiseID`).

Comment on lines +91 to +98
# Strip supplementary notes (e.g. "Don't forget...", "Extra garden waste...")
# that follow the bin-type description.
type_parts = []
for part in bin_info[type_start:]:
if "don't" in part.lower() or part.startswith("Extra"):
break
type_parts.append(part)
bin_type = str.capitalize(" ".join(type_parts))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the "Extra" stop-marker check case-insensitive for consistency.

The "don't" in part.lower() check is case-insensitive, but part.startswith("Extra") is case-sensitive. If the site ever returns "extra garden waste…" or "EXTRA…", the supplementary note will leak back into bin_type and silently regress the cleanup this PR is introducing. Align both checks on a single normalized form.

🧹 Proposed fix
                 type_parts = []
                 for part in bin_info[type_start:]:
-                    if "don't" in part.lower() or part.startswith("Extra"):
+                    lowered = part.lower()
+                    if "don't" in lowered or lowered.startswith("extra"):
                         break
                     type_parts.append(part)
                 bin_type = str.capitalize(" ".join(type_parts))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/SouthOxfordshireCouncil.py`
around lines 91 - 98, The stop-marker check is inconsistent: it uses
case-insensitive `"don't" in part.lower()` but case-sensitive
`part.startswith("Extra")`, so normalize the part before checking; in the loop
that builds type_parts (variables: bin_info, type_start, type_parts, bin_type)
convert each part to lowercase once (e.g., part_lower = part.lower()) and use
part_lower.startswith("extra") alongside the existing `"don't" in part_lower`
check to break, ensuring supplemental notes like "extra…" are removed regardless
of case.

@robbrad robbrad mentioned this pull request May 1, 2026
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented May 1, 2026

Included in May 2026 Release PR #1992. Closing.

@robbrad robbrad closed this May 1, 2026
@robbrad robbrad mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants