Fix HuntingdonDistrictCouncil scraper crash#1833
Conversation
📝 WalkthroughWalkthroughReworks the Huntingdonshire scraper to fetch pages with requests (timeout/status handling), enforce UPRN or URL, parse collection dates from tags, extract/normalize bin types via regex, return a dict Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Fix AttributeError when 'does not receive X collection' messages lack <strong> tags - Use HTTPS instead of HTTP for secure requests - Add timeout and error handling for HTTP requests - Add null check for page structure changes - Handle malformed dates gracefully - Improve bin type extraction using regex instead of word position indexing - Add support for food waste collection (different text pattern) - Add comprehensive docstrings Results now show proper types: "Domestic waste", "Dry recycling waste", "Garden waste", "Food waste" Fixes crash: 'NoneType' object has no attribute 'get_text'
e1ea6f9 to
e8b9814
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py`:
- Around line 71-90: The parser currently falls back to "Unknown" for bin_type
and silently continues on ValueError when parsing collection_date; update the
logic in HuntingdonDistrictCouncil.py (the block using result, type_match,
bin_type, strong_tag, collection_date, date_format) to raise a clear exception
(e.g., ValueError or a custom ParseError) when the type regex fails or when
datetime.strptime fails instead of assigning "Unknown" or using continue, so
format changes surface immediately; include a descriptive message mentioning the
raw text or tag content to aid debugging.
- Around line 39-47: The current try block in HuntingdonDistrictCouncil (where
user_uprn = kwargs.get("uprn"), check_uprn(user_uprn), and url is built) can
silently continue when check_uprn does not raise; change it to explicitly
validate and raise if neither a valid UPRN nor a legacy URL is present: call
check_uprn and if it returns False (or if user_uprn is falsy) attempt to read
kwargs.get("url") into url_fallback, and if that is also falsy then raise
ValueError("Missing or invalid UPRN and no URL provided"); ensure the raised
error replaces the broad catch so the constructor/method in
HuntingdonDistrictCouncil fails fast on invalid input.
🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py (1)
49-53: Replace wildcard import with explicit imports for better code clarity and explicit dependencies.The code currently relies on
from ... import *to accessrequests,check_uprn, anddate_format. While this works becauserequestsis imported incommon.py, explicit imports make dependencies visible and are used consistently across other council modules in the codebase.Suggested refactor
import re +import requests from datetime import datetime from bs4 import BeautifulSoup -from uk_bin_collection.uk_bin_collection.common import * +from uk_bin_collection.uk_bin_collection.common import check_uprn, date_format from uk_bin_collection.uk_bin_collection.get_bin_data import AbstractGetBinDataClass
Replace silent failures with explicit exceptions: - Raise ValueError with raw text when bin type regex fails (instead of "Unknown") - Raise ValueError with date text when datetime parsing fails (instead of continue) - Validate UPRN/URL upfront and fail fast with clear message if neither provided These changes ensure page format changes surface immediately with debugging info. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… resolved - using PR #1833 comprehensive solution)
|
This PR has been merged into the February 2026 consolidated release PR #1837. Thank you for your contribution! |
Summary
AttributeError: 'NoneType' object has no attribute 'get_text'crash<strong>date tags (e.g., "does not receive X collection" messages)Problem
The scraper crashes when the page includes "does not receive X collection" messages that lack
<strong>tags. This affects properties without certain waste services.Changes
Crash Fix & Parsing
<strong>tag existence before calling.get_text()Error Handling Improvements
ValueErrorwith raw text when bin type regex fails (instead of assigning "Unknown")ValueErrorwith date text when datetime parsing fails (instead of silently skipping)Testing
Tested with multiple UPRNs - output now returns correct bin types without crashing:
{ "bins": [ {"type": "Domestic waste", "collectionDate": "02/02/2026"}, {"type": "Dry recycling waste", "collectionDate": "09/02/2026"}, {"type": "Food waste", "collectionDate": "30/03/2026"} ] }Error handling verified:
ValueError: Missing or invalid UPRN and no URL provided...🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.