Skip to content

Fix HuntingdonDistrictCouncil scraper crash#1833

Closed
daaaaan wants to merge 2 commits into
robbrad:masterfrom
daaaaan:fix/huntingdon-district-council-scraper
Closed

Fix HuntingdonDistrictCouncil scraper crash#1833
daaaaan wants to merge 2 commits into
robbrad:masterfrom
daaaaan:fix/huntingdon-district-council-scraper

Conversation

@daaaaan
Copy link
Copy Markdown
Contributor

@daaaaan daaaaan commented Jan 30, 2026

Summary

  • Fix AttributeError: 'NoneType' object has no attribute 'get_text' crash
  • Skip collection items without <strong> date tags (e.g., "does not receive X collection" messages)
  • Improve bin type parsing using regex instead of word position indexing
  • Add support for food waste collection pattern
  • Fail fast with clear exceptions when page format changes (instead of silent failures)

Problem

The scraper crashes when the page includes "does not receive X collection" messages that lack <strong> tags. This affects properties without certain waste services.

Changes

Crash Fix & Parsing

  • Check for <strong> tag existence before calling .get_text()
  • Use regex to extract bin type reliably from standard format
  • Handle food waste's different text pattern: "Your next weekly food collection is on..."
  • Results now show proper types: "Domestic waste", "Dry recycling waste", "Garden waste", "Food waste"

Error Handling Improvements

  • Raise ValueError with raw text when bin type regex fails (instead of assigning "Unknown")
  • Raise ValueError with date text when datetime parsing fails (instead of silently skipping)
  • Validate UPRN/URL upfront and fail fast with clear message if neither provided
  • Error messages include debugging info (raw text, expected format) to aid troubleshooting

Testing

Tested with multiple UPRNs - output now returns correct bin types without crashing:

{
  "bins": [
    {"type": "Domestic waste", "collectionDate": "02/02/2026"},
    {"type": "Dry recycling waste", "collectionDate": "09/02/2026"},
    {"type": "Food waste", "collectionDate": "30/03/2026"}
  ]
}

Error handling verified:

  • Missing UPRN/URL: ValueError: Missing or invalid UPRN and no URL provided...
  • Format changes will raise clear exceptions with the problematic text

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • More reliable extraction of bin types and collection dates; skips incomplete or malformed entries to avoid incorrect displays.
    • Improved detection of missing or changed page sections to prevent silent failures and surface clear error messages.
  • Improvements

    • Stronger validation with explicit identifier (UPRN) support and URL fallback, plus more robust network and parsing error handling.
    • Normalized bin type labels and consistent date formatting for clearer schedule display.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Reworks the Huntingdonshire scraper to fetch pages with requests (timeout/status handling), enforce UPRN or URL, parse collection dates from tags, extract/normalize bin types via regex, return a dict {"bins":[...]}, and raise ValueError on fetch or parse failures (adds re import).

Changes

Cohort / File(s) Summary
HuntingdonDistrictCouncil parser
uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py
Replaces previous parsing with a robust flow: uses requests.get(..., timeout) and HTTP status checks; enforces UPRN or URL via check_uprn; prefers UPRN-based URLs with legacy fallback; validates presence of results container; iterates items, skips entries missing a <strong> date, parses/normalizes bin type with regex (e.g., "X waste" or "Food waste") and parses date into collectionDate; raises ValueError on missing container, fetch failures, parsing errors; returns {"bins":[{type,collectionDate}, ...]}; adds import re and updates parse_data(self, page, **kwargs) -> dict signature.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • dp247

Poem

🐇 I nibble through regex threads tonight,
I chase each date from strong-tag light.
URLs and UPRNs align,
Bins and dates in tidy line.
Hooray — the parser hops just right!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix HuntingdonDistrictCouncil scraper crash' accurately summarizes the main objective of the PR—fixing a crash in the Huntingdon District Council scraper caused by missing tags.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Fix AttributeError when 'does not receive X collection' messages lack <strong> tags
- Use HTTPS instead of HTTP for secure requests
- Add timeout and error handling for HTTP requests
- Add null check for page structure changes
- Handle malformed dates gracefully
- Improve bin type extraction using regex instead of word position indexing
- Add support for food waste collection (different text pattern)
- Add comprehensive docstrings

Results now show proper types: "Domestic waste", "Dry recycling waste",
"Garden waste", "Food waste"

Fixes crash: 'NoneType' object has no attribute 'get_text'
@daaaaan daaaaan force-pushed the fix/huntingdon-district-council-scraper branch from e1ea6f9 to e8b9814 Compare January 30, 2026 13:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py`:
- Around line 71-90: The parser currently falls back to "Unknown" for bin_type
and silently continues on ValueError when parsing collection_date; update the
logic in HuntingdonDistrictCouncil.py (the block using result, type_match,
bin_type, strong_tag, collection_date, date_format) to raise a clear exception
(e.g., ValueError or a custom ParseError) when the type regex fails or when
datetime.strptime fails instead of assigning "Unknown" or using continue, so
format changes surface immediately; include a descriptive message mentioning the
raw text or tag content to aid debugging.
- Around line 39-47: The current try block in HuntingdonDistrictCouncil (where
user_uprn = kwargs.get("uprn"), check_uprn(user_uprn), and url is built) can
silently continue when check_uprn does not raise; change it to explicitly
validate and raise if neither a valid UPRN nor a legacy URL is present: call
check_uprn and if it returns False (or if user_uprn is falsy) attempt to read
kwargs.get("url") into url_fallback, and if that is also falsy then raise
ValueError("Missing or invalid UPRN and no URL provided"); ensure the raised
error replaces the broad catch so the constructor/method in
HuntingdonDistrictCouncil fails fast on invalid input.
🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py (1)

49-53: Replace wildcard import with explicit imports for better code clarity and explicit dependencies.

The code currently relies on from ... import * to access requests, check_uprn, and date_format. While this works because requests is imported in common.py, explicit imports make dependencies visible and are used consistently across other council modules in the codebase.

Suggested refactor
 import re
+import requests
 from datetime import datetime
 
 from bs4 import BeautifulSoup
 
-from uk_bin_collection.uk_bin_collection.common import *
+from uk_bin_collection.uk_bin_collection.common import check_uprn, date_format
 from uk_bin_collection.uk_bin_collection.get_bin_data import AbstractGetBinDataClass

Comment thread uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py Outdated
Comment thread uk_bin_collection/uk_bin_collection/councils/HuntingdonDistrictCouncil.py Outdated
Replace silent failures with explicit exceptions:
- Raise ValueError with raw text when bin type regex fails (instead of "Unknown")
- Raise ValueError with date text when datetime parsing fails (instead of continue)
- Validate UPRN/URL upfront and fail fast with clear message if neither provided

These changes ensure page format changes surface immediately with debugging info.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
robbrad added a commit that referenced this pull request Feb 1, 2026
… resolved - using PR #1833 comprehensive solution)
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented Feb 1, 2026

This PR has been merged into the February 2026 consolidated release PR #1837.

Thank you for your contribution!

@robbrad robbrad closed this Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants