fix: add User-Agent header to KingsLynnandWestNorfolkBC scraper#1733
Conversation
The Kings Lynn and West Norfolk council scraper was returning empty bin data because the website (https://www.west-norfolk.gov.uk) was blocking requests without a proper User-Agent header, resulting in a 403 Forbidden HTTP error. Root cause: - The scraper was sending HTTP requests with only a Cookie header - The council website's server requires a User-Agent header to identify the client - Without this header, the server rejected the request with HTTP 403 Forbidden - This caused BeautifulSoup to parse an error page instead of bin collection data - The scraper found zero bin_date_container divs, resulting in empty bins array Solution: - Added a standard Chrome User-Agent string to the request headers - The website now accepts the request and returns the expected HTML content - The scraper successfully parses bin collection dates from the response Testing: - Verified with test UPRN - now returns bin collections successfully - Integration test passes successfully - All unit tests continue to pass (76/77, unrelated Chrome driver failure)
WalkthroughA User-Agent header is added to the HTTP request in the Kings Lynn and West Norfolk Borough Council bin collection module, while preserving the existing Cookie header with the Uprn value. No changes to parsing, query logic, or data extraction. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/KingsLynnandWestNorfolkBC.py (1)
25-28: User-Agent header fix looks good; consider extracting UA to a constant.The added User-Agent together with the existing Cookie header is a sensible, minimal fix for the 403s. To make future maintenance easier (and to reuse this UA across any other scrapers if needed), you could move the UA string to a module-level constant and reference it here.
@@ -from uk_bin_collection.uk_bin_collection.common import * -from uk_bin_collection.uk_bin_collection.get_bin_data import AbstractGetBinDataClass - - -# import the wonderful Beautiful Soup and the URL grabber -class CouncilClass(AbstractGetBinDataClass): +from uk_bin_collection.uk_bin_collection.common import * +from uk_bin_collection.uk_bin_collection.get_bin_data import AbstractGetBinDataClass + +USER_AGENT = ( + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " + "AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/91.0.4472.124 Safari/537.36" +) + + +# import the wonderful Beautiful Soup and the URL grabber +class CouncilClass(AbstractGetBinDataClass): @@ - headers = { - "Cookie": f"bcklwn_uprn={user_uprn}", - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", - } + headers = { + "Cookie": f"bcklwn_uprn={user_uprn}", + "User-Agent": USER_AGENT, + }
|
Any update on this please? Keen to get it implemented to fix my local council. Thank you. |
The Kings Lynn and West Norfolk council scraper was returning empty bin data because the website (https://www.west-norfolk.gov.uk) was blocking requests without a proper User-Agent header, resulting in a 403 Forbidden HTTP error.
Root cause:
Solution:
Testing:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.