Conversation
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughRefactors Firecrawl-specific scraping into a pluggable GeneralScraper with Firecrawl and Zyte backends; renames config keys to GENERAL_SCRAPING_* and FIRECRAWL_WAIT_FOR; introduces async Firecrawl client, Zyte integration, generalized data processors/items, parsing sanitization updates, and bot config wiring changes. (34 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant General as GeneralScraper
participant Registry as SCRAPER_REGISTRY
participant Backend as BackendScraper
participant Processor as DataProcessor
Caller->>General: request processor for URL
General->>Registry: lookup scraper type (GENERAL_SCRAPING_API)
Registry-->>General: returns BackendScraper class
General->>Backend: instantiate & call get_processor_by_url(url)
Backend->>Processor: create processor with url
Processor->>Processor: async _get_page_content()
Processor->>Processor: _process_*_result → _build_item_data()
Processor-->>General: DataProcessor ready (get_item)
General-->>Caller: returns Processor
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
app/config.py (1)
233-233:⚠️ Potential issue | 🟡 Minor
HTTP_REQUEST_TIMEOUThas inconsistent typing — will bestrwhen set from env,intwhen falling back to default.
env.get()returns a string for set environment variables but the default30is anint. This will cause type mismatches downstream.Proposed fix
-HTTP_REQUEST_TIMEOUT = env.get("HTTP_REQUEST_TIMEOUT", 30) +HTTP_REQUEST_TIMEOUT = int(env.get("HTTP_REQUEST_TIMEOUT", 30))app/services/telegram_bot/__init__.py (2)
209-221:⚠️ Potential issue | 🔴 CriticalBug: successful general scrape is immediately overwritten with "no supported url found".
When
GENERAL_SCRAPING_ONisTrueand the URL is processed successfully (lines 210–217), execution falls through to line 218 which overwrites the process message with "no supported url found" and returns. The user sees the error text even though processing succeeded.Add a
return(or restructure withelse) after the successful scrape path:Proposed fix
if url_metadata.source == "unknown": if GENERAL_SCRAPING_ON: await process_message.edit_text( text=f"Uncategorized url found. General webpage parser is on, Processing..." ) metadata_item = await content_process_function(url_metadata=url_metadata) await send_item_message( metadata_item, chat_id=message.chat_id ) + await process_message.delete() + return await process_message.edit_text( text=f"For the {i + 1} th url, no supported url found." ) return
351-358:⚠️ Potential issue | 🟠 MajorAfter a successful general scrape, the "no supported url found" branch still executes.
On line 351, when
GENERAL_SCRAPING_ONisTrueandsource == "unknown", the URL is processed and sent. But execution continues to line 356 which matchessource == "unknown"again, logs "no supported url found", and returns — preventing any further processing. Consider addingcontinueafter the successful scrape to avoid the misleading log and early return.Proposed fix
if url_metadata.source == "unknown" and GENERAL_SCRAPING_ON: metadata_item = await content_process_function(url_metadata=url_metadata) await send_item_message( metadata_item, chat_id=message.chat_id, message=message ) + continue if url_metadata.source == "unknown" or url_metadata.source == "banned":
🤖 Fix all issues with AI agents
In `@app/services/scrapers/general/base.py`:
- Around line 84-88: The code can raise TypeError when description is falsy and
markdown_content is None because None[:500] is invalid; update the processing
block in base.py to defensively handle None by coalescing markdown_content to an
empty string (or using conditional slicing) before taking [:500], then pass the
resulting string into BeautifulSoup and set item_data["text"]; reference the
variables description, markdown_content, text, BeautifulSoup, and item_data when
making the change.
- Around line 133-136: The loop that removes DOCTYPE nodes mutates soup.contents
while iterating which can skip nodes; change the iteration to iterate over a
snapshot (e.g. wrap soup.contents in list(...)) so every Doctype is visited and
extracted — update the loop that uses soup and Doctype in
app/services/scrapers/general/base.py to iterate over list(soup.contents) (or
use soup.find_all/explicit copy) before calling item.extract().
- Around line 89-96: The else branch can pass None into wrap_text_into_html when
markdown_content is None; update the logic in the block using
parsing_article_body_by_llm, sanitize_html and wrap_text_into_html so
markdown_content is guarded with a fallback (e.g., use markdown_content or "" )
before calling wrap_text_into_html(markdown_content, is_html=False), ensuring
wrap_text_into_html always receives a string.
In `@app/services/scrapers/general/firecrawl.py`:
- Around line 24-36: The call to the synchronous Firecrawl client in
_get_page_content is blocking the event loop; change the call to
self._client.scrape_url so it runs on a background thread (e.g., use
asyncio.to_thread or loop.run_in_executor) and await that result before passing
it to _process_firecrawl_result; also add the necessary import for asyncio at
the top and keep the existing try/except and logger.error behavior unchanged.
In `@app/services/scrapers/general/scraper.py`:
- Around line 36-42: self.scraper_type can be None which causes
self.scraper_type.upper() in _init_scraper to raise AttributeError; ensure a
safe default string is used when both the constructor argument and
GENERAL_SCRAPING_API are None by normalizing scraper_type before calling
.upper(). Update the assignment or _init_scraper so it uses something like
scraper_type = (scraper_type or GENERAL_SCRAPING_API or "GENERAL") (or otherwise
coerce to str) and then call .upper() against that value; reference symbols:
scraper_type, GENERAL_SCRAPING_API, _init_scraper, SCRAPER_REGISTRY, and the
.upper() call to locate the change.
In `@app/services/scrapers/general/zyte.py`:
- Around line 22-35: The AsyncZyteAPI instance (AsyncZyteAPI) creates an
internal aiohttp session that must be closed via its async context manager;
update the Zyte scraping block to open the client's session with "async with
client.session():" and call the existing AsyncZyteAPI.get(...) inside that
context so the underlying connection pool is always closed even on exceptions,
then pass the returned result to self._process_zyte_result(result) as before and
keep the current exception logging/raising.
In `@app/services/scrapers/scraper_manager.py`:
- Around line 54-57: init_general_scraper currently constructs and returns a
GeneralScraper but never assigns it to the class attribute, so the guard
checking cls.general_scraper in init_scraper will always be true; modify
init_general_scraper to assign the new instance to cls.general_scraper (e.g.,
cls.general_scraper = GeneralScraper()) and then return cls.general_scraper so
subsequent calls reuse the same scraper instance.
In `@app/utils/parse.py`:
- Around line 130-133: The loop mutates the live list soup.contents while
iterating (using Doctype and item.extract()), which can skip nodes; fix by
iterating over a snapshot of the contents instead (e.g., iterate over
list(soup.contents) or use soup.find_all/children snapshot) and call
item.extract() on the snapshot so no elements are skipped; update the block that
references soup.contents, Doctype, and item.extract() accordingly.
- Around line 151-154: The loop that unwrapps <p> tags (the for tag in
soup.find_all("p") block) currently calls tag.unwrap() which merges consecutive
paragraphs; before unwrapping each <p> insert a paragraph separator (e.g., a
newline or double-newline as a NavigableString or a <br> element) so paragraph
boundaries are preserved in the resulting Telegram text, then call tag.unwrap();
ensure any required import (NavigableString) is added if using it.
🧹 Nitpick comments (2)
app/services/scrapers/scraper_manager.py (1)
18-21:scrapersdict capturesNoneat class-definition time, not a live reference.The dict values for
"other"and"unknown"are bound to the current value ofgeneral_scraper(which isNone) when the class body executes. Afterinit_scraperupdates the dict entry, subsequent lookups viacls.scrapers["other"]will work, but this pattern is fragile and misleading. Consider convertingscrapersinto a@classmethodproperty or populating it lazily to avoid this gotcha.app/services/scrapers/general/base.py (1)
116-150: Near-duplicate sanitization logic withtelegram_message_html_triminparse.py.Both methods remove DOCTYPEs, decompose script/style/etc., and unwrap structural tags, but with slightly different tag lists. This is a potential source of drift. Consider extracting a shared sanitization utility that both call sites can reuse, or at minimum add a comment cross-referencing the two implementations.
| # Remove DOCTYPE declarations | ||
| for item in soup.contents: | ||
| if isinstance(item, Doctype): | ||
| item.extract() |
There was a problem hiding this comment.
Same live-list mutation issue — iterate over a snapshot of soup.contents.
Same issue as in parse.py: calling .extract() while iterating over soup.contents can skip DOCTYPE nodes.
Proposed fix
# Remove DOCTYPE declarations
- for item in soup.contents:
+ for item in list(soup.contents):
if isinstance(item, Doctype):
item.extract()🤖 Prompt for AI Agents
In `@app/services/scrapers/general/base.py` around lines 133 - 136, The loop that
removes DOCTYPE nodes mutates soup.contents while iterating which can skip
nodes; change the iteration to iterate over a snapshot (e.g. wrap soup.contents
in list(...)) so every Doctype is visited and extracted — update the loop that
uses soup and Doctype in app/services/scrapers/general/base.py to iterate over
list(soup.contents) (or use soup.find_all/explicit copy) before calling
item.extract().
| async def _get_page_content(self) -> None: | ||
| try: | ||
| result = self._client.scrape_url( | ||
| url=self.url, | ||
| formats=["markdown", "html"], | ||
| only_main_content=True, | ||
| exclude_tags=FIRECRAWL_EXCLUDE_TAGS, | ||
| wait_for=FIRECRAWL_WAIT_FOR, | ||
| ) | ||
| await self._process_firecrawl_result(result) | ||
| except Exception as e: | ||
| logger.error(f"Failed to scrape URL with Firecrawl: {e}") | ||
| raise |
There was a problem hiding this comment.
Synchronous scrape_url blocks the async event loop.
FirecrawlClient.scrape_url() is a synchronous method (it calls self._app.scrape(...) from the Firecrawl SDK). Calling it directly inside this async def _get_page_content will block the event loop for the duration of the HTTP request to Firecrawl.
Wrap the blocking call with asyncio.to_thread (or loop.run_in_executor):
Proposed fix
+import asyncio
+from functools import partial
+
...
async def _get_page_content(self) -> None:
try:
- result = self._client.scrape_url(
- url=self.url,
- formats=["markdown", "html"],
- only_main_content=True,
- exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
- wait_for=FIRECRAWL_WAIT_FOR,
- )
+ result = await asyncio.to_thread(
+ partial(
+ self._client.scrape_url,
+ url=self.url,
+ formats=["markdown", "html"],
+ only_main_content=True,
+ exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
+ wait_for=FIRECRAWL_WAIT_FOR,
+ )
+ )
await self._process_firecrawl_result(result)
except Exception as e:
logger.error(f"Failed to scrape URL with Firecrawl: {e}")
raise🤖 Prompt for AI Agents
In `@app/services/scrapers/general/firecrawl.py` around lines 24 - 36, The call to
the synchronous Firecrawl client in _get_page_content is blocking the event
loop; change the call to self._client.scrape_url so it runs on a background
thread (e.g., use asyncio.to_thread or loop.run_in_executor) and await that
result before passing it to _process_firecrawl_result; also add the necessary
import for asyncio at the top and keep the existing try/except and logger.error
behavior unchanged.
| self.scraper_type = scraper_type or GENERAL_SCRAPING_API | ||
| self._scraper: Optional[BaseGeneralScraper] = None | ||
| self._init_scraper() | ||
|
|
||
| def _init_scraper(self) -> None: | ||
| """Initialize the underlying scraper based on scraper_type.""" | ||
| scraper_class = self.SCRAPER_REGISTRY.get(self.scraper_type.upper()) |
There was a problem hiding this comment.
AttributeError if both scraper_type and GENERAL_SCRAPING_API are None.
self.scraper_type can be None when the caller passes nothing and the config variable is unset. Calling .upper() on None at line 42 will raise AttributeError.
Proposed fix — default to a known value
- self.scraper_type = scraper_type or GENERAL_SCRAPING_API
+ self.scraper_type = scraper_type or GENERAL_SCRAPING_API or "FIRECRAWL"🤖 Prompt for AI Agents
In `@app/services/scrapers/general/scraper.py` around lines 36 - 42,
self.scraper_type can be None which causes self.scraper_type.upper() in
_init_scraper to raise AttributeError; ensure a safe default string is used when
both the constructor argument and GENERAL_SCRAPING_API are None by normalizing
scraper_type before calling .upper(). Update the assignment or _init_scraper so
it uses something like scraper_type = (scraper_type or GENERAL_SCRAPING_API or
"GENERAL") (or otherwise coerce to str) and then call .upper() against that
value; reference symbols: scraper_type, GENERAL_SCRAPING_API, _init_scraper,
SCRAPER_REGISTRY, and the .upper() call to locate the change.
| try: | ||
| client = AsyncZyteAPI(api_key=ZYTE_API_KEY) | ||
| result = await client.get( | ||
| { | ||
| "url": self.url, | ||
| "browserHtml": True, | ||
| "article": True, | ||
| "articleOptions": {"extractFrom": "browserHtml"}, | ||
| } | ||
| ) | ||
| await self._process_zyte_result(result) | ||
| except Exception as e: | ||
| logger.error(f"Failed to scrape URL with Zyte: {e}") | ||
| raise |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Zyte API AsyncZyteAPI Python context manager resource cleanup
💡 Result:
For resource cleanup with zyte_api.AsyncZyteAPI, the main thing you need to close is the session object returned by client.session() (it manages an internal aiohttp.ClientSession connection pool).
Recommended: use it as an async context manager
from zyte_api import AsyncZyteAPI
async def main():
client = AsyncZyteAPI()
async with client.session() as session:
r = await session.get({"url": "https://example.com", "httpResponseBody": True})
...Exiting the async with block performs the cleanup automatically. [1] [2]
Alternative: manual cleanup (equivalent, but you must remember it)
session = client.session()
try:
r = await session.get({"url": "https://example.com"})
finally:
await session.close()This is explicitly documented as sufficient—you don’t have to use the context manager as long as you await session.close() when done. [1]
If you pass your own aiohttp.ClientSession
If you call await client.get(..., session=your_aiohttp_session), then you own that aiohttp session lifecycle and must close it yourself (Zyte won’t). [1]
Sources:
[1] python-zyte-api API reference (AsyncZyteAPI, session(), close())
[2] Zyte docs Python examples using async with client.session()
🏁 Script executed:
# Search for AsyncZyteAPI imports and usage in the codebase
rg "AsyncZyteAPI" --type py -B 2 -A 5Repository: aturret/FastFetchBot
Length of output: 1095
🏁 Script executed:
# Check if there are any existing Zyte API client usage patterns
rg "ZYTE_API_KEY" --type py -B 3 -A 3Repository: aturret/FastFetchBot
Length of output: 1573
AsyncZyteAPI session requires proper async context management — resource leak risk.
AsyncZyteAPI.get() creates an internal session that manages an aiohttp.ClientSession connection pool. The current code does not close this session, causing connection leaks.
The client must use client.session() as an async context manager:
Proposed fix — use session context manager
try:
client = AsyncZyteAPI(api_key=ZYTE_API_KEY)
- result = await client.get(
- {
- "url": self.url,
- "browserHtml": True,
- "article": True,
- "articleOptions": {"extractFrom": "browserHtml"},
- }
- )
+ async with client.session() as session:
+ result = await session.get(
+ {
+ "url": self.url,
+ "browserHtml": True,
+ "article": True,
+ "articleOptions": {"extractFrom": "browserHtml"},
+ }
+ )
await self._process_zyte_result(result)🤖 Prompt for AI Agents
In `@app/services/scrapers/general/zyte.py` around lines 22 - 35, The AsyncZyteAPI
instance (AsyncZyteAPI) creates an internal aiohttp session that must be closed
via its async context manager; update the Zyte scraping block to open the
client's session with "async with client.session():" and call the existing
AsyncZyteAPI.get(...) inside that context so the underlying connection pool is
always closed even on exceptions, then pass the returned result to
self._process_zyte_result(result) as before and keep the current exception
logging/raising.
| # Remove DOCTYPE declarations | ||
| for item in soup.contents: | ||
| if isinstance(item, Doctype): | ||
| item.extract() |
There was a problem hiding this comment.
Mutating soup.contents while iterating over it can skip nodes.
soup.contents is a live list; calling .extract() during iteration may cause elements to be skipped. Snapshot the list first.
Proposed fix
# Remove DOCTYPE declarations
- for item in soup.contents:
+ for item in list(soup.contents):
if isinstance(item, Doctype):
item.extract()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Remove DOCTYPE declarations | |
| for item in soup.contents: | |
| if isinstance(item, Doctype): | |
| item.extract() | |
| # Remove DOCTYPE declarations | |
| for item in list(soup.contents): | |
| if isinstance(item, Doctype): | |
| item.extract() |
🤖 Prompt for AI Agents
In `@app/utils/parse.py` around lines 130 - 133, The loop mutates the live list
soup.contents while iterating (using Doctype and item.extract()), which can skip
nodes; fix by iterating over a snapshot of the contents instead (e.g., iterate
over list(soup.contents) or use soup.find_all/children snapshot) and call
item.extract() on the snapshot so no elements are skipped; update the block that
references soup.contents, Doctype, and item.extract() accordingly.
| # Unwrap <p> tags (keep text content) | ||
| for tag in soup.find_all("p"): | ||
| tag.unwrap() | ||
|
|
There was a problem hiding this comment.
Unwrapping <p> tags loses paragraph separation.
When <p> tags are unwrapped without inserting a separator (e.g., <br> or newline), consecutive paragraphs will merge into a single run of text, harming readability in the Telegram message.
Proposed fix — insert a line break before unwrapping
# Unwrap <p> tags (keep text content)
for tag in soup.find_all("p"):
+ tag.insert_after(soup.new_string("\n"))
tag.unwrap()🤖 Prompt for AI Agents
In `@app/utils/parse.py` around lines 151 - 154, The loop that unwrapps <p> tags
(the for tag in soup.find_all("p") block) currently calls tag.unwrap() which
merges consecutive paragraphs; before unwrapping each <p> insert a paragraph
separator (e.g., a newline or double-newline as a NavigableString or a <br>
element) so paragraph boundaries are preserved in the resulting Telegram text,
then call tag.unwrap(); ensure any required import (NavigableString) is added if
using it.
|
@codex review |
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad6ec6cb10
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| raise | ||
|
|
||
| async def _process_zyte_result(self, result: dict) -> None: | ||
| article = result.get("article", {}) |
There was a problem hiding this comment.
Guard against null Zyte article payloads
If Zyte returns "article": null (which can happen on pages where article extraction fails), result.get("article", {}) yields None, and subsequent calls like article.get(...) raise AttributeError before the browserHtml fallback can run. This makes general scraping fail for those URLs when the Zyte backend is selected.
Useful? React with 👍 / 👎.
| print(text) | ||
| split_pivot = "\n" if is_html is False else "<br>" | ||
| text_list = text.split(split_pivot) | ||
| text_list = text.split("\n") |
There was a problem hiding this comment.
Preserve HTML structure when wrapping HTML content
Splitting all HTML input on raw newlines when is_html=True breaks pretty-printed markup into line fragments and wraps each fragment in <p>, which can corrupt nested structures (e.g., lists/tables become invalid paragraph-wrapped fragments). This degrades rendered content quality for scraped pages that include formatting newlines, not just explicit <br> line breaks.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@app/services/scrapers/scraper_manager.py`:
- Around line 18-21: The class-level scrapers dict is populated with None at
import time causing "other" and "unknown" to fall out of sync when init_scraper
sets only one key; update init_scraper (the method updating cls.general_scraper
and cls.scrapers) so that when you assign the general_scraper instance you also
set both cls.scrapers["other"] and cls.scrapers["unknown"] to that instance (or
change scrapers to be a property/method that returns {"other":
cls.general_scraper, "unknown": cls.general_scraper, ...} based on the current
class attributes); edit the init_scraper function and the scrapers reference to
ensure both keys always reference the same non-None scraper instance.
🧹 Nitpick comments (5)
app/services/scrapers/scraper_manager.py (2)
18-21: Annotate mutable class attribute withClassVar.Ruff (RUF012) flags that
scrapersis a mutable class attribute without aClassVarannotation. This helps type checkers distinguish class-level state from instance-level state.Proposed fix
+from typing import ClassVar, Optional ... - scrapers = {"bluesky": bluesky_scraper, + scrapers: ClassVar[dict] = {"bluesky": bluesky_scraper,
44-47: Partially initialized scraper persists ifinit()throws.If
await cls.bluesky_scraper.init()raises,cls.bluesky_scraperis already assigned (line 45). Future calls toinit_scraper("bluesky")will see it's notNoneand skip re-initialization, leaving a broken instance in place.Proposed fix
`@classmethod` async def init_bluesky_scraper(cls) -> BlueskyScraper: - cls.bluesky_scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD) - await cls.bluesky_scraper.init() - return cls.bluesky_scraper + scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD) + await scraper.init() + cls.bluesky_scraper = scraper + return cls.bluesky_scraperapp/services/scrapers/general/firecrawl_client.py (1)
26-57: Thread-safe singleton holding an async client — consider event-loop affinity.The
threading.Lock-based singleton is fine for creation, butAsyncFirecrawllikely holds an internalhttpx.AsyncClient(or similar) that is bound to the event loop active at creation time. Ifget_instance()is first called from a different loop (e.g., during tests or worker startup) than the one that later awaitsscrape_url, you may get "attached to a different loop" errors. This is a low-risk concern for a single-loop server but worth noting.app/services/scrapers/general/base.py (2)
170-171:AsyncOpenAIclient instantiated on every LLM call.A new
AsyncOpenAIis created each invocation ofparsing_article_body_by_llm, which allocates a new HTTP connection pool every time. Consider creating it once (e.g., as a module-level or class-level singleton) and reusing it.Proposed refactor sketch
+_openai_client: Optional[AsyncOpenAI] = None + +def _get_openai_client() -> AsyncOpenAI: + global _openai_client + if _openai_client is None: + _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY) + return _openai_client + ... `@staticmethod` async def parsing_article_body_by_llm(html_content: str) -> str: ... try: - client = AsyncOpenAI(api_key=OPENAI_API_KEY) + client = _get_openai_client()
46-46: MD5 used for URL-based ID generation (static analysis flag S324).The static analysis tool flags
hashlib.md5as insecure. Since this is only used to derive a short deterministic ID from a URL (not for security/integrity), the risk is negligible. If you want to silence the linter, you can addhashlib.md5(url.encode(), usedforsecurity=False)(Python 3.9+).
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores