Feat: Refactor general webpage scraping and fix HTML santizing by aturret · Pull Request #49 · aturret/FastFetchBot

aturret · 2026-02-14T23:08:13Z

Summary by CodeRabbit

New Features
- New, configurable general scraping framework with selectable backends (including Zyte); new scraping-related environment options.
Improvements
- Cleaner HTML sanitization and text extraction for better message formatting.
- Unified, asynchronous scraping flow for more reliable page processing and media handling.
Bug Fixes
- Fixed typo in a user-facing message ("Occurred").
Chores
- Added top-level .claude to .gitignore.

coderabbitai · 2026-02-14T23:08:29Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Refactors Firecrawl-specific scraping into a pluggable GeneralScraper with Firecrawl and Zyte backends; renames config keys to GENERAL_SCRAPING_* and FIRECRAWL_WAIT_FOR; introduces async Firecrawl client, Zyte integration, generalized data processors/items, parsing sanitization updates, and bot config wiring changes. (34 words)

Changes

Cohort / File(s)	Summary
Configuration & Ignore `/.gitignore`, `template.env`, `pyproject.toml`, `app/config.py`	Added `.claude` to .gitignore; new env keys: `GENERAL_SCRAPING_ON`, `GENERAL_SCRAPING_API`, `FIRECRAWL_WAIT_FOR`, `ZYTE_API_KEY`; removed `FIRECRAWL_ON` and `FIRECRAWL_TIMEOUT_SECONDS`; added `HTTP_REQUEST_TIMEOUT` and `zyte-api` dependency.
Core abstraction & models `app/services/scrapers/general/__init__.py`, `app/services/scrapers/general/base.py`	Renamed FirecrawlItem→GeneralItem (added `scraper_type`); introduced `BaseGeneralDataProcessor`/`BaseGeneralScraper` with abstract `_get_page_content`, `_build_item_data`, `sanitize_html`, LLM parsing, `GENERAL_TEXT_LIMIT`, and `DEFAULT_OPENAI_MODEL`.
Firecrawl backend & client `app/services/scrapers/general/firecrawl.py`, `app/services/scrapers/general/firecrawl_client.py`	Added `FirecrawlDataProcessor` & `FirecrawlScraper`; client switched to `AsyncFirecrawl`, made `scrape_url` async, removed `timeout_seconds` and `extra_params`, added `timeout`, `exclude_tags`, and `wait_for` parameters; added `FIRECRAWL_EXCLUDE_TAGS`.
Zyte backend `app/services/scrapers/general/zyte.py`	New `ZyteDataProcessor` and `ZyteScraper` using `AsyncZyteAPI`, extracting metadata/body and feeding `_build_item_data`.
Scraper factory & manager `app/services/scrapers/general/scraper.py`, `app/services/scrapers/scraper_manager.py`	Added `GeneralScraper` with `SCRAPER_REGISTRY` (includes FIRECRAWL, ZYTE), registration/discovery APIs; `ScraperManager` now initializes/uses `GeneralScraper` for "other"/"unknown"; replaced `init_firecrawl_scraper` with `init_general_scraper`.
Integration & parsing utilities `app/services/telegram_bot/__init__.py`, `app/utils/parse.py`	Replaced `FIRECRAWL_ON` with `GENERAL_SCRAPING_ON` in bot flows; fixed "Occured"→"Occurred"; commented out direct `bot.send_message` in error path. Overhauled `telegram_message_html_trim` (DOCTYPE removal, broad tag removal/unwrapping, headings→bold) and simplified `wrap_text_into_html` newline handling.
Small edits `app/services/scrapers/scraper_manager.py`, other minor files	Updated imports/usages to new general scraper types and async client signatures; minor refactors and logging updates across scrapers.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant General as GeneralScraper
    participant Registry as SCRAPER_REGISTRY
    participant Backend as BackendScraper
    participant Processor as DataProcessor

    Caller->>General: request processor for URL
    General->>Registry: lookup scraper type (GENERAL_SCRAPING_API)
    Registry-->>General: returns BackendScraper class
    General->>Backend: instantiate & call get_processor_by_url(url)
    Backend->>Processor: create processor with url
    Processor->>Processor: async _get_page_content()
    Processor->>Processor: _process_*_result → _build_item_data()
    Processor-->>General: DataProcessor ready (get_item)
    General-->>Caller: returns Processor

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

feat: add Firecrawl scraping feature #45 — Earlier Firecrawl-centered implementation that this PR generalizes and replaces with the GeneralScraper/Zyte architecture.
Feat: Refactor general webpage scraping and fix HTML santizing #49 — Parallel refactor that renames/moves Firecrawl-specific types into a generalized scraping framework (GeneralItem, BaseGeneralDataProcessor, registry/backends).

Poem

🐰 I hopped from code to string and vine,
New scrapers join the registry line.
Firecrawl, Zyte — backends in a row,
Data blooms where HTML used to grow.
A rabbit cheers: "Scrape on, off we go!"

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Refactor general webpage scraping and fix HTML santizing' accurately describes the main changes: refactoring scraping infrastructure and addressing HTML sanitization issues.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch zyte-api

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

app/config.py (1)
233-233: ⚠️ Potential issue | 🟡 Minor

HTTP_REQUEST_TIMEOUT has inconsistent typing — will be str when set from env, int when falling back to default.

env.get() returns a string for set environment variables but the default 30 is an int. This will cause type mismatches downstream.
Proposed fix
-HTTP_REQUEST_TIMEOUT = env.get("HTTP_REQUEST_TIMEOUT", 30)
+HTTP_REQUEST_TIMEOUT = int(env.get("HTTP_REQUEST_TIMEOUT", 30))
app/services/telegram_bot/__init__.py (2)
209-221: ⚠️ Potential issue | 🔴 Critical

Bug: successful general scrape is immediately overwritten with "no supported url found".

When GENERAL_SCRAPING_ON is True and the URL is processed successfully (lines 210–217), execution falls through to line 218 which overwrites the process message with "no supported url found" and returns. The user sees the error text even though processing succeeded.

Add a return (or restructure with else) after the successful scrape path:
Proposed fix
         if url_metadata.source == "unknown":
             if GENERAL_SCRAPING_ON:
                 await process_message.edit_text(
                     text=f"Uncategorized url found. General webpage parser is on, Processing..."
                 )
                 metadata_item = await content_process_function(url_metadata=url_metadata)
                 await send_item_message(
                     metadata_item, chat_id=message.chat_id
                 )
+                await process_message.delete()
+                return
             await process_message.edit_text(
                 text=f"For the {i + 1} th url, no supported url found."
             )
             return
351-358: ⚠️ Potential issue | 🟠 Major

After a successful general scrape, the "no supported url found" branch still executes.

On line 351, when GENERAL_SCRAPING_ON is True and source == "unknown", the URL is processed and sent. But execution continues to line 356 which matches source == "unknown" again, logs "no supported url found", and returns — preventing any further processing. Consider adding continue after the successful scrape to avoid the misleading log and early return.
Proposed fix
         if url_metadata.source == "unknown" and GENERAL_SCRAPING_ON:
             metadata_item = await content_process_function(url_metadata=url_metadata)
             await send_item_message(
                 metadata_item, chat_id=message.chat_id, message=message
             )
+            continue
         if url_metadata.source == "unknown" or url_metadata.source == "banned":

🤖 Fix all issues with AI agents

In `@app/services/scrapers/general/base.py`:
- Around line 84-88: The code can raise TypeError when description is falsy and
markdown_content is None because None[:500] is invalid; update the processing
block in base.py to defensively handle None by coalescing markdown_content to an
empty string (or using conditional slicing) before taking [:500], then pass the
resulting string into BeautifulSoup and set item_data["text"]; reference the
variables description, markdown_content, text, BeautifulSoup, and item_data when
making the change.
- Around line 133-136: The loop that removes DOCTYPE nodes mutates soup.contents
while iterating which can skip nodes; change the iteration to iterate over a
snapshot (e.g. wrap soup.contents in list(...)) so every Doctype is visited and
extracted — update the loop that uses soup and Doctype in
app/services/scrapers/general/base.py to iterate over list(soup.contents) (or
use soup.find_all/explicit copy) before calling item.extract().
- Around line 89-96: The else branch can pass None into wrap_text_into_html when
markdown_content is None; update the logic in the block using
parsing_article_body_by_llm, sanitize_html and wrap_text_into_html so
markdown_content is guarded with a fallback (e.g., use markdown_content or "" )
before calling wrap_text_into_html(markdown_content, is_html=False), ensuring
wrap_text_into_html always receives a string.

In `@app/services/scrapers/general/firecrawl.py`:
- Around line 24-36: The call to the synchronous Firecrawl client in
_get_page_content is blocking the event loop; change the call to
self._client.scrape_url so it runs on a background thread (e.g., use
asyncio.to_thread or loop.run_in_executor) and await that result before passing
it to _process_firecrawl_result; also add the necessary import for asyncio at
the top and keep the existing try/except and logger.error behavior unchanged.

In `@app/services/scrapers/general/scraper.py`:
- Around line 36-42: self.scraper_type can be None which causes
self.scraper_type.upper() in _init_scraper to raise AttributeError; ensure a
safe default string is used when both the constructor argument and
GENERAL_SCRAPING_API are None by normalizing scraper_type before calling
.upper(). Update the assignment or _init_scraper so it uses something like
scraper_type = (scraper_type or GENERAL_SCRAPING_API or "GENERAL") (or otherwise
coerce to str) and then call .upper() against that value; reference symbols:
scraper_type, GENERAL_SCRAPING_API, _init_scraper, SCRAPER_REGISTRY, and the
.upper() call to locate the change.

In `@app/services/scrapers/general/zyte.py`:
- Around line 22-35: The AsyncZyteAPI instance (AsyncZyteAPI) creates an
internal aiohttp session that must be closed via its async context manager;
update the Zyte scraping block to open the client's session with "async with
client.session():" and call the existing AsyncZyteAPI.get(...) inside that
context so the underlying connection pool is always closed even on exceptions,
then pass the returned result to self._process_zyte_result(result) as before and
keep the current exception logging/raising.

In `@app/services/scrapers/scraper_manager.py`:
- Around line 54-57: init_general_scraper currently constructs and returns a
GeneralScraper but never assigns it to the class attribute, so the guard
checking cls.general_scraper in init_scraper will always be true; modify
init_general_scraper to assign the new instance to cls.general_scraper (e.g.,
cls.general_scraper = GeneralScraper()) and then return cls.general_scraper so
subsequent calls reuse the same scraper instance.

In `@app/utils/parse.py`:
- Around line 130-133: The loop mutates the live list soup.contents while
iterating (using Doctype and item.extract()), which can skip nodes; fix by
iterating over a snapshot of the contents instead (e.g., iterate over
list(soup.contents) or use soup.find_all/children snapshot) and call
item.extract() on the snapshot so no elements are skipped; update the block that
references soup.contents, Doctype, and item.extract() accordingly.
- Around line 151-154: The loop that unwrapps <p> tags (the for tag in
soup.find_all("p") block) currently calls tag.unwrap() which merges consecutive
paragraphs; before unwrapping each <p> insert a paragraph separator (e.g., a
newline or double-newline as a NavigableString or a <br> element) so paragraph
boundaries are preserved in the resulting Telegram text, then call tag.unwrap();
ensure any required import (NavigableString) is added if using it.

🧹 Nitpick comments (2)

app/services/scrapers/scraper_manager.py (1)

18-21: scrapers dict captures None at class-definition time, not a live reference.

The dict values for "other" and "unknown" are bound to the current value of general_scraper (which is None) when the class body executes. After init_scraper updates the dict entry, subsequent lookups via cls.scrapers["other"] will work, but this pattern is fragile and misleading. Consider converting scrapers into a @classmethod property or populating it lazily to avoid this gotcha.

app/services/scrapers/general/base.py (1)

116-150: Near-duplicate sanitization logic with telegram_message_html_trim in parse.py.

Both methods remove DOCTYPEs, decompose script/style/etc., and unwrap structural tags, but with slightly different tag lists. This is a potential source of drift. Consider extracting a shared sanitization utility that both call sites can reuse, or at minimum add a comment cross-referencing the two implementations.

app/services/scrapers/general/base.py

coderabbitai · 2026-02-14T23:13:09Z

app/services/scrapers/general/base.py

+        # Remove DOCTYPE declarations
+        for item in soup.contents:
+            if isinstance(item, Doctype):
+                item.extract()


⚠️ Potential issue | 🟡 Minor

Same live-list mutation issue — iterate over a snapshot of soup.contents.

Same issue as in parse.py: calling .extract() while iterating over soup.contents can skip DOCTYPE nodes.

Proposed fix

# Remove DOCTYPE declarations - for item in soup.contents: + for item in list(soup.contents): if isinstance(item, Doctype): item.extract()

🤖 Prompt for AI Agents

In `@app/services/scrapers/general/base.py` around lines 133 - 136, The loop that removes DOCTYPE nodes mutates soup.contents while iterating which can skip nodes; change the iteration to iterate over a snapshot (e.g. wrap soup.contents in list(...)) so every Doctype is visited and extracted — update the loop that uses soup and Doctype in app/services/scrapers/general/base.py to iterate over list(soup.contents) (or use soup.find_all/explicit copy) before calling item.extract().

coderabbitai · 2026-02-14T23:13:09Z

app/services/scrapers/general/firecrawl.py

+    async def _get_page_content(self) -> None:
+        try:
+            result = self._client.scrape_url(
+                url=self.url,
+                formats=["markdown", "html"],
+                only_main_content=True,
+                exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
+                wait_for=FIRECRAWL_WAIT_FOR,
+            )
+            await self._process_firecrawl_result(result)
+        except Exception as e:
+            logger.error(f"Failed to scrape URL with Firecrawl: {e}")
+            raise


⚠️ Potential issue | 🟠 Major

Synchronous scrape_url blocks the async event loop.

FirecrawlClient.scrape_url() is a synchronous method (it calls self._app.scrape(...) from the Firecrawl SDK). Calling it directly inside this async def _get_page_content will block the event loop for the duration of the HTTP request to Firecrawl.

Wrap the blocking call with asyncio.to_thread (or loop.run_in_executor):

Proposed fix

+import asyncio +from functools import partial + ... async def _get_page_content(self) -> None: try: - result = self._client.scrape_url( - url=self.url, - formats=["markdown", "html"], - only_main_content=True, - exclude_tags=FIRECRAWL_EXCLUDE_TAGS, - wait_for=FIRECRAWL_WAIT_FOR, - ) + result = await asyncio.to_thread( + partial( + self._client.scrape_url, + url=self.url, + formats=["markdown", "html"], + only_main_content=True, + exclude_tags=FIRECRAWL_EXCLUDE_TAGS, + wait_for=FIRECRAWL_WAIT_FOR, + ) + ) await self._process_firecrawl_result(result) except Exception as e: logger.error(f"Failed to scrape URL with Firecrawl: {e}") raise

🤖 Prompt for AI Agents

In `@app/services/scrapers/general/firecrawl.py` around lines 24 - 36, The call to the synchronous Firecrawl client in _get_page_content is blocking the event loop; change the call to self._client.scrape_url so it runs on a background thread (e.g., use asyncio.to_thread or loop.run_in_executor) and await that result before passing it to _process_firecrawl_result; also add the necessary import for asyncio at the top and keep the existing try/except and logger.error behavior unchanged.

coderabbitai · 2026-02-14T23:13:09Z

app/services/scrapers/general/scraper.py

+        self.scraper_type = scraper_type or GENERAL_SCRAPING_API
+        self._scraper: Optional[BaseGeneralScraper] = None
+        self._init_scraper()
+
+    def _init_scraper(self) -> None:
+        """Initialize the underlying scraper based on scraper_type."""
+        scraper_class = self.SCRAPER_REGISTRY.get(self.scraper_type.upper())


⚠️ Potential issue | 🟠 Major

AttributeError if both scraper_type and GENERAL_SCRAPING_API are None.

self.scraper_type can be None when the caller passes nothing and the config variable is unset. Calling .upper() on None at line 42 will raise AttributeError.

Proposed fix — default to a known value

- self.scraper_type = scraper_type or GENERAL_SCRAPING_API + self.scraper_type = scraper_type or GENERAL_SCRAPING_API or "FIRECRAWL"

🤖 Prompt for AI Agents

In `@app/services/scrapers/general/scraper.py` around lines 36 - 42, self.scraper_type can be None which causes self.scraper_type.upper() in _init_scraper to raise AttributeError; ensure a safe default string is used when both the constructor argument and GENERAL_SCRAPING_API are None by normalizing scraper_type before calling .upper(). Update the assignment or _init_scraper so it uses something like scraper_type = (scraper_type or GENERAL_SCRAPING_API or "GENERAL") (or otherwise coerce to str) and then call .upper() against that value; reference symbols: scraper_type, GENERAL_SCRAPING_API, _init_scraper, SCRAPER_REGISTRY, and the .upper() call to locate the change.

coderabbitai · 2026-02-14T23:13:10Z

app/services/scrapers/general/zyte.py

+        try:
+            client = AsyncZyteAPI(api_key=ZYTE_API_KEY)
+            result = await client.get(
+                {
+                    "url": self.url,
+                    "browserHtml": True,
+                    "article": True,
+                    "articleOptions": {"extractFrom": "browserHtml"},
+                }
+            )
+            await self._process_zyte_result(result)
+        except Exception as e:
+            logger.error(f"Failed to scrape URL with Zyte: {e}")
+            raise


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Zyte API AsyncZyteAPI Python context manager resource cleanup

💡 Result:

For resource cleanup with zyte_api.AsyncZyteAPI, the main thing you need to close is the session object returned by client.session() (it manages an internal aiohttp.ClientSession connection pool).

Recommended: use it as an async context manager

from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() async with client.session() as session: r = await session.get({"url": "https://example.com", "httpResponseBody": True}) ...

Exiting the async with block performs the cleanup automatically. [1] [2]

Alternative: manual cleanup (equivalent, but you must remember it)

session = client.session() try: r = await session.get({"url": "https://example.com"}) finally: await session.close()

This is explicitly documented as sufficient—you don’t have to use the context manager as long as you await session.close() when done. [1]

If you pass your own aiohttp.ClientSession

If you call await client.get(..., session=your_aiohttp_session), then you own that aiohttp session lifecycle and must close it yourself (Zyte won’t). [1]

Sources:
[1] python-zyte-api API reference (AsyncZyteAPI, session(), close())
[2] Zyte docs Python examples using async with client.session()

🏁 Script executed:

# Search for AsyncZyteAPI imports and usage in the codebase rg "AsyncZyteAPI" --type py -B 2 -A 5

Repository: aturret/FastFetchBot

Length of output: 1095

🏁 Script executed:

# Check if there are any existing Zyte API client usage patterns rg "ZYTE_API_KEY" --type py -B 3 -A 3

Repository: aturret/FastFetchBot

Length of output: 1573

AsyncZyteAPI session requires proper async context management — resource leak risk.

AsyncZyteAPI.get() creates an internal session that manages an aiohttp.ClientSession connection pool. The current code does not close this session, causing connection leaks.

The client must use client.session() as an async context manager:

Proposed fix — use session context manager

try: client = AsyncZyteAPI(api_key=ZYTE_API_KEY) - result = await client.get( - { - "url": self.url, - "browserHtml": True, - "article": True, - "articleOptions": {"extractFrom": "browserHtml"}, - } - ) + async with client.session() as session: + result = await session.get( + { + "url": self.url, + "browserHtml": True, + "article": True, + "articleOptions": {"extractFrom": "browserHtml"}, + } + ) await self._process_zyte_result(result)

🤖 Prompt for AI Agents

In `@app/services/scrapers/general/zyte.py` around lines 22 - 35, The AsyncZyteAPI instance (AsyncZyteAPI) creates an internal aiohttp session that must be closed via its async context manager; update the Zyte scraping block to open the client's session with "async with client.session():" and call the existing AsyncZyteAPI.get(...) inside that context so the underlying connection pool is always closed even on exceptions, then pass the returned result to self._process_zyte_result(result) as before and keep the current exception logging/raising.

app/services/scrapers/scraper_manager.py

coderabbitai · 2026-02-14T23:13:10Z

app/utils/parse.py

+    # Remove DOCTYPE declarations
+    for item in soup.contents:
+        if isinstance(item, Doctype):
+            item.extract()


⚠️ Potential issue | 🟡 Minor

Mutating soup.contents while iterating over it can skip nodes.

soup.contents is a live list; calling .extract() during iteration may cause elements to be skipped. Snapshot the list first.

Proposed fix

# Remove DOCTYPE declarations - for item in soup.contents: + for item in list(soup.contents): if isinstance(item, Doctype): item.extract()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Remove DOCTYPE declarations

for item in soup.contents:

if isinstance(item, Doctype):

item.extract()

# Remove DOCTYPE declarations

for item in list(soup.contents):

if isinstance(item, Doctype):

item.extract()

🤖 Prompt for AI Agents

In `@app/utils/parse.py` around lines 130 - 133, The loop mutates the live list soup.contents while iterating (using Doctype and item.extract()), which can skip nodes; fix by iterating over a snapshot of the contents instead (e.g., iterate over list(soup.contents) or use soup.find_all/children snapshot) and call item.extract() on the snapshot so no elements are skipped; update the block that references soup.contents, Doctype, and item.extract() accordingly.

coderabbitai · 2026-02-14T23:13:10Z

app/utils/parse.py

+    # Unwrap <p> tags (keep text content)
+    for tag in soup.find_all("p"):
+        tag.unwrap()
+


⚠️ Potential issue | 🟠 Major

Unwrapping  tags loses paragraph separation.

When  tags are unwrapped without inserting a separator (e.g.,   or newline), consecutive paragraphs will merge into a single run of text, harming readability in the Telegram message.

Proposed fix — insert a line break before unwrapping

# Unwrap tags (keep text content) for tag in soup.find_all("p"): + tag.insert_after(soup.new_string("\n")) tag.unwrap()

🤖 Prompt for AI Agents

In `@app/utils/parse.py` around lines 151 - 154, The loop that unwrapps tags (the for tag in soup.find_all("p") block) currently calls tag.unwrap() which merges consecutive paragraphs; before unwrapping each insert a paragraph separator (e.g., a newline or double-newline as a NavigableString or a element) so paragraph boundaries are preserved in the resulting Telegram text, then call tag.unwrap(); ensure any required import (NavigableString) is added if using it.

aturret · 2026-02-14T23:27:05Z

@codex review

chatgpt-codex-connector · 2026-02-14T23:27:09Z

To use Codex here, create a Codex account and connect to github.

aturret · 2026-02-14T23:29:15Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad6ec6cb10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-14T23:34:48Z

app/services/scrapers/general/zyte.py

+            raise
+
+    async def _process_zyte_result(self, result: dict) -> None:
+        article = result.get("article", {})


Guard against null Zyte article payloads

If Zyte returns "article": null (which can happen on pages where article extraction fails), result.get("article", {}) yields None, and subsequent calls like article.get(...) raise AttributeError before the browserHtml fallback can run. This makes general scraping fail for those URLs when the Zyte backend is selected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-14T23:34:48Z

app/utils/parse.py

-        print(text)
-    split_pivot = "\n" if is_html is False else "<br>"
-    text_list = text.split(split_pivot)
+    text_list = text.split("\n")


Preserve HTML structure when wrapping HTML content

Splitting all HTML input on raw newlines when is_html=True breaks pretty-printed markup into line fragments and wraps each fragment in , which can corrupt nested structures (e.g., lists/tables become invalid paragraph-wrapped fragments). This degrades rendered content quality for scraped pages that include formatting newlines, not just explicit   line breaks.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@app/services/scrapers/scraper_manager.py`:
- Around line 18-21: The class-level scrapers dict is populated with None at
import time causing "other" and "unknown" to fall out of sync when init_scraper
sets only one key; update init_scraper (the method updating cls.general_scraper
and cls.scrapers) so that when you assign the general_scraper instance you also
set both cls.scrapers["other"] and cls.scrapers["unknown"] to that instance (or
change scrapers to be a property/method that returns {"other":
cls.general_scraper, "unknown": cls.general_scraper, ...} based on the current
class attributes); edit the init_scraper function and the scrapers reference to
ensure both keys always reference the same non-None scraper instance.

🧹 Nitpick comments (5)

app/services/scrapers/scraper_manager.py (2)
18-21: Annotate mutable class attribute with ClassVar.

Ruff (RUF012) flags that scrapers is a mutable class attribute without a ClassVar annotation. This helps type checkers distinguish class-level state from instance-level state.
Proposed fix
+from typing import ClassVar, Optional
 ...
-    scrapers = {"bluesky": bluesky_scraper,
+    scrapers: ClassVar[dict] = {"bluesky": bluesky_scraper,
44-47: Partially initialized scraper persists if init() throws.

If await cls.bluesky_scraper.init() raises, cls.bluesky_scraper is already assigned (line 45). Future calls to init_scraper("bluesky") will see it's not None and skip re-initialization, leaving a broken instance in place.
Proposed fix
     `@classmethod`
     async def init_bluesky_scraper(cls) -> BlueskyScraper:
-        cls.bluesky_scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD)
-        await cls.bluesky_scraper.init()
-        return cls.bluesky_scraper
+        scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD)
+        await scraper.init()
+        cls.bluesky_scraper = scraper
+        return cls.bluesky_scraper
app/services/scrapers/general/firecrawl_client.py (1)

26-57: Thread-safe singleton holding an async client — consider event-loop affinity.

The threading.Lock-based singleton is fine for creation, but AsyncFirecrawl likely holds an internal httpx.AsyncClient (or similar) that is bound to the event loop active at creation time. If get_instance() is first called from a different loop (e.g., during tests or worker startup) than the one that later awaits scrape_url, you may get "attached to a different loop" errors. This is a low-risk concern for a single-loop server but worth noting.
app/services/scrapers/general/base.py (2)
170-171: AsyncOpenAI client instantiated on every LLM call.

A new AsyncOpenAI is created each invocation of parsing_article_body_by_llm, which allocates a new HTTP connection pool every time. Consider creating it once (e.g., as a module-level or class-level singleton) and reusing it.
Proposed refactor sketch
+_openai_client: Optional[AsyncOpenAI] = None
+
+def _get_openai_client() -> AsyncOpenAI:
+    global _openai_client
+    if _openai_client is None:
+        _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
+    return _openai_client
+
 ...
     `@staticmethod`
     async def parsing_article_body_by_llm(html_content: str) -> str:
         ...
         try:
-            client = AsyncOpenAI(api_key=OPENAI_API_KEY)
+            client = _get_openai_client()
46-46: MD5 used for URL-based ID generation (static analysis flag S324).

The static analysis tool flags hashlib.md5 as insecure. Since this is only used to derive a short deterministic ID from a URL (not for security/integrity), the risk is negligible. If you want to silence the linter, you can add hashlib.md5(url.encode(), usedforsecurity=False) (Python 3.9+).

app/services/scrapers/scraper_manager.py

aturret added 4 commits February 14, 2026 14:41

feat: add zyte API

868896f

Merge branch 'main' into zyte-api

ce2e1fd

Update .gitignore

bd906af

fix: html tag sanitizing for general scraping

ad6ec6c

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 14, 2026

View reviewed changes

aturret added 4 commits February 14, 2026 18:27

chore: disable in group chat error message

32b08f8

fix: add fallback to text slice

0438268

fix: fix class init

ca6e9c0

fix: fix Firecrawl SDK async

1155a69

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

app/services/scrapers/scraper_manager.py Show resolved Hide resolved

fix: keep scraper manager in sync with general

3d228c5

aturret merged commit cef9849 into main Feb 15, 2026
1 of 2 checks passed

aturret deleted the zyte-api branch February 16, 2026 04:59

This was referenced Feb 18, 2026

feat: refactor codebase to microservices monorepo #56

Merged

feat: integrate file exporter to monorepo as a celery worker #57

Merged

feat: add user settings feature #58

Merged

This was referenced Mar 9, 2026

fix: add firecrawl json extraction #62

Merged

feat: update Firecrawl prompt #63

Merged

This was referenced Mar 21, 2026

refactor: move api scraping logic to shared #67

Merged

feat: add some unit tests #68

Merged

Conversation

aturret commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Recommended: use it as an async context manager

Alternative: manual cleanup (equivalent, but you must remember it)

If you pass your own aiohttp.ClientSession

Uh oh!

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

aturret commented Feb 14, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 14, 2026

Uh oh!

aturret commented Feb 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aturret commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

If you pass your own `aiohttp.ClientSession`