Skip to content

Feat: Refactor general webpage scraping and fix HTML santizing#49

Merged
aturret merged 9 commits intomainfrom
zyte-api
Feb 15, 2026
Merged

Feat: Refactor general webpage scraping and fix HTML santizing#49
aturret merged 9 commits intomainfrom
zyte-api

Conversation

@aturret
Copy link
Owner

@aturret aturret commented Feb 14, 2026

Summary by CodeRabbit

  • New Features

    • New, configurable general scraping framework with selectable backends (including Zyte); new scraping-related environment options.
  • Improvements

    • Cleaner HTML sanitization and text extraction for better message formatting.
    • Unified, asynchronous scraping flow for more reliable page processing and media handling.
  • Bug Fixes

    • Fixed typo in a user-facing message ("Occurred").
  • Chores

    • Added top-level .claude to .gitignore.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Refactors Firecrawl-specific scraping into a pluggable GeneralScraper with Firecrawl and Zyte backends; renames config keys to GENERAL_SCRAPING_* and FIRECRAWL_WAIT_FOR; introduces async Firecrawl client, Zyte integration, generalized data processors/items, parsing sanitization updates, and bot config wiring changes. (34 words)

Changes

Cohort / File(s) Summary
Configuration & Ignore
/.gitignore, template.env, pyproject.toml, app/config.py
Added .claude to .gitignore; new env keys: GENERAL_SCRAPING_ON, GENERAL_SCRAPING_API, FIRECRAWL_WAIT_FOR, ZYTE_API_KEY; removed FIRECRAWL_ON and FIRECRAWL_TIMEOUT_SECONDS; added HTTP_REQUEST_TIMEOUT and zyte-api dependency.
Core abstraction & models
app/services/scrapers/general/__init__.py, app/services/scrapers/general/base.py
Renamed FirecrawlItem→GeneralItem (added scraper_type); introduced BaseGeneralDataProcessor/BaseGeneralScraper with abstract _get_page_content, _build_item_data, sanitize_html, LLM parsing, GENERAL_TEXT_LIMIT, and DEFAULT_OPENAI_MODEL.
Firecrawl backend & client
app/services/scrapers/general/firecrawl.py, app/services/scrapers/general/firecrawl_client.py
Added FirecrawlDataProcessor & FirecrawlScraper; client switched to AsyncFirecrawl, made scrape_url async, removed timeout_seconds and extra_params, added timeout, exclude_tags, and wait_for parameters; added FIRECRAWL_EXCLUDE_TAGS.
Zyte backend
app/services/scrapers/general/zyte.py
New ZyteDataProcessor and ZyteScraper using AsyncZyteAPI, extracting metadata/body and feeding _build_item_data.
Scraper factory & manager
app/services/scrapers/general/scraper.py, app/services/scrapers/scraper_manager.py
Added GeneralScraper with SCRAPER_REGISTRY (includes FIRECRAWL, ZYTE), registration/discovery APIs; ScraperManager now initializes/uses GeneralScraper for "other"/"unknown"; replaced init_firecrawl_scraper with init_general_scraper.
Integration & parsing utilities
app/services/telegram_bot/__init__.py, app/utils/parse.py
Replaced FIRECRAWL_ON with GENERAL_SCRAPING_ON in bot flows; fixed "Occured"→"Occurred"; commented out direct bot.send_message in error path. Overhauled telegram_message_html_trim (DOCTYPE removal, broad tag removal/unwrapping, headings→bold) and simplified wrap_text_into_html newline handling.
Small edits
app/services/scrapers/scraper_manager.py, other minor files
Updated imports/usages to new general scraper types and async client signatures; minor refactors and logging updates across scrapers.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant General as GeneralScraper
    participant Registry as SCRAPER_REGISTRY
    participant Backend as BackendScraper
    participant Processor as DataProcessor

    Caller->>General: request processor for URL
    General->>Registry: lookup scraper type (GENERAL_SCRAPING_API)
    Registry-->>General: returns BackendScraper class
    General->>Backend: instantiate & call get_processor_by_url(url)
    Backend->>Processor: create processor with url
    Processor->>Processor: async _get_page_content()
    Processor->>Processor: _process_*_result → _build_item_data()
    Processor-->>General: DataProcessor ready (get_item)
    General-->>Caller: returns Processor
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

Poem

🐰 I hopped from code to string and vine,
New scrapers join the registry line.
Firecrawl, Zyte — backends in a row,
Data blooms where HTML used to grow.
A rabbit cheers: "Scrape on, off we go!"

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refactor general webpage scraping and fix HTML santizing' accurately describes the main changes: refactoring scraping infrastructure and addressing HTML sanitization issues.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch zyte-api

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
app/config.py (1)

233-233: ⚠️ Potential issue | 🟡 Minor

HTTP_REQUEST_TIMEOUT has inconsistent typing — will be str when set from env, int when falling back to default.

env.get() returns a string for set environment variables but the default 30 is an int. This will cause type mismatches downstream.

Proposed fix
-HTTP_REQUEST_TIMEOUT = env.get("HTTP_REQUEST_TIMEOUT", 30)
+HTTP_REQUEST_TIMEOUT = int(env.get("HTTP_REQUEST_TIMEOUT", 30))
app/services/telegram_bot/__init__.py (2)

209-221: ⚠️ Potential issue | 🔴 Critical

Bug: successful general scrape is immediately overwritten with "no supported url found".

When GENERAL_SCRAPING_ON is True and the URL is processed successfully (lines 210–217), execution falls through to line 218 which overwrites the process message with "no supported url found" and returns. The user sees the error text even though processing succeeded.

Add a return (or restructure with else) after the successful scrape path:

Proposed fix
         if url_metadata.source == "unknown":
             if GENERAL_SCRAPING_ON:
                 await process_message.edit_text(
                     text=f"Uncategorized url found. General webpage parser is on, Processing..."
                 )
                 metadata_item = await content_process_function(url_metadata=url_metadata)
                 await send_item_message(
                     metadata_item, chat_id=message.chat_id
                 )
+                await process_message.delete()
+                return
             await process_message.edit_text(
                 text=f"For the {i + 1} th url, no supported url found."
             )
             return

351-358: ⚠️ Potential issue | 🟠 Major

After a successful general scrape, the "no supported url found" branch still executes.

On line 351, when GENERAL_SCRAPING_ON is True and source == "unknown", the URL is processed and sent. But execution continues to line 356 which matches source == "unknown" again, logs "no supported url found", and returns — preventing any further processing. Consider adding continue after the successful scrape to avoid the misleading log and early return.

Proposed fix
         if url_metadata.source == "unknown" and GENERAL_SCRAPING_ON:
             metadata_item = await content_process_function(url_metadata=url_metadata)
             await send_item_message(
                 metadata_item, chat_id=message.chat_id, message=message
             )
+            continue
         if url_metadata.source == "unknown" or url_metadata.source == "banned":
🤖 Fix all issues with AI agents
In `@app/services/scrapers/general/base.py`:
- Around line 84-88: The code can raise TypeError when description is falsy and
markdown_content is None because None[:500] is invalid; update the processing
block in base.py to defensively handle None by coalescing markdown_content to an
empty string (or using conditional slicing) before taking [:500], then pass the
resulting string into BeautifulSoup and set item_data["text"]; reference the
variables description, markdown_content, text, BeautifulSoup, and item_data when
making the change.
- Around line 133-136: The loop that removes DOCTYPE nodes mutates soup.contents
while iterating which can skip nodes; change the iteration to iterate over a
snapshot (e.g. wrap soup.contents in list(...)) so every Doctype is visited and
extracted — update the loop that uses soup and Doctype in
app/services/scrapers/general/base.py to iterate over list(soup.contents) (or
use soup.find_all/explicit copy) before calling item.extract().
- Around line 89-96: The else branch can pass None into wrap_text_into_html when
markdown_content is None; update the logic in the block using
parsing_article_body_by_llm, sanitize_html and wrap_text_into_html so
markdown_content is guarded with a fallback (e.g., use markdown_content or "" )
before calling wrap_text_into_html(markdown_content, is_html=False), ensuring
wrap_text_into_html always receives a string.

In `@app/services/scrapers/general/firecrawl.py`:
- Around line 24-36: The call to the synchronous Firecrawl client in
_get_page_content is blocking the event loop; change the call to
self._client.scrape_url so it runs on a background thread (e.g., use
asyncio.to_thread or loop.run_in_executor) and await that result before passing
it to _process_firecrawl_result; also add the necessary import for asyncio at
the top and keep the existing try/except and logger.error behavior unchanged.

In `@app/services/scrapers/general/scraper.py`:
- Around line 36-42: self.scraper_type can be None which causes
self.scraper_type.upper() in _init_scraper to raise AttributeError; ensure a
safe default string is used when both the constructor argument and
GENERAL_SCRAPING_API are None by normalizing scraper_type before calling
.upper(). Update the assignment or _init_scraper so it uses something like
scraper_type = (scraper_type or GENERAL_SCRAPING_API or "GENERAL") (or otherwise
coerce to str) and then call .upper() against that value; reference symbols:
scraper_type, GENERAL_SCRAPING_API, _init_scraper, SCRAPER_REGISTRY, and the
.upper() call to locate the change.

In `@app/services/scrapers/general/zyte.py`:
- Around line 22-35: The AsyncZyteAPI instance (AsyncZyteAPI) creates an
internal aiohttp session that must be closed via its async context manager;
update the Zyte scraping block to open the client's session with "async with
client.session():" and call the existing AsyncZyteAPI.get(...) inside that
context so the underlying connection pool is always closed even on exceptions,
then pass the returned result to self._process_zyte_result(result) as before and
keep the current exception logging/raising.

In `@app/services/scrapers/scraper_manager.py`:
- Around line 54-57: init_general_scraper currently constructs and returns a
GeneralScraper but never assigns it to the class attribute, so the guard
checking cls.general_scraper in init_scraper will always be true; modify
init_general_scraper to assign the new instance to cls.general_scraper (e.g.,
cls.general_scraper = GeneralScraper()) and then return cls.general_scraper so
subsequent calls reuse the same scraper instance.

In `@app/utils/parse.py`:
- Around line 130-133: The loop mutates the live list soup.contents while
iterating (using Doctype and item.extract()), which can skip nodes; fix by
iterating over a snapshot of the contents instead (e.g., iterate over
list(soup.contents) or use soup.find_all/children snapshot) and call
item.extract() on the snapshot so no elements are skipped; update the block that
references soup.contents, Doctype, and item.extract() accordingly.
- Around line 151-154: The loop that unwrapps <p> tags (the for tag in
soup.find_all("p") block) currently calls tag.unwrap() which merges consecutive
paragraphs; before unwrapping each <p> insert a paragraph separator (e.g., a
newline or double-newline as a NavigableString or a <br> element) so paragraph
boundaries are preserved in the resulting Telegram text, then call tag.unwrap();
ensure any required import (NavigableString) is added if using it.
🧹 Nitpick comments (2)
app/services/scrapers/scraper_manager.py (1)

18-21: scrapers dict captures None at class-definition time, not a live reference.

The dict values for "other" and "unknown" are bound to the current value of general_scraper (which is None) when the class body executes. After init_scraper updates the dict entry, subsequent lookups via cls.scrapers["other"] will work, but this pattern is fragile and misleading. Consider converting scrapers into a @classmethod property or populating it lazily to avoid this gotcha.

app/services/scrapers/general/base.py (1)

116-150: Near-duplicate sanitization logic with telegram_message_html_trim in parse.py.

Both methods remove DOCTYPEs, decompose script/style/etc., and unwrap structural tags, but with slightly different tag lists. This is a potential source of drift. Consider extracting a shared sanitization utility that both call sites can reuse, or at minimum add a comment cross-referencing the two implementations.

Comment on lines +133 to +136
# Remove DOCTYPE declarations
for item in soup.contents:
if isinstance(item, Doctype):
item.extract()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same live-list mutation issue — iterate over a snapshot of soup.contents.

Same issue as in parse.py: calling .extract() while iterating over soup.contents can skip DOCTYPE nodes.

Proposed fix
         # Remove DOCTYPE declarations
-        for item in soup.contents:
+        for item in list(soup.contents):
             if isinstance(item, Doctype):
                 item.extract()
🤖 Prompt for AI Agents
In `@app/services/scrapers/general/base.py` around lines 133 - 136, The loop that
removes DOCTYPE nodes mutates soup.contents while iterating which can skip
nodes; change the iteration to iterate over a snapshot (e.g. wrap soup.contents
in list(...)) so every Doctype is visited and extracted — update the loop that
uses soup and Doctype in app/services/scrapers/general/base.py to iterate over
list(soup.contents) (or use soup.find_all/explicit copy) before calling
item.extract().

Comment on lines +24 to +36
async def _get_page_content(self) -> None:
try:
result = self._client.scrape_url(
url=self.url,
formats=["markdown", "html"],
only_main_content=True,
exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
wait_for=FIRECRAWL_WAIT_FOR,
)
await self._process_firecrawl_result(result)
except Exception as e:
logger.error(f"Failed to scrape URL with Firecrawl: {e}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Synchronous scrape_url blocks the async event loop.

FirecrawlClient.scrape_url() is a synchronous method (it calls self._app.scrape(...) from the Firecrawl SDK). Calling it directly inside this async def _get_page_content will block the event loop for the duration of the HTTP request to Firecrawl.

Wrap the blocking call with asyncio.to_thread (or loop.run_in_executor):

Proposed fix
+import asyncio
+from functools import partial
+
 ...
 
     async def _get_page_content(self) -> None:
         try:
-            result = self._client.scrape_url(
-                url=self.url,
-                formats=["markdown", "html"],
-                only_main_content=True,
-                exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
-                wait_for=FIRECRAWL_WAIT_FOR,
-            )
+            result = await asyncio.to_thread(
+                partial(
+                    self._client.scrape_url,
+                    url=self.url,
+                    formats=["markdown", "html"],
+                    only_main_content=True,
+                    exclude_tags=FIRECRAWL_EXCLUDE_TAGS,
+                    wait_for=FIRECRAWL_WAIT_FOR,
+                )
+            )
             await self._process_firecrawl_result(result)
         except Exception as e:
             logger.error(f"Failed to scrape URL with Firecrawl: {e}")
             raise
🤖 Prompt for AI Agents
In `@app/services/scrapers/general/firecrawl.py` around lines 24 - 36, The call to
the synchronous Firecrawl client in _get_page_content is blocking the event
loop; change the call to self._client.scrape_url so it runs on a background
thread (e.g., use asyncio.to_thread or loop.run_in_executor) and await that
result before passing it to _process_firecrawl_result; also add the necessary
import for asyncio at the top and keep the existing try/except and logger.error
behavior unchanged.

Comment on lines +36 to +42
self.scraper_type = scraper_type or GENERAL_SCRAPING_API
self._scraper: Optional[BaseGeneralScraper] = None
self._init_scraper()

def _init_scraper(self) -> None:
"""Initialize the underlying scraper based on scraper_type."""
scraper_class = self.SCRAPER_REGISTRY.get(self.scraper_type.upper())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

AttributeError if both scraper_type and GENERAL_SCRAPING_API are None.

self.scraper_type can be None when the caller passes nothing and the config variable is unset. Calling .upper() on None at line 42 will raise AttributeError.

Proposed fix — default to a known value
-        self.scraper_type = scraper_type or GENERAL_SCRAPING_API
+        self.scraper_type = scraper_type or GENERAL_SCRAPING_API or "FIRECRAWL"
🤖 Prompt for AI Agents
In `@app/services/scrapers/general/scraper.py` around lines 36 - 42,
self.scraper_type can be None which causes self.scraper_type.upper() in
_init_scraper to raise AttributeError; ensure a safe default string is used when
both the constructor argument and GENERAL_SCRAPING_API are None by normalizing
scraper_type before calling .upper(). Update the assignment or _init_scraper so
it uses something like scraper_type = (scraper_type or GENERAL_SCRAPING_API or
"GENERAL") (or otherwise coerce to str) and then call .upper() against that
value; reference symbols: scraper_type, GENERAL_SCRAPING_API, _init_scraper,
SCRAPER_REGISTRY, and the .upper() call to locate the change.

Comment on lines +22 to +35
try:
client = AsyncZyteAPI(api_key=ZYTE_API_KEY)
result = await client.get(
{
"url": self.url,
"browserHtml": True,
"article": True,
"articleOptions": {"extractFrom": "browserHtml"},
}
)
await self._process_zyte_result(result)
except Exception as e:
logger.error(f"Failed to scrape URL with Zyte: {e}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Zyte API AsyncZyteAPI Python context manager resource cleanup

💡 Result:

For resource cleanup with zyte_api.AsyncZyteAPI, the main thing you need to close is the session object returned by client.session() (it manages an internal aiohttp.ClientSession connection pool).

Recommended: use it as an async context manager

from zyte_api import AsyncZyteAPI

async def main():
    client = AsyncZyteAPI()
    async with client.session() as session:
        r = await session.get({"url": "https://example.com", "httpResponseBody": True})
        ...

Exiting the async with block performs the cleanup automatically. [1] [2]

Alternative: manual cleanup (equivalent, but you must remember it)

session = client.session()
try:
    r = await session.get({"url": "https://example.com"})
finally:
    await session.close()

This is explicitly documented as sufficient—you don’t have to use the context manager as long as you await session.close() when done. [1]

If you pass your own aiohttp.ClientSession

If you call await client.get(..., session=your_aiohttp_session), then you own that aiohttp session lifecycle and must close it yourself (Zyte won’t). [1]

Sources:
[1] python-zyte-api API reference (AsyncZyteAPI, session(), close())
[2] Zyte docs Python examples using async with client.session()


🏁 Script executed:

# Search for AsyncZyteAPI imports and usage in the codebase
rg "AsyncZyteAPI" --type py -B 2 -A 5

Repository: aturret/FastFetchBot

Length of output: 1095


🏁 Script executed:

# Check if there are any existing Zyte API client usage patterns
rg "ZYTE_API_KEY" --type py -B 3 -A 3

Repository: aturret/FastFetchBot

Length of output: 1573


AsyncZyteAPI session requires proper async context management — resource leak risk.

AsyncZyteAPI.get() creates an internal session that manages an aiohttp.ClientSession connection pool. The current code does not close this session, causing connection leaks.

The client must use client.session() as an async context manager:

Proposed fix — use session context manager
         try:
             client = AsyncZyteAPI(api_key=ZYTE_API_KEY)
-            result = await client.get(
-                {
-                    "url": self.url,
-                    "browserHtml": True,
-                    "article": True,
-                    "articleOptions": {"extractFrom": "browserHtml"},
-                }
-            )
+            async with client.session() as session:
+                result = await session.get(
+                    {
+                        "url": self.url,
+                        "browserHtml": True,
+                        "article": True,
+                        "articleOptions": {"extractFrom": "browserHtml"},
+                    }
+                )
             await self._process_zyte_result(result)
🤖 Prompt for AI Agents
In `@app/services/scrapers/general/zyte.py` around lines 22 - 35, The AsyncZyteAPI
instance (AsyncZyteAPI) creates an internal aiohttp session that must be closed
via its async context manager; update the Zyte scraping block to open the
client's session with "async with client.session():" and call the existing
AsyncZyteAPI.get(...) inside that context so the underlying connection pool is
always closed even on exceptions, then pass the returned result to
self._process_zyte_result(result) as before and keep the current exception
logging/raising.

Comment on lines +130 to +133
# Remove DOCTYPE declarations
for item in soup.contents:
if isinstance(item, Doctype):
item.extract()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mutating soup.contents while iterating over it can skip nodes.

soup.contents is a live list; calling .extract() during iteration may cause elements to be skipped. Snapshot the list first.

Proposed fix
     # Remove DOCTYPE declarations
-    for item in soup.contents:
+    for item in list(soup.contents):
         if isinstance(item, Doctype):
             item.extract()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Remove DOCTYPE declarations
for item in soup.contents:
if isinstance(item, Doctype):
item.extract()
# Remove DOCTYPE declarations
for item in list(soup.contents):
if isinstance(item, Doctype):
item.extract()
🤖 Prompt for AI Agents
In `@app/utils/parse.py` around lines 130 - 133, The loop mutates the live list
soup.contents while iterating (using Doctype and item.extract()), which can skip
nodes; fix by iterating over a snapshot of the contents instead (e.g., iterate
over list(soup.contents) or use soup.find_all/children snapshot) and call
item.extract() on the snapshot so no elements are skipped; update the block that
references soup.contents, Doctype, and item.extract() accordingly.

Comment on lines +151 to +154
# Unwrap <p> tags (keep text content)
for tag in soup.find_all("p"):
tag.unwrap()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unwrapping <p> tags loses paragraph separation.

When <p> tags are unwrapped without inserting a separator (e.g., <br> or newline), consecutive paragraphs will merge into a single run of text, harming readability in the Telegram message.

Proposed fix — insert a line break before unwrapping
     # Unwrap <p> tags (keep text content)
     for tag in soup.find_all("p"):
+        tag.insert_after(soup.new_string("\n"))
         tag.unwrap()
🤖 Prompt for AI Agents
In `@app/utils/parse.py` around lines 151 - 154, The loop that unwrapps <p> tags
(the for tag in soup.find_all("p") block) currently calls tag.unwrap() which
merges consecutive paragraphs; before unwrapping each <p> insert a paragraph
separator (e.g., a newline or double-newline as a NavigableString or a <br>
element) so paragraph boundaries are preserved in the resulting Telegram text,
then call tag.unwrap(); ensure any required import (NavigableString) is added if
using it.

@aturret
Copy link
Owner Author

aturret commented Feb 14, 2026

@codex review

@chatgpt-codex-connector
Copy link

To use Codex here, create a Codex account and connect to github.

@aturret
Copy link
Owner Author

aturret commented Feb 14, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad6ec6cb10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

raise

async def _process_zyte_result(self, result: dict) -> None:
article = result.get("article", {})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard against null Zyte article payloads

If Zyte returns "article": null (which can happen on pages where article extraction fails), result.get("article", {}) yields None, and subsequent calls like article.get(...) raise AttributeError before the browserHtml fallback can run. This makes general scraping fail for those URLs when the Zyte backend is selected.

Useful? React with 👍 / 👎.

print(text)
split_pivot = "\n" if is_html is False else "<br>"
text_list = text.split(split_pivot)
text_list = text.split("\n")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve HTML structure when wrapping HTML content

Splitting all HTML input on raw newlines when is_html=True breaks pretty-printed markup into line fragments and wraps each fragment in <p>, which can corrupt nested structures (e.g., lists/tables become invalid paragraph-wrapped fragments). This degrades rendered content quality for scraped pages that include formatting newlines, not just explicit <br> line breaks.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/services/scrapers/scraper_manager.py`:
- Around line 18-21: The class-level scrapers dict is populated with None at
import time causing "other" and "unknown" to fall out of sync when init_scraper
sets only one key; update init_scraper (the method updating cls.general_scraper
and cls.scrapers) so that when you assign the general_scraper instance you also
set both cls.scrapers["other"] and cls.scrapers["unknown"] to that instance (or
change scrapers to be a property/method that returns {"other":
cls.general_scraper, "unknown": cls.general_scraper, ...} based on the current
class attributes); edit the init_scraper function and the scrapers reference to
ensure both keys always reference the same non-None scraper instance.
🧹 Nitpick comments (5)
app/services/scrapers/scraper_manager.py (2)

18-21: Annotate mutable class attribute with ClassVar.

Ruff (RUF012) flags that scrapers is a mutable class attribute without a ClassVar annotation. This helps type checkers distinguish class-level state from instance-level state.

Proposed fix
+from typing import ClassVar, Optional
 ...
-    scrapers = {"bluesky": bluesky_scraper,
+    scrapers: ClassVar[dict] = {"bluesky": bluesky_scraper,

44-47: Partially initialized scraper persists if init() throws.

If await cls.bluesky_scraper.init() raises, cls.bluesky_scraper is already assigned (line 45). Future calls to init_scraper("bluesky") will see it's not None and skip re-initialization, leaving a broken instance in place.

Proposed fix
     `@classmethod`
     async def init_bluesky_scraper(cls) -> BlueskyScraper:
-        cls.bluesky_scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD)
-        await cls.bluesky_scraper.init()
-        return cls.bluesky_scraper
+        scraper = BlueskyScraper(username=BLUESKY_USERNAME, password=BLUESKY_PASSWORD)
+        await scraper.init()
+        cls.bluesky_scraper = scraper
+        return cls.bluesky_scraper
app/services/scrapers/general/firecrawl_client.py (1)

26-57: Thread-safe singleton holding an async client — consider event-loop affinity.

The threading.Lock-based singleton is fine for creation, but AsyncFirecrawl likely holds an internal httpx.AsyncClient (or similar) that is bound to the event loop active at creation time. If get_instance() is first called from a different loop (e.g., during tests or worker startup) than the one that later awaits scrape_url, you may get "attached to a different loop" errors. This is a low-risk concern for a single-loop server but worth noting.

app/services/scrapers/general/base.py (2)

170-171: AsyncOpenAI client instantiated on every LLM call.

A new AsyncOpenAI is created each invocation of parsing_article_body_by_llm, which allocates a new HTTP connection pool every time. Consider creating it once (e.g., as a module-level or class-level singleton) and reusing it.

Proposed refactor sketch
+_openai_client: Optional[AsyncOpenAI] = None
+
+def _get_openai_client() -> AsyncOpenAI:
+    global _openai_client
+    if _openai_client is None:
+        _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
+    return _openai_client
+
 ...
     `@staticmethod`
     async def parsing_article_body_by_llm(html_content: str) -> str:
         ...
         try:
-            client = AsyncOpenAI(api_key=OPENAI_API_KEY)
+            client = _get_openai_client()

46-46: MD5 used for URL-based ID generation (static analysis flag S324).

The static analysis tool flags hashlib.md5 as insecure. Since this is only used to derive a short deterministic ID from a URL (not for security/integrity), the risk is negligible. If you want to silence the linter, you can add hashlib.md5(url.encode(), usedforsecurity=False) (Python 3.9+).

@aturret aturret merged commit cef9849 into main Feb 15, 2026
1 of 2 checks passed
@aturret aturret deleted the zyte-api branch February 16, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant