feat: add Firecrawl scraping feature by aturret · Pull Request #45 · aturret/FastFetchBot

aturret · 2026-01-18T02:01:59Z

Summary by CodeRabbit

New Features
- Automatic content extraction for URLs from previously unsupported sources enables extraction and formatting of comprehensive metadata including titles, authors, summaries, and main article content.
- Short-form messages now display extracted titles for improved readability and better visual hierarchy.
- Enhanced content processing ensures consistent, high-quality formatting across diverse content sources and formats.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-18T02:02:08Z

📝 Walkthrough

Walkthrough

This pull request introduces Firecrawl-based web scraping integration into the application. It adds configuration constants, implements a thread-safe Firecrawl client wrapper, creates a scraping pipeline with LLM-powered article extraction, integrates the new scraper into the existing architecture, and enables conditional URL processing in the Telegram bot.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `.gitignore`, `app/config.py`, `pyproject.toml`	Added macOS .DS_Store to ignored files. Introduced Firecrawl configuration variables (ON flag, API URL/key, timeout). Updated openai dependency to ^2.15.0 and added firecrawl-py ^4.13.0.
Firecrawl Client Infrastructure `app/services/scrapers/firecrawl_client/__init__.py`, `app/services/scrapers/firecrawl_client/client.py`	Created FirecrawlItem dataclass extending MetadataItem with id and raw_content fields. Implemented thread-safe FirecrawlClient singleton wrapper around Firecrawl SDK with scrape_url method and error handling.
Firecrawl Scraping Pipeline `app/services/scrapers/firecrawl_client/scraper.py`	Added FirecrawlDataProcessor with async content retrieval and LLM-based article extraction using OpenAI (GPT-4o-mini). Processes Firecrawl results into structured items with markdown/HTML selection and media extraction. Includes fallback mechanisms for missing API key or extraction failures.
Scraper Integration `app/services/scrapers/common.py`, `app/services/scrapers/scraper_manager.py`	Expanded early-scraper workaround to include "other" and "unknown" categories alongside existing patterns. Extended ScraperManager with FirecrawlScraper support, lazy initialization, and registry mapping for new categories.
Telegram Bot Integration `app/services/telegram_bot/__init__.py`	Added FIRECRAWL_ON flag import and conditional URL processing logic. When flag is enabled and URL is unknown, processes via Firecrawl before returning standard "no supported URL" response.
Template Updates `app/templates/social_media_message.jinja2`	Modified short-message rendering to conditionally display bold title before text when title exists.

Sequence Diagram(s)

sequenceDiagram
    participant TelegramBot as Telegram Bot
    participant ScraperMgr as ScraperManager
    participant FirecrawlScraper as FirecrawlScraper
    participant FirecrawlClient as FirecrawlClient
    participant FirecrawlAPI as Firecrawl API
    participant LLM as OpenAI LLM
    participant Database as Item Storage

    TelegramBot->>TelegramBot: Receive unknown URL
    TelegramBot->>TelegramBot: Check FIRECRAWL_ON flag
    alt FIRECRAWL_ON enabled
        TelegramBot->>ScraperMgr: get_scraper("unknown" or "other")
        ScraperMgr->>FirecrawlScraper: init_firecrawl_scraper()
        FirecrawlScraper->>FirecrawlClient: get_instance()
        FirecrawlScraper->>FirecrawlClient: scrape_url(url)
        FirecrawlClient->>FirecrawlAPI: POST /scrape with URL
        FirecrawlAPI-->>FirecrawlClient: HTML, markdown, metadata
        FirecrawlClient-->>FirecrawlScraper: scrape result dict
        FirecrawlScraper->>LLM: parsing_article_body_by_llm(html)
        LLM-->>FirecrawlScraper: extracted article HTML
        FirecrawlScraper->>FirecrawlScraper: _process_firecrawl_result()
        FirecrawlScraper-->>TelegramBot: FirecrawlItem (structured)
        TelegramBot->>TelegramBot: send_item_message(item)
        TelegramBot->>Database: Store/forward item
    else FIRECRAWL_ON disabled
        TelegramBot->>TelegramBot: Return "unsupported URL" message
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Firecrawl spins its web with grace,
Unknown URLs find their place,
LLM whispers what's true inside,
New content flows with proper stride! 🕸️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add Firecrawl scraping feature' accurately captures the main objective of the pull request, which is to add Firecrawl-based web scraping functionality across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

app/services/telegram_bot/__init__.py (1)

209-221: Control flow issue: success message overwritten with failure message.

When FIRECRAWL_ON is enabled and the URL is successfully processed (lines 214-217), the code continues to execute lines 218-221, which overwrites the "Processing..." message with "no supported url found" and returns. This contradicts the successful Firecrawl processing.

The return statement at line 221 should likely be inside the if FIRECRAWL_ON block after successful processing, or use else to separate the two paths.

Proposed fix

         if url_metadata.source == "unknown":
             if FIRECRAWL_ON:
                 await process_message.edit_text(
-                    text=f"Uncategorized url found. General webpage parser is on, Processing..."
+                    text="Uncategorized url found. General webpage parser is on, Processing..."
                 )
                 metadata_item = await content_process_function(url_metadata=url_metadata)
                 await send_item_message(
                     metadata_item, chat_id=message.chat_id
                 )
-            await process_message.edit_text(
-                text=f"For the {i + 1} th url, no supported url found."
-            )
-            return
+                await process_message.delete()
+            else:
+                await process_message.edit_text(
+                    text=f"For the {i + 1} th url, no supported url found."
+                )
+            return

🤖 Fix all issues with AI agents

In `@app/config.py`:
- Around line 214-218: FIRECRAWL_TIMEOUT_SECONDS is currently assigned from
env.get and may be a string; convert it to an int for type safety by parsing the
value (e.g., wrap the retrieved env value with int(...) or use an existing
helper like get_env_int) so downstream code receives an integer; update the
assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to parse/coerce
the env value to int and handle a missing/invalid value by falling back to the
default 60.

In `@app/services/scrapers/firecrawl_client/client.py`:
- Around line 70-93: scrape_url builds a params dict (including extra_params)
and accepts timeout_seconds but never uses either; update the call to
self._app.scrape to pass the assembled params (e.g., params=params) and wire the
timeout_seconds through — converting seconds to milliseconds if the SDK expects
ms (timeout_ms = int(timeout_seconds * 1000)) or passing raw seconds when
appropriate; modify the scrape invocation in scrape_url to use these values
instead of the current arguments (refer to scrape_url, params, timeout_seconds,
and self._app.scrape) so callers' options are honored.

In `@app/services/scrapers/firecrawl_client/scraper.py`:
- Around line 51-61: The _get_page_content coroutine is calling the synchronous
self._client.scrape_url which blocks the event loop; switch to Firecrawl's async
API by replacing the blocking call with the AsyncFirecrawl async client and its
async scrape method (e.g., create or ensure self._client is an AsyncFirecrawl
instance and call await self._client.scrape(...) with the same parameters), then
await the existing _process_firecrawl_result(result) call; update error handling
to catch exceptions from the awaited async call and rethrow as before.

In `@app/services/scrapers/scraper_manager.py`:
- Around line 14-21: The class-level scraper attributes (bluesky_scraper,
weibo_scraper, firecrawl_scraper) are declared but never set after creating
instances, causing repeated re-initialization in init_scraper(); update
init_scraper() so that when you create a scraper instance you assign it back to
the corresponding class attribute (e.g., cls.firecrawl_scraper = instance,
cls.bluesky_scraper = instance, cls.weibo_scraper = instance) and ensure the
scrapers mapping (cls.scrapers) points to that same instance for all relevant
keys (update both "other" and "unknown" to reference cls.firecrawl_scraper or
rebuild cls.scrapers from the class attrs after initialization) so subsequent
calls use the cached instances.

In `@app/services/telegram_bot/__init__.py`:
- Around line 351-358: The code processes "unknown" URLs via Firecrawl when
FIRECRAWL_ON is true but then immediately hits the subsequent check that logs
and returns for url_metadata.source == "unknown", negating the Firecrawl result;
to fix, after successfully calling content_process_function and
send_item_message in the Firecrawl branch (the block using url_metadata,
FIRECRAWL_ON, content_process_function, and send_item_message) add an early
return so execution does not continue to the later logger.debug/return, or
alternatively change the second condition to skip returning when Firecrawl was
performed (e.g., only return if url_metadata.source == "unknown" and not
FIRECRAWL_ON or if a flag indicates Firecrawl didn't run).

In `@app/templates/social_media_message.jinja2`:
- Around line 4-7: The template outputs user-provided data.title without
escaping; update the social_media_message.jinja2 template to explicitly escape
the title (e.g., use the Jinja2 escape/filter on data.title) so HTML/Telegram
markup cannot be injected; modify the conditional block around data.title (the
line rendering <b>{{ data.title }}</b>) to render an escaped version of
data.title using the appropriate Jinja2 escape/filter.

🧹 Nitpick comments (5)

.gitignore (1)
259-259: LGTM! Standard macOS system file exclusion.

Adding .DS_Store to .gitignore is a best practice to prevent macOS folder metadata files from being committed to version control.
📁 Optional: Consider organizing OS-specific entries

For better organization, you could group OS-specific files in a dedicated section near the top of the file or with other OS/IDE-specific entries. However, the current placement at the end is perfectly acceptable.

Example organization:
+# macOS
+.DS_Store
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
This is purely a stylistic preference and not necessary.
app/services/telegram_bot/__init__.py (1)
71-72: Minor formatting nit: consider separating imports for readability.

The FIRECRAWL_ON constant is appended to the same line as other imports. For better readability, consider placing it on its own line.
Suggested change
-    TEMPLATE_LANGUAGE, TELEBOT_MAX_RETRY, FIRECRAWL_ON,
+    TEMPLATE_LANGUAGE,
+    TELEBOT_MAX_RETRY,
+    FIRECRAWL_ON,
app/services/scrapers/firecrawl_client/scraper.py (2)
81-96: Consider reusing AsyncOpenAI client instance.

Creating a new AsyncOpenAI client on every call adds overhead. Consider instantiating it once at the module level or as a class attribute.
Suggested refactor
# At module level or in __init__
_openai_client: Optional[AsyncOpenAI] = None

`@staticmethod`
def _get_openai_client() -> AsyncOpenAI:
    global _openai_client
    if _openai_client is None and OPENAI_API_KEY:
        _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
    return _openai_client
84-95: Consider making the model name configurable via environment variable.

gpt-4o-mini is a valid OpenAI model, but it's hardcoded. Since OPENAI_API_KEY is already externalized to environment configuration, consider adding an OPENAI_MODEL setting to app/config.py and using it here. This would align with the existing configuration pattern and allow flexibility to switch models without code changes.
app/services/scrapers/scraper_manager.py (1)
18-21: Consider annotating mutable class attribute with ClassVar.

The scrapers dictionary is a mutable class attribute. For type safety and clarity, consider annotating it with typing.ClassVar.
Suggested change
+from typing import ClassVar, Dict, Optional
+
 class ScraperManager:
 
     bluesky_scraper: Optional[BlueskyScraper] = None
     weibo_scraper: Optional[WeiboScraper] = None
     firecrawl_scraper: Optional[FirecrawlScraper] = None
 
-    scrapers = {"bluesky": bluesky_scraper,
-                "weibo": weibo_scraper,
-                "other": firecrawl_scraper,
-                "unknown": firecrawl_scraper}
+    scrapers: ClassVar[Dict[str, Optional[Scraper]]] = {
+        "bluesky": None,
+        "weibo": None,
+        "other": None,
+        "unknown": None,
+    }

coderabbitai · 2026-01-18T02:11:26Z

app/config.py

+# Firecrawl API environment variables
+FIRECRAWL_ON = get_env_bool(env, "FIRECRAWL_ON", False)
+FIRECRAWL_API_URL = os.getenv("FIRECRAWL_API_URL", "")
+FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
+FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)


⚠️ Potential issue | 🟡 Minor

Coerce FIRECRAWL_TIMEOUT_SECONDS to int for type safety.
Environment values are strings; downstream expects an int.

🔧 Suggested fix

-FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60) +FIRECRAWL_TIMEOUT_SECONDS = int(env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)) or 60

🤖 Prompt for AI Agents

In `@app/config.py` around lines 214 - 218, FIRECRAWL_TIMEOUT_SECONDS is currently assigned from env.get and may be a string; convert it to an int for type safety by parsing the value (e.g., wrap the retrieved env value with int(...) or use an existing helper like get_env_int) so downstream code receives an integer; update the assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to parse/coerce the env value to int and handle a missing/invalid value by falling back to the default 60.

coderabbitai · 2026-01-18T02:11:26Z

app/services/scrapers/firecrawl_client/client.py

+    def scrape_url(
+            self,
+            url: str,
+            formats: Optional[List[str]] = None,
+            only_main_content: bool = True,
+            timeout_seconds: Optional[int] = None,
+            extra_params: Optional[Dict[str, Any]] = None,
+    ) -> Dict[str, Any]:
+        """
+        单页抓取（最常用）
+        """
+        params: Dict[str, Any] = {
+            "formats": formats or ["markdown"],
+            "onlyMainContent": only_main_content,
+        }
+        if extra_params:
+            params.update(extra_params)
+
+        # if timeout_seconds is None:
+        #     timeout_seconds = self._settings.timeout_seconds
+
+        try:
+            return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump(
+                exclude_none=True)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "client.py" | grep firecrawl

Repository: aturret/FastFetchBot

Length of output: 114

🏁 Script executed:

cat -n app/services/scrapers/firecrawl_client/client.py

Repository: aturret/FastFetchBot

Length of output: 3450

🏁 Script executed:

python3 << 'EOF' import subprocess result = subprocess.run(['pip', 'show', 'firecrawl'], capture_output=True, text=True) print(result.stdout) EOF

Repository: aturret/FastFetchBot

Length of output: 150

🏁 Script executed:

# Search for how firecrawl scrape is used in the codebase rg "scrape_url" --type py -A 3 -B 1

Repository: aturret/FastFetchBot

Length of output: 1059

🏁 Script executed:

# Check if there are other usages or tests of the firecrawl SDK rg "_app\.scrape|Firecrawl" --type py

Repository: aturret/FastFetchBot

Length of output: 3710

🌐 Web query:

firecrawl-py SDK scrape method parameters API documentation

💡 Result:

Summary of firecrawl-py /scrape method parameters (concise):

url / url string — the target page to scrape. (docs.firecrawl.dev)

formats — array of output formats (e.g., "markdown", "html", "rawHtml", "summary", "json", "links", "images", "screenshot", "branding"). (docs.firecrawl.dev)

maxAge — cache freshness in ms (default ~172800000 ms = 2 days); set 0 to force fresh. (docs.firecrawl.dev)

storeInCache — boolean to prevent storing results in cache. (docs.firecrawl.dev)

location — object with country (ISO 3166-1 alpha-2) and languages to control proxy/language emulation. (docs.firecrawl.dev)

pageOptions — object for page parsing behavior (examples include onlyMainContent, parsePDF, etc.). (docs.firecrawl.dev)

parsers — array to force specific parsers (e.g., ["pdf"]). (docs.firecrawl.dev)

screenshot (when formats includes "screenshot") — options like fullPage, quality, viewport. (docs.firecrawl.dev)

JSON extraction / LLM schema — support for structured extraction using a JSON format or pydantic schema. (docs.firecrawl.dev)

Additional SDK params — SDK methods accept named params/objects (scrape(), scrape_url(), scrape_options / ScrapeOptions) and support sync/async clients. (docs.firecrawl.dev)

Sources:
[1] Firecrawl Scrape docs (Scrape endpoint, formats, cache, location, screenshots). (docs.firecrawl.dev)
[2] Firecrawl Advanced Scraping Guide (pageOptions, parsers, scrape options). (docs.firecrawl.dev)
[3] firecrawl-py SDK / PyPI / GitHub pages (SDK method names, ScrapeOptions, examples, async). (github.com)

If you want, I can paste exact parameter names and example Python signatures from the SDK docs.

Citations:

1: https://docs.firecrawl.dev/features/scrape?utm_source=openai

2: https://docs.firecrawl.dev/features/scrape?utm_source=openai

3: https://docs.firecrawl.dev/features/scrape?utm_source=openai

4: https://docs.firecrawl.dev/features/scrape?utm_source=openai

5: https://docs.firecrawl.dev/features/scrape?utm_source=openai

6: https://docs.firecrawl.dev/v0/advanced-scraping-guide?utm_source=openai

7: https://docs.firecrawl.dev/advanced-scraping-guide?utm_source=openai

8: https://docs.firecrawl.dev/features/scrape?utm_source=openai

9: https://docs.firecrawl.dev/features/scrape?utm_source=openai

10: https://docs.firecrawl.dev/sdks/python?utm_source=openai

11: https://docs.firecrawl.dev/features/scrape?utm_source=openai

12: https://docs.firecrawl.dev/v0/advanced-scraping-guide?utm_source=openai

13: https://github.com/firecrawl/firecrawl-py?utm_source=openai

🌐 Web query:

firecrawl-py GitHub repository scrape method signature source code

💡 Result:

The Firecrawl Python SDK scrape signature (synchronous) — from the SDK docs/source — is:

def scrape(
url: str,
*,
formats: Optional[List[str]] = None,
headers: Optional[Dict[str, str]] = None,
include_tags: Optional[List[str]] = None,
exclude_tags: Optional[List[str]] = None,
wait_for: Optional[int] = None,
screenshot: Optional[bool] = None,
full_page_screenshot: Optional[bool] = None,
mobile: Optional[bool] = None,
**kwargs
) -> Document

(Example usage and parameter details in the Firecrawl docs and the package docs mirror.) [1][2]

Sources:

Firecrawl docs — Scrape endpoint / examples. [1]

firecrawl-py package docs (scraping API / signature listing). [2]

🏁 Script executed:

# Check if there's a requirements.txt or setup.py that shows firecrawl version find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" \) -exec cat {} \;

Repository: aturret/FastFetchBot

Length of output: 1668

🏁 Script executed:

python3 << 'EOF' import subprocess import json # Try to get info about firecrawl package result = subprocess.run(['pip', 'show', 'firecrawl-py', '-f'], capture_output=True, text=True) print("Package info:") print(result.stdout) print("\nStderr:") print(result.stderr) EOF

Repository: aturret/FastFetchBot

Length of output: 132

🌐 Web query:

firecrawl-py scrape method pageOptions onlyMainContent parameter example

💡 Result:

Example — Python (firecrawl-py) using pageOptions.onlyMainContent:

from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
resp = app.scrape_url(
"https://example.com",
params={
"pageOptions": {
"onlyMainContent": True,
"includeHtml": False
}
}
)
print(resp) # resp will contain the main page content without headers/navs/footers

References: Firecrawl docs — Page Options and Scrape examples. [1][2]

🏁 Script executed:

# Check if scraper.py shows how scrape_url is being called sed -n '1,50p' app/services/scrapers/firecrawl_client/scraper.py

Repository: aturret/FastFetchBot

Length of output: 2048

🌐 Web query:

firecrawl-py SDK scrape method timeout parameter how to pass

💡 Result:

Pass the timeout as the timeout keyword when calling scrape; it expects milliseconds. Example:

Python
from firecrawl import Firecrawl
fc = Firecrawl(api_key="fc-YOUR-API-KEY")
doc = fc.scrape("https://example.com", formats=["markdown"], timeout=15000) # 15,000 ms = 15s

Docs: scrape endpoint timeout is in milliseconds and the Python SDK forwards that parameter. [1][2]

Sources
[1] Firecrawl Python SDK usage (scrape examples).
[2] Firecrawl Scrape endpoint — timeout (integer, milliseconds).

🌐 Web query:

site:github.com firecrawl-py scrape method implementation kwargs

💡 Result:

The firecrawl Python SDK's scrape method accepts options via keyword args (examples: formats, scrape_options, poll_interval). See the SDK docs and examples in the repository README. [1][2]

Sources:

firecrawl (main repo) README — example usage of firecrawl.scrape(url, formats=[...]). [1]

firecrawl-py (old Python SDK) README — scrape_url / scrape_url params examples. [2]

Wire unused params dict and timeout_seconds into the SDK call.

The method builds a params dict that includes extra_params, but then ignores it entirely when calling scrape(). Additionally, the timeout_seconds parameter is accepted but never applied—the code to use it is commented out. This causes callers' configuration to be silently dropped.

The timeout_seconds parameter is in seconds but the firecrawl SDK expects milliseconds. Consider either converting on input or passing the unconverted value if the SDK handles both.

🔧 Suggested direction (adjust based on SDK version's params handling)

params: Dict[str, Any] = { "formats": formats or ["markdown"], "onlyMainContent": only_main_content, } + if timeout_seconds is None: + timeout_seconds = self._settings.timeout_seconds + if timeout_seconds is not None: + params["timeout"] = timeout_seconds * 1000 # Convert seconds to milliseconds if extra_params: params.update(extra_params) try: - return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump( - exclude_none=True) + return self._app.scrape(url, **params).model_dump(exclude_none=True)

🧰 Tools

🪛 Ruff (0.14.13)

75-75: Unused method argument: timeout_seconds

(ARG002)

79-79: Docstring contains ambiguous （ (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF002)

79-79: Docstring contains ambiguous ） (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF002)

🤖 Prompt for AI Agents

In `@app/services/scrapers/firecrawl_client/client.py` around lines 70 - 93, scrape_url builds a params dict (including extra_params) and accepts timeout_seconds but never uses either; update the call to self._app.scrape to pass the assembled params (e.g., params=params) and wire the timeout_seconds through — converting seconds to milliseconds if the SDK expects ms (timeout_ms = int(timeout_seconds * 1000)) or passing raw seconds when appropriate; modify the scrape invocation in scrape_url to use these values instead of the current arguments (refer to scrape_url, params, timeout_seconds, and self._app.scrape) so callers' options are honored.

coderabbitai · 2026-01-18T02:11:26Z

app/services/scrapers/firecrawl_client/scraper.py

+    async def _get_page_content(self) -> None:
+        try:
+            result = self._client.scrape_url(
+                url=self.url,
+                formats=["markdown", "html"],
+                only_main_content=True,
+            )
+            await self._process_firecrawl_result(result)
+        except Exception as e:
+            logger.error(f"Failed to scrape URL with Firecrawl: {e}")
+            raise


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find where FirecrawlClient is defined or imported find . -type f -name "*.py" | xargs grep -l "class FirecrawlClient" 2>/dev/null || echo "No local FirecrawlClient class found" # Search for scrape_url method definition in the repo rg "def scrape_url|async def scrape_url" --type py -B 2 -A 10 # Also check the imports in the scraper.py file to understand where _client comes from cat -n app/services/scrapers/firecrawl_client/scraper.py | head -50

Repository: aturret/FastFetchBot

Length of output: 3450

🏁 Script executed:

#!/bin/bash # Read the scrape_url method from client.py cat -n app/services/scrapers/firecrawl_client/client.py | grep -A 30 "def scrape_url"

Repository: aturret/FastFetchBot

Length of output: 1153

🏁 Script executed:

#!/bin/bash # Read the FirecrawlClient class to see how _app is initialized cat -n app/services/scrapers/firecrawl_client/client.py | head -70

Repository: aturret/FastFetchBot

Length of output: 2387

🌐 Web query:

Firecrawl Python SDK async support AsyncFirecrawl

💡 Result:

Yes — the Firecrawl Python SDK includes async support via the AsyncFirecrawl class. Its methods mirror Firecrawl (awaitable versions of scrape, search, start_crawl, batch_scrape, watcher, etc.). Example usage and docs are on the official SDK pages. [1][2]

Sources:

Firecrawl Python SDK docs (AsyncFirecrawl examples and description). [1]

Firecrawl docs (Python SDK overview, async class usage and watcher example). [2]

Use AsyncFirecrawl instead of blocking the event loop with a synchronous call.

self._client.scrape_url() is a synchronous, blocking call that performs network I/O to Firecrawl. Calling it without await in this async method blocks the event loop, preventing other async operations from executing concurrently and defeating the purpose of async. The Firecrawl Python SDK provides an AsyncFirecrawl class with async methods (e.g., scrape()); use it to avoid blocking during the scraping request.

🤖 Prompt for AI Agents

In `@app/services/scrapers/firecrawl_client/scraper.py` around lines 51 - 61, The _get_page_content coroutine is calling the synchronous self._client.scrape_url which blocks the event loop; switch to Firecrawl's async API by replacing the blocking call with the AsyncFirecrawl async client and its async scrape method (e.g., create or ensure self._client is an AsyncFirecrawl instance and call await self._client.scrape(...) with the same parameters), then await the existing _process_firecrawl_result(result) call; update error handling to catch exceptions from the awaited async call and rethrow as before.

coderabbitai · 2026-01-18T02:11:26Z

app/services/scrapers/scraper_manager.py

    bluesky_scraper: Optional[BlueskyScraper] = None
    weibo_scraper: Optional[WeiboScraper] = None
+    firecrawl_scraper: Optional[FirecrawlScraper] = None

    scrapers = {"bluesky": bluesky_scraper,
-                "weibo": bluesky_scraper}
+                "weibo": weibo_scraper,
+                "other": firecrawl_scraper,
+                "unknown": firecrawl_scraper}


⚠️ Potential issue | 🟠 Major

Class attributes never updated after scraper initialization - causes repeated re-initialization.

The class attributes bluesky_scraper, weibo_scraper, and firecrawl_scraper are used as guards (e.g., not cls.firecrawl_scraper) but are never assigned after initialization. This means every call to init_scraper() will re-initialize the scraper.

Additionally, when initializing for "other" category, cls.scrapers["other"] is updated but cls.scrapers["unknown"] still points to None, causing separate initializations.

Proposed fix

`@classmethod` async def init_scraper(cls, category: str) -> None: if category in cls.scrapers.keys(): scraper = None if category == "bluesky" and not cls.bluesky_scraper: scraper = await cls.init_bluesky_scraper() + cls.bluesky_scraper = scraper elif category == "weibo" and not cls.weibo_scraper: scraper = await cls.init_weibo_scraper() + cls.weibo_scraper = scraper elif category in ["other", "unknown"] and not cls.firecrawl_scraper: scraper = await cls.init_firecrawl_scraper() + cls.firecrawl_scraper = scraper + # Update both keys to use same instance + cls.scrapers["other"] = scraper + cls.scrapers["unknown"] = scraper if scraper: cls.scrapers[category] = scraper

🧰 Tools

🪛 Ruff (0.14.13)

18-21: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents

In `@app/services/scrapers/scraper_manager.py` around lines 14 - 21, The class-level scraper attributes (bluesky_scraper, weibo_scraper, firecrawl_scraper) are declared but never set after creating instances, causing repeated re-initialization in init_scraper(); update init_scraper() so that when you create a scraper instance you assign it back to the corresponding class attribute (e.g., cls.firecrawl_scraper = instance, cls.bluesky_scraper = instance, cls.weibo_scraper = instance) and ensure the scrapers mapping (cls.scrapers) points to that same instance for all relevant keys (update both "other" and "unknown" to reference cls.firecrawl_scraper or rebuild cls.scrapers from the class attrs after initialization) so subsequent calls use the cached instances.

coderabbitai · 2026-01-18T02:11:26Z

app/services/telegram_bot/__init__.py

+        if url_metadata.source == "unknown" and FIRECRAWL_ON:
+            metadata_item = await content_process_function(url_metadata=url_metadata)
+            await send_item_message(
+                metadata_item, chat_id=message.chat_id, message=message
+            )
        if url_metadata.source == "unknown" or url_metadata.source == "banned":
            logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
            return


⚠️ Potential issue | 🟠 Major

Control flow issue: Firecrawl processing followed by early return for "unknown" source.

After successfully processing the URL with Firecrawl (lines 352-355), the code falls through to line 356-358 which checks url_metadata.source == "unknown" again and returns early with a debug log stating "no supported url found." This negates the Firecrawl processing that just occurred.

The return should be handled inside the Firecrawl block, or the "unknown" check should exclude the Firecrawl-processed case.

Proposed fix

if url_metadata.source == "unknown" and FIRECRAWL_ON: metadata_item = await content_process_function(url_metadata=url_metadata) await send_item_message( metadata_item, chat_id=message.chat_id, message=message ) - if url_metadata.source == "unknown" or url_metadata.source == "banned": + return + if url_metadata.source == "unknown" or url_metadata.source == "banned": logger.debug(f"for the {i + 1}th url {url}, no supported url found.") return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if url_metadata.source == "unknown" and FIRECRAWL_ON:

metadata_item = await content_process_function(url_metadata=url_metadata)

await send_item_message(

metadata_item, chat_id=message.chat_id, message=message

)

if url_metadata.source == "unknown" or url_metadata.source == "banned":

logger.debug(f"for the {i + 1}th url {url}, no supported url found.")

return

if url_metadata.source == "unknown" and FIRECRAWL_ON:

metadata_item = await content_process_function(url_metadata=url_metadata)

await send_item_message(

metadata_item, chat_id=message.chat_id, message=message

)

return

if url_metadata.source == "unknown" or url_metadata.source == "banned":

logger.debug(f"for the {i + 1}th url {url}, no supported url found.")

return

🤖 Prompt for AI Agents

In `@app/services/telegram_bot/__init__.py` around lines 351 - 358, The code processes "unknown" URLs via Firecrawl when FIRECRAWL_ON is true but then immediately hits the subsequent check that logs and returns for url_metadata.source == "unknown", negating the Firecrawl result; to fix, after successfully calling content_process_function and send_item_message in the Firecrawl branch (the block using url_metadata, FIRECRAWL_ON, content_process_function, and send_item_message) add an early return so execution does not continue to the later logger.debug/return, or alternatively change the second condition to skip returning when Firecrawl was performed (e.g., only return if url_metadata.source == "unknown" and not FIRECRAWL_ON or if a flag indicates Firecrawl didn't run).

coderabbitai · 2026-01-18T02:11:26Z

app/templates/social_media_message.jinja2

+   {% if data.title %}
+<b>{{ data.title }}</b>
+   {% endif %}
+{{ data.text }}


⚠️ Potential issue | 🟡 Minor

Escape title output to avoid HTML/Telegram markup injection.
The Jinja2 environment is not configured for autoescape, so titles should be explicitly escaped to prevent formatting issues or unsafe markup.

🔧 Suggested fix

-{{ data.title }} +{{ data.title | e }}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{% if data.title %}

{{ data.title }}

{% endif %}

{{ data.text }}

{% if data.title %}

{{ data.title | e }}

{% endif %}

{{ data.text }}

🤖 Prompt for AI Agents

In `@app/templates/social_media_message.jinja2` around lines 4 - 7, The template outputs user-provided data.title without escaping; update the social_media_message.jinja2 template to explicitly escape the title (e.g., use the Jinja2 escape/filter on data.title) so HTML/Telegram markup cannot be injected; modify the conditional block around data.title (the line rendering {{ data.title }}) to render an escaped version of data.title using the appropriate Jinja2 escape/filter.

feat: add Firecrawl scraping feature

fc00a51

aturret merged commit 8a99c84 into main Jan 18, 2026
1 of 2 checks passed

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

aturret deleted the firecrawl-update branch January 18, 2026 18:33

coderabbitai bot mentioned this pull request Feb 1, 2026

feat: add ban list hotfix #47

Merged

This was referenced Feb 14, 2026

Feat: Refactor general webpage scraping and fix HTML santizing #49

Merged

feat: refactor codebase to microservices monorepo #56

Merged

coderabbitai bot mentioned this pull request Mar 9, 2026

fix: add firecrawl json extraction #62

Merged

coderabbitai bot mentioned this pull request Mar 21, 2026

refactor: move api scraping logic to shared #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Firecrawl scraping feature#45

feat: add Firecrawl scraping feature#45
aturret merged 1 commit intomainfrom
firecrawl-update

aturret commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aturret commented Jan 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aturret commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 18, 2026 •

edited

Loading