Skip to content

feat: add Firecrawl scraping feature#45

Merged
aturret merged 1 commit intomainfrom
firecrawl-update
Jan 18, 2026
Merged

feat: add Firecrawl scraping feature#45
aturret merged 1 commit intomainfrom
firecrawl-update

Conversation

@aturret
Copy link
Owner

@aturret aturret commented Jan 18, 2026

Summary by CodeRabbit

  • New Features
    • Automatic content extraction for URLs from previously unsupported sources enables extraction and formatting of comprehensive metadata including titles, authors, summaries, and main article content.
    • Short-form messages now display extracted titles for improved readability and better visual hierarchy.
    • Enhanced content processing ensures consistent, high-quality formatting across diverse content sources and formats.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

📝 Walkthrough

Walkthrough

This pull request introduces Firecrawl-based web scraping integration into the application. It adds configuration constants, implements a thread-safe Firecrawl client wrapper, creates a scraping pipeline with LLM-powered article extraction, integrates the new scraper into the existing architecture, and enables conditional URL processing in the Telegram bot.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
.gitignore, app/config.py, pyproject.toml
Added macOS .DS_Store to ignored files. Introduced Firecrawl configuration variables (ON flag, API URL/key, timeout). Updated openai dependency to ^2.15.0 and added firecrawl-py ^4.13.0.
Firecrawl Client Infrastructure
app/services/scrapers/firecrawl_client/__init__.py, app/services/scrapers/firecrawl_client/client.py
Created FirecrawlItem dataclass extending MetadataItem with id and raw_content fields. Implemented thread-safe FirecrawlClient singleton wrapper around Firecrawl SDK with scrape_url method and error handling.
Firecrawl Scraping Pipeline
app/services/scrapers/firecrawl_client/scraper.py
Added FirecrawlDataProcessor with async content retrieval and LLM-based article extraction using OpenAI (GPT-4o-mini). Processes Firecrawl results into structured items with markdown/HTML selection and media extraction. Includes fallback mechanisms for missing API key or extraction failures.
Scraper Integration
app/services/scrapers/common.py, app/services/scrapers/scraper_manager.py
Expanded early-scraper workaround to include "other" and "unknown" categories alongside existing patterns. Extended ScraperManager with FirecrawlScraper support, lazy initialization, and registry mapping for new categories.
Telegram Bot Integration
app/services/telegram_bot/__init__.py
Added FIRECRAWL_ON flag import and conditional URL processing logic. When flag is enabled and URL is unknown, processes via Firecrawl before returning standard "no supported URL" response.
Template Updates
app/templates/social_media_message.jinja2
Modified short-message rendering to conditionally display bold title before text when title exists.

Sequence Diagram(s)

sequenceDiagram
    participant TelegramBot as Telegram Bot
    participant ScraperMgr as ScraperManager
    participant FirecrawlScraper as FirecrawlScraper
    participant FirecrawlClient as FirecrawlClient
    participant FirecrawlAPI as Firecrawl API
    participant LLM as OpenAI LLM
    participant Database as Item Storage

    TelegramBot->>TelegramBot: Receive unknown URL
    TelegramBot->>TelegramBot: Check FIRECRAWL_ON flag
    alt FIRECRAWL_ON enabled
        TelegramBot->>ScraperMgr: get_scraper("unknown" or "other")
        ScraperMgr->>FirecrawlScraper: init_firecrawl_scraper()
        FirecrawlScraper->>FirecrawlClient: get_instance()
        FirecrawlScraper->>FirecrawlClient: scrape_url(url)
        FirecrawlClient->>FirecrawlAPI: POST /scrape with URL
        FirecrawlAPI-->>FirecrawlClient: HTML, markdown, metadata
        FirecrawlClient-->>FirecrawlScraper: scrape result dict
        FirecrawlScraper->>LLM: parsing_article_body_by_llm(html)
        LLM-->>FirecrawlScraper: extracted article HTML
        FirecrawlScraper->>FirecrawlScraper: _process_firecrawl_result()
        FirecrawlScraper-->>TelegramBot: FirecrawlItem (structured)
        TelegramBot->>TelegramBot: send_item_message(item)
        TelegramBot->>Database: Store/forward item
    else FIRECRAWL_ON disabled
        TelegramBot->>TelegramBot: Return "unsupported URL" message
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Firecrawl spins its web with grace,
Unknown URLs find their place,
LLM whispers what's true inside,
New content flows with proper stride! 🕸️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add Firecrawl scraping feature' accurately captures the main objective of the pull request, which is to add Firecrawl-based web scraping functionality across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@aturret aturret merged commit 8a99c84 into main Jan 18, 2026
1 of 2 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/services/telegram_bot/__init__.py (1)

209-221: Control flow issue: success message overwritten with failure message.

When FIRECRAWL_ON is enabled and the URL is successfully processed (lines 214-217), the code continues to execute lines 218-221, which overwrites the "Processing..." message with "no supported url found" and returns. This contradicts the successful Firecrawl processing.

The return statement at line 221 should likely be inside the if FIRECRAWL_ON block after successful processing, or use else to separate the two paths.

Proposed fix
         if url_metadata.source == "unknown":
             if FIRECRAWL_ON:
                 await process_message.edit_text(
-                    text=f"Uncategorized url found. General webpage parser is on, Processing..."
+                    text="Uncategorized url found. General webpage parser is on, Processing..."
                 )
                 metadata_item = await content_process_function(url_metadata=url_metadata)
                 await send_item_message(
                     metadata_item, chat_id=message.chat_id
                 )
-            await process_message.edit_text(
-                text=f"For the {i + 1} th url, no supported url found."
-            )
-            return
+                await process_message.delete()
+            else:
+                await process_message.edit_text(
+                    text=f"For the {i + 1} th url, no supported url found."
+                )
+            return
🤖 Fix all issues with AI agents
In `@app/config.py`:
- Around line 214-218: FIRECRAWL_TIMEOUT_SECONDS is currently assigned from
env.get and may be a string; convert it to an int for type safety by parsing the
value (e.g., wrap the retrieved env value with int(...) or use an existing
helper like get_env_int) so downstream code receives an integer; update the
assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to parse/coerce
the env value to int and handle a missing/invalid value by falling back to the
default 60.

In `@app/services/scrapers/firecrawl_client/client.py`:
- Around line 70-93: scrape_url builds a params dict (including extra_params)
and accepts timeout_seconds but never uses either; update the call to
self._app.scrape to pass the assembled params (e.g., params=params) and wire the
timeout_seconds through — converting seconds to milliseconds if the SDK expects
ms (timeout_ms = int(timeout_seconds * 1000)) or passing raw seconds when
appropriate; modify the scrape invocation in scrape_url to use these values
instead of the current arguments (refer to scrape_url, params, timeout_seconds,
and self._app.scrape) so callers' options are honored.

In `@app/services/scrapers/firecrawl_client/scraper.py`:
- Around line 51-61: The _get_page_content coroutine is calling the synchronous
self._client.scrape_url which blocks the event loop; switch to Firecrawl's async
API by replacing the blocking call with the AsyncFirecrawl async client and its
async scrape method (e.g., create or ensure self._client is an AsyncFirecrawl
instance and call await self._client.scrape(...) with the same parameters), then
await the existing _process_firecrawl_result(result) call; update error handling
to catch exceptions from the awaited async call and rethrow as before.

In `@app/services/scrapers/scraper_manager.py`:
- Around line 14-21: The class-level scraper attributes (bluesky_scraper,
weibo_scraper, firecrawl_scraper) are declared but never set after creating
instances, causing repeated re-initialization in init_scraper(); update
init_scraper() so that when you create a scraper instance you assign it back to
the corresponding class attribute (e.g., cls.firecrawl_scraper = instance,
cls.bluesky_scraper = instance, cls.weibo_scraper = instance) and ensure the
scrapers mapping (cls.scrapers) points to that same instance for all relevant
keys (update both "other" and "unknown" to reference cls.firecrawl_scraper or
rebuild cls.scrapers from the class attrs after initialization) so subsequent
calls use the cached instances.

In `@app/services/telegram_bot/__init__.py`:
- Around line 351-358: The code processes "unknown" URLs via Firecrawl when
FIRECRAWL_ON is true but then immediately hits the subsequent check that logs
and returns for url_metadata.source == "unknown", negating the Firecrawl result;
to fix, after successfully calling content_process_function and
send_item_message in the Firecrawl branch (the block using url_metadata,
FIRECRAWL_ON, content_process_function, and send_item_message) add an early
return so execution does not continue to the later logger.debug/return, or
alternatively change the second condition to skip returning when Firecrawl was
performed (e.g., only return if url_metadata.source == "unknown" and not
FIRECRAWL_ON or if a flag indicates Firecrawl didn't run).

In `@app/templates/social_media_message.jinja2`:
- Around line 4-7: The template outputs user-provided data.title without
escaping; update the social_media_message.jinja2 template to explicitly escape
the title (e.g., use the Jinja2 escape/filter on data.title) so HTML/Telegram
markup cannot be injected; modify the conditional block around data.title (the
line rendering <b>{{ data.title }}</b>) to render an escaped version of
data.title using the appropriate Jinja2 escape/filter.
🧹 Nitpick comments (5)
.gitignore (1)

259-259: LGTM! Standard macOS system file exclusion.

Adding .DS_Store to .gitignore is a best practice to prevent macOS folder metadata files from being committed to version control.

📁 Optional: Consider organizing OS-specific entries

For better organization, you could group OS-specific files in a dedicated section near the top of the file or with other OS/IDE-specific entries. However, the current placement at the end is perfectly acceptable.

Example organization:

+# macOS
+.DS_Store
+
 # Byte-compiled / optimized / DLL files
 __pycache__/

This is purely a stylistic preference and not necessary.

app/services/telegram_bot/__init__.py (1)

71-72: Minor formatting nit: consider separating imports for readability.

The FIRECRAWL_ON constant is appended to the same line as other imports. For better readability, consider placing it on its own line.

Suggested change
-    TEMPLATE_LANGUAGE, TELEBOT_MAX_RETRY, FIRECRAWL_ON,
+    TEMPLATE_LANGUAGE,
+    TELEBOT_MAX_RETRY,
+    FIRECRAWL_ON,
app/services/scrapers/firecrawl_client/scraper.py (2)

81-96: Consider reusing AsyncOpenAI client instance.

Creating a new AsyncOpenAI client on every call adds overhead. Consider instantiating it once at the module level or as a class attribute.

Suggested refactor
# At module level or in __init__
_openai_client: Optional[AsyncOpenAI] = None

`@staticmethod`
def _get_openai_client() -> AsyncOpenAI:
    global _openai_client
    if _openai_client is None and OPENAI_API_KEY:
        _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
    return _openai_client

84-95: Consider making the model name configurable via environment variable.

gpt-4o-mini is a valid OpenAI model, but it's hardcoded. Since OPENAI_API_KEY is already externalized to environment configuration, consider adding an OPENAI_MODEL setting to app/config.py and using it here. This would align with the existing configuration pattern and allow flexibility to switch models without code changes.

app/services/scrapers/scraper_manager.py (1)

18-21: Consider annotating mutable class attribute with ClassVar.

The scrapers dictionary is a mutable class attribute. For type safety and clarity, consider annotating it with typing.ClassVar.

Suggested change
+from typing import ClassVar, Dict, Optional
+
 class ScraperManager:
 
     bluesky_scraper: Optional[BlueskyScraper] = None
     weibo_scraper: Optional[WeiboScraper] = None
     firecrawl_scraper: Optional[FirecrawlScraper] = None
 
-    scrapers = {"bluesky": bluesky_scraper,
-                "weibo": weibo_scraper,
-                "other": firecrawl_scraper,
-                "unknown": firecrawl_scraper}
+    scrapers: ClassVar[Dict[str, Optional[Scraper]]] = {
+        "bluesky": None,
+        "weibo": None,
+        "other": None,
+        "unknown": None,
+    }

Comment on lines +214 to +218
# Firecrawl API environment variables
FIRECRAWL_ON = get_env_bool(env, "FIRECRAWL_ON", False)
FIRECRAWL_API_URL = os.getenv("FIRECRAWL_API_URL", "")
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Coerce FIRECRAWL_TIMEOUT_SECONDS to int for type safety.
Environment values are strings; downstream expects an int.

🔧 Suggested fix
-FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)
+FIRECRAWL_TIMEOUT_SECONDS = int(env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)) or 60
🤖 Prompt for AI Agents
In `@app/config.py` around lines 214 - 218, FIRECRAWL_TIMEOUT_SECONDS is currently
assigned from env.get and may be a string; convert it to an int for type safety
by parsing the value (e.g., wrap the retrieved env value with int(...) or use an
existing helper like get_env_int) so downstream code receives an integer; update
the assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to
parse/coerce the env value to int and handle a missing/invalid value by falling
back to the default 60.

Comment on lines +70 to +93
def scrape_url(
self,
url: str,
formats: Optional[List[str]] = None,
only_main_content: bool = True,
timeout_seconds: Optional[int] = None,
extra_params: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""
单页抓取(最常用)
"""
params: Dict[str, Any] = {
"formats": formats or ["markdown"],
"onlyMainContent": only_main_content,
}
if extra_params:
params.update(extra_params)

# if timeout_seconds is None:
# timeout_seconds = self._settings.timeout_seconds

try:
return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump(
exclude_none=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "client.py" | grep firecrawl

Repository: aturret/FastFetchBot

Length of output: 114


🏁 Script executed:

cat -n app/services/scrapers/firecrawl_client/client.py

Repository: aturret/FastFetchBot

Length of output: 3450


🏁 Script executed:

python3 << 'EOF'
import subprocess
result = subprocess.run(['pip', 'show', 'firecrawl'], capture_output=True, text=True)
print(result.stdout)
EOF

Repository: aturret/FastFetchBot

Length of output: 150


🏁 Script executed:

# Search for how firecrawl scrape is used in the codebase
rg "scrape_url" --type py -A 3 -B 1

Repository: aturret/FastFetchBot

Length of output: 1059


🏁 Script executed:

# Check if there are other usages or tests of the firecrawl SDK
rg "_app\.scrape|Firecrawl" --type py

Repository: aturret/FastFetchBot

Length of output: 3710


🌐 Web query:

firecrawl-py SDK scrape method parameters API documentation

💡 Result:

Summary of firecrawl-py /scrape method parameters (concise):

  • url / url string — the target page to scrape. (docs.firecrawl.dev)
  • formats — array of output formats (e.g., "markdown", "html", "rawHtml", "summary", "json", "links", "images", "screenshot", "branding"). (docs.firecrawl.dev)
  • maxAge — cache freshness in ms (default ~172800000 ms = 2 days); set 0 to force fresh. (docs.firecrawl.dev)
  • storeInCache — boolean to prevent storing results in cache. (docs.firecrawl.dev)
  • location — object with country (ISO 3166-1 alpha-2) and languages to control proxy/language emulation. (docs.firecrawl.dev)
  • pageOptions — object for page parsing behavior (examples include onlyMainContent, parsePDF, etc.). (docs.firecrawl.dev)
  • parsers — array to force specific parsers (e.g., ["pdf"]). (docs.firecrawl.dev)
  • screenshot (when formats includes "screenshot") — options like fullPage, quality, viewport. (docs.firecrawl.dev)
  • JSON extraction / LLM schema — support for structured extraction using a JSON format or pydantic schema. (docs.firecrawl.dev)
  • Additional SDK params — SDK methods accept named params/objects (scrape(), scrape_url(), scrape_options / ScrapeOptions) and support sync/async clients. (docs.firecrawl.dev)

Sources:
[1] Firecrawl Scrape docs (Scrape endpoint, formats, cache, location, screenshots). (docs.firecrawl.dev)
[2] Firecrawl Advanced Scraping Guide (pageOptions, parsers, scrape options). (docs.firecrawl.dev)
[3] firecrawl-py SDK / PyPI / GitHub pages (SDK method names, ScrapeOptions, examples, async). (github.com)

If you want, I can paste exact parameter names and example Python signatures from the SDK docs.

Citations:


🌐 Web query:

firecrawl-py GitHub repository scrape method signature source code

💡 Result:

The Firecrawl Python SDK scrape signature (synchronous) — from the SDK docs/source — is:

def scrape(
url: str,
*,
formats: Optional[List[str]] = None,
headers: Optional[Dict[str, str]] = None,
include_tags: Optional[List[str]] = None,
exclude_tags: Optional[List[str]] = None,
wait_for: Optional[int] = None,
screenshot: Optional[bool] = None,
full_page_screenshot: Optional[bool] = None,
mobile: Optional[bool] = None,
**kwargs
) -> Document

(Example usage and parameter details in the Firecrawl docs and the package docs mirror.) [1][2]

Sources:

  • Firecrawl docs — Scrape endpoint / examples. [1]
  • firecrawl-py package docs (scraping API / signature listing). [2]

🏁 Script executed:

# Check if there's a requirements.txt or setup.py that shows firecrawl version
find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" \) -exec cat {} \;

Repository: aturret/FastFetchBot

Length of output: 1668


🏁 Script executed:

python3 << 'EOF'
import subprocess
import json

# Try to get info about firecrawl package
result = subprocess.run(['pip', 'show', 'firecrawl-py', '-f'], capture_output=True, text=True)
print("Package info:")
print(result.stdout)
print("\nStderr:")
print(result.stderr)
EOF

Repository: aturret/FastFetchBot

Length of output: 132


🌐 Web query:

firecrawl-py scrape method pageOptions onlyMainContent parameter example

💡 Result:

Example — Python (firecrawl-py) using pageOptions.onlyMainContent:

from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
resp = app.scrape_url(
"https://example.com",
params={
"pageOptions": {
"onlyMainContent": True,
"includeHtml": False
}
}
)
print(resp) # resp will contain the main page content without headers/navs/footers

References: Firecrawl docs — Page Options and Scrape examples. [1][2]


🏁 Script executed:

# Check if scraper.py shows how scrape_url is being called
sed -n '1,50p' app/services/scrapers/firecrawl_client/scraper.py

Repository: aturret/FastFetchBot

Length of output: 2048


🌐 Web query:

firecrawl-py SDK scrape method timeout parameter how to pass

💡 Result:

Pass the timeout as the timeout keyword when calling scrape; it expects milliseconds. Example:

Python
from firecrawl import Firecrawl
fc = Firecrawl(api_key="fc-YOUR-API-KEY")
doc = fc.scrape("https://example.com", formats=["markdown"], timeout=15000) # 15,000 ms = 15s

Docs: scrape endpoint timeout is in milliseconds and the Python SDK forwards that parameter. [1][2]

Sources
[1] Firecrawl Python SDK usage (scrape examples).
[2] Firecrawl Scrape endpoint — timeout (integer, milliseconds).


🌐 Web query:

site:github.com firecrawl-py scrape method implementation kwargs

💡 Result:

The firecrawl Python SDK's scrape method accepts options via keyword args (examples: formats, scrape_options, poll_interval). See the SDK docs and examples in the repository README. [1][2]

Sources:

  • firecrawl (main repo) README — example usage of firecrawl.scrape(url, formats=[...]). [1]
  • firecrawl-py (old Python SDK) README — scrape_url / scrape_url params examples. [2]

Wire unused params dict and timeout_seconds into the SDK call.

The method builds a params dict that includes extra_params, but then ignores it entirely when calling scrape(). Additionally, the timeout_seconds parameter is accepted but never applied—the code to use it is commented out. This causes callers' configuration to be silently dropped.

The timeout_seconds parameter is in seconds but the firecrawl SDK expects milliseconds. Consider either converting on input or passing the unconverted value if the SDK handles both.

🔧 Suggested direction (adjust based on SDK version's params handling)
         params: Dict[str, Any] = {
             "formats": formats or ["markdown"],
             "onlyMainContent": only_main_content,
         }
+        if timeout_seconds is None:
+            timeout_seconds = self._settings.timeout_seconds
+        if timeout_seconds is not None:
+            params["timeout"] = timeout_seconds * 1000  # Convert seconds to milliseconds
         if extra_params:
             params.update(extra_params)

         try:
-            return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump(
-                exclude_none=True)
+            return self._app.scrape(url, **params).model_dump(exclude_none=True)
🧰 Tools
🪛 Ruff (0.14.13)

75-75: Unused method argument: timeout_seconds

(ARG002)


79-79: Docstring contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF002)


79-79: Docstring contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF002)

🤖 Prompt for AI Agents
In `@app/services/scrapers/firecrawl_client/client.py` around lines 70 - 93,
scrape_url builds a params dict (including extra_params) and accepts
timeout_seconds but never uses either; update the call to self._app.scrape to
pass the assembled params (e.g., params=params) and wire the timeout_seconds
through — converting seconds to milliseconds if the SDK expects ms (timeout_ms =
int(timeout_seconds * 1000)) or passing raw seconds when appropriate; modify the
scrape invocation in scrape_url to use these values instead of the current
arguments (refer to scrape_url, params, timeout_seconds, and self._app.scrape)
so callers' options are honored.

Comment on lines +51 to +61
async def _get_page_content(self) -> None:
try:
result = self._client.scrape_url(
url=self.url,
formats=["markdown", "html"],
only_main_content=True,
)
await self._process_firecrawl_result(result)
except Exception as e:
logger.error(f"Failed to scrape URL with Firecrawl: {e}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's find where FirecrawlClient is defined or imported
find . -type f -name "*.py" | xargs grep -l "class FirecrawlClient" 2>/dev/null || echo "No local FirecrawlClient class found"

# Search for scrape_url method definition in the repo
rg "def scrape_url|async def scrape_url" --type py -B 2 -A 10

# Also check the imports in the scraper.py file to understand where _client comes from
cat -n app/services/scrapers/firecrawl_client/scraper.py | head -50

Repository: aturret/FastFetchBot

Length of output: 3450


🏁 Script executed:

#!/bin/bash
# Read the scrape_url method from client.py
cat -n app/services/scrapers/firecrawl_client/client.py | grep -A 30 "def scrape_url"

Repository: aturret/FastFetchBot

Length of output: 1153


🏁 Script executed:

#!/bin/bash
# Read the FirecrawlClient class to see how _app is initialized
cat -n app/services/scrapers/firecrawl_client/client.py | head -70

Repository: aturret/FastFetchBot

Length of output: 2387


🌐 Web query:

Firecrawl Python SDK async support AsyncFirecrawl

💡 Result:

Yes — the Firecrawl Python SDK includes async support via the AsyncFirecrawl class. Its methods mirror Firecrawl (awaitable versions of scrape, search, start_crawl, batch_scrape, watcher, etc.). Example usage and docs are on the official SDK pages. [1][2]

Sources:

  • Firecrawl Python SDK docs (AsyncFirecrawl examples and description). [1]
  • Firecrawl docs (Python SDK overview, async class usage and watcher example). [2]

Use AsyncFirecrawl instead of blocking the event loop with a synchronous call.

self._client.scrape_url() is a synchronous, blocking call that performs network I/O to Firecrawl. Calling it without await in this async method blocks the event loop, preventing other async operations from executing concurrently and defeating the purpose of async. The Firecrawl Python SDK provides an AsyncFirecrawl class with async methods (e.g., scrape()); use it to avoid blocking during the scraping request.

🤖 Prompt for AI Agents
In `@app/services/scrapers/firecrawl_client/scraper.py` around lines 51 - 61, The
_get_page_content coroutine is calling the synchronous self._client.scrape_url
which blocks the event loop; switch to Firecrawl's async API by replacing the
blocking call with the AsyncFirecrawl async client and its async scrape method
(e.g., create or ensure self._client is an AsyncFirecrawl instance and call
await self._client.scrape(...) with the same parameters), then await the
existing _process_firecrawl_result(result) call; update error handling to catch
exceptions from the awaited async call and rethrow as before.

Comment on lines 14 to +21
bluesky_scraper: Optional[BlueskyScraper] = None
weibo_scraper: Optional[WeiboScraper] = None
firecrawl_scraper: Optional[FirecrawlScraper] = None

scrapers = {"bluesky": bluesky_scraper,
"weibo": bluesky_scraper}
"weibo": weibo_scraper,
"other": firecrawl_scraper,
"unknown": firecrawl_scraper}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Class attributes never updated after scraper initialization - causes repeated re-initialization.

The class attributes bluesky_scraper, weibo_scraper, and firecrawl_scraper are used as guards (e.g., not cls.firecrawl_scraper) but are never assigned after initialization. This means every call to init_scraper() will re-initialize the scraper.

Additionally, when initializing for "other" category, cls.scrapers["other"] is updated but cls.scrapers["unknown"] still points to None, causing separate initializations.

Proposed fix
     `@classmethod`
     async def init_scraper(cls, category: str) -> None:
         if category in cls.scrapers.keys():
             scraper = None
             if category == "bluesky" and not cls.bluesky_scraper:
                 scraper = await cls.init_bluesky_scraper()
+                cls.bluesky_scraper = scraper
             elif category == "weibo" and not cls.weibo_scraper:
                 scraper = await cls.init_weibo_scraper()
+                cls.weibo_scraper = scraper
             elif category in ["other", "unknown"] and not cls.firecrawl_scraper:
                 scraper = await cls.init_firecrawl_scraper()
+                cls.firecrawl_scraper = scraper
+                # Update both keys to use same instance
+                cls.scrapers["other"] = scraper
+                cls.scrapers["unknown"] = scraper
             if scraper:
                 cls.scrapers[category] = scraper
🧰 Tools
🪛 Ruff (0.14.13)

18-21: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In `@app/services/scrapers/scraper_manager.py` around lines 14 - 21, The
class-level scraper attributes (bluesky_scraper, weibo_scraper,
firecrawl_scraper) are declared but never set after creating instances, causing
repeated re-initialization in init_scraper(); update init_scraper() so that when
you create a scraper instance you assign it back to the corresponding class
attribute (e.g., cls.firecrawl_scraper = instance, cls.bluesky_scraper =
instance, cls.weibo_scraper = instance) and ensure the scrapers mapping
(cls.scrapers) points to that same instance for all relevant keys (update both
"other" and "unknown" to reference cls.firecrawl_scraper or rebuild cls.scrapers
from the class attrs after initialization) so subsequent calls use the cached
instances.

Comment on lines +351 to 358
if url_metadata.source == "unknown" and FIRECRAWL_ON:
metadata_item = await content_process_function(url_metadata=url_metadata)
await send_item_message(
metadata_item, chat_id=message.chat_id, message=message
)
if url_metadata.source == "unknown" or url_metadata.source == "banned":
logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Control flow issue: Firecrawl processing followed by early return for "unknown" source.

After successfully processing the URL with Firecrawl (lines 352-355), the code falls through to line 356-358 which checks url_metadata.source == "unknown" again and returns early with a debug log stating "no supported url found." This negates the Firecrawl processing that just occurred.

The return should be handled inside the Firecrawl block, or the "unknown" check should exclude the Firecrawl-processed case.

Proposed fix
         if url_metadata.source == "unknown" and FIRECRAWL_ON:
             metadata_item = await content_process_function(url_metadata=url_metadata)
             await send_item_message(
                 metadata_item, chat_id=message.chat_id, message=message
             )
-        if url_metadata.source == "unknown" or url_metadata.source == "banned":
+            return
+        if url_metadata.source == "unknown" or url_metadata.source == "banned":
             logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
             return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if url_metadata.source == "unknown" and FIRECRAWL_ON:
metadata_item = await content_process_function(url_metadata=url_metadata)
await send_item_message(
metadata_item, chat_id=message.chat_id, message=message
)
if url_metadata.source == "unknown" or url_metadata.source == "banned":
logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
return
if url_metadata.source == "unknown" and FIRECRAWL_ON:
metadata_item = await content_process_function(url_metadata=url_metadata)
await send_item_message(
metadata_item, chat_id=message.chat_id, message=message
)
return
if url_metadata.source == "unknown" or url_metadata.source == "banned":
logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
return
🤖 Prompt for AI Agents
In `@app/services/telegram_bot/__init__.py` around lines 351 - 358, The code
processes "unknown" URLs via Firecrawl when FIRECRAWL_ON is true but then
immediately hits the subsequent check that logs and returns for
url_metadata.source == "unknown", negating the Firecrawl result; to fix, after
successfully calling content_process_function and send_item_message in the
Firecrawl branch (the block using url_metadata, FIRECRAWL_ON,
content_process_function, and send_item_message) add an early return so
execution does not continue to the later logger.debug/return, or alternatively
change the second condition to skip returning when Firecrawl was performed
(e.g., only return if url_metadata.source == "unknown" and not FIRECRAWL_ON or
if a flag indicates Firecrawl didn't run).

Comment on lines +4 to +7
{% if data.title %}
<b>{{ data.title }}</b>
{% endif %}
{{ data.text }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Escape title output to avoid HTML/Telegram markup injection.
The Jinja2 environment is not configured for autoescape, so titles should be explicitly escaped to prevent formatting issues or unsafe markup.

🔧 Suggested fix
-<b>{{ data.title }}</b>
+<b>{{ data.title | e }}</b>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{% if data.title %}
<b>{{ data.title }}</b>
{% endif %}
{{ data.text }}
{% if data.title %}
<b>{{ data.title | e }}</b>
{% endif %}
{{ data.text }}
🤖 Prompt for AI Agents
In `@app/templates/social_media_message.jinja2` around lines 4 - 7, The template
outputs user-provided data.title without escaping; update the
social_media_message.jinja2 template to explicitly escape the title (e.g., use
the Jinja2 escape/filter on data.title) so HTML/Telegram markup cannot be
injected; modify the conditional block around data.title (the line rendering
<b>{{ data.title }}</b>) to render an escaped version of data.title using the
appropriate Jinja2 escape/filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant