Conversation
📝 WalkthroughWalkthroughThis pull request introduces Firecrawl-based web scraping integration into the application. It adds configuration constants, implements a thread-safe Firecrawl client wrapper, creates a scraping pipeline with LLM-powered article extraction, integrates the new scraper into the existing architecture, and enables conditional URL processing in the Telegram bot. Changes
Sequence Diagram(s)sequenceDiagram
participant TelegramBot as Telegram Bot
participant ScraperMgr as ScraperManager
participant FirecrawlScraper as FirecrawlScraper
participant FirecrawlClient as FirecrawlClient
participant FirecrawlAPI as Firecrawl API
participant LLM as OpenAI LLM
participant Database as Item Storage
TelegramBot->>TelegramBot: Receive unknown URL
TelegramBot->>TelegramBot: Check FIRECRAWL_ON flag
alt FIRECRAWL_ON enabled
TelegramBot->>ScraperMgr: get_scraper("unknown" or "other")
ScraperMgr->>FirecrawlScraper: init_firecrawl_scraper()
FirecrawlScraper->>FirecrawlClient: get_instance()
FirecrawlScraper->>FirecrawlClient: scrape_url(url)
FirecrawlClient->>FirecrawlAPI: POST /scrape with URL
FirecrawlAPI-->>FirecrawlClient: HTML, markdown, metadata
FirecrawlClient-->>FirecrawlScraper: scrape result dict
FirecrawlScraper->>LLM: parsing_article_body_by_llm(html)
LLM-->>FirecrawlScraper: extracted article HTML
FirecrawlScraper->>FirecrawlScraper: _process_firecrawl_result()
FirecrawlScraper-->>TelegramBot: FirecrawlItem (structured)
TelegramBot->>TelegramBot: send_item_message(item)
TelegramBot->>Database: Store/forward item
else FIRECRAWL_ON disabled
TelegramBot->>TelegramBot: Return "unsupported URL" message
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/services/telegram_bot/__init__.py (1)
209-221: Control flow issue: success message overwritten with failure message.When
FIRECRAWL_ONis enabled and the URL is successfully processed (lines 214-217), the code continues to execute lines 218-221, which overwrites the "Processing..." message with "no supported url found" and returns. This contradicts the successful Firecrawl processing.The
returnstatement at line 221 should likely be inside theif FIRECRAWL_ONblock after successful processing, or useelseto separate the two paths.Proposed fix
if url_metadata.source == "unknown": if FIRECRAWL_ON: await process_message.edit_text( - text=f"Uncategorized url found. General webpage parser is on, Processing..." + text="Uncategorized url found. General webpage parser is on, Processing..." ) metadata_item = await content_process_function(url_metadata=url_metadata) await send_item_message( metadata_item, chat_id=message.chat_id ) - await process_message.edit_text( - text=f"For the {i + 1} th url, no supported url found." - ) - return + await process_message.delete() + else: + await process_message.edit_text( + text=f"For the {i + 1} th url, no supported url found." + ) + return
🤖 Fix all issues with AI agents
In `@app/config.py`:
- Around line 214-218: FIRECRAWL_TIMEOUT_SECONDS is currently assigned from
env.get and may be a string; convert it to an int for type safety by parsing the
value (e.g., wrap the retrieved env value with int(...) or use an existing
helper like get_env_int) so downstream code receives an integer; update the
assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to parse/coerce
the env value to int and handle a missing/invalid value by falling back to the
default 60.
In `@app/services/scrapers/firecrawl_client/client.py`:
- Around line 70-93: scrape_url builds a params dict (including extra_params)
and accepts timeout_seconds but never uses either; update the call to
self._app.scrape to pass the assembled params (e.g., params=params) and wire the
timeout_seconds through — converting seconds to milliseconds if the SDK expects
ms (timeout_ms = int(timeout_seconds * 1000)) or passing raw seconds when
appropriate; modify the scrape invocation in scrape_url to use these values
instead of the current arguments (refer to scrape_url, params, timeout_seconds,
and self._app.scrape) so callers' options are honored.
In `@app/services/scrapers/firecrawl_client/scraper.py`:
- Around line 51-61: The _get_page_content coroutine is calling the synchronous
self._client.scrape_url which blocks the event loop; switch to Firecrawl's async
API by replacing the blocking call with the AsyncFirecrawl async client and its
async scrape method (e.g., create or ensure self._client is an AsyncFirecrawl
instance and call await self._client.scrape(...) with the same parameters), then
await the existing _process_firecrawl_result(result) call; update error handling
to catch exceptions from the awaited async call and rethrow as before.
In `@app/services/scrapers/scraper_manager.py`:
- Around line 14-21: The class-level scraper attributes (bluesky_scraper,
weibo_scraper, firecrawl_scraper) are declared but never set after creating
instances, causing repeated re-initialization in init_scraper(); update
init_scraper() so that when you create a scraper instance you assign it back to
the corresponding class attribute (e.g., cls.firecrawl_scraper = instance,
cls.bluesky_scraper = instance, cls.weibo_scraper = instance) and ensure the
scrapers mapping (cls.scrapers) points to that same instance for all relevant
keys (update both "other" and "unknown" to reference cls.firecrawl_scraper or
rebuild cls.scrapers from the class attrs after initialization) so subsequent
calls use the cached instances.
In `@app/services/telegram_bot/__init__.py`:
- Around line 351-358: The code processes "unknown" URLs via Firecrawl when
FIRECRAWL_ON is true but then immediately hits the subsequent check that logs
and returns for url_metadata.source == "unknown", negating the Firecrawl result;
to fix, after successfully calling content_process_function and
send_item_message in the Firecrawl branch (the block using url_metadata,
FIRECRAWL_ON, content_process_function, and send_item_message) add an early
return so execution does not continue to the later logger.debug/return, or
alternatively change the second condition to skip returning when Firecrawl was
performed (e.g., only return if url_metadata.source == "unknown" and not
FIRECRAWL_ON or if a flag indicates Firecrawl didn't run).
In `@app/templates/social_media_message.jinja2`:
- Around line 4-7: The template outputs user-provided data.title without
escaping; update the social_media_message.jinja2 template to explicitly escape
the title (e.g., use the Jinja2 escape/filter on data.title) so HTML/Telegram
markup cannot be injected; modify the conditional block around data.title (the
line rendering <b>{{ data.title }}</b>) to render an escaped version of
data.title using the appropriate Jinja2 escape/filter.
🧹 Nitpick comments (5)
.gitignore (1)
259-259: LGTM! Standard macOS system file exclusion.Adding
.DS_Storeto.gitignoreis a best practice to prevent macOS folder metadata files from being committed to version control.📁 Optional: Consider organizing OS-specific entries
For better organization, you could group OS-specific files in a dedicated section near the top of the file or with other OS/IDE-specific entries. However, the current placement at the end is perfectly acceptable.
Example organization:
+# macOS +.DS_Store + # Byte-compiled / optimized / DLL files __pycache__/This is purely a stylistic preference and not necessary.
app/services/telegram_bot/__init__.py (1)
71-72: Minor formatting nit: consider separating imports for readability.The
FIRECRAWL_ONconstant is appended to the same line as other imports. For better readability, consider placing it on its own line.Suggested change
- TEMPLATE_LANGUAGE, TELEBOT_MAX_RETRY, FIRECRAWL_ON, + TEMPLATE_LANGUAGE, + TELEBOT_MAX_RETRY, + FIRECRAWL_ON,app/services/scrapers/firecrawl_client/scraper.py (2)
81-96: Consider reusing AsyncOpenAI client instance.Creating a new
AsyncOpenAIclient on every call adds overhead. Consider instantiating it once at the module level or as a class attribute.Suggested refactor
# At module level or in __init__ _openai_client: Optional[AsyncOpenAI] = None `@staticmethod` def _get_openai_client() -> AsyncOpenAI: global _openai_client if _openai_client is None and OPENAI_API_KEY: _openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY) return _openai_client
84-95: Consider making the model name configurable via environment variable.
gpt-4o-miniis a valid OpenAI model, but it's hardcoded. SinceOPENAI_API_KEYis already externalized to environment configuration, consider adding anOPENAI_MODELsetting toapp/config.pyand using it here. This would align with the existing configuration pattern and allow flexibility to switch models without code changes.app/services/scrapers/scraper_manager.py (1)
18-21: Consider annotating mutable class attribute withClassVar.The
scrapersdictionary is a mutable class attribute. For type safety and clarity, consider annotating it withtyping.ClassVar.Suggested change
+from typing import ClassVar, Dict, Optional + class ScraperManager: bluesky_scraper: Optional[BlueskyScraper] = None weibo_scraper: Optional[WeiboScraper] = None firecrawl_scraper: Optional[FirecrawlScraper] = None - scrapers = {"bluesky": bluesky_scraper, - "weibo": weibo_scraper, - "other": firecrawl_scraper, - "unknown": firecrawl_scraper} + scrapers: ClassVar[Dict[str, Optional[Scraper]]] = { + "bluesky": None, + "weibo": None, + "other": None, + "unknown": None, + }
| # Firecrawl API environment variables | ||
| FIRECRAWL_ON = get_env_bool(env, "FIRECRAWL_ON", False) | ||
| FIRECRAWL_API_URL = os.getenv("FIRECRAWL_API_URL", "") | ||
| FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "") | ||
| FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60) |
There was a problem hiding this comment.
Coerce FIRECRAWL_TIMEOUT_SECONDS to int for type safety.
Environment values are strings; downstream expects an int.
🔧 Suggested fix
-FIRECRAWL_TIMEOUT_SECONDS = env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)
+FIRECRAWL_TIMEOUT_SECONDS = int(env.get("FIRECRAWL_TIMEOUT_SECONDS", 60)) or 60🤖 Prompt for AI Agents
In `@app/config.py` around lines 214 - 218, FIRECRAWL_TIMEOUT_SECONDS is currently
assigned from env.get and may be a string; convert it to an int for type safety
by parsing the value (e.g., wrap the retrieved env value with int(...) or use an
existing helper like get_env_int) so downstream code receives an integer; update
the assignment of FIRECRAWL_TIMEOUT_SECONDS (the symbol in the diff) to
parse/coerce the env value to int and handle a missing/invalid value by falling
back to the default 60.
| def scrape_url( | ||
| self, | ||
| url: str, | ||
| formats: Optional[List[str]] = None, | ||
| only_main_content: bool = True, | ||
| timeout_seconds: Optional[int] = None, | ||
| extra_params: Optional[Dict[str, Any]] = None, | ||
| ) -> Dict[str, Any]: | ||
| """ | ||
| 单页抓取(最常用) | ||
| """ | ||
| params: Dict[str, Any] = { | ||
| "formats": formats or ["markdown"], | ||
| "onlyMainContent": only_main_content, | ||
| } | ||
| if extra_params: | ||
| params.update(extra_params) | ||
|
|
||
| # if timeout_seconds is None: | ||
| # timeout_seconds = self._settings.timeout_seconds | ||
|
|
||
| try: | ||
| return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump( | ||
| exclude_none=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "client.py" | grep firecrawlRepository: aturret/FastFetchBot
Length of output: 114
🏁 Script executed:
cat -n app/services/scrapers/firecrawl_client/client.pyRepository: aturret/FastFetchBot
Length of output: 3450
🏁 Script executed:
python3 << 'EOF'
import subprocess
result = subprocess.run(['pip', 'show', 'firecrawl'], capture_output=True, text=True)
print(result.stdout)
EOFRepository: aturret/FastFetchBot
Length of output: 150
🏁 Script executed:
# Search for how firecrawl scrape is used in the codebase
rg "scrape_url" --type py -A 3 -B 1Repository: aturret/FastFetchBot
Length of output: 1059
🏁 Script executed:
# Check if there are other usages or tests of the firecrawl SDK
rg "_app\.scrape|Firecrawl" --type pyRepository: aturret/FastFetchBot
Length of output: 3710
🌐 Web query:
firecrawl-py SDK scrape method parameters API documentation
💡 Result:
Summary of firecrawl-py /scrape method parameters (concise):
- url / url string — the target page to scrape. (docs.firecrawl.dev)
- formats — array of output formats (e.g., "markdown", "html", "rawHtml", "summary", "json", "links", "images", "screenshot", "branding"). (docs.firecrawl.dev)
- maxAge — cache freshness in ms (default ~172800000 ms = 2 days); set 0 to force fresh. (docs.firecrawl.dev)
- storeInCache — boolean to prevent storing results in cache. (docs.firecrawl.dev)
- location — object with country (ISO 3166-1 alpha-2) and languages to control proxy/language emulation. (docs.firecrawl.dev)
- pageOptions — object for page parsing behavior (examples include onlyMainContent, parsePDF, etc.). (docs.firecrawl.dev)
- parsers — array to force specific parsers (e.g., ["pdf"]). (docs.firecrawl.dev)
- screenshot (when formats includes "screenshot") — options like fullPage, quality, viewport. (docs.firecrawl.dev)
- JSON extraction / LLM schema — support for structured extraction using a JSON format or pydantic schema. (docs.firecrawl.dev)
- Additional SDK params — SDK methods accept named params/objects (scrape(), scrape_url(), scrape_options / ScrapeOptions) and support sync/async clients. (docs.firecrawl.dev)
Sources:
[1] Firecrawl Scrape docs (Scrape endpoint, formats, cache, location, screenshots). (docs.firecrawl.dev)
[2] Firecrawl Advanced Scraping Guide (pageOptions, parsers, scrape options). (docs.firecrawl.dev)
[3] firecrawl-py SDK / PyPI / GitHub pages (SDK method names, ScrapeOptions, examples, async). (github.com)
If you want, I can paste exact parameter names and example Python signatures from the SDK docs.
Citations:
- 1: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 2: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 3: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 4: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 5: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 6: https://docs.firecrawl.dev/v0/advanced-scraping-guide?utm_source=openai
- 7: https://docs.firecrawl.dev/advanced-scraping-guide?utm_source=openai
- 8: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 9: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 10: https://docs.firecrawl.dev/sdks/python?utm_source=openai
- 11: https://docs.firecrawl.dev/features/scrape?utm_source=openai
- 12: https://docs.firecrawl.dev/v0/advanced-scraping-guide?utm_source=openai
- 13: https://github.com/firecrawl/firecrawl-py?utm_source=openai
🌐 Web query:
firecrawl-py GitHub repository scrape method signature source code
💡 Result:
The Firecrawl Python SDK scrape signature (synchronous) — from the SDK docs/source — is:
def scrape(
url: str,
*,
formats: Optional[List[str]] = None,
headers: Optional[Dict[str, str]] = None,
include_tags: Optional[List[str]] = None,
exclude_tags: Optional[List[str]] = None,
wait_for: Optional[int] = None,
screenshot: Optional[bool] = None,
full_page_screenshot: Optional[bool] = None,
mobile: Optional[bool] = None,
**kwargs
) -> Document
(Example usage and parameter details in the Firecrawl docs and the package docs mirror.) [1][2]
Sources:
- Firecrawl docs — Scrape endpoint / examples. [1]
- firecrawl-py package docs (scraping API / signature listing). [2]
🏁 Script executed:
# Check if there's a requirements.txt or setup.py that shows firecrawl version
find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" \) -exec cat {} \;Repository: aturret/FastFetchBot
Length of output: 1668
🏁 Script executed:
python3 << 'EOF'
import subprocess
import json
# Try to get info about firecrawl package
result = subprocess.run(['pip', 'show', 'firecrawl-py', '-f'], capture_output=True, text=True)
print("Package info:")
print(result.stdout)
print("\nStderr:")
print(result.stderr)
EOFRepository: aturret/FastFetchBot
Length of output: 132
🌐 Web query:
firecrawl-py scrape method pageOptions onlyMainContent parameter example
💡 Result:
Example — Python (firecrawl-py) using pageOptions.onlyMainContent:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
resp = app.scrape_url(
"https://example.com",
params={
"pageOptions": {
"onlyMainContent": True,
"includeHtml": False
}
}
)
print(resp) # resp will contain the main page content without headers/navs/footers
References: Firecrawl docs — Page Options and Scrape examples. [1][2]
🏁 Script executed:
# Check if scraper.py shows how scrape_url is being called
sed -n '1,50p' app/services/scrapers/firecrawl_client/scraper.pyRepository: aturret/FastFetchBot
Length of output: 2048
🌐 Web query:
firecrawl-py SDK scrape method timeout parameter how to pass
💡 Result:
Pass the timeout as the timeout keyword when calling scrape; it expects milliseconds. Example:
Python
from firecrawl import Firecrawl
fc = Firecrawl(api_key="fc-YOUR-API-KEY")
doc = fc.scrape("https://example.com", formats=["markdown"], timeout=15000) # 15,000 ms = 15s
Docs: scrape endpoint timeout is in milliseconds and the Python SDK forwards that parameter. [1][2]
Sources
[1] Firecrawl Python SDK usage (scrape examples).
[2] Firecrawl Scrape endpoint — timeout (integer, milliseconds).
🌐 Web query:
site:github.com firecrawl-py scrape method implementation kwargs
💡 Result:
The firecrawl Python SDK's scrape method accepts options via keyword args (examples: formats, scrape_options, poll_interval). See the SDK docs and examples in the repository README. [1][2]
Sources:
- firecrawl (main repo) README — example usage of firecrawl.scrape(url, formats=[...]). [1]
- firecrawl-py (old Python SDK) README — scrape_url / scrape_url params examples. [2]
Wire unused params dict and timeout_seconds into the SDK call.
The method builds a params dict that includes extra_params, but then ignores it entirely when calling scrape(). Additionally, the timeout_seconds parameter is accepted but never applied—the code to use it is commented out. This causes callers' configuration to be silently dropped.
The timeout_seconds parameter is in seconds but the firecrawl SDK expects milliseconds. Consider either converting on input or passing the unconverted value if the SDK handles both.
🔧 Suggested direction (adjust based on SDK version's params handling)
params: Dict[str, Any] = {
"formats": formats or ["markdown"],
"onlyMainContent": only_main_content,
}
+ if timeout_seconds is None:
+ timeout_seconds = self._settings.timeout_seconds
+ if timeout_seconds is not None:
+ params["timeout"] = timeout_seconds * 1000 # Convert seconds to milliseconds
if extra_params:
params.update(extra_params)
try:
- return self._app.scrape(url, formats=formats, only_main_content=only_main_content).model_dump(
- exclude_none=True)
+ return self._app.scrape(url, **params).model_dump(exclude_none=True)🧰 Tools
🪛 Ruff (0.14.13)
75-75: Unused method argument: timeout_seconds
(ARG002)
79-79: Docstring contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF002)
79-79: Docstring contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF002)
🤖 Prompt for AI Agents
In `@app/services/scrapers/firecrawl_client/client.py` around lines 70 - 93,
scrape_url builds a params dict (including extra_params) and accepts
timeout_seconds but never uses either; update the call to self._app.scrape to
pass the assembled params (e.g., params=params) and wire the timeout_seconds
through — converting seconds to milliseconds if the SDK expects ms (timeout_ms =
int(timeout_seconds * 1000)) or passing raw seconds when appropriate; modify the
scrape invocation in scrape_url to use these values instead of the current
arguments (refer to scrape_url, params, timeout_seconds, and self._app.scrape)
so callers' options are honored.
| async def _get_page_content(self) -> None: | ||
| try: | ||
| result = self._client.scrape_url( | ||
| url=self.url, | ||
| formats=["markdown", "html"], | ||
| only_main_content=True, | ||
| ) | ||
| await self._process_firecrawl_result(result) | ||
| except Exception as e: | ||
| logger.error(f"Failed to scrape URL with Firecrawl: {e}") | ||
| raise |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's find where FirecrawlClient is defined or imported
find . -type f -name "*.py" | xargs grep -l "class FirecrawlClient" 2>/dev/null || echo "No local FirecrawlClient class found"
# Search for scrape_url method definition in the repo
rg "def scrape_url|async def scrape_url" --type py -B 2 -A 10
# Also check the imports in the scraper.py file to understand where _client comes from
cat -n app/services/scrapers/firecrawl_client/scraper.py | head -50Repository: aturret/FastFetchBot
Length of output: 3450
🏁 Script executed:
#!/bin/bash
# Read the scrape_url method from client.py
cat -n app/services/scrapers/firecrawl_client/client.py | grep -A 30 "def scrape_url"Repository: aturret/FastFetchBot
Length of output: 1153
🏁 Script executed:
#!/bin/bash
# Read the FirecrawlClient class to see how _app is initialized
cat -n app/services/scrapers/firecrawl_client/client.py | head -70Repository: aturret/FastFetchBot
Length of output: 2387
🌐 Web query:
Firecrawl Python SDK async support AsyncFirecrawl
💡 Result:
Yes — the Firecrawl Python SDK includes async support via the AsyncFirecrawl class. Its methods mirror Firecrawl (awaitable versions of scrape, search, start_crawl, batch_scrape, watcher, etc.). Example usage and docs are on the official SDK pages. [1][2]
Sources:
- Firecrawl Python SDK docs (AsyncFirecrawl examples and description). [1]
- Firecrawl docs (Python SDK overview, async class usage and watcher example). [2]
Use AsyncFirecrawl instead of blocking the event loop with a synchronous call.
self._client.scrape_url() is a synchronous, blocking call that performs network I/O to Firecrawl. Calling it without await in this async method blocks the event loop, preventing other async operations from executing concurrently and defeating the purpose of async. The Firecrawl Python SDK provides an AsyncFirecrawl class with async methods (e.g., scrape()); use it to avoid blocking during the scraping request.
🤖 Prompt for AI Agents
In `@app/services/scrapers/firecrawl_client/scraper.py` around lines 51 - 61, The
_get_page_content coroutine is calling the synchronous self._client.scrape_url
which blocks the event loop; switch to Firecrawl's async API by replacing the
blocking call with the AsyncFirecrawl async client and its async scrape method
(e.g., create or ensure self._client is an AsyncFirecrawl instance and call
await self._client.scrape(...) with the same parameters), then await the
existing _process_firecrawl_result(result) call; update error handling to catch
exceptions from the awaited async call and rethrow as before.
| bluesky_scraper: Optional[BlueskyScraper] = None | ||
| weibo_scraper: Optional[WeiboScraper] = None | ||
| firecrawl_scraper: Optional[FirecrawlScraper] = None | ||
|
|
||
| scrapers = {"bluesky": bluesky_scraper, | ||
| "weibo": bluesky_scraper} | ||
| "weibo": weibo_scraper, | ||
| "other": firecrawl_scraper, | ||
| "unknown": firecrawl_scraper} |
There was a problem hiding this comment.
Class attributes never updated after scraper initialization - causes repeated re-initialization.
The class attributes bluesky_scraper, weibo_scraper, and firecrawl_scraper are used as guards (e.g., not cls.firecrawl_scraper) but are never assigned after initialization. This means every call to init_scraper() will re-initialize the scraper.
Additionally, when initializing for "other" category, cls.scrapers["other"] is updated but cls.scrapers["unknown"] still points to None, causing separate initializations.
Proposed fix
`@classmethod`
async def init_scraper(cls, category: str) -> None:
if category in cls.scrapers.keys():
scraper = None
if category == "bluesky" and not cls.bluesky_scraper:
scraper = await cls.init_bluesky_scraper()
+ cls.bluesky_scraper = scraper
elif category == "weibo" and not cls.weibo_scraper:
scraper = await cls.init_weibo_scraper()
+ cls.weibo_scraper = scraper
elif category in ["other", "unknown"] and not cls.firecrawl_scraper:
scraper = await cls.init_firecrawl_scraper()
+ cls.firecrawl_scraper = scraper
+ # Update both keys to use same instance
+ cls.scrapers["other"] = scraper
+ cls.scrapers["unknown"] = scraper
if scraper:
cls.scrapers[category] = scraper🧰 Tools
🪛 Ruff (0.14.13)
18-21: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In `@app/services/scrapers/scraper_manager.py` around lines 14 - 21, The
class-level scraper attributes (bluesky_scraper, weibo_scraper,
firecrawl_scraper) are declared but never set after creating instances, causing
repeated re-initialization in init_scraper(); update init_scraper() so that when
you create a scraper instance you assign it back to the corresponding class
attribute (e.g., cls.firecrawl_scraper = instance, cls.bluesky_scraper =
instance, cls.weibo_scraper = instance) and ensure the scrapers mapping
(cls.scrapers) points to that same instance for all relevant keys (update both
"other" and "unknown" to reference cls.firecrawl_scraper or rebuild cls.scrapers
from the class attrs after initialization) so subsequent calls use the cached
instances.
| if url_metadata.source == "unknown" and FIRECRAWL_ON: | ||
| metadata_item = await content_process_function(url_metadata=url_metadata) | ||
| await send_item_message( | ||
| metadata_item, chat_id=message.chat_id, message=message | ||
| ) | ||
| if url_metadata.source == "unknown" or url_metadata.source == "banned": | ||
| logger.debug(f"for the {i + 1}th url {url}, no supported url found.") | ||
| return |
There was a problem hiding this comment.
Control flow issue: Firecrawl processing followed by early return for "unknown" source.
After successfully processing the URL with Firecrawl (lines 352-355), the code falls through to line 356-358 which checks url_metadata.source == "unknown" again and returns early with a debug log stating "no supported url found." This negates the Firecrawl processing that just occurred.
The return should be handled inside the Firecrawl block, or the "unknown" check should exclude the Firecrawl-processed case.
Proposed fix
if url_metadata.source == "unknown" and FIRECRAWL_ON:
metadata_item = await content_process_function(url_metadata=url_metadata)
await send_item_message(
metadata_item, chat_id=message.chat_id, message=message
)
- if url_metadata.source == "unknown" or url_metadata.source == "banned":
+ return
+ if url_metadata.source == "unknown" or url_metadata.source == "banned":
logger.debug(f"for the {i + 1}th url {url}, no supported url found.")
return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if url_metadata.source == "unknown" and FIRECRAWL_ON: | |
| metadata_item = await content_process_function(url_metadata=url_metadata) | |
| await send_item_message( | |
| metadata_item, chat_id=message.chat_id, message=message | |
| ) | |
| if url_metadata.source == "unknown" or url_metadata.source == "banned": | |
| logger.debug(f"for the {i + 1}th url {url}, no supported url found.") | |
| return | |
| if url_metadata.source == "unknown" and FIRECRAWL_ON: | |
| metadata_item = await content_process_function(url_metadata=url_metadata) | |
| await send_item_message( | |
| metadata_item, chat_id=message.chat_id, message=message | |
| ) | |
| return | |
| if url_metadata.source == "unknown" or url_metadata.source == "banned": | |
| logger.debug(f"for the {i + 1}th url {url}, no supported url found.") | |
| return |
🤖 Prompt for AI Agents
In `@app/services/telegram_bot/__init__.py` around lines 351 - 358, The code
processes "unknown" URLs via Firecrawl when FIRECRAWL_ON is true but then
immediately hits the subsequent check that logs and returns for
url_metadata.source == "unknown", negating the Firecrawl result; to fix, after
successfully calling content_process_function and send_item_message in the
Firecrawl branch (the block using url_metadata, FIRECRAWL_ON,
content_process_function, and send_item_message) add an early return so
execution does not continue to the later logger.debug/return, or alternatively
change the second condition to skip returning when Firecrawl was performed
(e.g., only return if url_metadata.source == "unknown" and not FIRECRAWL_ON or
if a flag indicates Firecrawl didn't run).
| {% if data.title %} | ||
| <b>{{ data.title }}</b> | ||
| {% endif %} | ||
| {{ data.text }} |
There was a problem hiding this comment.
Escape title output to avoid HTML/Telegram markup injection.
The Jinja2 environment is not configured for autoescape, so titles should be explicitly escaped to prevent formatting issues or unsafe markup.
🔧 Suggested fix
-<b>{{ data.title }}</b>
+<b>{{ data.title | e }}</b>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {% if data.title %} | |
| <b>{{ data.title }}</b> | |
| {% endif %} | |
| {{ data.text }} | |
| {% if data.title %} | |
| <b>{{ data.title | e }}</b> | |
| {% endif %} | |
| {{ data.text }} |
🤖 Prompt for AI Agents
In `@app/templates/social_media_message.jinja2` around lines 4 - 7, The template
outputs user-provided data.title without escaping; update the
social_media_message.jinja2 template to explicitly escape the title (e.g., use
the Jinja2 escape/filter on data.title) so HTML/Telegram markup cannot be
injected; modify the conditional block around data.title (the line rendering
<b>{{ data.title }}</b>) to render an escaped version of data.title using the
appropriate Jinja2 escape/filter.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.