Conversation
📝 WalkthroughWalkthroughThe changes enhance content extraction robustness in the Firecrawl scraper by introducing truncation detection logic that compares extracted content length against raw HTML, implements sanitization fallback when truncation is detected, updates extraction prompts to enforce complete content delivery, and refines banned domain pattern matching. Changes
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/api/src/services/scrapers/general/firecrawl.py (1)
141-153: DRY: Extract common wrapping logic outside the conditional.Lines 150 and 153 are identical. Consider refactoring to avoid duplication:
♻️ Proposed refactor
# Sanitize and wrap content HTML, with truncation detection fallback raw_html = full_result.get("html", "") if not content_html or (raw_html and _is_content_truncated(content_html, raw_html)): if content_html: logger.warning( "Firecrawl JSON extraction appears truncated, " "falling back to raw HTML" ) content_html = self.sanitize_html(raw_html) if raw_html else "" - content = wrap_text_into_html(content_html, is_html=True) else: content_html = self.sanitize_html(content_html) - content = wrap_text_into_html(content_html, is_html=True) + content = wrap_text_into_html(content_html, is_html=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 141 - 153, The block duplicates the final sanitize+wrap logic; refactor by selecting the source HTML first (use raw_html when content_html is empty or when _is_content_truncated(content_html, raw_html) is true), then perform a single sanitize_html(...) call and one wrap_text_into_html(..., is_html=True) assignment to content; keep the existing warning when falling back but remove the duplicated wrap/sanitize in both branches (refer to variables content_html/raw_html and functions _is_content_truncated, sanitize_html, wrap_text_into_html).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 166-172: The hardcoded "message_type": MessageType.LONG forces
Telegraph storage for all JSON-extracted content; revert to a conditional that
uses get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from
scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want
shorter content treated as LONG for Telegram; update the logic where
"message_type" is set in firecrawl.py to compute length via
get_html_text_length(content) and choose MessageType.LONG only when above the
chosen threshold, and add a short comment explaining the chosen threshold
decision for future maintainers.
---
Nitpick comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 141-153: The block duplicates the final sanitize+wrap logic;
refactor by selecting the source HTML first (use raw_html when content_html is
empty or when _is_content_truncated(content_html, raw_html) is true), then
perform a single sanitize_html(...) call and one wrap_text_into_html(...,
is_html=True) assignment to content; keep the existing warning when falling back
but remove the duplicated wrap/sanitize in both branches (refer to variables
content_html/raw_html and functions _is_content_truncated, sanitize_html,
wrap_text_into_html).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35b1c818-ed9c-440d-92d7-9bec8cce6b45
📒 Files selected for processing (3)
apps/api/src/services/scrapers/general/firecrawl.pyapps/api/src/services/scrapers/general/firecrawl_schema.pypackages/shared/fastfetchbot_shared/utils/config.py
| "message_type": MessageType.LONG, | ||
| # ( | ||
| # MessageType.LONG | ||
| # if get_html_text_length(content) > GENERAL_TEXT_LIMIT | ||
| # else MessageType.SHORT | ||
| # ), | ||
| # TODO: For now, we classify all JSON-extracted content as LONG to improve Telegram users' reading experience. |
There was a problem hiding this comment.
Hardcoding MessageType.LONG changes storage behavior for all JSON-extracted content.
Per context snippet 4 in apps/api/src/services/scrapers/common.py, MessageType.LONG triggers Telegraph storage for all articles. This means even short articles (e.g., a 200-character post) will now be stored in Telegraph unnecessarily, adding latency and external service dependency.
If the goal is better Telegram reading experience, consider a lower threshold than the original 800 characters rather than removing the threshold entirely. Alternatively, document this decision more explicitly for future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 166 - 172,
The hardcoded "message_type": MessageType.LONG forces Telegraph storage for all
JSON-extracted content; revert to a conditional that uses
get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from
scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want
shorter content treated as LONG for Telegram; update the logic where
"message_type" is set in firecrawl.py to compute length via
get_html_text_length(content) and choose MessageType.LONG only when above the
chosen threshold, and add a short comment explaining the chosen threshold
decision for future maintainers.
Summary by CodeRabbit
Release Notes
Bug Fixes
Chores