Skip to content

feat: update Firecrawl prompt#63

Merged
aturret merged 2 commits intomainfrom
firecrawl-prompt-fix
Mar 12, 2026
Merged

feat: update Firecrawl prompt#63
aturret merged 2 commits intomainfrom
firecrawl-prompt-fix

Conversation

@aturret
Copy link
Owner

@aturret aturret commented Mar 12, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved content extraction robustness with truncation detection and automatic fallback handling to ensure complete article bodies are captured
    • Enhanced extraction validation and completeness verification
  • Chores

    • Updated URL filtering patterns for improved domain-level security

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

The changes enhance content extraction robustness in the Firecrawl scraper by introducing truncation detection logic that compares extracted content length against raw HTML, implements sanitization fallback when truncation is detected, updates extraction prompts to enforce complete content delivery, and refines banned domain pattern matching.

Changes

Cohort / File(s) Summary
Truncation Detection & Fallback
apps/api/src/services/scrapers/general/firecrawl.py
Adds _TRUNCATION_RATIO_THRESHOLD constant and _is_content_truncated() function to detect whether extracted HTML is significantly shorter than raw HTML. Implements conditional fallback in JSON extraction: if content is missing or truncated, sanitizes and wraps raw HTML with warning log; otherwise sanitizes extracted content. Changes message_type from dynamic computation to fixed LONG classification with TODO note.
Extraction Schema & Prompts
apps/api/src/services/scrapers/general/firecrawl_schema.py
Updates ExtractedArticle.content description and FIRECRAWL_EXTRACTION_PROMPT to emphasize complete, full article body as clean HTML with no truncation, summarization, or skipping. Adds explicit prohibitions against partial content and editorial insertions; mandates inclusion of every paragraph and section in full.
Banned Pattern Configuration
packages/shared/fastfetchbot_shared/utils/config.py
Updates BANNED_PATTERNS constant: replaces Gemini share pattern from gemini/share/[A-Za-z0-9]+ to gemini\.google\.com\/share\/[A-Za-z0-9]+, refines Discord invite pattern to discord\.gg\/[A-Za-z0-9]+, and adds new pattern for telegra\.ph.

Possibly related PRs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit's ode to truncation's cure:
When HTML grows too lean and bare,
Our clever detector rings the flare—
Fallback paths ensure complete care,
No snippet left incomplete or spare,
Full content flows, forever fair! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'update Firecrawl prompt' only partially describes the changeset. While it addresses the prompt changes in firecrawl_schema.py, it ignores significant changes to firecrawl.py (truncation detection, fallback logic) and config.py (banned patterns updates), making it incomplete. Revise the title to reflect the full scope of changes, such as 'feat: enhance Firecrawl extraction with truncation detection and update validation patterns' or similar to capture all major modifications.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch firecrawl-prompt-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/api/src/services/scrapers/general/firecrawl.py (1)

141-153: DRY: Extract common wrapping logic outside the conditional.

Lines 150 and 153 are identical. Consider refactoring to avoid duplication:

♻️ Proposed refactor
         # Sanitize and wrap content HTML, with truncation detection fallback
         raw_html = full_result.get("html", "")
         if not content_html or (raw_html and _is_content_truncated(content_html, raw_html)):
             if content_html:
                 logger.warning(
                     "Firecrawl JSON extraction appears truncated, "
                     "falling back to raw HTML"
                 )
             content_html = self.sanitize_html(raw_html) if raw_html else ""
-            content = wrap_text_into_html(content_html, is_html=True)
         else:
             content_html = self.sanitize_html(content_html)
-            content = wrap_text_into_html(content_html, is_html=True)
+        content = wrap_text_into_html(content_html, is_html=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 141 - 153,
The block duplicates the final sanitize+wrap logic; refactor by selecting the
source HTML first (use raw_html when content_html is empty or when
_is_content_truncated(content_html, raw_html) is true), then perform a single
sanitize_html(...) call and one wrap_text_into_html(..., is_html=True)
assignment to content; keep the existing warning when falling back but remove
the duplicated wrap/sanitize in both branches (refer to variables
content_html/raw_html and functions _is_content_truncated, sanitize_html,
wrap_text_into_html).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 166-172: The hardcoded "message_type": MessageType.LONG forces
Telegraph storage for all JSON-extracted content; revert to a conditional that
uses get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from
scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want
shorter content treated as LONG for Telegram; update the logic where
"message_type" is set in firecrawl.py to compute length via
get_html_text_length(content) and choose MessageType.LONG only when above the
chosen threshold, and add a short comment explaining the chosen threshold
decision for future maintainers.

---

Nitpick comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 141-153: The block duplicates the final sanitize+wrap logic;
refactor by selecting the source HTML first (use raw_html when content_html is
empty or when _is_content_truncated(content_html, raw_html) is true), then
perform a single sanitize_html(...) call and one wrap_text_into_html(...,
is_html=True) assignment to content; keep the existing warning when falling back
but remove the duplicated wrap/sanitize in both branches (refer to variables
content_html/raw_html and functions _is_content_truncated, sanitize_html,
wrap_text_into_html).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35b1c818-ed9c-440d-92d7-9bec8cce6b45

📥 Commits

Reviewing files that changed from the base of the PR and between 00f2ffb and 0251c6f.

📒 Files selected for processing (3)
  • apps/api/src/services/scrapers/general/firecrawl.py
  • apps/api/src/services/scrapers/general/firecrawl_schema.py
  • packages/shared/fastfetchbot_shared/utils/config.py

Comment on lines +166 to +172
"message_type": MessageType.LONG,
# (
# MessageType.LONG
# if get_html_text_length(content) > GENERAL_TEXT_LIMIT
# else MessageType.SHORT
# ),
# TODO: For now, we classify all JSON-extracted content as LONG to improve Telegram users' reading experience.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoding MessageType.LONG changes storage behavior for all JSON-extracted content.

Per context snippet 4 in apps/api/src/services/scrapers/common.py, MessageType.LONG triggers Telegraph storage for all articles. This means even short articles (e.g., a 200-character post) will now be stored in Telegraph unnecessarily, adding latency and external service dependency.

If the goal is better Telegram reading experience, consider a lower threshold than the original 800 characters rather than removing the threshold entirely. Alternatively, document this decision more explicitly for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 166 - 172,
The hardcoded "message_type": MessageType.LONG forces Telegraph storage for all
JSON-extracted content; revert to a conditional that uses
get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from
scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want
shorter content treated as LONG for Telegram; update the logic where
"message_type" is set in firecrawl.py to compute length via
get_html_text_length(content) and choose MessageType.LONG only when above the
chosen threshold, and add a short comment explaining the chosen threshold
decision for future maintainers.

@aturret aturret merged commit 9be42c3 into main Mar 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant