feat: update Firecrawl prompt by aturret · Pull Request #63 · aturret/FastFetchBot

aturret · 2026-03-12T00:32:47Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved content extraction robustness with truncation detection and automatic fallback handling to ensure complete article bodies are captured
- Enhanced extraction validation and completeness verification
Chores
- Updated URL filtering patterns for improved domain-level security

coderabbitai · 2026-03-12T00:33:01Z

📝 Walkthrough

Walkthrough

The changes enhance content extraction robustness in the Firecrawl scraper by introducing truncation detection logic that compares extracted content length against raw HTML, implements sanitization fallback when truncation is detected, updates extraction prompts to enforce complete content delivery, and refines banned domain pattern matching.

Changes

Cohort / File(s)	Summary
Truncation Detection & Fallback `apps/api/src/services/scrapers/general/firecrawl.py`	Adds `_TRUNCATION_RATIO_THRESHOLD` constant and `_is_content_truncated()` function to detect whether extracted HTML is significantly shorter than raw HTML. Implements conditional fallback in JSON extraction: if content is missing or truncated, sanitizes and wraps raw HTML with warning log; otherwise sanitizes extracted content. Changes `message_type` from dynamic computation to fixed `LONG` classification with TODO note.
Extraction Schema & Prompts `apps/api/src/services/scrapers/general/firecrawl_schema.py`	Updates `ExtractedArticle.content` description and `FIRECRAWL_EXTRACTION_PROMPT` to emphasize complete, full article body as clean HTML with no truncation, summarization, or skipping. Adds explicit prohibitions against partial content and editorial insertions; mandates inclusion of every paragraph and section in full.
Banned Pattern Configuration `packages/shared/fastfetchbot_shared/utils/config.py`	Updates `BANNED_PATTERNS` constant: replaces Gemini share pattern from `gemini/share/[A-Za-z0-9]+` to `gemini\.google\.com\/share\/[A-Za-z0-9]+`, refines Discord invite pattern to `discord\.gg\/[A-Za-z0-9]+`, and adds new pattern for `telegra\.ph`.

Possibly related PRs

Feat: Refactor general webpage scraping and fix HTML santizing #49 – Modifies Firecrawl scraper and HTML sanitization/fallback mechanisms with overlapping code paths
fix: add firecrawl json extraction #62 – Updates Firecrawl JSON extraction flow; this PR builds on top with additional truncation detection and sanitization logic

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit's ode to truncation's cure:
When HTML grows too lean and bare,
Our clever detector rings the flare—
Fallback paths ensure complete care,
No snippet left incomplete or spare,
Full content flows, forever fair! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'update Firecrawl prompt' only partially describes the changeset. While it addresses the prompt changes in firecrawl_schema.py, it ignores significant changes to firecrawl.py (truncation detection, fallback logic) and config.py (banned patterns updates), making it incomplete.	Revise the title to reflect the full scope of changes, such as 'feat: enhance Firecrawl extraction with truncation detection and update validation patterns' or similar to capture all major modifications.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch firecrawl-prompt-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

apps/api/src/services/scrapers/general/firecrawl.py (1)

141-153: DRY: Extract common wrapping logic outside the conditional.

Lines 150 and 153 are identical. Consider refactoring to avoid duplication:

♻️ Proposed refactor

         # Sanitize and wrap content HTML, with truncation detection fallback
         raw_html = full_result.get("html", "")
         if not content_html or (raw_html and _is_content_truncated(content_html, raw_html)):
             if content_html:
                 logger.warning(
                     "Firecrawl JSON extraction appears truncated, "
                     "falling back to raw HTML"
                 )
             content_html = self.sanitize_html(raw_html) if raw_html else ""
-            content = wrap_text_into_html(content_html, is_html=True)
         else:
             content_html = self.sanitize_html(content_html)
-            content = wrap_text_into_html(content_html, is_html=True)
+        content = wrap_text_into_html(content_html, is_html=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 141 - 153,
The block duplicates the final sanitize+wrap logic; refactor by selecting the
source HTML first (use raw_html when content_html is empty or when
_is_content_truncated(content_html, raw_html) is true), then perform a single
sanitize_html(...) call and one wrap_text_into_html(..., is_html=True)
assignment to content; keep the existing warning when falling back but remove
the duplicated wrap/sanitize in both branches (refer to variables
content_html/raw_html and functions _is_content_truncated, sanitize_html,
wrap_text_into_html).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 166-172: The hardcoded "message_type": MessageType.LONG forces
Telegraph storage for all JSON-extracted content; revert to a conditional that
uses get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from
scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want
shorter content treated as LONG for Telegram; update the logic where
"message_type" is set in firecrawl.py to compute length via
get_html_text_length(content) and choose MessageType.LONG only when above the
chosen threshold, and add a short comment explaining the chosen threshold
decision for future maintainers.

---

Nitpick comments:
In `@apps/api/src/services/scrapers/general/firecrawl.py`:
- Around line 141-153: The block duplicates the final sanitize+wrap logic;
refactor by selecting the source HTML first (use raw_html when content_html is
empty or when _is_content_truncated(content_html, raw_html) is true), then
perform a single sanitize_html(...) call and one wrap_text_into_html(...,
is_html=True) assignment to content; keep the existing warning when falling back
but remove the duplicated wrap/sanitize in both branches (refer to variables
content_html/raw_html and functions _is_content_truncated, sanitize_html,
wrap_text_into_html).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35b1c818-ed9c-440d-92d7-9bec8cce6b45

📥 Commits

Reviewing files that changed from the base of the PR and between 00f2ffb and 0251c6f.

📒 Files selected for processing (3)

apps/api/src/services/scrapers/general/firecrawl.py
apps/api/src/services/scrapers/general/firecrawl_schema.py
packages/shared/fastfetchbot_shared/utils/config.py

coderabbitai · 2026-03-12T00:37:42Z

apps/api/src/services/scrapers/general/firecrawl.py

+            "message_type": MessageType.LONG,
+            #     (
+            #     MessageType.LONG
+            #     if get_html_text_length(content) > GENERAL_TEXT_LIMIT
+            #     else MessageType.SHORT
+            # ),
+            # TODO: For now, we classify all JSON-extracted content as LONG to improve Telegram users' reading experience.


⚠️ Potential issue | 🟡 Minor

Hardcoding MessageType.LONG changes storage behavior for all JSON-extracted content.

Per context snippet 4 in apps/api/src/services/scrapers/common.py, MessageType.LONG triggers Telegraph storage for all articles. This means even short articles (e.g., a 200-character post) will now be stored in Telegraph unnecessarily, adding latency and external service dependency.

If the goal is better Telegram reading experience, consider a lower threshold than the original 800 characters rather than removing the threshold entirely. Alternatively, document this decision more explicitly for future maintainers.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/services/scrapers/general/firecrawl.py` around lines 166 - 172, The hardcoded "message_type": MessageType.LONG forces Telegraph storage for all JSON-extracted content; revert to a conditional that uses get_html_text_length(content) compared to GENERAL_TEXT_LIMIT (from scrapers.common) or pick a lower threshold (e.g., 200–800 chars) if you want shorter content treated as LONG for Telegram; update the logic where "message_type" is set in firecrawl.py to compute length via get_html_text_length(content) and choose MessageType.LONG only when above the chosen threshold, and add a short comment explaining the chosen threshold decision for future maintainers.

aturret added 2 commits March 11, 2026 19:32

chore: update banned list

a7523dc

feat: update firecrawl prompt

0251c6f

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

aturret merged commit 9be42c3 into main Mar 12, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update Firecrawl prompt#63

feat: update Firecrawl prompt#63
aturret merged 2 commits intomainfrom
firecrawl-prompt-fix

aturret commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aturret commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aturret commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading