feat: Enhance URL Component with HTML Link Processing by zhangsichu · Pull Request #9388 · langflow-ai/langflow

zhangsichu · 2025-08-14T07:14:02Z

Summary

This PR enhances the URL component by adding comprehensive HTML link processing capabilities, allowing users to convert relative URLs to absolute URLs in crawled content. This feature is particularly useful for maintaining link integrity when processing web content for analysis or storage.

Changes Made

1. New Feature: HTML Link Processing

Added process_links boolean input parameter to control link processing
Implemented _process_html_links() method for comprehensive URL conversion
Added validation to ensure link processing only occurs with HTML format

2. Enhanced URL Processing

HTML Tags: Processes href, src, action attributes in various HTML elements
CSS URLs: Converts relative URLs in CSS url() references to absolute URLs
Data Attributes: Handles data attributes that may contain relative paths
Comprehensive Coverage: Supports a, img, link, script, iframe, form, video, audio, source, track tags

3. Improved Content Handling

Refactored content processing loop for better readability and maintainability
Added conditional link processing based on format and user preference
Enhanced error handling with graceful fallback for link processing failures

4. Code Quality Improvements

Added proper import for urllib.parse utilities
Updated component documentation to reflect new capabilities
Improved code structure and readability

Technical Details

New Input Parameter

BoolInput(
    name="process_links",
    display_name="Process Links",
    info="If enabled and format is HTML, converts relative links to absolute URLs in the output.",
    value=True,
    required=False,
    advanced=True,
)

Link Processing Method

The _process_html_links() method:

Uses BeautifulSoup for robust HTML parsing
Handles multiple attribute types (href, src, action)
Processes CSS url() references
Maintains data attribute integrity
Provides graceful error handling

Validation Logic

# Validate that process_links is only used with HTML format
if self.process_links and self.format != "HTML":
    logger.warning("process_links is only effective when format is set to 'HTML'")

Benefits

Link Integrity: Maintains proper URL references in processed content
Content Portability: Makes crawled content self-contained with absolute URLs
User Control: Optional feature that doesn't affect existing functionality
Performance: Efficient processing with minimal overhead
Robustness: Graceful error handling prevents processing failures

Testing

The changes maintain backward compatibility and include:

Input validation for the new parameter
Conditional processing based on format selection
Comprehensive error handling for malformed HTML
Logging for debugging and monitoring

Backward Compatibility

✅ Fully backward compatible - All existing functionality remains unchanged ✅ New feature is opt-in - Users must explicitly enable process_links ✅ No breaking changes - Existing flows continue to work without modification

Files Changed

src/backend/base/langflow/components/data/url.py - Main component enhancement

Checklist

Code follows Langflow backend development guidelines
New feature is properly documented
Input validation and error handling implemented
Backward compatibility maintained
Code is readable and maintainable
No breaking changes introduced

Related Issues

This enhancement addresses the need for maintaining link integrity when processing web content, making the URL component more useful for content analysis and storage use cases.

Summary by CodeRabbit

New Features
- Added “Process Links” option to URL components (default on) that converts relative links to absolute URLs in HTML output, improving link reliability. Applies to Blog Writer, Knowledge Ingestion, and Simple Agent starter projects.
- Handles links in href/src/action, data-* attributes, and CSS url() inside style tags.
Bug Fixes
- Improved error handling during URL fetching with clearer messages and safer fallbacks.
- Warns when “Process Links” is enabled for non-HTML formats.
Documentation
- Updated component descriptions to reflect link-processing capability.

### Summary This PR enhances the URL component by adding comprehensive HTML link processing capabilities, allowing users to convert relative URLs to absolute URLs in crawled content. This feature is particularly useful for maintaining link integrity when processing web content for analysis or storage. ### Changes Made #### 1. **New Feature: HTML Link Processing** - Added `process_links` boolean input parameter to control link processing - Implemented `_process_html_links()` method for comprehensive URL conversion - Added validation to ensure link processing only occurs with HTML format #### 2. **Enhanced URL Processing** - **HTML Tags**: Processes `href`, `src`, `action` attributes in various HTML elements - **CSS URLs**: Converts relative URLs in CSS `url()` references to absolute URLs - **Data Attributes**: Handles data attributes that may contain relative paths - **Comprehensive Coverage**: Supports `a`, `img`, `link`, `script`, `iframe`, `form`, `video`, `audio`, `source`, `track` tags #### 3. **Improved Content Handling** - Refactored content processing loop for better readability and maintainability - Added conditional link processing based on format and user preference - Enhanced error handling with graceful fallback for link processing failures #### 4. **Code Quality Improvements** - Added proper import for `urllib.parse` utilities - Updated component documentation to reflect new capabilities - Improved code structure and readability ### Technical Details #### New Input Parameter ```python BoolInput( name="process_links", display_name="Process Links", info="If enabled and format is HTML, converts relative links to absolute URLs in the output.", value=True, required=False, advanced=True, ) ``` #### Link Processing Method The `_process_html_links()` method: - Uses BeautifulSoup for robust HTML parsing - Handles multiple attribute types (`href`, `src`, `action`) - Processes CSS `url()` references - Maintains data attribute integrity - Provides graceful error handling #### Validation Logic ```python # Validate that process_links is only used with HTML format if self.process_links and self.format != "HTML": logger.warning("process_links is only effective when format is set to 'HTML'") ``` ### Benefits 1. **Link Integrity**: Maintains proper URL references in processed content 2. **Content Portability**: Makes crawled content self-contained with absolute URLs 3. **User Control**: Optional feature that doesn't affect existing functionality 4. **Performance**: Efficient processing with minimal overhead 5. **Robustness**: Graceful error handling prevents processing failures ### Testing The changes maintain backward compatibility and include: - Input validation for the new parameter - Conditional processing based on format selection - Comprehensive error handling for malformed HTML - Logging for debugging and monitoring ### Backward Compatibility ✅ **Fully backward compatible** - All existing functionality remains unchanged ✅ **New feature is opt-in** - Users must explicitly enable `process_links` ✅ **No breaking changes** - Existing flows continue to work without modification ### Files Changed - `src/backend/base/langflow/components/data/url.py` - Main component enhancement ### Checklist - [x] Code follows Langflow backend development guidelines - [x] New feature is properly documented - [x] Input validation and error handling implemented - [x] Backward compatibility maintained - [x] Code is readable and maintainable - [x] No breaking changes introduced ### Related Issues This enhancement addresses the need for maintaining link integrity when processing web content, making the URL component more useful for content analysis and storage use cases.

coderabbitai · 2025-08-14T07:14:10Z

Walkthrough

Adds optional HTML link normalization to URLComponent. Introduces a process_links boolean input and a private _process_html_links method to convert relative URLs to absolute in HTML outputs. Applies processing conditionally in fetch_url_contents. Updates three starter project templates to expose the new input and updated code payloads.

Changes

Cohort / File(s)	Summary of Changes
Core URL Component `src/backend/base/langflow/components/data/url.py`	Added process_links BoolInput and _process_html_links method using BeautifulSoup and urljoin to rewrite relative links (href/src/action/data-*, CSS url()). fetch_url_contents conditionally applies processing for HTML format, adjusts per-document handling, logs warnings on misuse, and improves error handling. Updated docstring.
Starter Projects `src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json`, `src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json`, `src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json`	Updated URLComponent code payloads to include process_links input (default true) and link-processing logic. Integrated conditional HTML link rewriting and loader/base URL adjustments where applicable. Updated metadata (e.g., code_hash).

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant URLComponent
  participant Loader
  participant Fetcher
  User->>URLComponent: Run with url(s), format, process_links
  URLComponent->>Loader: Create loader (base_url, format)
  Loader->>Fetcher: Fetch documents
  Fetcher-->>Loader: Documents (content, metadata)
  Loader-->>URLComponent: Documents
  alt format == HTML and process_links == true
    URLComponent->>URLComponent: _process_html_links(content, base_url)
  end
  URLComponent-->>User: Data dicts (url, title, description, content_type, language, content)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

refactor(urlcomponent): rename output display names and method references #8448: Also modifies URLComponent in the same path; overlapping changes to outputs/display and message handling suggest coordination with this link-processing update.

Suggested labels

size:L, lgtm

Suggested reviewers

Yukiyukiyeah
erichare
ogabrielluiz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

sonarqubecloud · 2025-08-14T07:17:34Z

Quality Gate failed

Failed conditions
5 Security Hotspots

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (7)

src/backend/base/langflow/components/data/url.py (4)
167-174: New input looks good; consider syncing UI visibility with format selection

The new advanced BoolInput process_links is well-scoped and backwards-compatible.

Optional: Hide/disable this input dynamically when format != "HTML" via update_build_config to prevent user confusion (in addition to the runtime warning you already log). I can draft that update if useful.

233-258: URL rewriting coverage is solid; consider extending support for srcset/poster and scheme-relative URLs

Current handling covers href, src, and action well. Two optional improvements:

Add support for img/srcset and source/srcset attributes (comma-separated URLs) and video/poster.

Scheme-relative URLs (//cdn.example.com/foo.css) are already resolved correctly by urljoin; keeping them as-is is fine, but consider skipping rewrite if you prefer not to collapse to an explicit scheme.

I can provide a small helper to parse and normalize srcset lists if you want to include it.

274-290: Avoid re-import and compile the CSS regex once; handle whitespace around url()

Remove the inner import re and reuse the module import.

Precompile a robust pattern once to avoid recompilation and to handle whitespace and optional quotes.

Apply this diff here:
-            for style_tag in soup.find_all("style"):
-                if style_tag.string:
-                    # Simple regex to find url() references in CSS
-                    import re
-
-                    css_content = style_tag.string
-                    url_pattern = r'url\([\'"]?([^\'"]+)[\'"]?\)'
-
-                    def replace_url(match):
-                        url = match.group(1)
-                        if url and not url.startswith(("http://", "https://", "data:", "#")):
-                            absolute_url = urljoin(base_url, url)
-                            return f'url("{absolute_url}")'
-                        return match.group(0)
-
-                    style_tag.string = re.sub(url_pattern, replace_url, css_content)
+            for style_tag in soup.find_all("style"):
+                if style_tag.string:
+                    css_content = style_tag.string
+
+                    def replace_url(match):
+                        url = match.group(2)
+                        if url and not url.startswith(("http://", "https://", "data:", "#")):
+                            absolute_url = urljoin(base_url, url)
+                            return f'url("{absolute_url}")'
+                        return match.group(0)
+
+                    style_tag.string = CSS_URL_PATTERN.sub(replace_url, css_content)
And add this once at the top-level (outside this range):
# at module level, near imports
CSS_URL_PATTERN = re.compile(r'url\(\s*([\'"]?)([^\'")]+)\1\s*\)', re.IGNORECASE)
387-391: Simplify error message extraction; avoid deprecated Exception.message attribute

Python exceptions don’t reliably expose a .message attribute. Use str(e).

Apply this diff:
-        except Exception as e:
-            error_msg = e.message if hasattr(e, "message") else e
-            msg = f"Error loading documents: {error_msg!s}"
+        except Exception as e:
+            msg = f"Error loading documents: {e!s}"
             logger.exception(msg)
             raise ValueError(msg) from e
src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json (1)

1755-1792: New process_links input is good; consider adding to field_order

Adding process_links is great. For deterministic UI ordering, consider including it in field_order after "format". Not required, but improves UX consistency across templates.

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

599-616: process_links input inclusion looks correct

The new input is well-described and advanced by default. Consider listing it in field_order for a predictable UI position (optional).

src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json (1)

1239-1256: process_links input added; optional UI order tweak

Good addition. Optionally add process_links to field_order (e.g., after "format") for stable ordering in the template.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e68f6a4 and 15e35a5.

📒 Files selected for processing (4)

src/backend/base/langflow/components/data/url.py (6 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json (3 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (3 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

src/backend/base/langflow/components/data/url.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/base/langflow/components/data/url.py

src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/base/langflow/components/data/url.py

🧬 Code Graph Analysis (1)

src/backend/base/langflow/components/data/url.py (2)

src/backend/base/langflow/inputs/inputs.py (1)

BoolInput (413-425)

src/backend/base/langflow/helpers/data.py (1)

safe_convert (166-192)

🪛 GitHub Check: Ruff Style Check (3.13)

src/backend/base/langflow/components/data/url.py

[failure] 292-292: Ruff (BLE001)
src/backend/base/langflow/components/data/url.py:292:16: BLE001 Do not catch blind exception: Exception

[failure] 265-269: Ruff (SIM102)
src/backend/base/langflow/components/data/url.py:265:25: SIM102 Use a single if statement instead of nested if statements

🪛 GitHub Actions: Ruff Style Check

src/backend/base/langflow/components/data/url.py

[error] 265-265: SIM102 Use a single if statement instead of nested if statements. Command: uv run --only-dev ruff check --output-format=github .

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test-starter-projects

🔇 Additional comments (4)

src/backend/base/langflow/components/data/url.py (1)

318-322: No action needed — RecursiveUrlLoader supports base_url, exclude_dirs, and link_regex

Verified against the LangChain API docs and module source: RecursiveUrlLoader accepts the keyword arguments base_url, exclude_dirs, and link_regex, so the code is valid as written.

File to note: src/backend/base/langflow/components/data/url.py (lines 318–322) — keep the kwargs as-is.

src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json (1)

1528-1530: Code hash update acknowledged

The code_hash change reflects the updated URLComponent implementation.

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

342-344: Code hash update acknowledged

This reflects the updated URLComponent code in the template.

src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json (1)

981-983: Code hash update acknowledged

Reflects the new URLComponent implementation. No issues here.

coderabbitai · 2025-08-14T07:24:42Z

+                for attr_name, attr_value in tag.attrs.items():
+                    if attr_name.startswith("data-") and isinstance(attr_value, str):
+                        if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]):
+                            # This might contain a URL, but be careful not to break data attributes
+                            continue
+                        if attr_value and not attr_value.startswith(("#", "javascript:", "data:")):
+                            # Check if it looks like a relative path
+                            if "/" in attr_value or attr_value.endswith(
+                                (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif")
+                            ):
+                                absolute_url = urljoin(base_url, attr_value)
+                                tag[attr_name] = absolute_url


🛠️ Refactor suggestion

Ruff SIM102: combine nested ifs for data- attributes and harden detection*

Combine the nested conditions into one and handle extensions case-insensitively. This also resolves the pipeline failure.

Apply this diff:

- # Process data attributes that might contain URLs - for attr_name, attr_value in tag.attrs.items(): - if attr_name.startswith("data-") and isinstance(attr_value, str): - if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]): - # This might contain a URL, but be careful not to break data attributes - continue - if attr_value and not attr_value.startswith(("#", "javascript:", "data:")): - # Check if it looks like a relative path - if "/" in attr_value or attr_value.endswith( - (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif") - ): - absolute_url = urljoin(base_url, attr_value) - tag[attr_name] = absolute_url + # Process data attributes that might contain URLs + for attr_name, attr_value in tag.attrs.items(): + if ( + attr_name.startswith("data-") + and isinstance(attr_value, str) + and attr_value + and not attr_value.startswith(("#", "javascript:", "data:")) + and not any(ind in attr_value.lower() for ind in ("http://", "https://", "//")) + and ("/" in attr_value or attr_value.lower().endswith((".html", ".htm", ".css", ".js", ".jpg", ".jpeg", ".png", ".gif", ".svg", ".webp"))) + ): + absolute_url = urljoin(base_url, attr_value) + tag[attr_name] = absolute_url

This satisfies SIM102 and reduces false positives by normalizing extension matching.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for attr_name, attr_value in tag.attrs.items():

if attr_name.startswith("data-") and isinstance(attr_value, str):

if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]):

# This might contain a URL, but be careful not to break data attributes

continue

if attr_value and not attr_value.startswith(("#", "javascript:", "data:")):

# Check if it looks like a relative path

if "/" in attr_value or attr_value.endswith(

(".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif")

):

absolute_url = urljoin(base_url, attr_value)

tag[attr_name] = absolute_url

# Process data attributes that might contain URLs

for attr_name, attr_value in tag.attrs.items():

if (

attr_name.startswith("data-")

and isinstance(attr_value, str)

and attr_value

and not attr_value.startswith(("#", "javascript:", "data:"))

and not any(ind in attr_value.lower() for ind in ("http://", "https://", "//"))

and ("/" in attr_value or attr_value.lower().endswith((".html", ".htm", ".css", ".js", ".jpg", ".jpeg", ".png", ".gif", ".svg", ".webp")))

):

absolute_url = urljoin(base_url, attr_value)

tag[attr_name] = absolute_url

🧰 Tools

🪛 GitHub Check: Ruff Style Check (3.13)

[failure] 265-269: Ruff (SIM102)
src/backend/base/langflow/components/data/url.py:265:25: SIM102 Use a single if statement instead of nested if statements

🪛 GitHub Actions: Ruff Style Check

[error] 265-265: SIM102 Use a single if statement instead of nested if statements. Command: uv run --only-dev ruff check --output-format=github .

🤖 Prompt for AI Agents

In src/backend/base/langflow/components/data/url.py around lines 260 to 271, combine the nested ifs that process data-* attributes into a single conditional and make extension checks case-insensitive: ensure attr_name startswith("data-") and isinstance(attr_value, str), reject values that already contain URL indicators (http://, https://, //) or start with ("#", "javascript:", "data:"), then normalize attr_value.lower() and check for "/" or endswith a tuple of lowercased extensions (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif") before calling urljoin(base_url, attr_value) and assigning tag[attr_name] = absolute_url; this removes nested branches and avoids false positives by using lowercased extension matching.

coderabbitai · 2025-08-14T07:24:42Z

+        except Exception as e:
+            logger.warning(f"Error processing HTML links: {e}")
+            return html_content
+


🛠️ Refactor suggestion

Narrow broad exception (Ruff BLE001) in HTML processing

Catching bare Exception triggers BLE001. Restrict to specific, expected exceptions. BeautifulSoup may raise FeatureNotFound if the parser is missing; regex may raise re.error; general type/value/attr errors are sufficient.

Apply this diff:

- except Exception as e: - logger.warning(f"Error processing HTML links: {e}") - return html_content + except (FeatureNotFound, re.error, TypeError, ValueError, AttributeError) as e: + logger.warning(f"Error processing HTML links: {e!s}") + return html_content

And update imports (outside this range) to include FeatureNotFound:

from bs4 import BeautifulSoup, FeatureNotFound

If you prefer to keep the broad catch for safety, add # noqa: BLE001 to satisfy Ruff, but the narrowed set is recommended.

🧰 Tools

🪛 GitHub Check: Ruff Style Check (3.13)

[failure] 292-292: Ruff (BLE001)
src/backend/base/langflow/components/data/url.py:292:16: BLE001 Do not catch blind exception: Exception

🤖 Prompt for AI Agents

In src/backend/base/langflow/components/data/url.py around lines 292 to 295, replace the broad "except Exception as e" with a narrowed set of expected exceptions (e.g. FeatureNotFound, re.error, AttributeError, TypeError, ValueError) so you only catch parser, regex, and common type/value errors when processing HTML links, and update the module imports to include FeatureNotFound from bs4 (from bs4 import BeautifulSoup, FeatureNotFound); if you intentionally want to keep a catch-all for safety, annotate the broad except with "# noqa: BLE001" instead of leaving it unannotated.

coderabbitai · 2025-08-14T07:24:42Z

+                "value": "import re\nfrom urllib.parse import urljoin\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n    r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n    re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n    \"\"\"A component that loads and parses content from web pages recursively.\n\n    This component allows fetching content from one or more URLs, with options to:\n    - Control crawl depth\n    - Prevent crawling outside the root domain\n    - Use async loading for better performance\n    - Extract either raw HTML or clean text\n    - Configure request headers and timeouts\n    - Process HTML links to convert relative URLs to absolute URLs\n    \"\"\"\n\n    display_name = \"URL\"\n    description = \"Fetch content from one or more web pages, following links recursively.\"\n    documentation: str = \"https://docs.langflow.org/components-data#url\"\n    icon = \"layout-template\"\n    name = \"URLComponent\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n            placeholder=\"Enter a URL...\",\n            list_add_label=\"Add URL\",\n            input_types=[],\n        ),\n        SliderInput(\n            name=\"max_depth\",\n            display_name=\"Depth\",\n            info=(\n                \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n                \"- depth 1: only the initial page\\n\"\n                \"- depth 2: initial page + all pages linked directly from it\\n\"\n                \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n                \"Note: This is about link traversal, not URL path depth.\"\n            ),\n            value=DEFAULT_MAX_DEPTH,\n            range_spec=RangeSpec(min=1, max=5, step=1),\n            required=False,\n            min_label=\" \",\n            max_label=\" \",\n            min_label_icon=\"None\",\n            max_label_icon=\"None\",\n            # slider_input=True\n        ),\n        BoolInput(\n            name=\"prevent_outside\",\n            display_name=\"Prevent Outside\",\n            info=(\n                \"If enabled, only crawls URLs within the same domain as the root URL. \"\n                \"This helps prevent the crawler from going to external websites.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"use_async\",\n            display_name=\"Use Async\",\n            info=(\n                \"If enabled, uses asynchronous loading which can be significantly faster \"\n                \"but might use more system resources.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n            options=[\"Text\", \"HTML\"],\n            value=DEFAULT_FORMAT,\n            advanced=True,\n        ),\n        IntInput(\n            name=\"timeout\",\n            display_name=\"Timeout\",\n            info=\"Timeout for the request in seconds.\",\n            value=DEFAULT_TIMEOUT,\n            required=False,\n            advanced=True,\n        ),\n        TableInput(\n            name=\"headers\",\n            display_name=\"Headers\",\n            info=\"The headers to send with the request\",\n            table_schema=[\n                {\n                    \"name\": \"key\",\n                    \"display_name\": \"Header\",\n                    \"type\": \"str\",\n                    \"description\": \"Header name\",\n                },\n                {\n                    \"name\": \"value\",\n                    \"display_name\": \"Value\",\n                    \"type\": \"str\",\n                    \"description\": \"Header value\",\n                },\n            ],\n            value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n            advanced=True,\n            input_types=[\"DataFrame\"],\n        ),\n        BoolInput(\n            name=\"filter_text_html\",\n            display_name=\"Filter Text/HTML\",\n            info=\"If enabled, filters out text/css content type from the results.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"continue_on_failure\",\n            display_name=\"Continue on Failure\",\n            info=\"If enabled, continues crawling even if some requests fail.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"check_response_status\",\n            display_name=\"Check Response Status\",\n            info=\"If enabled, checks the response status of the request.\",\n            value=False,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"autoset_encoding\",\n            display_name=\"Autoset Encoding\",\n            info=\"If enabled, automatically sets the encoding of the request.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"process_links\",\n            display_name=\"Process Links\",\n            info=\"If enabled and format is HTML, converts relative links to absolute URLs in the output.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n        Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n    ]\n\n    @staticmethod\n    def validate_url(url: str) -> bool:\n        \"\"\"Validates if the given string matches URL pattern.\n\n        Args:\n            url: The URL string to validate\n\n        Returns:\n            bool: True if the URL is valid, False otherwise\n        \"\"\"\n        return bool(URL_REGEX.match(url))\n\n    def ensure_url(self, url: str) -> str:\n        \"\"\"Ensures the given string is a valid URL.\n\n        Args:\n            url: The URL string to validate and normalize\n\n        Returns:\n            str: The normalized URL\n\n        Raises:\n            ValueError: If the URL is invalid\n        \"\"\"\n        url = url.strip()\n        if not url.startswith((\"http://\", \"https://\")):\n            url = \"https://\" + url\n\n        if not self.validate_url(url):\n            msg = f\"Invalid URL: {url}\"\n            raise ValueError(msg)\n\n        return url\n\n    def _process_html_links(self, html_content: str, base_url: str) -> str:\n        \"\"\"Process HTML content and convert relative links to absolute URLs.\n\n        Args:\n            html_content: The raw HTML content\n            base_url: The base URL to resolve relative links against\n\n        Returns:\n            str: HTML content with relative links converted to absolute URLs\n        \"\"\"\n        if not html_content or not base_url:\n            return html_content\n\n        try:\n            soup = BeautifulSoup(html_content, \"lxml\")\n\n            # Process various types of links and resources\n            for tag in soup.find_all(\n                [\"a\", \"img\", \"link\", \"script\", \"iframe\", \"form\", \"video\", \"audio\", \"source\", \"track\"]\n            ):\n                # Process href attributes\n                if tag.has_attr(\"href\"):\n                    href = tag[\"href\"]\n                    if href and not href.startswith(\n                        (\"http://\", \"https://\", \"mailto:\", \"tel:\", \"#\", \"javascript:\", \"data:\")\n                    ):\n                        absolute_url = urljoin(base_url, href)\n                        tag[\"href\"] = absolute_url\n\n                # Process src attributes\n                if tag.has_attr(\"src\"):\n                    src = tag[\"src\"]\n                    if src and not src.startswith((\"http://\", \"https://\", \"data:\", \"#\", \"javascript:\")):\n                        absolute_url = urljoin(base_url, src)\n                        tag[\"src\"] = absolute_url\n\n                # Process action attributes (for forms)\n                if tag.has_attr(\"action\"):\n                    action = tag[\"action\"]\n                    if action and not action.startswith((\"http://\", \"https://\")):\n                        absolute_url = urljoin(base_url, action)\n                        tag[\"action\"] = absolute_url\n\n                # Process data attributes that might contain URLs\n                for attr_name, attr_value in tag.attrs.items():\n                    if attr_name.startswith(\"data-\") and isinstance(attr_value, str):\n                        if any(url_indicator in attr_value.lower() for url_indicator in [\"http://\", \"https://\", \"//\"]):\n                            # This might contain a URL, but be careful not to break data attributes\n                            continue\n                        if attr_value and not attr_value.startswith((\"#\", \"javascript:\", \"data:\")):\n                            # Check if it looks like a relative path\n                            if \"/\" in attr_value or attr_value.endswith(\n                                (\".html\", \".htm\", \".css\", \".js\", \".jpg\", \".png\", \".gif\")\n                            ):\n                                absolute_url = urljoin(base_url, attr_value)\n                                tag[attr_name] = absolute_url\n\n            # Process CSS content for url() references\n            for style_tag in soup.find_all(\"style\"):\n                if style_tag.string:\n                    # Simple regex to find url() references in CSS\n                    import re\n\n                    css_content = style_tag.string\n                    url_pattern = r'url\\([\\'\"]?([^\\'\"]+)[\\'\"]?\\)'\n\n                    def replace_url(match):\n                        url = match.group(1)\n                        if url and not url.startswith((\"http://\", \"https://\", \"data:\", \"#\")):\n                            absolute_url = urljoin(base_url, url)\n                            return f'url(\"{absolute_url}\")'\n                        return match.group(0)\n\n                    style_tag.string = re.sub(url_pattern, replace_url, css_content)\n\n            return str(soup)\n        except Exception as e:\n            logger.warning(f\"Error processing HTML links: {e}\")\n            return html_content\n\n    def _create_loader(self, url: str) -> RecursiveUrlLoader:\n        \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n        Args:\n            url: The URL to load\n\n        Returns:\n            RecursiveUrlLoader: Configured loader instance\n        \"\"\"\n        headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n        extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n        return RecursiveUrlLoader(\n            url=url,\n            max_depth=self.max_depth,\n            prevent_outside=self.prevent_outside,\n            use_async=self.use_async,\n            extractor=extractor,\n            timeout=self.timeout,\n            headers=headers_dict,\n            check_response_status=self.check_response_status,\n            continue_on_failure=self.continue_on_failure,\n            base_url=url,  # Add base_url to ensure consistent domain crawling\n            autoset_encoding=self.autoset_encoding,  # Enable automatic encoding detection\n            exclude_dirs=[],  # Allow customization of excluded directories\n            link_regex=None,  # Allow customization of link filtering\n        )\n\n    def fetch_url_contents(self) -> list[dict]:\n        \"\"\"Load documents from the configured URLs.\n\n        Returns:\n            List[Data]: List of Data objects containing the fetched content\n\n        Raises:\n            ValueError: If no valid URLs are provided or if there's an error loading documents\n        \"\"\"\n        try:\n            urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n            logger.debug(f\"URLs: {urls}\")\n            if not urls:\n                msg = \"No valid URLs provided.\"\n                raise ValueError(msg)\n\n            # Validate that process_links is only used with HTML format\n            if self.process_links and self.format != \"HTML\":\n                logger.warning(\"process_links is only effective when format is set to 'HTML'\")\n\n            all_docs = []\n            for url in urls:\n                logger.debug(f\"Loading documents from {url}\")\n\n                try:\n                    loader = self._create_loader(url)\n                    docs = loader.load()\n\n                    if not docs:\n                        logger.warning(f\"No documents found for {url}\")\n                        continue\n\n                    logger.debug(f\"Found {len(docs)} documents from {url}\")\n                    all_docs.extend(docs)\n\n                except requests.exceptions.RequestException as e:\n                    logger.exception(f\"Error loading documents from {url}: {e}\")\n                    continue\n\n            if not all_docs:\n                msg = \"No documents were successfully loaded from any URL\"\n                raise ValueError(msg)\n\n            # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n            data = []\n            for doc in all_docs:\n                content = doc.page_content\n                source_url = doc.metadata.get(\"source\", \"\")\n\n                # Process HTML links if format is HTML and process_links is enabled\n                if self.format == \"HTML\" and self.process_links and source_url:\n                    content = self._process_html_links(content, source_url)\n\n                data.append(\n                    {\n                        \"text\": safe_convert(content, clean_data=True),\n                        \"url\": source_url,\n                        \"title\": doc.metadata.get(\"title\", \"\"),\n                        \"description\": doc.metadata.get(\"description\", \"\"),\n                        \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n                        \"language\": doc.metadata.get(\"language\", \"\"),\n                    }\n                )\n        except Exception as e:\n            error_msg = e.message if hasattr(e, \"message\") else e\n            msg = f\"Error loading documents: {error_msg!s}\"\n            logger.exception(msg)\n            raise ValueError(msg) from e\n        return data\n\n    def fetch_content(self) -> DataFrame:\n        \"\"\"Convert the documents to a DataFrame.\"\"\"\n        return DataFrame(data=self.fetch_url_contents())\n\n    def fetch_content_as_message(self) -> Message:\n        \"\"\"Convert the documents to a Message.\"\"\"\n        url_contents = self.fetch_url_contents()\n        return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"
              },


🛠️ Refactor suggestion

Mirror linter and robustness fixes in the embedded URLComponent code

As with the main module, please:

Replace nested ifs for data-* processing (SIM102).

Narrow the broad except in _process_html_links (BLE001) or annotate with noqa.

Use a precompiled CSS URL regex and remove the inner import.

Happy to provide a ready-to-paste JSON-safe code string with these updates.

🤖 Prompt for AI Agents

In src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json around lines 1072-1073, update the embedded URLComponent code to (1) move the CSS URL regex out of _process_html_links and precompile it at module scope (remove the inner import of re), (2) replace the nested ifs inside the data-* attributes loop with a flattened conditional (combine checks into a single if that verifies attr_name startswith("data-"), attr_value is str, attr_value not starting with ("#", "javascript:", "data:"), and that it looks like a relative path before rewriting), and (3) tighten the broad exception handler in _process_html_links by catching specific exception types (e.g., AttributeError, TypeError, ValueError, requests-related exceptions if any) or add a precise noqa comment only if narrowing isn't feasible; make sure logging still records the exception message.

coderabbitai · 2025-08-14T07:24:42Z

                "type": "code",
-                "value": "import re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n    r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n    re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n    \"\"\"A component that loads and parses content from web pages recursively.\n\n    This component allows fetching content from one or more URLs, with options to:\n    - Control crawl depth\n    - Prevent crawling outside the root domain\n    - Use async loading for better performance\n    - Extract either raw HTML or clean text\n    - Configure request headers and timeouts\n    \"\"\"\n\n    display_name = \"URL\"\n    description = \"Fetch content from one or more web pages, following links recursively.\"\n    documentation: str = \"https://docs.langflow.org/components-data#url\"\n    icon = \"layout-template\"\n    name = \"URLComponent\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n            placeholder=\"Enter a URL...\",\n            list_add_label=\"Add URL\",\n            input_types=[],\n        ),\n        SliderInput(\n            name=\"max_depth\",\n            display_name=\"Depth\",\n            info=(\n                \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n                \"- depth 1: only the initial page\\n\"\n                \"- depth 2: initial page + all pages linked directly from it\\n\"\n                \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n                \"Note: This is about link traversal, not URL path depth.\"\n            ),\n            value=DEFAULT_MAX_DEPTH,\n            range_spec=RangeSpec(min=1, max=5, step=1),\n            required=False,\n            min_label=\" \",\n            max_label=\" \",\n            min_label_icon=\"None\",\n            max_label_icon=\"None\",\n            # slider_input=True\n        ),\n        BoolInput(\n            name=\"prevent_outside\",\n            display_name=\"Prevent Outside\",\n            info=(\n                \"If enabled, only crawls URLs within the same domain as the root URL. \"\n                \"This helps prevent the crawler from going to external websites.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"use_async\",\n            display_name=\"Use Async\",\n            info=(\n                \"If enabled, uses asynchronous loading which can be significantly faster \"\n                \"but might use more system resources.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n            options=[\"Text\", \"HTML\"],\n            value=DEFAULT_FORMAT,\n            advanced=True,\n        ),\n        IntInput(\n            name=\"timeout\",\n            display_name=\"Timeout\",\n            info=\"Timeout for the request in seconds.\",\n            value=DEFAULT_TIMEOUT,\n            required=False,\n            advanced=True,\n        ),\n        TableInput(\n            name=\"headers\",\n            display_name=\"Headers\",\n            info=\"The headers to send with the request\",\n            table_schema=[\n                {\n                    \"name\": \"key\",\n                    \"display_name\": \"Header\",\n                    \"type\": \"str\",\n                    \"description\": \"Header name\",\n                },\n                {\n                    \"name\": \"value\",\n                    \"display_name\": \"Value\",\n                    \"type\": \"str\",\n                    \"description\": \"Header value\",\n                },\n            ],\n            value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n            advanced=True,\n            input_types=[\"DataFrame\"],\n        ),\n        BoolInput(\n            name=\"filter_text_html\",\n            display_name=\"Filter Text/HTML\",\n            info=\"If enabled, filters out text/css content type from the results.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"continue_on_failure\",\n            display_name=\"Continue on Failure\",\n            info=\"If enabled, continues crawling even if some requests fail.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"check_response_status\",\n            display_name=\"Check Response Status\",\n            info=\"If enabled, checks the response status of the request.\",\n            value=False,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"autoset_encoding\",\n            display_name=\"Autoset Encoding\",\n            info=\"If enabled, automatically sets the encoding of the request.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n        Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n    ]\n\n    @staticmethod\n    def validate_url(url: str) -> bool:\n        \"\"\"Validates if the given string matches URL pattern.\n\n        Args:\n            url: The URL string to validate\n\n        Returns:\n            bool: True if the URL is valid, False otherwise\n        \"\"\"\n        return bool(URL_REGEX.match(url))\n\n    def ensure_url(self, url: str) -> str:\n        \"\"\"Ensures the given string is a valid URL.\n\n        Args:\n            url: The URL string to validate and normalize\n\n        Returns:\n            str: The normalized URL\n\n        Raises:\n            ValueError: If the URL is invalid\n        \"\"\"\n        url = url.strip()\n        if not url.startswith((\"http://\", \"https://\")):\n            url = \"https://\" + url\n\n        if not self.validate_url(url):\n            msg = f\"Invalid URL: {url}\"\n            raise ValueError(msg)\n\n        return url\n\n    def _create_loader(self, url: str) -> RecursiveUrlLoader:\n        \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n        Args:\n            url: The URL to load\n\n        Returns:\n            RecursiveUrlLoader: Configured loader instance\n        \"\"\"\n        headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n        extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n        return RecursiveUrlLoader(\n            url=url,\n            max_depth=self.max_depth,\n            prevent_outside=self.prevent_outside,\n            use_async=self.use_async,\n            extractor=extractor,\n            timeout=self.timeout,\n            headers=headers_dict,\n            check_response_status=self.check_response_status,\n            continue_on_failure=self.continue_on_failure,\n            base_url=url,  # Add base_url to ensure consistent domain crawling\n            autoset_encoding=self.autoset_encoding,  # Enable automatic encoding detection\n            exclude_dirs=[],  # Allow customization of excluded directories\n            link_regex=None,  # Allow customization of link filtering\n        )\n\n    def fetch_url_contents(self) -> list[dict]:\n        \"\"\"Load documents from the configured URLs.\n\n        Returns:\n            List[Data]: List of Data objects containing the fetched content\n\n        Raises:\n            ValueError: If no valid URLs are provided or if there's an error loading documents\n        \"\"\"\n        try:\n            urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n            logger.debug(f\"URLs: {urls}\")\n            if not urls:\n                msg = \"No valid URLs provided.\"\n                raise ValueError(msg)\n\n            all_docs = []\n            for url in urls:\n                logger.debug(f\"Loading documents from {url}\")\n\n                try:\n                    loader = self._create_loader(url)\n                    docs = loader.load()\n\n                    if not docs:\n                        logger.warning(f\"No documents found for {url}\")\n                        continue\n\n                    logger.debug(f\"Found {len(docs)} documents from {url}\")\n                    all_docs.extend(docs)\n\n                except requests.exceptions.RequestException as e:\n                    logger.exception(f\"Error loading documents from {url}: {e}\")\n                    continue\n\n            if not all_docs:\n                msg = \"No documents were successfully loaded from any URL\"\n                raise ValueError(msg)\n\n            # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n            data = [\n                {\n                    \"text\": safe_convert(doc.page_content, clean_data=True),\n                    \"url\": doc.metadata.get(\"source\", \"\"),\n                    \"title\": doc.metadata.get(\"title\", \"\"),\n                    \"description\": doc.metadata.get(\"description\", \"\"),\n                    \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n                    \"language\": doc.metadata.get(\"language\", \"\"),\n                }\n                for doc in all_docs\n            ]\n        except Exception as e:\n            error_msg = e.message if hasattr(e, \"message\") else e\n            msg = f\"Error loading documents: {error_msg!s}\"\n            logger.exception(msg)\n            raise ValueError(msg) from e\n        return data\n\n    def fetch_content(self) -> DataFrame:\n        \"\"\"Convert the documents to a DataFrame.\"\"\"\n        return DataFrame(data=self.fetch_url_contents())\n\n    def fetch_content_as_message(self) -> Message:\n        \"\"\"Convert the documents to a Message.\"\"\"\n        url_contents = self.fetch_url_contents()\n        return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"
+                "value": "import re\nfrom urllib.parse import urljoin\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n    r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n    re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n    \"\"\"A component that loads and parses content from web pages recursively.\n\n    This component allows fetching content from one or more URLs, with options to:\n    - Control crawl depth\n    - Prevent crawling outside the root domain\n    - Use async loading for better performance\n    - Extract either raw HTML or clean text\n    - Configure request headers and timeouts\n    - Process HTML links to convert relative URLs to absolute URLs\n    \"\"\"\n\n    display_name = \"URL\"\n    description = \"Fetch content from one or more web pages, following links recursively.\"\n    documentation: str = \"https://docs.langflow.org/components-data#url\"\n    icon = \"layout-template\"\n    name = \"URLComponent\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n            placeholder=\"Enter a URL...\",\n            list_add_label=\"Add URL\",\n            input_types=[],\n        ),\n        SliderInput(\n            name=\"max_depth\",\n            display_name=\"Depth\",\n            info=(\n                \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n                \"- depth 1: only the initial page\\n\"\n                \"- depth 2: initial page + all pages linked directly from it\\n\"\n                \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n                \"Note: This is about link traversal, not URL path depth.\"\n            ),\n            value=DEFAULT_MAX_DEPTH,\n            range_spec=RangeSpec(min=1, max=5, step=1),\n            required=False,\n            min_label=\" \",\n            max_label=\" \",\n            min_label_icon=\"None\",\n            max_label_icon=\"None\",\n            # slider_input=True\n        ),\n        BoolInput(\n            name=\"prevent_outside\",\n            display_name=\"Prevent Outside\",\n            info=(\n                \"If enabled, only crawls URLs within the same domain as the root URL. \"\n                \"This helps prevent the crawler from going to external websites.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"use_async\",\n            display_name=\"Use Async\",\n            info=(\n                \"If enabled, uses asynchronous loading which can be significantly faster \"\n                \"but might use more system resources.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n            options=[\"Text\", \"HTML\"],\n            value=DEFAULT_FORMAT,\n            advanced=True,\n        ),\n        IntInput(\n            name=\"timeout\",\n            display_name=\"Timeout\",\n            info=\"Timeout for the request in seconds.\",\n            value=DEFAULT_TIMEOUT,\n            required=False,\n            advanced=True,\n        ),\n        TableInput(\n            name=\"headers\",\n            display_name=\"Headers\",\n            info=\"The headers to send with the request\",\n            table_schema=[\n                {\n                    \"name\": \"key\",\n                    \"display_name\": \"Header\",\n                    \"type\": \"str\",\n                    \"description\": \"Header name\",\n                },\n                {\n                    \"name\": \"value\",\n                    \"display_name\": \"Value\",\n                    \"type\": \"str\",\n                    \"description\": \"Header value\",\n                },\n            ],\n            value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n            advanced=True,\n            input_types=[\"DataFrame\"],\n        ),\n        BoolInput(\n            name=\"filter_text_html\",\n            display_name=\"Filter Text/HTML\",\n            info=\"If enabled, filters out text/css content type from the results.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"continue_on_failure\",\n            display_name=\"Continue on Failure\",\n            info=\"If enabled, continues crawling even if some requests fail.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"check_response_status\",\n            display_name=\"Check Response Status\",\n            info=\"If enabled, checks the response status of the request.\",\n            value=False,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"autoset_encoding\",\n            display_name=\"Autoset Encoding\",\n            info=\"If enabled, automatically sets the encoding of the request.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"process_links\",\n            display_name=\"Process Links\",\n            info=\"If enabled and format is HTML, converts relative links to absolute URLs in the output.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n        Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n    ]\n\n    @staticmethod\n    def validate_url(url: str) -> bool:\n        \"\"\"Validates if the given string matches URL pattern.\n\n        Args:\n            url: The URL string to validate\n\n        Returns:\n            bool: True if the URL is valid, False otherwise\n        \"\"\"\n        return bool(URL_REGEX.match(url))\n\n    def ensure_url(self, url: str) -> str:\n        \"\"\"Ensures the given string is a valid URL.\n\n        Args:\n            url: The URL string to validate and normalize\n\n        Returns:\n            str: The normalized URL\n\n        Raises:\n            ValueError: If the URL is invalid\n        \"\"\"\n        url = url.strip()\n        if not url.startswith((\"http://\", \"https://\")):\n            url = \"https://\" + url\n\n        if not self.validate_url(url):\n            msg = f\"Invalid URL: {url}\"\n            raise ValueError(msg)\n\n        return url\n\n    def _process_html_links(self, html_content: str, base_url: str) -> str:\n        \"\"\"Process HTML content and convert relative links to absolute URLs.\n\n        Args:\n            html_content: The raw HTML content\n            base_url: The base URL to resolve relative links against\n\n        Returns:\n            str: HTML content with relative links converted to absolute URLs\n        \"\"\"\n        if not html_content or not base_url:\n            return html_content\n\n        try:\n            soup = BeautifulSoup(html_content, \"lxml\")\n\n            # Process various types of links and resources\n            for tag in soup.find_all(\n                [\"a\", \"img\", \"link\", \"script\", \"iframe\", \"form\", \"video\", \"audio\", \"source\", \"track\"]\n            ):\n                # Process href attributes\n                if tag.has_attr(\"href\"):\n                    href = tag[\"href\"]\n                    if href and not href.startswith(\n                        (\"http://\", \"https://\", \"mailto:\", \"tel:\", \"#\", \"javascript:\", \"data:\")\n                    ):\n                        absolute_url = urljoin(base_url, href)\n                        tag[\"href\"] = absolute_url\n\n                # Process src attributes\n                if tag.has_attr(\"src\"):\n                    src = tag[\"src\"]\n                    if src and not src.startswith((\"http://\", \"https://\", \"data:\", \"#\", \"javascript:\")):\n                        absolute_url = urljoin(base_url, src)\n                        tag[\"src\"] = absolute_url\n\n                # Process action attributes (for forms)\n                if tag.has_attr(\"action\"):\n                    action = tag[\"action\"]\n                    if action and not action.startswith((\"http://\", \"https://\")):\n                        absolute_url = urljoin(base_url, action)\n                        tag[\"action\"] = absolute_url\n\n                # Process data attributes that might contain URLs\n                for attr_name, attr_value in tag.attrs.items():\n                    if attr_name.startswith(\"data-\") and isinstance(attr_value, str):\n                        if any(url_indicator in attr_value.lower() for url_indicator in [\"http://\", \"https://\", \"//\"]):\n                            # This might contain a URL, but be careful not to break data attributes\n                            continue\n                        if attr_value and not attr_value.startswith((\"#\", \"javascript:\", \"data:\")):\n                            # Check if it looks like a relative path\n                            if \"/\" in attr_value or attr_value.endswith(\n                                (\".html\", \".htm\", \".css\", \".js\", \".jpg\", \".png\", \".gif\")\n                            ):\n                                absolute_url = urljoin(base_url, attr_value)\n                                tag[attr_name] = absolute_url\n\n            # Process CSS content for url() references\n            for style_tag in soup.find_all(\"style\"):\n                if style_tag.string:\n                    # Simple regex to find url() references in CSS\n                    import re\n\n                    css_content = style_tag.string\n                    url_pattern = r'url\\([\\'\"]?([^\\'\"]+)[\\'\"]?\\)'\n\n                    def replace_url(match):\n                        url = match.group(1)\n                        if url and not url.startswith((\"http://\", \"https://\", \"data:\", \"#\")):\n                            absolute_url = urljoin(base_url, url)\n                            return f'url(\"{absolute_url}\")'\n                        return match.group(0)\n\n                    style_tag.string = re.sub(url_pattern, replace_url, css_content)\n\n            return str(soup)\n        except Exception as e:\n            logger.warning(f\"Error processing HTML links: {e}\")\n            return html_content\n\n    def _create_loader(self, url: str) -> RecursiveUrlLoader:\n        \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n        Args:\n            url: The URL to load\n\n        Returns:\n            RecursiveUrlLoader: Configured loader instance\n        \"\"\"\n        headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n        extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n        return RecursiveUrlLoader(\n            url=url,\n            max_depth=self.max_depth,\n            prevent_outside=self.prevent_outside,\n            use_async=self.use_async,\n            extractor=extractor,\n            timeout=self.timeout,\n            headers=headers_dict,\n            check_response_status=self.check_response_status,\n            continue_on_failure=self.continue_on_failure,\n            base_url=url,  # Add base_url to ensure consistent domain crawling\n            autoset_encoding=self.autoset_encoding,  # Enable automatic encoding detection\n            exclude_dirs=[],  # Allow customization of excluded directories\n            link_regex=None,  # Allow customization of link filtering\n        )\n\n    def fetch_url_contents(self) -> list[dict]:\n        \"\"\"Load documents from the configured URLs.\n\n        Returns:\n            List[Data]: List of Data objects containing the fetched content\n\n        Raises:\n            ValueError: If no valid URLs are provided or if there's an error loading documents\n        \"\"\"\n        try:\n            urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n            logger.debug(f\"URLs: {urls}\")\n            if not urls:\n                msg = \"No valid URLs provided.\"\n                raise ValueError(msg)\n\n            # Validate that process_links is only used with HTML format\n            if self.process_links and self.format != \"HTML\":\n                logger.warning(\"process_links is only effective when format is set to 'HTML'\")\n\n            all_docs = []\n            for url in urls:\n                logger.debug(f\"Loading documents from {url}\")\n\n                try:\n                    loader = self._create_loader(url)\n                    docs = loader.load()\n\n                    if not docs:\n                        logger.warning(f\"No documents found for {url}\")\n                        continue\n\n                    logger.debug(f\"Found {len(docs)} documents from {url}\")\n                    all_docs.extend(docs)\n\n                except requests.exceptions.RequestException as e:\n                    logger.exception(f\"Error loading documents from {url}: {e}\")\n                    continue\n\n            if not all_docs:\n                msg = \"No documents were successfully loaded from any URL\"\n                raise ValueError(msg)\n\n            # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n            data = []\n            for doc in all_docs:\n                content = doc.page_content\n                source_url = doc.metadata.get(\"source\", \"\")\n\n                # Process HTML links if format is HTML and process_links is enabled\n                if self.format == \"HTML\" and self.process_links and source_url:\n                    content = self._process_html_links(content, source_url)\n\n                data.append(\n                    {\n                        \"text\": safe_convert(content, clean_data=True),\n                        \"url\": source_url,\n                        \"title\": doc.metadata.get(\"title\", \"\"),\n                        \"description\": doc.metadata.get(\"description\", \"\"),\n                        \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n                        \"language\": doc.metadata.get(\"language\", \"\"),\n                    }\n                )\n        except Exception as e:\n            error_msg = e.message if hasattr(e, \"message\") else e\n            msg = f\"Error loading documents: {error_msg!s}\"\n            logger.exception(msg)\n            raise ValueError(msg) from e\n        return data\n\n    def fetch_content(self) -> DataFrame:\n        \"\"\"Convert the documents to a DataFrame.\"\"\"\n        return DataFrame(data=self.fetch_url_contents())\n\n    def fetch_content_as_message(self) -> Message:\n        \"\"\"Convert the documents to a Message.\"\"\"\n        url_contents = self.fetch_url_contents()\n        return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"
              },


🛠️ Refactor suggestion

Sync URLComponent snippet with module linter fixes

Please replicate the same adjustments as suggested for the Python module:

Collapse nested data-* attribute conditionals (SIM102).

Narrow the broad except in _process_html_links (BLE001) or add noqa.

Use a precompiled CSS URL regex and remove inner import re.

I can help generate the escaped JSON value for a drop-in replacement.

github-actions Bot added the enhancement New feature or request label Aug 14, 2025

[autofix.ci] apply automated fixes

02938ad

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

[autofix.ci] apply automated fixes (attempt 2/3)

15e35a5

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

zhangsichu closed this Aug 14, 2025

codeflash-ai Bot mentioned this pull request Aug 14, 2025

⚡️ Speed up function calculate_text_metrics by 302% in PR #9088 (feat-knowledge-bases) #9293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enhance URL Component with HTML Link Processing#9388

feat: Enhance URL Component with HTML Link Processing#9388
zhangsichu wants to merge 3 commits into
langflow-ai:mainfrom
zhangsichu:main

zhangsichu commented Aug 14, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Aug 14, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

sonarqubecloud Bot commented Aug 14, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Uh oh!

coderabbitai Bot Aug 14, 2025

Uh oh!

coderabbitai Bot Aug 14, 2025

Uh oh!

coderabbitai Bot Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"value": "import re\nfrom urllib.parse import urljoin\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s])?$\",\n re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n - Process HTML links to convert relative URLs to absolute URLs\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/components-data#url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n options=[\"Text\", \"HTML\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"process_links\",\n display_name=\"Process Links\",\n info=\"If enabled and format is HTML, converts relative links to absolute URLs in the output.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n return url\n\n def _process_html_links(self, html_content: str, base_url: str) -> str:\n \"\"\"Process HTML content and convert relative links to absolute URLs.\n\n Args:\n html_content: The raw HTML content\n base_url: The base URL to resolve relative links against\n\n Returns:\n str: HTML content with relative links converted to absolute URLs\n \"\"\"\n if not html_content or not base_url:\n return html_content\n\n try:\n soup = BeautifulSoup(html_content, \"lxml\")\n\n # Process various types of links and resources\n for tag in soup.find_all(\n [\"a\", \"img\", \"link\", \"script\", \"iframe\", \"form\", \"video\", \"audio\", \"source\", \"track\"]\n ):\n # Process href attributes\n if tag.has_attr(\"href\"):\n href = tag[\"href\"]\n if href and not href.startswith(\n (\"http://\", \"https://\", \"mailto:\", \"tel:\", \"#\", \"javascript:\", \"data:\")\n ):\n absolute_url = urljoin(base_url, href)\n tag[\"href\"] = absolute_url\n\n # Process src attributes\n if tag.has_attr(\"src\"):\n src = tag[\"src\"]\n if src and not src.startswith((\"http://\", \"https://\", \"data:\", \"#\", \"javascript:\")):\n absolute_url = urljoin(base_url, src)\n tag[\"src\"] = absolute_url\n\n # Process action attributes (for forms)\n if tag.has_attr(\"action\"):\n action = tag[\"action\"]\n if action and not action.startswith((\"http://\", \"https://\")):\n absolute_url = urljoin(base_url, action)\n tag[\"action\"] = absolute_url\n\n # Process data attributes that might contain URLs\n for attr_name, attr_value in tag.attrs.items():\n if attr_name.startswith(\"data-\") and isinstance(attr_value, str):\n if any(url_indicator in attr_value.lower() for url_indicator in [\"http://\", \"https://\", \"//\"]):\n # This might contain a URL, but be careful not to break data attributes\n continue\n if attr_value and not attr_value.startswith((\"#\", \"javascript:\", \"data:\")):\n # Check if it looks like a relative path\n if \"/\" in attr_value or attr_value.endswith(\n (\".html\", \".htm\", \".css\", \".js\", \".jpg\", \".png\", \".gif\")\n ):\n absolute_url = urljoin(base_url, attr_value)\n tag[attr_name] = absolute_url\n\n # Process CSS content for url() references\n for style_tag in soup.find_all(\"style\"):\n if style_tag.string:\n # Simple regex to find url() references in CSS\n import re\n\n css_content = style_tag.string\n url_pattern = r'url\\([\\'\"]?([^\\'\"]+)[\\'\"]?\\)'\n\n def replace_url(match):\n url = match.group(1)\n if url and not url.startswith((\"http://\", \"https://\", \"data:\", \"#\")):\n absolute_url = urljoin(base_url, url)\n return f'url(\"{absolute_url}\")'\n return match.group(0)\n\n style_tag.string = re.sub(url_pattern, replace_url, css_content)\n\n return str(soup)\n except Exception as e:\n logger.warning(f\"Error processing HTML links: {e}\")\n return html_content\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n # Validate that process_links is only used with HTML format\n if self.process_links and self.format != \"HTML\":\n logger.warning(\"process_links is only effective when format is set to 'HTML'\")\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, *doc.metadata) for doc in all_docs]\n data = []\n for doc in all_docs:\n content = doc.page_content\n source_url = doc.metadata.get(\"source\", \"\")\n\n # Process HTML links if format is HTML and process_links is enabled\n if self.format == \"HTML\" and self.process_links and source_url:\n content = self._process_html_links(content, source_url)\n\n data.append(\n {\n \"text\": safe_convert(content, clean_data=True),\n \"url\": source_url,\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n )\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"
		},

Conversation

zhangsichu commented Aug 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

1. New Feature: HTML Link Processing

2. Enhanced URL Processing

3. Improved Content Handling

4. Code Quality Improvements

Technical Details

New Input Parameter

Link Processing Method

Validation Logic

Benefits

Testing

Backward Compatibility

Files Changed

Checklist

Related Issues

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

sonarqubecloud Bot commented Aug 14, 2025

Quality Gate failed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangsichu commented Aug 14, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 14, 2025 •

edited

Loading