feat: Enhance URL Component with HTML Link Processing#9388
Conversation
### Summary
This PR enhances the URL component by adding comprehensive HTML link processing capabilities, allowing users to convert relative URLs to absolute URLs in crawled content. This feature is particularly useful for maintaining link integrity when processing web content for analysis or storage.
### Changes Made
#### 1. **New Feature: HTML Link Processing**
- Added `process_links` boolean input parameter to control link processing
- Implemented `_process_html_links()` method for comprehensive URL conversion
- Added validation to ensure link processing only occurs with HTML format
#### 2. **Enhanced URL Processing**
- **HTML Tags**: Processes `href`, `src`, `action` attributes in various HTML elements
- **CSS URLs**: Converts relative URLs in CSS `url()` references to absolute URLs
- **Data Attributes**: Handles data attributes that may contain relative paths
- **Comprehensive Coverage**: Supports `a`, `img`, `link`, `script`, `iframe`, `form`, `video`, `audio`, `source`, `track` tags
#### 3. **Improved Content Handling**
- Refactored content processing loop for better readability and maintainability
- Added conditional link processing based on format and user preference
- Enhanced error handling with graceful fallback for link processing failures
#### 4. **Code Quality Improvements**
- Added proper import for `urllib.parse` utilities
- Updated component documentation to reflect new capabilities
- Improved code structure and readability
### Technical Details
#### New Input Parameter
```python
BoolInput(
name="process_links",
display_name="Process Links",
info="If enabled and format is HTML, converts relative links to absolute URLs in the output.",
value=True,
required=False,
advanced=True,
)
```
#### Link Processing Method
The `_process_html_links()` method:
- Uses BeautifulSoup for robust HTML parsing
- Handles multiple attribute types (`href`, `src`, `action`)
- Processes CSS `url()` references
- Maintains data attribute integrity
- Provides graceful error handling
#### Validation Logic
```python
# Validate that process_links is only used with HTML format
if self.process_links and self.format != "HTML":
logger.warning("process_links is only effective when format is set to 'HTML'")
```
### Benefits
1. **Link Integrity**: Maintains proper URL references in processed content
2. **Content Portability**: Makes crawled content self-contained with absolute URLs
3. **User Control**: Optional feature that doesn't affect existing functionality
4. **Performance**: Efficient processing with minimal overhead
5. **Robustness**: Graceful error handling prevents processing failures
### Testing
The changes maintain backward compatibility and include:
- Input validation for the new parameter
- Conditional processing based on format selection
- Comprehensive error handling for malformed HTML
- Logging for debugging and monitoring
### Backward Compatibility
✅ **Fully backward compatible** - All existing functionality remains unchanged
✅ **New feature is opt-in** - Users must explicitly enable `process_links`
✅ **No breaking changes** - Existing flows continue to work without modification
### Files Changed
- `src/backend/base/langflow/components/data/url.py` - Main component enhancement
### Checklist
- [x] Code follows Langflow backend development guidelines
- [x] New feature is properly documented
- [x] Input validation and error handling implemented
- [x] Backward compatibility maintained
- [x] Code is readable and maintainable
- [x] No breaking changes introduced
### Related Issues
This enhancement addresses the need for maintaining link integrity when processing web content, making the URL component more useful for content analysis and storage use cases.
WalkthroughAdds optional HTML link normalization to URLComponent. Introduces a process_links boolean input and a private _process_html_links method to convert relative URLs to absolute in HTML outputs. Applies processing conditionally in fetch_url_contents. Updates three starter project templates to expose the new input and updated code payloads. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant URLComponent
participant Loader
participant Fetcher
User->>URLComponent: Run with url(s), format, process_links
URLComponent->>Loader: Create loader (base_url, format)
Loader->>Fetcher: Fetch documents
Fetcher-->>Loader: Documents (content, metadata)
Loader-->>URLComponent: Documents
alt format == HTML and process_links == true
URLComponent->>URLComponent: _process_html_links(content, base_url)
end
URLComponent-->>User: Data dicts (url, title, description, content_type, language, content)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (7)
src/backend/base/langflow/components/data/url.py (4)
167-174: New input looks good; consider syncing UI visibility with format selectionThe new advanced BoolInput
process_linksis well-scoped and backwards-compatible.Optional: Hide/disable this input dynamically when format != "HTML" via update_build_config to prevent user confusion (in addition to the runtime warning you already log). I can draft that update if useful.
233-258: URL rewriting coverage is solid; consider extending support for srcset/poster and scheme-relative URLsCurrent handling covers href, src, and action well. Two optional improvements:
- Add support for img/srcset and source/srcset attributes (comma-separated URLs) and video/poster.
- Scheme-relative URLs (//cdn.example.com/foo.css) are already resolved correctly by urljoin; keeping them as-is is fine, but consider skipping rewrite if you prefer not to collapse to an explicit scheme.
I can provide a small helper to parse and normalize srcset lists if you want to include it.
274-290: Avoid re-import and compile the CSS regex once; handle whitespace around url()
- Remove the inner
import reand reuse the module import.- Precompile a robust pattern once to avoid recompilation and to handle whitespace and optional quotes.
Apply this diff here:
- for style_tag in soup.find_all("style"): - if style_tag.string: - # Simple regex to find url() references in CSS - import re - - css_content = style_tag.string - url_pattern = r'url\([\'"]?([^\'"]+)[\'"]?\)' - - def replace_url(match): - url = match.group(1) - if url and not url.startswith(("http://", "https://", "data:", "#")): - absolute_url = urljoin(base_url, url) - return f'url("{absolute_url}")' - return match.group(0) - - style_tag.string = re.sub(url_pattern, replace_url, css_content) + for style_tag in soup.find_all("style"): + if style_tag.string: + css_content = style_tag.string + + def replace_url(match): + url = match.group(2) + if url and not url.startswith(("http://", "https://", "data:", "#")): + absolute_url = urljoin(base_url, url) + return f'url("{absolute_url}")' + return match.group(0) + + style_tag.string = CSS_URL_PATTERN.sub(replace_url, css_content)And add this once at the top-level (outside this range):
# at module level, near imports CSS_URL_PATTERN = re.compile(r'url\(\s*([\'"]?)([^\'")]+)\1\s*\)', re.IGNORECASE)
387-391: Simplify error message extraction; avoid deprecated Exception.message attributePython exceptions don’t reliably expose a .message attribute. Use str(e).
Apply this diff:
- except Exception as e: - error_msg = e.message if hasattr(e, "message") else e - msg = f"Error loading documents: {error_msg!s}" + except Exception as e: + msg = f"Error loading documents: {e!s}" logger.exception(msg) raise ValueError(msg) from esrc/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json (1)
1755-1792: New process_links input is good; consider adding to field_orderAdding process_links is great. For deterministic UI ordering, consider including it in field_order after "format". Not required, but improves UX consistency across templates.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
599-616: process_links input inclusion looks correctThe new input is well-described and advanced by default. Consider listing it in field_order for a predictable UI position (optional).
src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json (1)
1239-1256: process_links input added; optional UI order tweakGood addition. Optionally add process_links to field_order (e.g., after "format") for stable ordering in the template.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/backend/base/langflow/components/data/url.py(6 hunks)src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json(3 hunks)src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json(3 hunks)src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately
Files:
src/backend/base/langflow/components/data/url.py
{src/backend/**/*.py,tests/**/*.py,Makefile}
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code
Files:
src/backend/base/langflow/components/data/url.py
src/backend/**/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)
In your Python component class, set the
iconattribute to a string matching the frontend icon mapping exactly (case-sensitive).
Files:
src/backend/base/langflow/components/data/url.py
🧬 Code Graph Analysis (1)
src/backend/base/langflow/components/data/url.py (2)
src/backend/base/langflow/inputs/inputs.py (1)
BoolInput(413-425)src/backend/base/langflow/helpers/data.py (1)
safe_convert(166-192)
🪛 GitHub Check: Ruff Style Check (3.13)
src/backend/base/langflow/components/data/url.py
[failure] 292-292: Ruff (BLE001)
src/backend/base/langflow/components/data/url.py:292:16: BLE001 Do not catch blind exception: Exception
[failure] 265-269: Ruff (SIM102)
src/backend/base/langflow/components/data/url.py:265:25: SIM102 Use a single if statement instead of nested if statements
🪛 GitHub Actions: Ruff Style Check
src/backend/base/langflow/components/data/url.py
[error] 265-265: SIM102 Use a single if statement instead of nested if statements. Command: uv run --only-dev ruff check --output-format=github .
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-starter-projects
🔇 Additional comments (4)
src/backend/base/langflow/components/data/url.py (1)
318-322: No action needed — RecursiveUrlLoader supports base_url, exclude_dirs, and link_regexVerified against the LangChain API docs and module source: RecursiveUrlLoader accepts the keyword arguments base_url, exclude_dirs, and link_regex, so the code is valid as written.
- File to note: src/backend/base/langflow/components/data/url.py (lines 318–322) — keep the kwargs as-is.
src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json (1)
1528-1530: Code hash update acknowledgedThe code_hash change reflects the updated URLComponent implementation.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
342-344: Code hash update acknowledgedThis reflects the updated URLComponent code in the template.
src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json (1)
981-983: Code hash update acknowledgedReflects the new URLComponent implementation. No issues here.
| for attr_name, attr_value in tag.attrs.items(): | ||
| if attr_name.startswith("data-") and isinstance(attr_value, str): | ||
| if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]): | ||
| # This might contain a URL, but be careful not to break data attributes | ||
| continue | ||
| if attr_value and not attr_value.startswith(("#", "javascript:", "data:")): | ||
| # Check if it looks like a relative path | ||
| if "/" in attr_value or attr_value.endswith( | ||
| (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif") | ||
| ): | ||
| absolute_url = urljoin(base_url, attr_value) | ||
| tag[attr_name] = absolute_url |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Ruff SIM102: combine nested ifs for data- attributes and harden detection*
Combine the nested conditions into one and handle extensions case-insensitively. This also resolves the pipeline failure.
Apply this diff:
- # Process data attributes that might contain URLs
- for attr_name, attr_value in tag.attrs.items():
- if attr_name.startswith("data-") and isinstance(attr_value, str):
- if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]):
- # This might contain a URL, but be careful not to break data attributes
- continue
- if attr_value and not attr_value.startswith(("#", "javascript:", "data:")):
- # Check if it looks like a relative path
- if "/" in attr_value or attr_value.endswith(
- (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif")
- ):
- absolute_url = urljoin(base_url, attr_value)
- tag[attr_name] = absolute_url
+ # Process data attributes that might contain URLs
+ for attr_name, attr_value in tag.attrs.items():
+ if (
+ attr_name.startswith("data-")
+ and isinstance(attr_value, str)
+ and attr_value
+ and not attr_value.startswith(("#", "javascript:", "data:"))
+ and not any(ind in attr_value.lower() for ind in ("http://", "https://", "//"))
+ and ("/" in attr_value or attr_value.lower().endswith((".html", ".htm", ".css", ".js", ".jpg", ".jpeg", ".png", ".gif", ".svg", ".webp")))
+ ):
+ absolute_url = urljoin(base_url, attr_value)
+ tag[attr_name] = absolute_urlThis satisfies SIM102 and reduces false positives by normalizing extension matching.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for attr_name, attr_value in tag.attrs.items(): | |
| if attr_name.startswith("data-") and isinstance(attr_value, str): | |
| if any(url_indicator in attr_value.lower() for url_indicator in ["http://", "https://", "//"]): | |
| # This might contain a URL, but be careful not to break data attributes | |
| continue | |
| if attr_value and not attr_value.startswith(("#", "javascript:", "data:")): | |
| # Check if it looks like a relative path | |
| if "/" in attr_value or attr_value.endswith( | |
| (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif") | |
| ): | |
| absolute_url = urljoin(base_url, attr_value) | |
| tag[attr_name] = absolute_url | |
| # Process data attributes that might contain URLs | |
| for attr_name, attr_value in tag.attrs.items(): | |
| if ( | |
| attr_name.startswith("data-") | |
| and isinstance(attr_value, str) | |
| and attr_value | |
| and not attr_value.startswith(("#", "javascript:", "data:")) | |
| and not any(ind in attr_value.lower() for ind in ("http://", "https://", "//")) | |
| and ("/" in attr_value or attr_value.lower().endswith((".html", ".htm", ".css", ".js", ".jpg", ".jpeg", ".png", ".gif", ".svg", ".webp"))) | |
| ): | |
| absolute_url = urljoin(base_url, attr_value) | |
| tag[attr_name] = absolute_url |
🧰 Tools
🪛 GitHub Check: Ruff Style Check (3.13)
[failure] 265-269: Ruff (SIM102)
src/backend/base/langflow/components/data/url.py:265:25: SIM102 Use a single if statement instead of nested if statements
🪛 GitHub Actions: Ruff Style Check
[error] 265-265: SIM102 Use a single if statement instead of nested if statements. Command: uv run --only-dev ruff check --output-format=github .
🤖 Prompt for AI Agents
In src/backend/base/langflow/components/data/url.py around lines 260 to 271,
combine the nested ifs that process data-* attributes into a single conditional
and make extension checks case-insensitive: ensure attr_name startswith("data-")
and isinstance(attr_value, str), reject values that already contain URL
indicators (http://, https://, //) or start with ("#", "javascript:", "data:"),
then normalize attr_value.lower() and check for "/" or endswith a tuple of
lowercased extensions (".html", ".htm", ".css", ".js", ".jpg", ".png", ".gif")
before calling urljoin(base_url, attr_value) and assigning tag[attr_name] =
absolute_url; this removes nested branches and avoids false positives by using
lowercased extension matching.
| except Exception as e: | ||
| logger.warning(f"Error processing HTML links: {e}") | ||
| return html_content | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Narrow broad exception (Ruff BLE001) in HTML processing
Catching bare Exception triggers BLE001. Restrict to specific, expected exceptions. BeautifulSoup may raise FeatureNotFound if the parser is missing; regex may raise re.error; general type/value/attr errors are sufficient.
Apply this diff:
- except Exception as e:
- logger.warning(f"Error processing HTML links: {e}")
- return html_content
+ except (FeatureNotFound, re.error, TypeError, ValueError, AttributeError) as e:
+ logger.warning(f"Error processing HTML links: {e!s}")
+ return html_contentAnd update imports (outside this range) to include FeatureNotFound:
from bs4 import BeautifulSoup, FeatureNotFoundIf you prefer to keep the broad catch for safety, add # noqa: BLE001 to satisfy Ruff, but the narrowed set is recommended.
🧰 Tools
🪛 GitHub Check: Ruff Style Check (3.13)
[failure] 292-292: Ruff (BLE001)
src/backend/base/langflow/components/data/url.py:292:16: BLE001 Do not catch blind exception: Exception
🤖 Prompt for AI Agents
In src/backend/base/langflow/components/data/url.py around lines 292 to 295,
replace the broad "except Exception as e" with a narrowed set of expected
exceptions (e.g. FeatureNotFound, re.error, AttributeError, TypeError,
ValueError) so you only catch parser, regex, and common type/value errors when
processing HTML links, and update the module imports to include FeatureNotFound
from bs4 (from bs4 import BeautifulSoup, FeatureNotFound); if you intentionally
want to keep a catch-all for safety, annotate the broad except with "# noqa:
BLE001" instead of leaving it unannotated.
| "value": "import re\nfrom urllib.parse import urljoin\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n - Process HTML links to convert relative URLs to absolute URLs\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/components-data#url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n options=[\"Text\", \"HTML\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"process_links\",\n display_name=\"Process Links\",\n info=\"If enabled and format is HTML, converts relative links to absolute URLs in the output.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n return url\n\n def _process_html_links(self, html_content: str, base_url: str) -> str:\n \"\"\"Process HTML content and convert relative links to absolute URLs.\n\n Args:\n html_content: The raw HTML content\n base_url: The base URL to resolve relative links against\n\n Returns:\n str: HTML content with relative links converted to absolute URLs\n \"\"\"\n if not html_content or not base_url:\n return html_content\n\n try:\n soup = BeautifulSoup(html_content, \"lxml\")\n\n # Process various types of links and resources\n for tag in soup.find_all(\n [\"a\", \"img\", \"link\", \"script\", \"iframe\", \"form\", \"video\", \"audio\", \"source\", \"track\"]\n ):\n # Process href attributes\n if tag.has_attr(\"href\"):\n href = tag[\"href\"]\n if href and not href.startswith(\n (\"http://\", \"https://\", \"mailto:\", \"tel:\", \"#\", \"javascript:\", \"data:\")\n ):\n absolute_url = urljoin(base_url, href)\n tag[\"href\"] = absolute_url\n\n # Process src attributes\n if tag.has_attr(\"src\"):\n src = tag[\"src\"]\n if src and not src.startswith((\"http://\", \"https://\", \"data:\", \"#\", \"javascript:\")):\n absolute_url = urljoin(base_url, src)\n tag[\"src\"] = absolute_url\n\n # Process action attributes (for forms)\n if tag.has_attr(\"action\"):\n action = tag[\"action\"]\n if action and not action.startswith((\"http://\", \"https://\")):\n absolute_url = urljoin(base_url, action)\n tag[\"action\"] = absolute_url\n\n # Process data attributes that might contain URLs\n for attr_name, attr_value in tag.attrs.items():\n if attr_name.startswith(\"data-\") and isinstance(attr_value, str):\n if any(url_indicator in attr_value.lower() for url_indicator in [\"http://\", \"https://\", \"//\"]):\n # This might contain a URL, but be careful not to break data attributes\n continue\n if attr_value and not attr_value.startswith((\"#\", \"javascript:\", \"data:\")):\n # Check if it looks like a relative path\n if \"/\" in attr_value or attr_value.endswith(\n (\".html\", \".htm\", \".css\", \".js\", \".jpg\", \".png\", \".gif\")\n ):\n absolute_url = urljoin(base_url, attr_value)\n tag[attr_name] = absolute_url\n\n # Process CSS content for url() references\n for style_tag in soup.find_all(\"style\"):\n if style_tag.string:\n # Simple regex to find url() references in CSS\n import re\n\n css_content = style_tag.string\n url_pattern = r'url\\([\\'\"]?([^\\'\"]+)[\\'\"]?\\)'\n\n def replace_url(match):\n url = match.group(1)\n if url and not url.startswith((\"http://\", \"https://\", \"data:\", \"#\")):\n absolute_url = urljoin(base_url, url)\n return f'url(\"{absolute_url}\")'\n return match.group(0)\n\n style_tag.string = re.sub(url_pattern, replace_url, css_content)\n\n return str(soup)\n except Exception as e:\n logger.warning(f\"Error processing HTML links: {e}\")\n return html_content\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n # Validate that process_links is only used with HTML format\n if self.process_links and self.format != \"HTML\":\n logger.warning(\"process_links is only effective when format is set to 'HTML'\")\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n data = []\n for doc in all_docs:\n content = doc.page_content\n source_url = doc.metadata.get(\"source\", \"\")\n\n # Process HTML links if format is HTML and process_links is enabled\n if self.format == \"HTML\" and self.process_links and source_url:\n content = self._process_html_links(content, source_url)\n\n data.append(\n {\n \"text\": safe_convert(content, clean_data=True),\n \"url\": source_url,\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n )\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n" | ||
| }, |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Mirror linter and robustness fixes in the embedded URLComponent code
As with the main module, please:
- Replace nested ifs for data-* processing (SIM102).
- Narrow the broad except in _process_html_links (BLE001) or annotate with noqa.
- Use a precompiled CSS URL regex and remove the inner import.
Happy to provide a ready-to-paste JSON-safe code string with these updates.
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json
around lines 1072-1073, update the embedded URLComponent code to (1) move the
CSS URL regex out of _process_html_links and precompile it at module scope
(remove the inner import of re), (2) replace the nested ifs inside the data-*
attributes loop with a flattened conditional (combine checks into a single if
that verifies attr_name startswith("data-"), attr_value is str, attr_value not
starting with ("#", "javascript:", "data:"), and that it looks like a relative
path before rewriting), and (3) tighten the broad exception handler in
_process_html_links by catching specific exception types (e.g., AttributeError,
TypeError, ValueError, requests-related exceptions if any) or add a precise noqa
comment only if narrowing isn't feasible; make sure logging still records the
exception message.
| "type": "code", | ||
| "value": "import re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/components-data#url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n options=[\"Text\", \"HTML\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n return url\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n data = [\n {\n \"text\": safe_convert(doc.page_content, clean_data=True),\n \"url\": doc.metadata.get(\"source\", \"\"),\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n for doc in all_docs\n ]\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n" | ||
| "value": "import re\nfrom urllib.parse import urljoin\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom loguru import logger\n\nfrom langflow.custom.custom_component.component import Component\nfrom langflow.field_typing.range_spec import RangeSpec\nfrom langflow.helpers.data import safe_convert\nfrom langflow.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\nfrom langflow.services.deps import get_settings_service\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n re.IGNORECASE,\n)\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n - Process HTML links to convert relative URLs to absolute URLs\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/components-data#url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=\"Output Format. Use 'Text' to extract the text from the HTML or 'HTML' for the raw HTML content.\",\n options=[\"Text\", \"HTML\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": get_settings_service().settings.user_agent}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"process_links\",\n display_name=\"Process Links\",\n info=\"If enabled and format is HTML, converts relative links to absolute URLs in the output.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n return url\n\n def _process_html_links(self, html_content: str, base_url: str) -> str:\n \"\"\"Process HTML content and convert relative links to absolute URLs.\n\n Args:\n html_content: The raw HTML content\n base_url: The base URL to resolve relative links against\n\n Returns:\n str: HTML content with relative links converted to absolute URLs\n \"\"\"\n if not html_content or not base_url:\n return html_content\n\n try:\n soup = BeautifulSoup(html_content, \"lxml\")\n\n # Process various types of links and resources\n for tag in soup.find_all(\n [\"a\", \"img\", \"link\", \"script\", \"iframe\", \"form\", \"video\", \"audio\", \"source\", \"track\"]\n ):\n # Process href attributes\n if tag.has_attr(\"href\"):\n href = tag[\"href\"]\n if href and not href.startswith(\n (\"http://\", \"https://\", \"mailto:\", \"tel:\", \"#\", \"javascript:\", \"data:\")\n ):\n absolute_url = urljoin(base_url, href)\n tag[\"href\"] = absolute_url\n\n # Process src attributes\n if tag.has_attr(\"src\"):\n src = tag[\"src\"]\n if src and not src.startswith((\"http://\", \"https://\", \"data:\", \"#\", \"javascript:\")):\n absolute_url = urljoin(base_url, src)\n tag[\"src\"] = absolute_url\n\n # Process action attributes (for forms)\n if tag.has_attr(\"action\"):\n action = tag[\"action\"]\n if action and not action.startswith((\"http://\", \"https://\")):\n absolute_url = urljoin(base_url, action)\n tag[\"action\"] = absolute_url\n\n # Process data attributes that might contain URLs\n for attr_name, attr_value in tag.attrs.items():\n if attr_name.startswith(\"data-\") and isinstance(attr_value, str):\n if any(url_indicator in attr_value.lower() for url_indicator in [\"http://\", \"https://\", \"//\"]):\n # This might contain a URL, but be careful not to break data attributes\n continue\n if attr_value and not attr_value.startswith((\"#\", \"javascript:\", \"data:\")):\n # Check if it looks like a relative path\n if \"/\" in attr_value or attr_value.endswith(\n (\".html\", \".htm\", \".css\", \".js\", \".jpg\", \".png\", \".gif\")\n ):\n absolute_url = urljoin(base_url, attr_value)\n tag[attr_name] = absolute_url\n\n # Process CSS content for url() references\n for style_tag in soup.find_all(\"style\"):\n if style_tag.string:\n # Simple regex to find url() references in CSS\n import re\n\n css_content = style_tag.string\n url_pattern = r'url\\([\\'\"]?([^\\'\"]+)[\\'\"]?\\)'\n\n def replace_url(match):\n url = match.group(1)\n if url and not url.startswith((\"http://\", \"https://\", \"data:\", \"#\")):\n absolute_url = urljoin(base_url, url)\n return f'url(\"{absolute_url}\")'\n return match.group(0)\n\n style_tag.string = re.sub(url_pattern, replace_url, css_content)\n\n return str(soup)\n except Exception as e:\n logger.warning(f\"Error processing HTML links: {e}\")\n return html_content\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers}\n extractor = (lambda x: x) if self.format == \"HTML\" else (lambda x: BeautifulSoup(x, \"lxml\").get_text())\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n # Validate that process_links is only used with HTML format\n if self.process_links and self.format != \"HTML\":\n logger.warning(\"process_links is only effective when format is set to 'HTML'\")\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n data = []\n for doc in all_docs:\n content = doc.page_content\n source_url = doc.metadata.get(\"source\", \"\")\n\n # Process HTML links if format is HTML and process_links is enabled\n if self.format == \"HTML\" and self.process_links and source_url:\n content = self._process_html_links(content, source_url)\n\n data.append(\n {\n \"text\": safe_convert(content, clean_data=True),\n \"url\": source_url,\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n )\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n" | ||
| }, |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Sync URLComponent snippet with module linter fixes
Please replicate the same adjustments as suggested for the Python module:
- Collapse nested data-* attribute conditionals (SIM102).
- Narrow the broad except in _process_html_links (BLE001) or add noqa.
- Use a precompiled CSS URL regex and remove inner import re.
I can help generate the escaped JSON value for a drop-in replacement.


Summary
This PR enhances the URL component by adding comprehensive HTML link processing capabilities, allowing users to convert relative URLs to absolute URLs in crawled content. This feature is particularly useful for maintaining link integrity when processing web content for analysis or storage.
Changes Made
1. New Feature: HTML Link Processing
process_linksboolean input parameter to control link processing_process_html_links()method for comprehensive URL conversion2. Enhanced URL Processing
href,src,actionattributes in various HTML elementsurl()references to absolute URLsa,img,link,script,iframe,form,video,audio,source,tracktags3. Improved Content Handling
4. Code Quality Improvements
urllib.parseutilitiesTechnical Details
New Input Parameter
Link Processing Method
The
_process_html_links()method:href,src,action)url()referencesValidation Logic
Benefits
Testing
The changes maintain backward compatibility and include:
Backward Compatibility
✅ Fully backward compatible - All existing functionality remains unchanged ✅ New feature is opt-in - Users must explicitly enable
process_links✅ No breaking changes - Existing flows continue to work without modificationFiles Changed
src/backend/base/langflow/components/data/url.py- Main component enhancementChecklist
Related Issues
This enhancement addresses the need for maintaining link integrity when processing web content, making the URL component more useful for content analysis and storage use cases.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation