fix: add SSRF protection to URL component (PVR0699081)#11996
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds SSRF (Server-Side Request Forgery) protection to URL handling across the codebase. The URLComponent in starter projects, component indexes, and the core implementation are updated to validate URLs against SSRF protections, raising errors when blocked. Hash values are updated to reflect functional changes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (48.79%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## release-1.9.0 #11996 +/- ##
=================================================
- Coverage 49.48% 49.25% -0.24%
=================================================
Files 1929 1929
Lines 171262 171189 -73
Branches 25038 23735 -1303
=================================================
- Hits 84754 84322 -432
- Misses 85481 85840 +359
Partials 1027 1027
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lfx/src/lfx/components/data_source/url.py (1)
240-247: Avoid pinning SSRF validation to permanent warn-only mode.Line 243 hard-codes
warn_only=True, which makes enforcement depend on a future manual edit. Calling with the utility default keeps current behavior today and automatically follows the planned default flip in v2.0.♻️ Proposed refactor
- # TODO: In next major version (2.0), remove warn_only=True to enforce blocking try: - validate_url_for_ssrf(url, warn_only=True) + validate_url_for_ssrf(url) except SSRFProtectionError as e: - # This will only raise if SSRF protection is enabled and warn_only=False + # Raised when SSRF protection is in enforcement mode msg = f"SSRF Protection: {e}" raise ValueError(msg) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lfx/src/lfx/components/data_source/url.py` around lines 240 - 247, The code pins SSRF validation to warn-only by calling validate_url_for_ssrf(url, warn_only=True); remove the hard-coded warn_only=True so the call uses the utility's default behavior (i.e., call validate_url_for_ssrf(url)), preserving the existing except SSRFProtectionError as e block and re-raising ValueError(msg) from e; this ensures enforcement follows the utility's future default without changing the current exception handling around validate_url_for_ssrf and SSRFProtectionError.src/backend/tests/unit/components/data_source/test_url_component.py (1)
283-337: Reduce mock-only SSRF assertions in the new tests.Most new cases patch
validate_url_for_ssrfdirectly, so they mostly verify mocked behavior rather than actual SSRF rule evaluation. Consider keeping at least one path that exercises the real validator (mocking only settings) to prove localhost/private/metadata blocking behavior end-to-end.As per coding guidelines
**/test_*.py: “Ensure mocks are used appropriately for external dependencies only, not for core logic.”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/tests/unit/components/data_source/test_url_component.py` around lines 283 - 337, The new tests over-mock the core SSRF logic by patching validate_url_for_ssrf everywhere; update tests to include at least one end-to-end case that does not patch validate_url_for_ssrf so the real validator runs (use the existing mock_ssrf_settings fixture to enable protection), e.g., in test_ssrf_blocks_private_ip_when_enabled or add a new test that calls URLComponent.ensure_url("http://127.0.0.1:8080") and asserts a ValueError/SSRFProtectionError, while limiting mocks to external deps only; ensure references to validate_url_for_ssrf, URLComponent.ensure_url, and mock_ssrf_settings are used so the validator itself is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/tests/unit/components/data_source/test_url_component.py`:
- Around line 279-331: The tests are failing Ruff ARG002 because the injected
fixture mock_ssrf_settings is never used in several tests; to fix, use the
fixture in each affected test (e.g., test_ssrf_blocks_localhost_when_enabled,
test_ssrf_blocks_private_ip_when_enabled,
test_ssrf_blocks_metadata_endpoint_when_enabled, test_ssrf_allows_public_urls,
test_ssrf_protection_in_fetch_content) by adding a no-op reference at the start
of the test body like "_ = mock_ssrf_settings" (or rename the parameter to
"_mock_ssrf_settings") so the fixture is considered used and the linter error is
resolved.
In `@src/lfx/src/lfx/_assets/component_index.json`:
- Line 57210: The SSRF check in ensure_url currently calls
validate_url_for_ssrf(url, warn_only=True) unconditionally; make warn_only
configurable by adding a BoolInput (e.g., name="ssrf_warn_only",
display_name="SSRF Warn Only", value=True, advanced=True) to the component's
inputs (so default remains True for compatibility) and replace the hardcoded
call with validate_url_for_ssrf(url, warn_only=self.ssrf_warn_only) inside
ensure_url; ensure the BoolInput name matches the attribute used so operators
can toggle strict blocking without a major release.
---
Nitpick comments:
In `@src/backend/tests/unit/components/data_source/test_url_component.py`:
- Around line 283-337: The new tests over-mock the core SSRF logic by patching
validate_url_for_ssrf everywhere; update tests to include at least one
end-to-end case that does not patch validate_url_for_ssrf so the real validator
runs (use the existing mock_ssrf_settings fixture to enable protection), e.g.,
in test_ssrf_blocks_private_ip_when_enabled or add a new test that calls
URLComponent.ensure_url("http://127.0.0.1:8080") and asserts a
ValueError/SSRFProtectionError, while limiting mocks to external deps only;
ensure references to validate_url_for_ssrf, URLComponent.ensure_url, and
mock_ssrf_settings are used so the validator itself is exercised.
In `@src/lfx/src/lfx/components/data_source/url.py`:
- Around line 240-247: The code pins SSRF validation to warn-only by calling
validate_url_for_ssrf(url, warn_only=True); remove the hard-coded warn_only=True
so the call uses the utility's default behavior (i.e., call
validate_url_for_ssrf(url)), preserving the existing except SSRFProtectionError
as e block and re-raising ValueError(msg) from e; this ensures enforcement
follows the utility's future default without changing the current exception
handling around validate_url_for_ssrf and SSRFProtectionError.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.jsonsrc/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.jsonsrc/backend/base/langflow/initial_setup/starter_projects/Simple Agent.jsonsrc/backend/tests/unit/components/data_source/test_url_component.pysrc/lfx/src/lfx/_assets/component_index.jsonsrc/lfx/src/lfx/_assets/stable_hash_history.jsonsrc/lfx/src/lfx/components/data_source/url.py
| def test_ssrf_blocks_localhost_when_enabled(self, mock_ssrf_settings): | ||
| """Test that localhost is blocked when SSRF protection is enabled.""" | ||
| component = URLComponent() | ||
|
|
||
| with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate: | ||
| mock_validate.side_effect = SSRFProtectionError("Access to IP address 127.0.0.1 is blocked") | ||
|
|
||
| with pytest.raises(ValueError, match="SSRF Protection"): | ||
| component.ensure_url("http://127.0.0.1:8080") | ||
|
|
||
| def test_ssrf_blocks_private_ip_when_enabled(self, mock_ssrf_settings): | ||
| """Test that private IPs are blocked when SSRF protection is enabled.""" | ||
| component = URLComponent() | ||
|
|
||
| with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate: | ||
| mock_validate.side_effect = SSRFProtectionError("Access to IP address 192.168.1.1 is blocked") | ||
|
|
||
| with pytest.raises(ValueError, match="SSRF Protection"): | ||
| component.ensure_url("http://192.168.1.1/admin") | ||
|
|
||
| def test_ssrf_blocks_metadata_endpoint_when_enabled(self, mock_ssrf_settings): | ||
| """Test that cloud metadata endpoints are blocked when SSRF protection is enabled.""" | ||
| component = URLComponent() | ||
|
|
||
| with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate: | ||
| mock_validate.side_effect = SSRFProtectionError("Access to IP address 169.254.169.254 is blocked") | ||
|
|
||
| with pytest.raises(ValueError, match="SSRF Protection"): | ||
| component.ensure_url("http://169.254.169.254/latest/meta-data/") | ||
|
|
||
| def test_ssrf_allows_public_urls(self, mock_ssrf_settings): | ||
| """Test that public URLs are allowed.""" | ||
| component = URLComponent() | ||
|
|
||
| with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate: | ||
| # No exception means it's allowed | ||
| mock_validate.return_value = None | ||
|
|
||
| url = component.ensure_url("https://www.google.com") | ||
| assert url == "https://www.google.com" | ||
| mock_validate.assert_called_once() | ||
|
|
||
| def test_ssrf_warn_only_mode(self): | ||
| """Test that warn_only=True is passed to validation.""" | ||
| component = URLComponent() | ||
|
|
||
| with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate: | ||
| component.ensure_url("https://example.com") | ||
|
|
||
| # Verify warn_only=True is passed (current behavior for backwards compatibility) | ||
| mock_validate.assert_called_with("https://example.com", warn_only=True) | ||
|
|
||
| def test_ssrf_protection_in_fetch_content(self, mock_ssrf_settings): |
There was a problem hiding this comment.
Fix unused fixture arguments to unblock Ruff CI.
mock_ssrf_settings is injected but never referenced in five test methods, which is failing lint with ARG002.
Minimal lint-safe fix
- def test_ssrf_blocks_localhost_when_enabled(self, mock_ssrf_settings):
+ def test_ssrf_blocks_localhost_when_enabled(self, _mock_ssrf_settings):
...
- def test_ssrf_blocks_private_ip_when_enabled(self, mock_ssrf_settings):
+ def test_ssrf_blocks_private_ip_when_enabled(self, _mock_ssrf_settings):
...
- def test_ssrf_blocks_metadata_endpoint_when_enabled(self, mock_ssrf_settings):
+ def test_ssrf_blocks_metadata_endpoint_when_enabled(self, _mock_ssrf_settings):
...
- def test_ssrf_allows_public_urls(self, mock_ssrf_settings):
+ def test_ssrf_allows_public_urls(self, _mock_ssrf_settings):
...
- def test_ssrf_protection_in_fetch_content(self, mock_ssrf_settings):
+ def test_ssrf_protection_in_fetch_content(self, _mock_ssrf_settings):🧰 Tools
🪛 GitHub Actions: Ruff Style Check
[error] 279-279: Command 'uv run --only-dev ruff check --output-format=github .' failed: ARG002 Unused method argument: mock_ssrf_settings.
🪛 GitHub Check: Ruff Style Check (3.13)
[failure] 331-331: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:331:53: ARG002 Unused method argument: mock_ssrf_settings
[failure] 309-309: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:309:44: ARG002 Unused method argument: mock_ssrf_settings
[failure] 299-299: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:299:63: ARG002 Unused method argument: mock_ssrf_settings
[failure] 289-289: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:289:56: ARG002 Unused method argument: mock_ssrf_settings
[failure] 279-279: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:279:55: ARG002 Unused method argument: mock_ssrf_settings
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/tests/unit/components/data_source/test_url_component.py` around
lines 279 - 331, The tests are failing Ruff ARG002 because the injected fixture
mock_ssrf_settings is never used in several tests; to fix, use the fixture in
each affected test (e.g., test_ssrf_blocks_localhost_when_enabled,
test_ssrf_blocks_private_ip_when_enabled,
test_ssrf_blocks_metadata_endpoint_when_enabled, test_ssrf_allows_public_urls,
test_ssrf_protection_in_fetch_content) by adding a no-op reference at the start
of the test body like "_ = mock_ssrf_settings" (or rename the parameter to
"_mock_ssrf_settings") so the fixture is considered used and the linter error is
resolved.
| "title_case": false, | ||
| "type": "code", | ||
| "value": "import importlib\nimport io\nimport re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom markitdown import MarkItDown\n\nfrom lfx.custom.custom_component.component import Component\nfrom lfx.field_typing.range_spec import RangeSpec\nfrom lfx.helpers.data import safe_convert\nfrom lfx.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom lfx.log.logger import logger\nfrom lfx.schema.dataframe import DataFrame\nfrom lfx.schema.message import Message\nfrom lfx.utils.request_utils import get_user_agent\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\n\n\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n re.IGNORECASE,\n)\n\nUSER_AGENT = None\n# Check if langflow is installed using importlib.util.find_spec(name))\nif importlib.util.find_spec(\"langflow\"):\n langflow_installed = True\n USER_AGENT = get_user_agent()\nelse:\n langflow_installed = False\n USER_AGENT = \"lfx\"\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=(\n \"Output Format. Use 'Text' to extract the text from the HTML, \"\n \"'Markdown' to parse the HTML into Markdown format, or 'HTML' \"\n \"for the raw HTML content.\"\n ),\n options=[\"Text\", \"HTML\", \"Markdown\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": USER_AGENT}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def _html_extractor(x: str) -> str:\n \"\"\"Extract raw HTML content.\"\"\"\n return x\n\n @staticmethod\n def _text_extractor(x: str) -> str:\n \"\"\"Extract clean text from HTML.\"\"\"\n return BeautifulSoup(x, \"lxml\").get_text()\n\n @staticmethod\n def _markdown_extractor(x: str) -> str:\n \"\"\"Convert HTML to Markdown format.\"\"\"\n stream = io.BytesIO(x.encode(\"utf-8\"))\n result = MarkItDown(enable_plugins=False).convert_stream(stream)\n return result.markdown\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n return url\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers if header[\"value\"] is not None}\n extractors = {\n \"HTML\": self._html_extractor,\n \"Markdown\": self._markdown_extractor,\n \"Text\": self._text_extractor,\n }\n extractor = extractors.get(self.format, self._text_extractor)\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n data = [\n {\n \"text\": safe_convert(doc.page_content, clean_data=True),\n \"url\": doc.metadata.get(\"source\", \"\"),\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n for doc in all_docs\n ]\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n" | ||
| "value": "import importlib\nimport io\nimport re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom markitdown import MarkItDown\n\nfrom lfx.custom.custom_component.component import Component\nfrom lfx.field_typing.range_spec import RangeSpec\nfrom lfx.helpers.data import safe_convert\nfrom lfx.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom lfx.log.logger import logger\nfrom lfx.schema.dataframe import DataFrame\nfrom lfx.schema.message import Message\nfrom lfx.utils.request_utils import get_user_agent\nfrom lfx.utils.ssrf_protection import SSRFProtectionError, validate_url_for_ssrf\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\n\n\nURL_REGEX = re.compile(\n r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n re.IGNORECASE,\n)\n\nUSER_AGENT = None\n# Check if langflow is installed using importlib.util.find_spec(name))\nif importlib.util.find_spec(\"langflow\"):\n langflow_installed = True\n USER_AGENT = get_user_agent()\nelse:\n langflow_installed = False\n USER_AGENT = \"lfx\"\n\n\nclass URLComponent(Component):\n \"\"\"A component that loads and parses content from web pages recursively.\n\n This component allows fetching content from one or more URLs, with options to:\n - Control crawl depth\n - Prevent crawling outside the root domain\n - Use async loading for better performance\n - Extract either raw HTML or clean text\n - Configure request headers and timeouts\n \"\"\"\n\n display_name = \"URL\"\n description = \"Fetch content from one or more web pages, following links recursively.\"\n documentation: str = \"https://docs.langflow.org/url\"\n icon = \"layout-template\"\n name = \"URLComponent\"\n\n inputs = [\n MessageTextInput(\n name=\"urls\",\n display_name=\"URLs\",\n info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n is_list=True,\n tool_mode=True,\n placeholder=\"Enter a URL...\",\n list_add_label=\"Add URL\",\n input_types=[],\n ),\n SliderInput(\n name=\"max_depth\",\n display_name=\"Depth\",\n info=(\n \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n \"- depth 1: only the initial page\\n\"\n \"- depth 2: initial page + all pages linked directly from it\\n\"\n \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n \"Note: This is about link traversal, not URL path depth.\"\n ),\n value=DEFAULT_MAX_DEPTH,\n range_spec=RangeSpec(min=1, max=5, step=1),\n required=False,\n min_label=\" \",\n max_label=\" \",\n min_label_icon=\"None\",\n max_label_icon=\"None\",\n # slider_input=True\n ),\n BoolInput(\n name=\"prevent_outside\",\n display_name=\"Prevent Outside\",\n info=(\n \"If enabled, only crawls URLs within the same domain as the root URL. \"\n \"This helps prevent the crawler from going to external websites.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"use_async\",\n display_name=\"Use Async\",\n info=(\n \"If enabled, uses asynchronous loading which can be significantly faster \"\n \"but might use more system resources.\"\n ),\n value=True,\n required=False,\n advanced=True,\n ),\n DropdownInput(\n name=\"format\",\n display_name=\"Output Format\",\n info=(\n \"Output Format. Use 'Text' to extract the text from the HTML, \"\n \"'Markdown' to parse the HTML into Markdown format, or 'HTML' \"\n \"for the raw HTML content.\"\n ),\n options=[\"Text\", \"HTML\", \"Markdown\"],\n value=DEFAULT_FORMAT,\n advanced=True,\n ),\n IntInput(\n name=\"timeout\",\n display_name=\"Timeout\",\n info=\"Timeout for the request in seconds.\",\n value=DEFAULT_TIMEOUT,\n required=False,\n advanced=True,\n ),\n TableInput(\n name=\"headers\",\n display_name=\"Headers\",\n info=\"The headers to send with the request\",\n table_schema=[\n {\n \"name\": \"key\",\n \"display_name\": \"Header\",\n \"type\": \"str\",\n \"description\": \"Header name\",\n },\n {\n \"name\": \"value\",\n \"display_name\": \"Value\",\n \"type\": \"str\",\n \"description\": \"Header value\",\n },\n ],\n value=[{\"key\": \"User-Agent\", \"value\": USER_AGENT}],\n advanced=True,\n input_types=[\"DataFrame\"],\n ),\n BoolInput(\n name=\"filter_text_html\",\n display_name=\"Filter Text/HTML\",\n info=\"If enabled, filters out text/css content type from the results.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"continue_on_failure\",\n display_name=\"Continue on Failure\",\n info=\"If enabled, continues crawling even if some requests fail.\",\n value=True,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"check_response_status\",\n display_name=\"Check Response Status\",\n info=\"If enabled, checks the response status of the request.\",\n value=False,\n required=False,\n advanced=True,\n ),\n BoolInput(\n name=\"autoset_encoding\",\n display_name=\"Autoset Encoding\",\n info=\"If enabled, automatically sets the encoding of the request.\",\n value=True,\n required=False,\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n ]\n\n @staticmethod\n def _html_extractor(x: str) -> str:\n \"\"\"Extract raw HTML content.\"\"\"\n return x\n\n @staticmethod\n def _text_extractor(x: str) -> str:\n \"\"\"Extract clean text from HTML.\"\"\"\n return BeautifulSoup(x, \"lxml\").get_text()\n\n @staticmethod\n def _markdown_extractor(x: str) -> str:\n \"\"\"Convert HTML to Markdown format.\"\"\"\n stream = io.BytesIO(x.encode(\"utf-8\"))\n result = MarkItDown(enable_plugins=False).convert_stream(stream)\n return result.markdown\n\n @staticmethod\n def validate_url(url: str) -> bool:\n \"\"\"Validates if the given string matches URL pattern.\n\n Args:\n url: The URL string to validate\n\n Returns:\n bool: True if the URL is valid, False otherwise\n \"\"\"\n return bool(URL_REGEX.match(url))\n\n def ensure_url(self, url: str) -> str:\n \"\"\"Ensures the given string is a valid URL.\n\n Args:\n url: The URL string to validate and normalize\n\n Returns:\n str: The normalized URL\n\n Raises:\n ValueError: If the URL is invalid or blocked by SSRF protection\n \"\"\"\n url = url.strip()\n if not url.startswith((\"http://\", \"https://\")):\n url = \"https://\" + url\n\n if not self.validate_url(url):\n msg = f\"Invalid URL: {url}\"\n raise ValueError(msg)\n\n # SSRF Protection: Validate URL to prevent access to internal resources\n # TODO: In next major version (2.0), remove warn_only=True to enforce blocking\n try:\n validate_url_for_ssrf(url, warn_only=True)\n except SSRFProtectionError as e:\n # This will only raise if SSRF protection is enabled and warn_only=False\n msg = f\"SSRF Protection: {e}\"\n raise ValueError(msg) from e\n\n return url\n\n def _create_loader(self, url: str) -> RecursiveUrlLoader:\n \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n Args:\n url: The URL to load\n\n Returns:\n RecursiveUrlLoader: Configured loader instance\n \"\"\"\n headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers if header[\"value\"] is not None}\n extractors = {\n \"HTML\": self._html_extractor,\n \"Markdown\": self._markdown_extractor,\n \"Text\": self._text_extractor,\n }\n extractor = extractors.get(self.format, self._text_extractor)\n\n return RecursiveUrlLoader(\n url=url,\n max_depth=self.max_depth,\n prevent_outside=self.prevent_outside,\n use_async=self.use_async,\n extractor=extractor,\n timeout=self.timeout,\n headers=headers_dict,\n check_response_status=self.check_response_status,\n continue_on_failure=self.continue_on_failure,\n base_url=url, # Add base_url to ensure consistent domain crawling\n autoset_encoding=self.autoset_encoding, # Enable automatic encoding detection\n exclude_dirs=[], # Allow customization of excluded directories\n link_regex=None, # Allow customization of link filtering\n )\n\n def fetch_url_contents(self) -> list[dict]:\n \"\"\"Load documents from the configured URLs.\n\n Returns:\n List[Data]: List of Data objects containing the fetched content\n\n Raises:\n ValueError: If no valid URLs are provided or if there's an error loading documents\n \"\"\"\n try:\n urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n logger.debug(f\"URLs: {urls}\")\n if not urls:\n msg = \"No valid URLs provided.\"\n raise ValueError(msg)\n\n all_docs = []\n for url in urls:\n logger.debug(f\"Loading documents from {url}\")\n\n try:\n loader = self._create_loader(url)\n docs = loader.load()\n\n if not docs:\n logger.warning(f\"No documents found for {url}\")\n continue\n\n logger.debug(f\"Found {len(docs)} documents from {url}\")\n all_docs.extend(docs)\n\n except requests.exceptions.RequestException as e:\n logger.exception(f\"Error loading documents from {url}: {e}\")\n continue\n\n if not all_docs:\n msg = \"No documents were successfully loaded from any URL\"\n raise ValueError(msg)\n\n # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n data = [\n {\n \"text\": safe_convert(doc.page_content, clean_data=True),\n \"url\": doc.metadata.get(\"source\", \"\"),\n \"title\": doc.metadata.get(\"title\", \"\"),\n \"description\": doc.metadata.get(\"description\", \"\"),\n \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n \"language\": doc.metadata.get(\"language\", \"\"),\n }\n for doc in all_docs\n ]\n except Exception as e:\n error_msg = e.message if hasattr(e, \"message\") else e\n msg = f\"Error loading documents: {error_msg!s}\"\n logger.exception(msg)\n raise ValueError(msg) from e\n return data\n\n def fetch_content(self) -> DataFrame:\n \"\"\"Convert the documents to a DataFrame.\"\"\"\n return DataFrame(data=self.fetch_url_contents())\n\n def fetch_content_as_message(self) -> Message:\n \"\"\"Convert the documents to a Message.\"\"\"\n url_contents = self.fetch_url_contents()\n return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n" |
There was a problem hiding this comment.
Hardcoded warn-only mode prevents strict SSRF enforcement.
At Line 57210, validate_url_for_ssrf(url, warn_only=True) is fixed to warning mode, so this component path cannot block internal URLs even when stricter SSRF enforcement is desired.
Please make warn_only configurable (defaulting to True for compatibility) so operators can enable blocking without waiting for a major-version code change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lfx/src/lfx/_assets/component_index.json` at line 57210, The SSRF check
in ensure_url currently calls validate_url_for_ssrf(url, warn_only=True)
unconditionally; make warn_only configurable by adding a BoolInput (e.g.,
name="ssrf_warn_only", display_name="SSRF Warn Only", value=True, advanced=True)
to the component's inputs (so default remains True for compatibility) and
replace the hardcoded call with validate_url_for_ssrf(url,
warn_only=self.ssrf_warn_only) inside ensure_url; ensure the BoolInput name
matches the attribute used so operators can toggle strict blocking without a
major release.
Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0).
df49e43 to
abf388f
Compare
- Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked.
…m/langflow-ai/langflow into fix/ssrf-url-component-PVR0699081
* fix: add SSRF protection to URL component (PVR0699081) Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * fix: enforce SSRF blocking and add env variables to .env.example - Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked. * fix: correct .env.example to show empty default for SSRF protection The default is false, so .env.example should be empty (not true). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * fix: add SSRF protection to URL component (PVR0699081) Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0). * fix: enforce SSRF blocking and add env variables to .env.example - Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked. * fix: correct .env.example to show empty default for SSRF protection The default is false, so .env.example should be empty (not true). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Summary
validate_url_for_ssrffunction (already used by API Request component)Security Issue
The URL component was vulnerable to Server-Side Request Forgery (SSRF) because it only validated URL syntax but did not check if the target was an internal resource. An attacker could supply URLs like
http://127.0.0.1:8080orhttp://169.254.169.254/latest/meta-data/to access internal services.Changes
ensure_url()method usingvalidate_url_for_ssrf()Behavior
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests