fix: add SSRF protection to URL component (PVR0699081) by RamGopalSrikar · Pull Request #11996 · langflow-ai/langflow

RamGopalSrikar · 2026-03-03T15:25:21Z

Summary

Add SSRF protection to URL component to prevent unauthorized access to internal resources
Integrates existing validate_url_for_ssrf function (already used by API Request component)
Blocks access to localhost, private IP ranges (10.x, 172.16-31.x, 192.168.x), and cloud metadata endpoints (169.254.169.254)

Security Issue

The URL component was vulnerable to Server-Side Request Forgery (SSRF) because it only validated URL syntax but did not check if the target was an internal resource. An attacker could supply URLs like http://127.0.0.1:8080 or http://169.254.169.254/latest/meta-data/ to access internal services.

Changes

url.py: Added SSRF validation in ensure_url() method using validate_url_for_ssrf()
test_url_component.py: Added 7 unit tests for SSRF protection

Behavior

SSRF Protection Setting	Behavior
Disabled (default)	All URLs allowed
Enabled + warn_only=True	Logs warning for internal URLs
Enabled + warn_only=False	Blocks internal URLs (future v2.0)

Test plan

All existing URL component tests pass
New SSRF protection tests pass (7 tests)
Matches API Request component implementation

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Enhanced URL security with SSRF protection that validates and blocks potentially dangerous URLs, including localhost, private IP addresses, and metadata service endpoints.
- Improved error handling to provide clear user-facing messages when URLs are blocked by security validation.
Tests
- Added comprehensive unit tests for SSRF protection mechanisms in URL handling.

coderabbitai · 2026-03-03T15:25:56Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 042de5e0-c5f4-4be8-a140-b9989565a25d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR adds SSRF (Server-Side Request Forgery) protection to URL handling across the codebase. The URLComponent in starter projects, component indexes, and the core implementation are updated to validate URLs against SSRF protections, raising errors when blocked. Hash values are updated to reflect functional changes.

Changes

Cohort / File(s)	Summary
Starter Project JSONs `src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json`, `src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json`, `src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json`	Updated URLComponent code blocks to import SSRFProtectionError and validate_url_for_ssrf; ensure_url now performs SSRF validation with warn_only=True and converts SSRF exceptions to ValueError; code_hash metadata updated for affected nodes.
Component Index Files `src/lfx/src/lfx/_assets/component_index.json`, `src/lfx/src/lfx/_assets/stable_hash_history.json`	Updated inlined URLComponent code to add SSRF validation imports and checks in ensure_url; hash entries updated to reflect functional changes in component code.
URL Component Implementation `src/lfx/src/lfx/components/data_source/url.py`	Added SSRFProtectionError and validate_url_for_ssrf imports; ensure_url enhanced to call validate_url_for_ssrf with warn_only=True before returning normalized URL; SSRF-related exceptions converted to ValueError with descriptive messages; docstrings updated.
URL Component Tests `src/backend/tests/unit/components/data_source/test_url_component.py`	Added TestURLComponentSSRFProtection test class with comprehensive SSRF protection coverage including mocked SSRF settings; validates ensure_url calls validate_url_for_ssrf correctly, blocks private IPs/localhost/metadata endpoints, passes public URLs, and propagates SSRF exceptions through fetch_content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Test Quality And Coverage	⚠️ Warning	SSRF protection tests exist but have significant quality issues preventing comprehensive validation of the feature implementation.	Fix unused fixture parameters by prefixing with underscores, add assertions for warn_only parameter propagation, expand edge-case coverage (IPv6, ports, malformed URLs), and strengthen integration tests between fetch_content and ensure_url components.
Test File Naming And Structure	⚠️ Warning	Test file follows proper naming, structure, and comprehensive test coverage conventions but contains five unused fixture arguments that violate ARG002 linting rules.	Prefix unused mock_ssrf_settings parameters with underscore (_mock_ssrf_settings) in five test methods to signal intentional non-usage or remove the unused parameters entirely.
Excessive Mock Usage Warning	⚠️ Warning	SSRF protection tests use excessive mocking of validate_url_for_ssrf function instead of testing real validation logic, with five tests containing unused mock_ssrf_settings fixture parameters causing linting violations.	Refactor tests to call real validate_url_for_ssrf function with only external dependencies mocked, following the pattern in test_ssrf_protection.py, and verify actual SSRF blocking behavior in component flow.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding SSRF protection to the URL component, matching the primary objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Coverage For New Implementations	✅ Passed	PR includes comprehensive test coverage for SSRF protection with 7 test methods covering all critical scenarios including localhost, private IPs, metadata endpoints, and public URLs.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ssrf-url-component-PVR0699081

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-03T15:32:55Z

Frontend Unit Test Coverage Report

Coverage Summary

Lines	Statements	Branches	Functions
	28.06% (29449/104920)	64.69% (3758/5809)	30.05% (691/2299)

Unit Test Results

Tests	Skipped	Failures	Errors	Time
3065	0 💤	0 ❌	0 🔥	4m 46s ⏱️

codecov · 2026-03-03T15:33:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.25%. Comparing base (68642a8) to head (56f0208).
⚠️ Report is 1 commits behind head on release-1.9.0.

❌ Your project status has failed because the head coverage (48.79%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@                Coverage Diff                @@
##           release-1.9.0   #11996      +/-   ##
=================================================
- Coverage          49.48%   49.25%   -0.24%     
=================================================
  Files               1929     1929              
  Lines             171262   171189      -73     
  Branches           25038    23735    -1303     
=================================================
- Hits               84754    84322     -432     
- Misses             85481    85840     +359     
  Partials            1027     1027

Flag	Coverage Δ
backend	`55.71% <ø> (-0.06%)`	⬇️
frontend	`47.85% <ø> (-0.34%)`	⬇️
lfx	`48.79% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 53 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/lfx/src/lfx/components/data_source/url.py (1)
240-247: Avoid pinning SSRF validation to permanent warn-only mode.

Line 243 hard-codes warn_only=True, which makes enforcement depend on a future manual edit. Calling with the utility default keeps current behavior today and automatically follows the planned default flip in v2.0.
♻️ Proposed refactor
-        # TODO: In next major version (2.0), remove warn_only=True to enforce blocking
         try:
-            validate_url_for_ssrf(url, warn_only=True)
+            validate_url_for_ssrf(url)
         except SSRFProtectionError as e:
-            # This will only raise if SSRF protection is enabled and warn_only=False
+            # Raised when SSRF protection is in enforcement mode
             msg = f"SSRF Protection: {e}"
             raise ValueError(msg) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lfx/src/lfx/components/data_source/url.py` around lines 240 - 247, The
code pins SSRF validation to warn-only by calling validate_url_for_ssrf(url,
warn_only=True); remove the hard-coded warn_only=True so the call uses the
utility's default behavior (i.e., call validate_url_for_ssrf(url)), preserving
the existing except SSRFProtectionError as e block and re-raising
ValueError(msg) from e; this ensures enforcement follows the utility's future
default without changing the current exception handling around
validate_url_for_ssrf and SSRFProtectionError.
src/backend/tests/unit/components/data_source/test_url_component.py (1)
283-337: Reduce mock-only SSRF assertions in the new tests.

Most new cases patch validate_url_for_ssrf directly, so they mostly verify mocked behavior rather than actual SSRF rule evaluation. Consider keeping at least one path that exercises the real validator (mocking only settings) to prove localhost/private/metadata blocking behavior end-to-end.

As per coding guidelines **/test_*.py: “Ensure mocks are used appropriately for external dependencies only, not for core logic.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/tests/unit/components/data_source/test_url_component.py` around
lines 283 - 337, The new tests over-mock the core SSRF logic by patching
validate_url_for_ssrf everywhere; update tests to include at least one
end-to-end case that does not patch validate_url_for_ssrf so the real validator
runs (use the existing mock_ssrf_settings fixture to enable protection), e.g.,
in test_ssrf_blocks_private_ip_when_enabled or add a new test that calls
URLComponent.ensure_url("http://127.0.0.1:8080") and asserts a
ValueError/SSRFProtectionError, while limiting mocks to external deps only;
ensure references to validate_url_for_ssrf, URLComponent.ensure_url, and
mock_ssrf_settings are used so the validator itself is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/tests/unit/components/data_source/test_url_component.py`:
- Around line 279-331: The tests are failing Ruff ARG002 because the injected
fixture mock_ssrf_settings is never used in several tests; to fix, use the
fixture in each affected test (e.g., test_ssrf_blocks_localhost_when_enabled,
test_ssrf_blocks_private_ip_when_enabled,
test_ssrf_blocks_metadata_endpoint_when_enabled, test_ssrf_allows_public_urls,
test_ssrf_protection_in_fetch_content) by adding a no-op reference at the start
of the test body like "_ = mock_ssrf_settings" (or rename the parameter to
"_mock_ssrf_settings") so the fixture is considered used and the linter error is
resolved.

In `@src/lfx/src/lfx/_assets/component_index.json`:
- Line 57210: The SSRF check in ensure_url currently calls
validate_url_for_ssrf(url, warn_only=True) unconditionally; make warn_only
configurable by adding a BoolInput (e.g., name="ssrf_warn_only",
display_name="SSRF Warn Only", value=True, advanced=True) to the component's
inputs (so default remains True for compatibility) and replace the hardcoded
call with validate_url_for_ssrf(url, warn_only=self.ssrf_warn_only) inside
ensure_url; ensure the BoolInput name matches the attribute used so operators
can toggle strict blocking without a major release.

---

Nitpick comments:
In `@src/backend/tests/unit/components/data_source/test_url_component.py`:
- Around line 283-337: The new tests over-mock the core SSRF logic by patching
validate_url_for_ssrf everywhere; update tests to include at least one
end-to-end case that does not patch validate_url_for_ssrf so the real validator
runs (use the existing mock_ssrf_settings fixture to enable protection), e.g.,
in test_ssrf_blocks_private_ip_when_enabled or add a new test that calls
URLComponent.ensure_url("http://127.0.0.1:8080") and asserts a
ValueError/SSRFProtectionError, while limiting mocks to external deps only;
ensure references to validate_url_for_ssrf, URLComponent.ensure_url, and
mock_ssrf_settings are used so the validator itself is exercised.

In `@src/lfx/src/lfx/components/data_source/url.py`:
- Around line 240-247: The code pins SSRF validation to warn-only by calling
validate_url_for_ssrf(url, warn_only=True); remove the hard-coded warn_only=True
so the call uses the utility's default behavior (i.e., call
validate_url_for_ssrf(url)), preserving the existing except SSRFProtectionError
as e block and re-raising ValueError(msg) from e; this ensures enforcement
follows the utility's future default without changing the current exception
handling around validate_url_for_ssrf and SSRFProtectionError.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a18aaa and fae6eb1.

📒 Files selected for processing (7)

src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json
src/backend/base/langflow/initial_setup/starter_projects/Simple Agent.json
src/backend/tests/unit/components/data_source/test_url_component.py
src/lfx/src/lfx/_assets/component_index.json
src/lfx/src/lfx/_assets/stable_hash_history.json
src/lfx/src/lfx/components/data_source/url.py

coderabbitai · 2026-03-03T15:43:48Z

+    def test_ssrf_blocks_localhost_when_enabled(self, mock_ssrf_settings):
+        """Test that localhost is blocked when SSRF protection is enabled."""
+        component = URLComponent()
+
+        with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate:
+            mock_validate.side_effect = SSRFProtectionError("Access to IP address 127.0.0.1 is blocked")
+
+            with pytest.raises(ValueError, match="SSRF Protection"):
+                component.ensure_url("http://127.0.0.1:8080")
+
+    def test_ssrf_blocks_private_ip_when_enabled(self, mock_ssrf_settings):
+        """Test that private IPs are blocked when SSRF protection is enabled."""
+        component = URLComponent()
+
+        with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate:
+            mock_validate.side_effect = SSRFProtectionError("Access to IP address 192.168.1.1 is blocked")
+
+            with pytest.raises(ValueError, match="SSRF Protection"):
+                component.ensure_url("http://192.168.1.1/admin")
+
+    def test_ssrf_blocks_metadata_endpoint_when_enabled(self, mock_ssrf_settings):
+        """Test that cloud metadata endpoints are blocked when SSRF protection is enabled."""
+        component = URLComponent()
+
+        with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate:
+            mock_validate.side_effect = SSRFProtectionError("Access to IP address 169.254.169.254 is blocked")
+
+            with pytest.raises(ValueError, match="SSRF Protection"):
+                component.ensure_url("http://169.254.169.254/latest/meta-data/")
+
+    def test_ssrf_allows_public_urls(self, mock_ssrf_settings):
+        """Test that public URLs are allowed."""
+        component = URLComponent()
+
+        with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate:
+            # No exception means it's allowed
+            mock_validate.return_value = None
+
+            url = component.ensure_url("https://www.google.com")
+            assert url == "https://www.google.com"
+            mock_validate.assert_called_once()
+
+    def test_ssrf_warn_only_mode(self):
+        """Test that warn_only=True is passed to validation."""
+        component = URLComponent()
+
+        with patch("lfx.components.data_source.url.validate_url_for_ssrf") as mock_validate:
+            component.ensure_url("https://example.com")
+
+            # Verify warn_only=True is passed (current behavior for backwards compatibility)
+            mock_validate.assert_called_with("https://example.com", warn_only=True)
+
+    def test_ssrf_protection_in_fetch_content(self, mock_ssrf_settings):


⚠️ Potential issue | 🟠 Major

Fix unused fixture arguments to unblock Ruff CI.

mock_ssrf_settings is injected but never referenced in five test methods, which is failing lint with ARG002.

Minimal lint-safe fix

- def test_ssrf_blocks_localhost_when_enabled(self, mock_ssrf_settings): + def test_ssrf_blocks_localhost_when_enabled(self, _mock_ssrf_settings): ... - def test_ssrf_blocks_private_ip_when_enabled(self, mock_ssrf_settings): + def test_ssrf_blocks_private_ip_when_enabled(self, _mock_ssrf_settings): ... - def test_ssrf_blocks_metadata_endpoint_when_enabled(self, mock_ssrf_settings): + def test_ssrf_blocks_metadata_endpoint_when_enabled(self, _mock_ssrf_settings): ... - def test_ssrf_allows_public_urls(self, mock_ssrf_settings): + def test_ssrf_allows_public_urls(self, _mock_ssrf_settings): ... - def test_ssrf_protection_in_fetch_content(self, mock_ssrf_settings): + def test_ssrf_protection_in_fetch_content(self, _mock_ssrf_settings):

🧰 Tools

🪛 GitHub Actions: Ruff Style Check

[error] 279-279: Command 'uv run --only-dev ruff check --output-format=github .' failed: ARG002 Unused method argument: mock_ssrf_settings.

🪛 GitHub Check: Ruff Style Check (3.13)

[failure] 331-331: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:331:53: ARG002 Unused method argument: mock_ssrf_settings

[failure] 309-309: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:309:44: ARG002 Unused method argument: mock_ssrf_settings

[failure] 299-299: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:299:63: ARG002 Unused method argument: mock_ssrf_settings

[failure] 289-289: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:289:56: ARG002 Unused method argument: mock_ssrf_settings

[failure] 279-279: Ruff (ARG002)
src/backend/tests/unit/components/data_source/test_url_component.py:279:55: ARG002 Unused method argument: mock_ssrf_settings

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/backend/tests/unit/components/data_source/test_url_component.py` around lines 279 - 331, The tests are failing Ruff ARG002 because the injected fixture mock_ssrf_settings is never used in several tests; to fix, use the fixture in each affected test (e.g., test_ssrf_blocks_localhost_when_enabled, test_ssrf_blocks_private_ip_when_enabled, test_ssrf_blocks_metadata_endpoint_when_enabled, test_ssrf_allows_public_urls, test_ssrf_protection_in_fetch_content) by adding a no-op reference at the start of the test body like "_ = mock_ssrf_settings" (or rename the parameter to "_mock_ssrf_settings") so the fixture is considered used and the linter error is resolved.

coderabbitai · 2026-03-03T15:43:48Z

              "title_case": false,
              "type": "code",
-              "value": "import importlib\nimport io\nimport re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom markitdown import MarkItDown\n\nfrom lfx.custom.custom_component.component import Component\nfrom lfx.field_typing.range_spec import RangeSpec\nfrom lfx.helpers.data import safe_convert\nfrom lfx.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom lfx.log.logger import logger\nfrom lfx.schema.dataframe import DataFrame\nfrom lfx.schema.message import Message\nfrom lfx.utils.request_utils import get_user_agent\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\n\n\nURL_REGEX = re.compile(\n    r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n    re.IGNORECASE,\n)\n\nUSER_AGENT = None\n# Check if langflow is installed using importlib.util.find_spec(name))\nif importlib.util.find_spec(\"langflow\"):\n    langflow_installed = True\n    USER_AGENT = get_user_agent()\nelse:\n    langflow_installed = False\n    USER_AGENT = \"lfx\"\n\n\nclass URLComponent(Component):\n    \"\"\"A component that loads and parses content from web pages recursively.\n\n    This component allows fetching content from one or more URLs, with options to:\n    - Control crawl depth\n    - Prevent crawling outside the root domain\n    - Use async loading for better performance\n    - Extract either raw HTML or clean text\n    - Configure request headers and timeouts\n    \"\"\"\n\n    display_name = \"URL\"\n    description = \"Fetch content from one or more web pages, following links recursively.\"\n    documentation: str = \"https://docs.langflow.org/url\"\n    icon = \"layout-template\"\n    name = \"URLComponent\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n            placeholder=\"Enter a URL...\",\n            list_add_label=\"Add URL\",\n            input_types=[],\n        ),\n        SliderInput(\n            name=\"max_depth\",\n            display_name=\"Depth\",\n            info=(\n                \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n                \"- depth 1: only the initial page\\n\"\n                \"- depth 2: initial page + all pages linked directly from it\\n\"\n                \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n                \"Note: This is about link traversal, not URL path depth.\"\n            ),\n            value=DEFAULT_MAX_DEPTH,\n            range_spec=RangeSpec(min=1, max=5, step=1),\n            required=False,\n            min_label=\" \",\n            max_label=\" \",\n            min_label_icon=\"None\",\n            max_label_icon=\"None\",\n            # slider_input=True\n        ),\n        BoolInput(\n            name=\"prevent_outside\",\n            display_name=\"Prevent Outside\",\n            info=(\n                \"If enabled, only crawls URLs within the same domain as the root URL. \"\n                \"This helps prevent the crawler from going to external websites.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"use_async\",\n            display_name=\"Use Async\",\n            info=(\n                \"If enabled, uses asynchronous loading which can be significantly faster \"\n                \"but might use more system resources.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=(\n                \"Output Format. Use 'Text' to extract the text from the HTML, \"\n                \"'Markdown' to parse the HTML into Markdown format, or 'HTML' \"\n                \"for the raw HTML content.\"\n            ),\n            options=[\"Text\", \"HTML\", \"Markdown\"],\n            value=DEFAULT_FORMAT,\n            advanced=True,\n        ),\n        IntInput(\n            name=\"timeout\",\n            display_name=\"Timeout\",\n            info=\"Timeout for the request in seconds.\",\n            value=DEFAULT_TIMEOUT,\n            required=False,\n            advanced=True,\n        ),\n        TableInput(\n            name=\"headers\",\n            display_name=\"Headers\",\n            info=\"The headers to send with the request\",\n            table_schema=[\n                {\n                    \"name\": \"key\",\n                    \"display_name\": \"Header\",\n                    \"type\": \"str\",\n                    \"description\": \"Header name\",\n                },\n                {\n                    \"name\": \"value\",\n                    \"display_name\": \"Value\",\n                    \"type\": \"str\",\n                    \"description\": \"Header value\",\n                },\n            ],\n            value=[{\"key\": \"User-Agent\", \"value\": USER_AGENT}],\n            advanced=True,\n            input_types=[\"DataFrame\"],\n        ),\n        BoolInput(\n            name=\"filter_text_html\",\n            display_name=\"Filter Text/HTML\",\n            info=\"If enabled, filters out text/css content type from the results.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"continue_on_failure\",\n            display_name=\"Continue on Failure\",\n            info=\"If enabled, continues crawling even if some requests fail.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"check_response_status\",\n            display_name=\"Check Response Status\",\n            info=\"If enabled, checks the response status of the request.\",\n            value=False,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"autoset_encoding\",\n            display_name=\"Autoset Encoding\",\n            info=\"If enabled, automatically sets the encoding of the request.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n        Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n    ]\n\n    @staticmethod\n    def _html_extractor(x: str) -> str:\n        \"\"\"Extract raw HTML content.\"\"\"\n        return x\n\n    @staticmethod\n    def _text_extractor(x: str) -> str:\n        \"\"\"Extract clean text from HTML.\"\"\"\n        return BeautifulSoup(x, \"lxml\").get_text()\n\n    @staticmethod\n    def _markdown_extractor(x: str) -> str:\n        \"\"\"Convert HTML to Markdown format.\"\"\"\n        stream = io.BytesIO(x.encode(\"utf-8\"))\n        result = MarkItDown(enable_plugins=False).convert_stream(stream)\n        return result.markdown\n\n    @staticmethod\n    def validate_url(url: str) -> bool:\n        \"\"\"Validates if the given string matches URL pattern.\n\n        Args:\n            url: The URL string to validate\n\n        Returns:\n            bool: True if the URL is valid, False otherwise\n        \"\"\"\n        return bool(URL_REGEX.match(url))\n\n    def ensure_url(self, url: str) -> str:\n        \"\"\"Ensures the given string is a valid URL.\n\n        Args:\n            url: The URL string to validate and normalize\n\n        Returns:\n            str: The normalized URL\n\n        Raises:\n            ValueError: If the URL is invalid\n        \"\"\"\n        url = url.strip()\n        if not url.startswith((\"http://\", \"https://\")):\n            url = \"https://\" + url\n\n        if not self.validate_url(url):\n            msg = f\"Invalid URL: {url}\"\n            raise ValueError(msg)\n\n        return url\n\n    def _create_loader(self, url: str) -> RecursiveUrlLoader:\n        \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n        Args:\n            url: The URL to load\n\n        Returns:\n            RecursiveUrlLoader: Configured loader instance\n        \"\"\"\n        headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers if header[\"value\"] is not None}\n        extractors = {\n            \"HTML\": self._html_extractor,\n            \"Markdown\": self._markdown_extractor,\n            \"Text\": self._text_extractor,\n        }\n        extractor = extractors.get(self.format, self._text_extractor)\n\n        return RecursiveUrlLoader(\n            url=url,\n            max_depth=self.max_depth,\n            prevent_outside=self.prevent_outside,\n            use_async=self.use_async,\n            extractor=extractor,\n            timeout=self.timeout,\n            headers=headers_dict,\n            check_response_status=self.check_response_status,\n            continue_on_failure=self.continue_on_failure,\n            base_url=url,  # Add base_url to ensure consistent domain crawling\n            autoset_encoding=self.autoset_encoding,  # Enable automatic encoding detection\n            exclude_dirs=[],  # Allow customization of excluded directories\n            link_regex=None,  # Allow customization of link filtering\n        )\n\n    def fetch_url_contents(self) -> list[dict]:\n        \"\"\"Load documents from the configured URLs.\n\n        Returns:\n            List[Data]: List of Data objects containing the fetched content\n\n        Raises:\n            ValueError: If no valid URLs are provided or if there's an error loading documents\n        \"\"\"\n        try:\n            urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n            logger.debug(f\"URLs: {urls}\")\n            if not urls:\n                msg = \"No valid URLs provided.\"\n                raise ValueError(msg)\n\n            all_docs = []\n            for url in urls:\n                logger.debug(f\"Loading documents from {url}\")\n\n                try:\n                    loader = self._create_loader(url)\n                    docs = loader.load()\n\n                    if not docs:\n                        logger.warning(f\"No documents found for {url}\")\n                        continue\n\n                    logger.debug(f\"Found {len(docs)} documents from {url}\")\n                    all_docs.extend(docs)\n\n                except requests.exceptions.RequestException as e:\n                    logger.exception(f\"Error loading documents from {url}: {e}\")\n                    continue\n\n            if not all_docs:\n                msg = \"No documents were successfully loaded from any URL\"\n                raise ValueError(msg)\n\n            # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n            data = [\n                {\n                    \"text\": safe_convert(doc.page_content, clean_data=True),\n                    \"url\": doc.metadata.get(\"source\", \"\"),\n                    \"title\": doc.metadata.get(\"title\", \"\"),\n                    \"description\": doc.metadata.get(\"description\", \"\"),\n                    \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n                    \"language\": doc.metadata.get(\"language\", \"\"),\n                }\n                for doc in all_docs\n            ]\n        except Exception as e:\n            error_msg = e.message if hasattr(e, \"message\") else e\n            msg = f\"Error loading documents: {error_msg!s}\"\n            logger.exception(msg)\n            raise ValueError(msg) from e\n        return data\n\n    def fetch_content(self) -> DataFrame:\n        \"\"\"Convert the documents to a DataFrame.\"\"\"\n        return DataFrame(data=self.fetch_url_contents())\n\n    def fetch_content_as_message(self) -> Message:\n        \"\"\"Convert the documents to a Message.\"\"\"\n        url_contents = self.fetch_url_contents()\n        return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"
+              "value": "import importlib\nimport io\nimport re\n\nimport requests\nfrom bs4 import BeautifulSoup\nfrom langchain_community.document_loaders import RecursiveUrlLoader\nfrom markitdown import MarkItDown\n\nfrom lfx.custom.custom_component.component import Component\nfrom lfx.field_typing.range_spec import RangeSpec\nfrom lfx.helpers.data import safe_convert\nfrom lfx.io import BoolInput, DropdownInput, IntInput, MessageTextInput, Output, SliderInput, TableInput\nfrom lfx.log.logger import logger\nfrom lfx.schema.dataframe import DataFrame\nfrom lfx.schema.message import Message\nfrom lfx.utils.request_utils import get_user_agent\nfrom lfx.utils.ssrf_protection import SSRFProtectionError, validate_url_for_ssrf\n\n# Constants\nDEFAULT_TIMEOUT = 30\nDEFAULT_MAX_DEPTH = 1\nDEFAULT_FORMAT = \"Text\"\n\n\nURL_REGEX = re.compile(\n    r\"^(https?:\\/\\/)?\" r\"(www\\.)?\" r\"([a-zA-Z0-9.-]+)\" r\"(\\.[a-zA-Z]{2,})?\" r\"(:\\d+)?\" r\"(\\/[^\\s]*)?$\",\n    re.IGNORECASE,\n)\n\nUSER_AGENT = None\n# Check if langflow is installed using importlib.util.find_spec(name))\nif importlib.util.find_spec(\"langflow\"):\n    langflow_installed = True\n    USER_AGENT = get_user_agent()\nelse:\n    langflow_installed = False\n    USER_AGENT = \"lfx\"\n\n\nclass URLComponent(Component):\n    \"\"\"A component that loads and parses content from web pages recursively.\n\n    This component allows fetching content from one or more URLs, with options to:\n    - Control crawl depth\n    - Prevent crawling outside the root domain\n    - Use async loading for better performance\n    - Extract either raw HTML or clean text\n    - Configure request headers and timeouts\n    \"\"\"\n\n    display_name = \"URL\"\n    description = \"Fetch content from one or more web pages, following links recursively.\"\n    documentation: str = \"https://docs.langflow.org/url\"\n    icon = \"layout-template\"\n    name = \"URLComponent\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs to crawl recursively, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n            placeholder=\"Enter a URL...\",\n            list_add_label=\"Add URL\",\n            input_types=[],\n        ),\n        SliderInput(\n            name=\"max_depth\",\n            display_name=\"Depth\",\n            info=(\n                \"Controls how many 'clicks' away from the initial page the crawler will go:\\n\"\n                \"- depth 1: only the initial page\\n\"\n                \"- depth 2: initial page + all pages linked directly from it\\n\"\n                \"- depth 3: initial page + direct links + links found on those direct link pages\\n\"\n                \"Note: This is about link traversal, not URL path depth.\"\n            ),\n            value=DEFAULT_MAX_DEPTH,\n            range_spec=RangeSpec(min=1, max=5, step=1),\n            required=False,\n            min_label=\" \",\n            max_label=\" \",\n            min_label_icon=\"None\",\n            max_label_icon=\"None\",\n            # slider_input=True\n        ),\n        BoolInput(\n            name=\"prevent_outside\",\n            display_name=\"Prevent Outside\",\n            info=(\n                \"If enabled, only crawls URLs within the same domain as the root URL. \"\n                \"This helps prevent the crawler from going to external websites.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"use_async\",\n            display_name=\"Use Async\",\n            info=(\n                \"If enabled, uses asynchronous loading which can be significantly faster \"\n                \"but might use more system resources.\"\n            ),\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=(\n                \"Output Format. Use 'Text' to extract the text from the HTML, \"\n                \"'Markdown' to parse the HTML into Markdown format, or 'HTML' \"\n                \"for the raw HTML content.\"\n            ),\n            options=[\"Text\", \"HTML\", \"Markdown\"],\n            value=DEFAULT_FORMAT,\n            advanced=True,\n        ),\n        IntInput(\n            name=\"timeout\",\n            display_name=\"Timeout\",\n            info=\"Timeout for the request in seconds.\",\n            value=DEFAULT_TIMEOUT,\n            required=False,\n            advanced=True,\n        ),\n        TableInput(\n            name=\"headers\",\n            display_name=\"Headers\",\n            info=\"The headers to send with the request\",\n            table_schema=[\n                {\n                    \"name\": \"key\",\n                    \"display_name\": \"Header\",\n                    \"type\": \"str\",\n                    \"description\": \"Header name\",\n                },\n                {\n                    \"name\": \"value\",\n                    \"display_name\": \"Value\",\n                    \"type\": \"str\",\n                    \"description\": \"Header value\",\n                },\n            ],\n            value=[{\"key\": \"User-Agent\", \"value\": USER_AGENT}],\n            advanced=True,\n            input_types=[\"DataFrame\"],\n        ),\n        BoolInput(\n            name=\"filter_text_html\",\n            display_name=\"Filter Text/HTML\",\n            info=\"If enabled, filters out text/css content type from the results.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"continue_on_failure\",\n            display_name=\"Continue on Failure\",\n            info=\"If enabled, continues crawling even if some requests fail.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"check_response_status\",\n            display_name=\"Check Response Status\",\n            info=\"If enabled, checks the response status of the request.\",\n            value=False,\n            required=False,\n            advanced=True,\n        ),\n        BoolInput(\n            name=\"autoset_encoding\",\n            display_name=\"Autoset Encoding\",\n            info=\"If enabled, automatically sets the encoding of the request.\",\n            value=True,\n            required=False,\n            advanced=True,\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Extracted Pages\", name=\"page_results\", method=\"fetch_content\"),\n        Output(display_name=\"Raw Content\", name=\"raw_results\", method=\"fetch_content_as_message\", tool_mode=False),\n    ]\n\n    @staticmethod\n    def _html_extractor(x: str) -> str:\n        \"\"\"Extract raw HTML content.\"\"\"\n        return x\n\n    @staticmethod\n    def _text_extractor(x: str) -> str:\n        \"\"\"Extract clean text from HTML.\"\"\"\n        return BeautifulSoup(x, \"lxml\").get_text()\n\n    @staticmethod\n    def _markdown_extractor(x: str) -> str:\n        \"\"\"Convert HTML to Markdown format.\"\"\"\n        stream = io.BytesIO(x.encode(\"utf-8\"))\n        result = MarkItDown(enable_plugins=False).convert_stream(stream)\n        return result.markdown\n\n    @staticmethod\n    def validate_url(url: str) -> bool:\n        \"\"\"Validates if the given string matches URL pattern.\n\n        Args:\n            url: The URL string to validate\n\n        Returns:\n            bool: True if the URL is valid, False otherwise\n        \"\"\"\n        return bool(URL_REGEX.match(url))\n\n    def ensure_url(self, url: str) -> str:\n        \"\"\"Ensures the given string is a valid URL.\n\n        Args:\n            url: The URL string to validate and normalize\n\n        Returns:\n            str: The normalized URL\n\n        Raises:\n            ValueError: If the URL is invalid or blocked by SSRF protection\n        \"\"\"\n        url = url.strip()\n        if not url.startswith((\"http://\", \"https://\")):\n            url = \"https://\" + url\n\n        if not self.validate_url(url):\n            msg = f\"Invalid URL: {url}\"\n            raise ValueError(msg)\n\n        # SSRF Protection: Validate URL to prevent access to internal resources\n        # TODO: In next major version (2.0), remove warn_only=True to enforce blocking\n        try:\n            validate_url_for_ssrf(url, warn_only=True)\n        except SSRFProtectionError as e:\n            # This will only raise if SSRF protection is enabled and warn_only=False\n            msg = f\"SSRF Protection: {e}\"\n            raise ValueError(msg) from e\n\n        return url\n\n    def _create_loader(self, url: str) -> RecursiveUrlLoader:\n        \"\"\"Creates a RecursiveUrlLoader instance with the configured settings.\n\n        Args:\n            url: The URL to load\n\n        Returns:\n            RecursiveUrlLoader: Configured loader instance\n        \"\"\"\n        headers_dict = {header[\"key\"]: header[\"value\"] for header in self.headers if header[\"value\"] is not None}\n        extractors = {\n            \"HTML\": self._html_extractor,\n            \"Markdown\": self._markdown_extractor,\n            \"Text\": self._text_extractor,\n        }\n        extractor = extractors.get(self.format, self._text_extractor)\n\n        return RecursiveUrlLoader(\n            url=url,\n            max_depth=self.max_depth,\n            prevent_outside=self.prevent_outside,\n            use_async=self.use_async,\n            extractor=extractor,\n            timeout=self.timeout,\n            headers=headers_dict,\n            check_response_status=self.check_response_status,\n            continue_on_failure=self.continue_on_failure,\n            base_url=url,  # Add base_url to ensure consistent domain crawling\n            autoset_encoding=self.autoset_encoding,  # Enable automatic encoding detection\n            exclude_dirs=[],  # Allow customization of excluded directories\n            link_regex=None,  # Allow customization of link filtering\n        )\n\n    def fetch_url_contents(self) -> list[dict]:\n        \"\"\"Load documents from the configured URLs.\n\n        Returns:\n            List[Data]: List of Data objects containing the fetched content\n\n        Raises:\n            ValueError: If no valid URLs are provided or if there's an error loading documents\n        \"\"\"\n        try:\n            urls = list({self.ensure_url(url) for url in self.urls if url.strip()})\n            logger.debug(f\"URLs: {urls}\")\n            if not urls:\n                msg = \"No valid URLs provided.\"\n                raise ValueError(msg)\n\n            all_docs = []\n            for url in urls:\n                logger.debug(f\"Loading documents from {url}\")\n\n                try:\n                    loader = self._create_loader(url)\n                    docs = loader.load()\n\n                    if not docs:\n                        logger.warning(f\"No documents found for {url}\")\n                        continue\n\n                    logger.debug(f\"Found {len(docs)} documents from {url}\")\n                    all_docs.extend(docs)\n\n                except requests.exceptions.RequestException as e:\n                    logger.exception(f\"Error loading documents from {url}: {e}\")\n                    continue\n\n            if not all_docs:\n                msg = \"No documents were successfully loaded from any URL\"\n                raise ValueError(msg)\n\n            # data = [Data(text=doc.page_content, **doc.metadata) for doc in all_docs]\n            data = [\n                {\n                    \"text\": safe_convert(doc.page_content, clean_data=True),\n                    \"url\": doc.metadata.get(\"source\", \"\"),\n                    \"title\": doc.metadata.get(\"title\", \"\"),\n                    \"description\": doc.metadata.get(\"description\", \"\"),\n                    \"content_type\": doc.metadata.get(\"content_type\", \"\"),\n                    \"language\": doc.metadata.get(\"language\", \"\"),\n                }\n                for doc in all_docs\n            ]\n        except Exception as e:\n            error_msg = e.message if hasattr(e, \"message\") else e\n            msg = f\"Error loading documents: {error_msg!s}\"\n            logger.exception(msg)\n            raise ValueError(msg) from e\n        return data\n\n    def fetch_content(self) -> DataFrame:\n        \"\"\"Convert the documents to a DataFrame.\"\"\"\n        return DataFrame(data=self.fetch_url_contents())\n\n    def fetch_content_as_message(self) -> Message:\n        \"\"\"Convert the documents to a Message.\"\"\"\n        url_contents = self.fetch_url_contents()\n        return Message(text=\"\\n\\n\".join([x[\"text\"] for x in url_contents]), data={\"data\": url_contents})\n"


⚠️ Potential issue | 🟠 Major

Hardcoded warn-only mode prevents strict SSRF enforcement.

At Line 57210, validate_url_for_ssrf(url, warn_only=True) is fixed to warning mode, so this component path cannot block internal URLs even when stricter SSRF enforcement is desired.

Please make warn_only configurable (defaulting to True for compatibility) so operators can enable blocking without waiting for a major-version code change.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lfx/src/lfx/_assets/component_index.json` at line 57210, The SSRF check in ensure_url currently calls validate_url_for_ssrf(url, warn_only=True) unconditionally; make warn_only configurable by adding a BoolInput (e.g., name="ssrf_warn_only", display_name="SSRF Warn Only", value=True, advanced=True) to the component's inputs (so default remains True for compatibility) and replace the hardcoded call with validate_url_for_ssrf(url, warn_only=self.ssrf_warn_only) inside ensure_url; ensure the BoolInput name matches the attribute used so operators can toggle strict blocking without a major release.

Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0).

- Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked.

…m/langflow-ai/langflow into fix/ssrf-url-component-PVR0699081

…-component-PVR0699081

* fix: add SSRF protection to URL component (PVR0699081) Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * fix: enforce SSRF blocking and add env variables to .env.example - Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked. * fix: correct .env.example to show empty default for SSRF protection The default is false, so .env.example should be empty (not true). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * fix: add SSRF protection to URL component (PVR0699081) Add Server-Side Request Forgery (SSRF) protection to the URL component by integrating the existing validate_url_for_ssrf function. This prevents the component from being used to access internal resources like localhost, private IP ranges, and cloud metadata endpoints. The fix uses warn_only=True for backwards compatibility, matching the behavior of the API Request component. Full blocking will be enabled in the next major version (2.0). * fix: enforce SSRF blocking and add env variables to .env.example - Change warn_only=False to actually block internal URLs when SSRF protection is enabled - Add LANGFLOW_SSRF_PROTECTION_ENABLED and LANGFLOW_SSRF_ALLOWED_HOSTS to .env.example - Update tests to reflect blocking mode When LANGFLOW_SSRF_PROTECTION_ENABLED=true, requests to private IPs, localhost, and cloud metadata endpoints will be blocked. * fix: correct .env.example to show empty default for SSRF protection The default is false, so .env.example should be empty (not true). * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

github-actions Bot added the community Pull Request from an external contributor label Mar 3, 2026