From cbbee704c5507e0387e68003e970f6ebe79d0bc8 Mon Sep 17 00:00:00 2001 From: Mendon Kissling <59585235+mendonk@users.noreply.github.com> Date: Fri, 16 Jan 2026 16:54:27 -0500 Subject: [PATCH 1/3] add-markdown-output-format --- docs/docs/Components/url.mdx | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/docs/Components/url.mdx b/docs/docs/Components/url.mdx index 67f527d0810d..845e2d00d762 100644 --- a/docs/docs/Components/url.mdx +++ b/docs/docs/Components/url.mdx @@ -10,7 +10,7 @@ import PartialParams from '@site/docs/_partial-hidden-params.mdx'; import PartialDevModeWindows from '@site/docs/_partial-dev-mode-windows.mdx'; The **URL** component fetches content from one or more URLs, processes the content, and returns it in various formats. -It follows links recursively to a given depth, and it supports output in plain text or raw HTML. +It follows links recursively to a given depth, and it supports output in plain text, Markdown, or raw HTML. ## URL parameters @@ -24,7 +24,7 @@ Some of the available parameters include the following: | max_depth | Depth | Input parameter. Controls link traversal: how many "clicks" away from the initial page the crawler will go. A depth of 1 limits the crawl to the first page at the given URL only. A depth of 2 means the crawler crawls the first page plus each page directly linked from the first page, then stops. This setting exclusively controls link traversal; it doesn't limit the number of URL path segments or the domain. | | prevent_outside | Prevent Outside | Input parameter. If enabled, only crawls URLs within the same domain as the root URL. This prevents the crawler from accessing sites outside the given URL's domain, even if they are linked from one of the crawled pages. | | use_async | Use Async | Input parameter. If enabled, uses asynchronous loading which can be significantly faster but might use more system resources. | -| format | Output Format | Input parameter. Sets the desired output format as **Text** or **HTML**. The default is **Text**. For more information, see [URL output](#url-output).| +| format | Output Format | Input parameter. Sets the desired output format as **Text**, **Markdown**, or **HTML**. The default is **Text**. For more information, see [URL output](#url-output).| | timeout | Timeout | Input parameter. Timeout for the request in seconds. | | headers | Headers | Input parameter. The headers to send with the request if needed for authentication or otherwise. | @@ -37,12 +37,13 @@ There are two settings that control the output of the **URL** component at diffe * **Output Format**: This optional parameter controls the content extracted from the crawled pages: * **Text (default)**: The component extracts only the text from the HTML of the crawled pages. + * **Markdown**: The component converts the HTML content to markdown format using [Markitdown](https://github.com/microsoft/markitdown). * **HTML**: The component extracts the entire raw HTML content of the crawled pages. * **Output data type**: In the component's output field (near the output port) you can select the structure of the outgoing data when it is passed to other components: * **Extracted Pages**: Outputs a [`DataFrame`](/data-types#dataframe) that breaks the crawled pages into columns for the entire page content (`text`) and metadata like `url` and `title`. - * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text or HTML from the crawled pages, including metadata, in a single block of text. + * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text, markdown, or HTML from the crawled pages, depending on the selected output format, including metadata, in a single block of text. When used as a standard component in a flow, the **URL** component must be connected to a component that accepts the selected output data type (`DataFrame` or `Message`). You can connect the **URL** component directly to a compatible component, or you can use a [**Type Convert** component](/type-convert) to convert the output to another type before passing the data to other components if the data types aren't directly compatible. From a20a7b73c712900b7a0d38b1df83338a3a0322d7 Mon Sep 17 00:00:00 2001 From: Mendon Kissling <59585235+mendonk@users.noreply.github.com> Date: Fri, 16 Jan 2026 16:56:07 -0500 Subject: [PATCH 2/3] raw-content --- docs/docs/Components/url.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/Components/url.mdx b/docs/docs/Components/url.mdx index 845e2d00d762..9f58e7fda2b4 100644 --- a/docs/docs/Components/url.mdx +++ b/docs/docs/Components/url.mdx @@ -43,7 +43,7 @@ There are two settings that control the output of the **URL** component at diffe * **Output data type**: In the component's output field (near the output port) you can select the structure of the outgoing data when it is passed to other components: * **Extracted Pages**: Outputs a [`DataFrame`](/data-types#dataframe) that breaks the crawled pages into columns for the entire page content (`text`) and metadata like `url` and `title`. - * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text, markdown, or HTML from the crawled pages, depending on the selected output format, including metadata, in a single block of text. + * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text, markdown, or HTML from the crawled pages, including metadata, in a single block of text. When used as a standard component in a flow, the **URL** component must be connected to a component that accepts the selected output data type (`DataFrame` or `Message`). You can connect the **URL** component directly to a compatible component, or you can use a [**Type Convert** component](/type-convert) to convert the output to another type before passing the data to other components if the data types aren't directly compatible. From aad13e5e11f9f6af4c6a1d9e42213fedfafb964e Mon Sep 17 00:00:00 2001 From: Mendon Kissling <59585235+mendonk@users.noreply.github.com> Date: Tue, 20 Jan 2026 11:01:15 -0500 Subject: [PATCH 3/3] Apply suggestions from code review --- docs/docs/Components/url.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/Components/url.mdx b/docs/docs/Components/url.mdx index 9f58e7fda2b4..af1522a71f10 100644 --- a/docs/docs/Components/url.mdx +++ b/docs/docs/Components/url.mdx @@ -43,7 +43,7 @@ There are two settings that control the output of the **URL** component at diffe * **Output data type**: In the component's output field (near the output port) you can select the structure of the outgoing data when it is passed to other components: * **Extracted Pages**: Outputs a [`DataFrame`](/data-types#dataframe) that breaks the crawled pages into columns for the entire page content (`text`) and metadata like `url` and `title`. - * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text, markdown, or HTML from the crawled pages, including metadata, in a single block of text. + * **Raw Content**: Outputs a [`Message`](/data-types#message) containing the entire text, Markdown, or HTML from the crawled pages, including metadata, in a single block of text. When used as a standard component in a flow, the **URL** component must be connected to a component that accepts the selected output data type (`DataFrame` or `Message`). You can connect the **URL** component directly to a compatible component, or you can use a [**Type Convert** component](/type-convert) to convert the output to another type before passing the data to other components if the data types aren't directly compatible.