Skip to content

fix: switch to docling-serve v1 API#9702

Merged
erichare merged 2 commits into
langflow-ai:mainfrom
dolfim-ibm:docling-serve-v1-2
Sep 4, 2025
Merged

fix: switch to docling-serve v1 API#9702
erichare merged 2 commits into
langflow-ai:mainfrom
dolfim-ibm:docling-serve-v1-2

Conversation

@dolfim-ibm
Copy link
Copy Markdown
Contributor

@dolfim-ibm dolfim-ibm commented Sep 4, 2025

Replaces #9634

@erichare here is the clean PR from the latest main.

Summary by CodeRabbit

  • Refactor
    • Migrated the Docling remote integration to the v1 API and aligned request payload structure and naming for compatibility.
  • Bug Fixes
    • Improved reliability of document conversion requests with the updated service.
    • Removed an unused option to prevent confusion in conversion behavior.
  • Chores
    • Internal configuration cleanup with no changes to public interfaces or exported APIs.

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 4, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updates Docling remote client to target API v1 (from v1alpha), rename payload field from "file_sources" to "sources" with per-item "kind": "file", and remove "return_as_file" option. No public API signatures changed.

Changes

Cohort / File(s) Summary
Docling Remote API update
src/lfx/src/lfx/components/docling/docling_remote.py
Switch base URL path from /v1alpha to /v1; request payload uses sources (was file_sources) and each source includes kind: "file" with base64_string and filename; removed return_as_file option from Docling options.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DoclingRemote
  participant DoclingAPI as Docling API (v1)

  Client->>DoclingRemote: convert_document(file)
  Note right of DoclingRemote: Build payload with<br/>sources: [{ kind: "file", base64_string, filename }]
  DoclingRemote->>DoclingAPI: POST /v1/... with payload
  DoclingAPI-->>DoclingRemote: Conversion result
  DoclingRemote-->>Client: Result (no return_as_file option)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Sep 4, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lfx/src/lfx/components/docling/docling_remote.py (2)

139-146: Add backoff and raise after exhausting 5xx retries

Currently this busy-loops every ~2s and returns None silently. Add exponential backoff and raise to surface errors to callers.

                 if retry_status_start <= response.status_code < retry_status_end:
                     http_failures += 1
                     if http_failures > self.MAX_500_RETRIES:
-                        self.log(f"The status requests got a http response {response.status_code} too many times.")
-                        return None
+                        msg = (f"Status polling failed with {response.status_code} "
+                               f"after {self.MAX_500_RETRIES} retries.")
+                        self.log(msg)
+                        raise RuntimeError(msg)
+                    # simple backoff: 2s, 4s, 8s ... (cap at 30s)
+                    backoff = min(2 ** http_failures, 30)
+                    time.sleep(backoff)
                     continue

172-175: Set explicit HTTP timeouts on the client

Without explicit timeouts, individual requests can hang regardless of max_poll_timeout.

-            httpx.Client(headers=self.api_headers) as client,
+            httpx.Client(
+                headers=self.api_headers,
+                timeout=httpx.Timeout(connect=10.0, read=30.0, write=30.0, pool=10.0),
+            ) as client,
🧹 Nitpick comments (2)
src/lfx/src/lfx/components/docling/docling_remote.py (2)

106-106: Normalize api_url before joining v1 path

Avoid potential double slashes and odd joins if users provide a trailing slash.

-        base_url = f"{self.api_url}/v1"
+        base_url = f"{self.api_url.rstrip('/')}/v1"

109-113: Add early file extension validation before encoding
Pre-validate against VALID_EXTENSIONS before reading the file to fail fast and avoid loading large unsupported files.

-            encoded_doc = base64.b64encode(file_path.read_bytes()).decode()
+            ext = file_path.suffix.lower().lstrip(".")
+            if ext not in self.VALID_EXTENSIONS:
+                self.log(f"Unsupported file extension: {ext}")
+                return None
+            encoded_doc = base64.b64encode(file_path.read_bytes()).decode()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 63ab2ca and 699a90a.

📒 Files selected for processing (1)
  • src/lfx/src/lfx/components/docling/docling_remote.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Update Starter Projects

@erichare erichare self-requested a review September 4, 2025 14:40
@erichare
Copy link
Copy Markdown
Collaborator

erichare commented Sep 4, 2025

Looks perfect! Thanks @dolfim-ibm

@erichare erichare enabled auto-merge September 4, 2025 14:41
@erichare erichare added this pull request to the merge queue Sep 4, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Sep 4, 2025
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Sep 4, 2025

Merged via the queue into langflow-ai:main with commit 4a1e6de Sep 4, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants