fix: move Docling worker to base module and update imports by ogabrielluiz · Pull Request #9471 · langflow-ai/langflow

ogabrielluiz · 2025-08-21T13:55:06Z

Move the docling_worker import to docling_utils to enhance code organization and maintainability. This change simplifies the import structure within the project.

Summary by CodeRabbit

New Features
- Background document-processing worker supporting “standard” and “VLM” pipelines, with optional OCR.
Improvements
- More robust processing with per-file error handling and continued execution.
- Clearer guidance when the optional Docling dependency isn’t installed.
- Enhanced logging and graceful shutdown behavior.
Bug Fixes
- Reduced crashes/hangs during cancellation or interruptions.
Refactor
- Moved the worker to a dedicated utility module and streamlined the Docling component module.
- Updated import paths to the new worker location.

…nization

coderabbitai · 2025-08-21T13:55:15Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Moved the Docling processing worker into a new base utility module, removed the worker and related imports from the components package, and updated an inline component to import the worker from the new location. The worker supports “standard” and “vlm” pipelines, per-file processing, signal-aware shutdown, and queue-based result reporting.

Changes

Cohort / File(s)	Summary of changes
Introduce docling worker utility `src/backend/base/langflow/base/data/docling_utils.py`	Added docling_worker with signal handling, lazy dependency loading, “standard”/“vlm” pipeline setup, per-file processing, and queue-based result aggregation and error reporting.
Components cleanup (remove worker from package init) `src/backend/base/langflow/components/docling/__init__.py`	Removed docling_worker and related imports; retained only lazy-import exposure for Docling components.
Update import to new worker location `src/backend/base/langflow/components/docling/docling_inline.py`	Switched docling_worker import to langflow.base.data.docling_utils; call sites unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Parent as Parent Process
  participant Worker as docling_worker (separate process)
  participant Docling as Docling Pipelines

  Parent->>Worker: start(file_paths, pipeline, ocr_engine, queue)
  Note over Worker: Register signal handlers<br/>Check shutdown flag
  Worker->>Worker: Lazy-import Docling modules
  alt pipeline == "standard"
    Worker->>Docling: Configure PdfPipelineOptions (+optional OCR)
  else pipeline == "vlm"
    Worker->>Docling: Configure VlmPipelineOptions
  end
  loop for each file_path
    Worker->>Docling: convert/process(file_path)
    alt success
      Worker-->>Parent: queue.put({file_path, document, status: "ok"})
    else error
      Worker-->>Parent: queue.put({file_path, error, status: "error"})
    end
    opt shutdown signaled
      Worker-->>Parent: queue.put({status: "shutdown"})
      Worker--xParent: exit
    end
  end
  Worker-->>Parent: queue.put({status: "done"})

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

refactor(docling): extract processing logic to separate worker process #9393 — Performs a similar extraction of docling_worker into multiprocessing context; overlaps with moving the worker location and updating imports.

Suggested labels

refactor, lgtm

Suggested reviewers

phact
edwinjosechittilappilly
rjordanfrazier
jordanrfrazier

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch move-docling-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/backend/base/langflow/components/docling/docling_inline.py (1)

171-180: Bug: mismatch with new worker error string — missing-Docling path won’t raise ImportError

The worker now emits a message starting with “Docling is an optional dependency of Langflow...”, while this code only checks for “Docling is not installed”. Result: you’ll raise RuntimeError instead of ImportError, breaking caller expectations and UX.

Apply this diff to accept both phrasings:

-        if isinstance(result, dict) and "error" in result:
-            msg = result["error"]
-            if msg.startswith("Docling is not installed"):
-                raise ImportError(msg)
+        if isinstance(result, dict) and "error" in result:
+            msg = result["error"]
+            # Normalize missing-Docling errors from the worker
+            if msg.startswith("Docling is not installed") or "optional dependency of Langflow" in msg:
+                raise ImportError(msg)
             # Handle interrupt gracefully - return empty result instead of raising error
             if "Worker interrupted by SIGINT" in msg or "shutdown" in result:
                 self.log("Docling process cancelled by user")
                 result = []
             else:
                 raise RuntimeError(msg)

🧹 Nitpick comments (5)

src/backend/base/langflow/components/docling/docling_inline.py (3)
77-111: Robust process health monitor — minor UX nit: consider configurable timeout

The loop correctly handles both normal completion and crash-before-result. Consider exposing timeout as a component input (advanced section) so users with large batches can tune beyond 300s without code changes.

112-131: Don't send SIGTERM preemptively — try a short join before escalating

You currently send SIGTERM immediately, even after a successful result, which can cut off worker’s finalization/logging unnecessarily. Prefer “wait → TERM → KILL”.

Apply this diff:
     def _terminate_process_gracefully(self, proc, timeout_terminate: int = 10, timeout_kill: int = 5):
         """Terminate process gracefully with escalating signals.
 
         First tries SIGTERM, then SIGKILL if needed.
         """
         if not proc.is_alive():
             return
 
-        self.log("Attempting graceful process termination with SIGTERM")
-        proc.terminate()  # Send SIGTERM
-        proc.join(timeout=timeout_terminate)
+        # Give the process a chance to exit cleanly without signals
+        self.log("Waiting for worker process to exit cleanly")
+        proc.join(timeout=timeout_terminate)
+
+        if not proc.is_alive():
+            return
+
+        self.log("Attempting graceful process termination with SIGTERM")
+        proc.terminate()  # Send SIGTERM
+        proc.join(timeout=timeout_terminate)
 
         if proc.is_alive():
             self.log("Process didn't respond to SIGTERM, using SIGKILL")
             proc.kill()  # Send SIGKILL
             proc.join(timeout=timeout_kill)
 
             if proc.is_alive():
                 self.log("Warning: Process still alive after SIGKILL")
141-144: Optional: name the worker process for easier debugging

A process name shows up in debuggers and logs.

Apply this diff:
         proc = ctx.Process(
-            target=docling_worker,
+            name="DoclingWorker",
+            target=docling_worker,
             args=(file_paths, queue, self.pipeline, self.ocr_engine),
         )
src/backend/base/langflow/base/data/docling_utils.py (2)
114-121: Consider standardizing the missing-Docling message prefix

For downstream consistency, consider starting the message with “Docling is not installed.” This aligns with the check added in the component and reduces string-matching brittleness.

Apply this minimal tweak:
-        msg = (
-            "Docling is an optional dependency of Langflow. "
+        msg = (
+            "Docling is not installed. Docling is an optional dependency of Langflow. "
             "Install with `uv pip install 'langflow[docling]'` "
             "or refer to the documentation"
         )
167-173: DocumentConverter provides built-in defaults for missing formats

The DocumentConverter implementation and documentation confirm that any InputFormat not explicitly provided in the format_options mapping will be populated with the library’s built-in default FormatOptions. This means that formats such as DOCX, PPTX, XLSX, and others will automatically receive sensible defaults without any additional code changes .

Mapping InputFormat.IMAGE → PdfFormatOption is technically supported if you want to apply the PDF processing pipeline (OCR, table extraction, etc.) to images, but it’s not the semantic default. Unless your use case specifically requires PDF-style handling for images, you can:

Omit the InputFormat.IMAGE entry entirely and let DocumentConverter assign its default ImageFormatOption.

Or explicitly map InputFormat.IMAGE → an ImageFormatOption instance, which better reflects standard image processing behavior .

If the intent is to reuse the PDF pipeline for images, keep the current mapping but verify that your PdfPipelineOptions and backends are configured appropriately for standalone image files (see examples in Docling issue #576).

Locations to review:

src/backend/base/langflow/base/data/docling_utils.py (lines 167–173):

No changes required for DOCX/PPTX/XLSX support—defaults cover those formats.

Optional: Replace the InputFormat.IMAGE → pdf_format_option mapping with either an omitted key or an image_format_option to use the semantic default.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 43119d0 and 369cf47.

📒 Files selected for processing (3)

src/backend/base/langflow/base/data/docling_utils.py (2 hunks)
src/backend/base/langflow/components/docling/__init__.py (0 hunks)
src/backend/base/langflow/components/docling/docling_inline.py (1 hunks)

💤 Files with no reviewable changes (1)

src/backend/base/langflow/components/docling/init.py

🧰 Additional context used

📓 Path-based instructions (3)

src/backend/base/langflow/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

src/backend/base/langflow/components/docling/docling_inline.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/base/langflow/components/docling/docling_inline.py
src/backend/base/langflow/base/data/docling_utils.py

src/backend/**/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/base/langflow/components/docling/docling_inline.py

🧬 Code graph analysis (1)

src/backend/base/langflow/components/docling/docling_inline.py (1)

src/backend/base/langflow/base/data/docling_utils.py (1)

docling_worker (60-245)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Optimize new Python code in this PR

🔇 Additional comments (7)

src/backend/base/langflow/components/docling/docling_inline.py (3)

6-6: Import relocation LGTM — centralizes worker and avoids component-level heavy deps

Switching to from langflow.base.data.docling_utils import docling_worker matches the PR goal and keeps the component lean.

16-17: Icon mapping present and case-correct

icon = "Docling" matches the frontend mapping requirement.

139-170: Approved: spawn context & importable target validated

The use of get_context("spawn") with the docling_worker top-level function ensures correct, cross-platform process spawning (Windows/macOS). Verification confirms there are no remaining imports from the old langflow.components.docling path—everything now references langflow.base.data.docling_utils. Code changes are clean and ready to merge.

src/backend/base/langflow/base/data/docling_utils.py (4)

13-57: Document extraction logic is solid — clear error messages and type checks

Good handling of DataFrame vs Data vs list[Data], with explicit, actionable errors.

91-99: Signal handling path is appropriate and guarded for platform differences

Registering SIGTERM/SIGINT with a guarded try/except is the right call for cross-platform support.

215-225: Result packaging is clear — successful conversions only; preserves positional mapping via None placeholders

This makes downstream rollup straightforward. Logging total successes is a nice touch.

226-245: Good shutdown semantics and error propagation

Graceful handling of KeyboardInterrupt and unexpected exceptions with traceback payloads is appropriate for a worker process.

coderabbitai · 2025-08-21T14:03:07Z

+import traceback
+from contextlib import suppress
+
 from docling_core.types.doc import DoclingDocument


⚠️ Potential issue

Critical: top-level import of Docling types breaks optional-dependency contract

from docling_core.types.doc import DoclingDocument at module import time will raise ModuleNotFoundError if Docling isn’t installed, causing any import of this module (and thus the component) to fail immediately. The worker carefully lazy-loads Docling; this top-level import defeats that.

Move the import inside extract_docling_documents and surface a friendly error:

-import signal -import sys -import traceback -from contextlib import suppress - -from docling_core.types.doc import DoclingDocument -from loguru import logger +import signal +import sys +import traceback +from contextlib import suppress + +from loguru import logger

And at the start of extract_docling_documents:

-def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str) -> list[DoclingDocument]: +def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str): + # Lazy import to keep Docling optional and avoid module import-time failures + try: + from docling_core.types.doc import DoclingDocument # type: ignore + except ModuleNotFoundError as e: + msg = ( + "Docling is not installed. Docling is an optional dependency of Langflow. " + "Install with `uv pip install 'langflow[docling]'` or refer to the documentation" + ) + raise TypeError(msg) from e

Note: returning a list[DoclingDocument] remains true at runtime; keep the annotation if your tooling supports postponed evaluation, or keep it un-annotated to avoid forward-reference to an optional type.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from docling_core.types.doc import DoclingDocument

import signal

import sys

import traceback

from contextlib import suppress

from loguru import logger

def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str):

# Lazy import to keep Docling optional and avoid module import-time failures

try:

from docling_core.types.doc import DoclingDocument # type: ignore

except ModuleNotFoundError as e:

msg = (

"Docling is not installed. Docling is an optional dependency of Langflow. "

"Install with `uv pip install 'langflow[docling]'` or refer to the documentation"

)

raise TypeError(msg) from e

# …rest of the original function body…

🤖 Prompt for AI Agents

In src/backend/base/langflow/base/data/docling_utils.py around line 6: the top-level import "from docling_core.types.doc import DoclingDocument" will raise ModuleNotFoundError when Docling is not installed and defeats lazy-loading; move that import into extract_docling_documents so the module only tries to import Docling at runtime inside the function, wrap the import in a try/except ImportError and raise or return a clear, friendly error message explaining Docling is optional and needs to be installed (or return an empty list/appropriate fallback), and keep or remove the return type annotation accordingly (use postponed evaluation like "list[DoclingDocument]" only if your tooling supports it, otherwise omit the annotation to avoid referencing the optional type at import time).

codecov · 2025-08-22T00:42:46Z

Codecov Report

❌ Patch coverage is 5.55556% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.41%. Comparing base (a1629a7) to head (8d77eb6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...c/backend/base/langflow/base/data/docling_utils.py	5.55%	102 Missing ⚠️

❌ Your patch status has failed because the patch coverage (5.55%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (2.67%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9471      +/-   ##
==========================================
- Coverage   33.94%   33.41%   -0.53%     
==========================================
  Files        1196     1186      -10     
  Lines       56116    56218     +102     
  Branches     5331     5363      +32     
==========================================
- Hits        19046    18783     -263     
- Misses      37000    37375     +375     
+ Partials       70       60      -10

Flag	Coverage Δ
backend	`56.32% <5.55%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...c/backend/base/langflow/base/data/docling_utils.py	`7.04% <5.55%> (-4.73%)`	⬇️

... and 45 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2025-08-22T18:43:02Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

refactor: move docling_worker import to docling_utils for better organization Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>

refactor: move docling_worker import to docling_utils for better orga…

369cf47

…nization

ogabrielluiz requested review from erichare, italojohnny and jordanrfrazier August 21, 2025 13:55

ogabrielluiz enabled auto-merge August 21, 2025 13:55

erichare approved these changes Aug 21, 2025

View reviewed changes

github-actions Bot added the lgtm This PR has been approved by a maintainer label Aug 21, 2025

coderabbitai Bot changed the title ~~@coderabbitai~~ refactor(utils): move Docling worker to base module and update imports Aug 21, 2025

github-actions Bot added the refactor Maintenance tasks and housekeeping label Aug 21, 2025

coderabbitai Bot reviewed Aug 21, 2025

View reviewed changes

ogabrielluiz changed the title ~~refactor(utils): move Docling worker to base module and update imports~~ refactor: move Docling worker to base module and update imports Aug 22, 2025

ogabrielluiz removed the lgtm This PR has been approved by a maintainer label Aug 22, 2025

github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 22, 2025

ogabrielluiz added the lgtm This PR has been approved by a maintainer label Aug 22, 2025

ogabrielluiz changed the title ~~refactor: move Docling worker to base module and update imports~~ fix: move Docling worker to base module and update imports Aug 22, 2025

github-actions Bot added bug Something isn't working and removed refactor Maintenance tasks and housekeeping labels Aug 22, 2025

italojohnny added lgtm This PR has been approved by a maintainer and removed lgtm This PR has been approved by a maintainer labels Aug 22, 2025

Merge branch 'main' into move-docling-worker

8d77eb6

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 22, 2025

ogabrielluiz added this pull request to the merge queue Aug 22, 2025

Merged via the queue into main with commit 877638b Aug 22, 2025
71 of 73 checks passed

ogabrielluiz deleted the move-docling-worker branch August 22, 2025 19:46

lucaseduoli pushed a commit that referenced this pull request Aug 22, 2025

fix: move Docling worker to base module and update imports (#9471)

bdf4562

refactor: move docling_worker import to docling_utils for better organization Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>

lucaseduoli pushed a commit that referenced this pull request Aug 25, 2025

fix: move Docling worker to base module and update imports (#9471)

7508a6a

refactor: move docling_worker import to docling_utils for better organization Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>

coderabbitai Bot mentioned this pull request Sep 10, 2025

feat: Use LLM components in Docling #9770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move Docling worker to base module and update imports#9471

fix: move Docling worker to base module and update imports#9471
ogabrielluiz merged 2 commits into
mainfrom
move-docling-worker

ogabrielluiz commented Aug 21, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Aug 21, 2025 •

edited

Loading

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Aug 21, 2025

Uh oh!

codecov Bot commented Aug 22, 2025 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Aug 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-from docling_core.types.doc import DoclingDocument
+import signal
+import sys
+import traceback
+from contextlib import suppress
+from loguru import logger
+def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str):
+    # Lazy import to keep Docling optional and avoid module import-time failures
+    try:
+        from docling_core.types.doc import DoclingDocument  # type: ignore
+    except ModuleNotFoundError as e:
+        msg = (
+            "Docling is not installed. Docling is an optional dependency of Langflow. "
+            "Install with `uv pip install 'langflow[docling]'` or refer to the documentation"
+        )
+        raise TypeError(msg) from e
+    # …rest of the original function body…

Conversation

ogabrielluiz commented Aug 21, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Aug 22, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ogabrielluiz commented Aug 21, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 21, 2025 •

edited

Loading

codecov Bot commented Aug 22, 2025 •

edited

Loading