Skip to content

fix: move Docling worker to base module and update imports#9471

Merged
ogabrielluiz merged 2 commits into
mainfrom
move-docling-worker
Aug 22, 2025
Merged

fix: move Docling worker to base module and update imports#9471
ogabrielluiz merged 2 commits into
mainfrom
move-docling-worker

Conversation

@ogabrielluiz
Copy link
Copy Markdown
Contributor

@ogabrielluiz ogabrielluiz commented Aug 21, 2025

Move the docling_worker import to docling_utils to enhance code organization and maintainability. This change simplifies the import structure within the project.

Summary by CodeRabbit

  • New Features

    • Background document-processing worker supporting “standard” and “VLM” pipelines, with optional OCR.
  • Improvements

    • More robust processing with per-file error handling and continued execution.
    • Clearer guidance when the optional Docling dependency isn’t installed.
    • Enhanced logging and graceful shutdown behavior.
  • Bug Fixes

    • Reduced crashes/hangs during cancellation or interruptions.
  • Refactor

    • Moved the worker to a dedicated utility module and streamlined the Docling component module.
    • Updated import paths to the new worker location.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 21, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Moved the Docling processing worker into a new base utility module, removed the worker and related imports from the components package, and updated an inline component to import the worker from the new location. The worker supports “standard” and “vlm” pipelines, per-file processing, signal-aware shutdown, and queue-based result reporting.

Changes

Cohort / File(s) Summary of changes
Introduce docling worker utility
src/backend/base/langflow/base/data/docling_utils.py
Added docling_worker with signal handling, lazy dependency loading, “standard”/“vlm” pipeline setup, per-file processing, and queue-based result aggregation and error reporting.
Components cleanup (remove worker from package init)
src/backend/base/langflow/components/docling/__init__.py
Removed docling_worker and related imports; retained only lazy-import exposure for Docling components.
Update import to new worker location
src/backend/base/langflow/components/docling/docling_inline.py
Switched docling_worker import to langflow.base.data.docling_utils; call sites unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Parent as Parent Process
  participant Worker as docling_worker (separate process)
  participant Docling as Docling Pipelines

  Parent->>Worker: start(file_paths, pipeline, ocr_engine, queue)
  Note over Worker: Register signal handlers<br/>Check shutdown flag
  Worker->>Worker: Lazy-import Docling modules
  alt pipeline == "standard"
    Worker->>Docling: Configure PdfPipelineOptions (+optional OCR)
  else pipeline == "vlm"
    Worker->>Docling: Configure VlmPipelineOptions
  end
  loop for each file_path
    Worker->>Docling: convert/process(file_path)
    alt success
      Worker-->>Parent: queue.put({file_path, document, status: "ok"})
    else error
      Worker-->>Parent: queue.put({file_path, error, status: "error"})
    end
    opt shutdown signaled
      Worker-->>Parent: queue.put({status: "shutdown"})
      Worker--xParent: exit
    end
  end
  Worker-->>Parent: queue.put({status: "done"})
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

refactor, lgtm

Suggested reviewers

  • phact
  • edwinjosechittilappilly
  • rjordanfrazier
  • jordanrfrazier
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch move-docling-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot added the lgtm This PR has been approved by a maintainer label Aug 21, 2025
@coderabbitai coderabbitai Bot changed the title @coderabbitai refactor(utils): move Docling worker to base module and update imports Aug 21, 2025
@github-actions github-actions Bot added the refactor Maintenance tasks and housekeeping label Aug 21, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)

171-180: Bug: mismatch with new worker error string — missing-Docling path won’t raise ImportError

The worker now emits a message starting with “Docling is an optional dependency of Langflow...”, while this code only checks for “Docling is not installed”. Result: you’ll raise RuntimeError instead of ImportError, breaking caller expectations and UX.

Apply this diff to accept both phrasings:

-        if isinstance(result, dict) and "error" in result:
-            msg = result["error"]
-            if msg.startswith("Docling is not installed"):
-                raise ImportError(msg)
+        if isinstance(result, dict) and "error" in result:
+            msg = result["error"]
+            # Normalize missing-Docling errors from the worker
+            if msg.startswith("Docling is not installed") or "optional dependency of Langflow" in msg:
+                raise ImportError(msg)
             # Handle interrupt gracefully - return empty result instead of raising error
             if "Worker interrupted by SIGINT" in msg or "shutdown" in result:
                 self.log("Docling process cancelled by user")
                 result = []
             else:
                 raise RuntimeError(msg)
🧹 Nitpick comments (5)
src/backend/base/langflow/components/docling/docling_inline.py (3)

77-111: Robust process health monitor — minor UX nit: consider configurable timeout

The loop correctly handles both normal completion and crash-before-result. Consider exposing timeout as a component input (advanced section) so users with large batches can tune beyond 300s without code changes.


112-131: Don't send SIGTERM preemptively — try a short join before escalating

You currently send SIGTERM immediately, even after a successful result, which can cut off worker’s finalization/logging unnecessarily. Prefer “wait → TERM → KILL”.

Apply this diff:

     def _terminate_process_gracefully(self, proc, timeout_terminate: int = 10, timeout_kill: int = 5):
         """Terminate process gracefully with escalating signals.
 
         First tries SIGTERM, then SIGKILL if needed.
         """
         if not proc.is_alive():
             return
 
-        self.log("Attempting graceful process termination with SIGTERM")
-        proc.terminate()  # Send SIGTERM
-        proc.join(timeout=timeout_terminate)
+        # Give the process a chance to exit cleanly without signals
+        self.log("Waiting for worker process to exit cleanly")
+        proc.join(timeout=timeout_terminate)
+
+        if not proc.is_alive():
+            return
+
+        self.log("Attempting graceful process termination with SIGTERM")
+        proc.terminate()  # Send SIGTERM
+        proc.join(timeout=timeout_terminate)
 
         if proc.is_alive():
             self.log("Process didn't respond to SIGTERM, using SIGKILL")
             proc.kill()  # Send SIGKILL
             proc.join(timeout=timeout_kill)
 
             if proc.is_alive():
                 self.log("Warning: Process still alive after SIGKILL")

141-144: Optional: name the worker process for easier debugging

A process name shows up in debuggers and logs.

Apply this diff:

         proc = ctx.Process(
-            target=docling_worker,
+            name="DoclingWorker",
+            target=docling_worker,
             args=(file_paths, queue, self.pipeline, self.ocr_engine),
         )
src/backend/base/langflow/base/data/docling_utils.py (2)

114-121: Consider standardizing the missing-Docling message prefix

For downstream consistency, consider starting the message with “Docling is not installed.” This aligns with the check added in the component and reduces string-matching brittleness.

Apply this minimal tweak:

-        msg = (
-            "Docling is an optional dependency of Langflow. "
+        msg = (
+            "Docling is not installed. Docling is an optional dependency of Langflow. "
             "Install with `uv pip install 'langflow[docling]'` "
             "or refer to the documentation"
         )

167-173: DocumentConverter provides built-in defaults for missing formats

The DocumentConverter implementation and documentation confirm that any InputFormat not explicitly provided in the format_options mapping will be populated with the library’s built-in default FormatOptions. This means that formats such as DOCX, PPTX, XLSX, and others will automatically receive sensible defaults without any additional code changes .

Mapping InputFormat.IMAGEPdfFormatOption is technically supported if you want to apply the PDF processing pipeline (OCR, table extraction, etc.) to images, but it’s not the semantic default. Unless your use case specifically requires PDF-style handling for images, you can:

  • Omit the InputFormat.IMAGE entry entirely and let DocumentConverter assign its default ImageFormatOption.
  • Or explicitly map InputFormat.IMAGE → an ImageFormatOption instance, which better reflects standard image processing behavior .

If the intent is to reuse the PDF pipeline for images, keep the current mapping but verify that your PdfPipelineOptions and backends are configured appropriately for standalone image files (see examples in Docling issue #576).

Locations to review:

  • src/backend/base/langflow/base/data/docling_utils.py (lines 167–173):
    • No changes required for DOCX/PPTX/XLSX support—defaults cover those formats.
    • Optional: Replace the InputFormat.IMAGEpdf_format_option mapping with either an omitted key or an image_format_option to use the semantic default.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 43119d0 and 369cf47.

📒 Files selected for processing (3)
  • src/backend/base/langflow/base/data/docling_utils.py (2 hunks)
  • src/backend/base/langflow/components/docling/__init__.py (0 hunks)
  • src/backend/base/langflow/components/docling/docling_inline.py (1 hunks)
💤 Files with no reviewable changes (1)
  • src/backend/base/langflow/components/docling/init.py
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

  • src/backend/base/langflow/components/docling/docling_inline.py
{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

  • src/backend/base/langflow/components/docling/docling_inline.py
  • src/backend/base/langflow/base/data/docling_utils.py
src/backend/**/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/base/langflow/components/docling/docling_inline.py
🧬 Code graph analysis (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)
src/backend/base/langflow/base/data/docling_utils.py (1)
  • docling_worker (60-245)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Optimize new Python code in this PR
🔇 Additional comments (7)
src/backend/base/langflow/components/docling/docling_inline.py (3)

6-6: Import relocation LGTM — centralizes worker and avoids component-level heavy deps

Switching to from langflow.base.data.docling_utils import docling_worker matches the PR goal and keeps the component lean.


16-17: Icon mapping present and case-correct

icon = "Docling" matches the frontend mapping requirement.


139-170: Approved: spawn context & importable target validated

The use of get_context("spawn") with the docling_worker top-level function ensures correct, cross-platform process spawning (Windows/macOS). Verification confirms there are no remaining imports from the old langflow.components.docling path—everything now references langflow.base.data.docling_utils. Code changes are clean and ready to merge.

src/backend/base/langflow/base/data/docling_utils.py (4)

13-57: Document extraction logic is solid — clear error messages and type checks

Good handling of DataFrame vs Data vs list[Data], with explicit, actionable errors.


91-99: Signal handling path is appropriate and guarded for platform differences

Registering SIGTERM/SIGINT with a guarded try/except is the right call for cross-platform support.


215-225: Result packaging is clear — successful conversions only; preserves positional mapping via None placeholders

This makes downstream rollup straightforward. Logging total successes is a nice touch.


226-245: Good shutdown semantics and error propagation

Graceful handling of KeyboardInterrupt and unexpected exceptions with traceback payloads is appropriate for a worker process.

import traceback
from contextlib import suppress

from docling_core.types.doc import DoclingDocument
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: top-level import of Docling types breaks optional-dependency contract

from docling_core.types.doc import DoclingDocument at module import time will raise ModuleNotFoundError if Docling isn’t installed, causing any import of this module (and thus the component) to fail immediately. The worker carefully lazy-loads Docling; this top-level import defeats that.

Move the import inside extract_docling_documents and surface a friendly error:

-import signal
-import sys
-import traceback
-from contextlib import suppress
-
-from docling_core.types.doc import DoclingDocument
-from loguru import logger
+import signal
+import sys
+import traceback
+from contextlib import suppress
+
+from loguru import logger

And at the start of extract_docling_documents:

-def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str) -> list[DoclingDocument]:
+def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str):
+    # Lazy import to keep Docling optional and avoid module import-time failures
+    try:
+        from docling_core.types.doc import DoclingDocument  # type: ignore
+    except ModuleNotFoundError as e:
+        msg = (
+            "Docling is not installed. Docling is an optional dependency of Langflow. "
+            "Install with `uv pip install 'langflow[docling]'` or refer to the documentation"
+        )
+        raise TypeError(msg) from e

Note: returning a list[DoclingDocument] remains true at runtime; keep the annotation if your tooling supports postponed evaluation, or keep it un-annotated to avoid forward-reference to an optional type.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from docling_core.types.doc import DoclingDocument
import signal
import sys
import traceback
from contextlib import suppress
from loguru import logger
def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str):
# Lazy import to keep Docling optional and avoid module import-time failures
try:
from docling_core.types.doc import DoclingDocument # type: ignore
except ModuleNotFoundError as e:
msg = (
"Docling is not installed. Docling is an optional dependency of Langflow. "
"Install with `uv pip install 'langflow[docling]'` or refer to the documentation"
)
raise TypeError(msg) from e
# …rest of the original function body…
🤖 Prompt for AI Agents
In src/backend/base/langflow/base/data/docling_utils.py around line 6: the
top-level import "from docling_core.types.doc import DoclingDocument" will raise
ModuleNotFoundError when Docling is not installed and defeats lazy-loading; move
that import into extract_docling_documents so the module only tries to import
Docling at runtime inside the function, wrap the import in a try/except
ImportError and raise or return a clear, friendly error message explaining
Docling is optional and needs to be installed (or return an empty
list/appropriate fallback), and keep or remove the return type annotation
accordingly (use postponed evaluation like "list[DoclingDocument]" only if your
tooling supports it, otherwise omit the annotation to avoid referencing the
optional type at import time).

@ogabrielluiz ogabrielluiz changed the title refactor(utils): move Docling worker to base module and update imports refactor: move Docling worker to base module and update imports Aug 22, 2025
@ogabrielluiz ogabrielluiz removed the lgtm This PR has been approved by a maintainer label Aug 22, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 22, 2025
@ogabrielluiz ogabrielluiz added the lgtm This PR has been approved by a maintainer label Aug 22, 2025
@ogabrielluiz ogabrielluiz changed the title refactor: move Docling worker to base module and update imports fix: move Docling worker to base module and update imports Aug 22, 2025
@github-actions github-actions Bot added bug Something isn't working and removed refactor Maintenance tasks and housekeeping labels Aug 22, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 22, 2025

Codecov Report

❌ Patch coverage is 5.55556% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.41%. Comparing base (a1629a7) to head (8d77eb6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...c/backend/base/langflow/base/data/docling_utils.py 5.55% 102 Missing ⚠️

❌ Your patch status has failed because the patch coverage (5.55%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (2.67%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #9471      +/-   ##
==========================================
- Coverage   33.94%   33.41%   -0.53%     
==========================================
  Files        1196     1186      -10     
  Lines       56116    56218     +102     
  Branches     5331     5363      +32     
==========================================
- Hits        19046    18783     -263     
- Misses      37000    37375     +375     
+ Partials       70       60      -10     
Flag Coverage Δ
backend 56.32% <5.55%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...c/backend/base/langflow/base/data/docling_utils.py 7.04% <5.55%> (-4.73%) ⬇️

... and 45 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@italojohnny italojohnny added lgtm This PR has been approved by a maintainer and removed lgtm This PR has been approved by a maintainer labels Aug 22, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 22, 2025
@sonarqubecloud
Copy link
Copy Markdown

@ogabrielluiz ogabrielluiz added this pull request to the merge queue Aug 22, 2025
Merged via the queue into main with commit 877638b Aug 22, 2025
71 of 73 checks passed
@ogabrielluiz ogabrielluiz deleted the move-docling-worker branch August 22, 2025 19:46
lucaseduoli pushed a commit that referenced this pull request Aug 22, 2025
refactor: move docling_worker import to docling_utils for better organization

Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>
lucaseduoli pushed a commit that referenced this pull request Aug 25, 2025
refactor: move docling_worker import to docling_utils for better organization

Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants