fix: move Docling worker to base module and update imports#9471
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughMoved the Docling processing worker into a new base utility module, removed the worker and related imports from the components package, and updated an inline component to import the worker from the new location. The worker supports “standard” and “vlm” pipelines, per-file processing, signal-aware shutdown, and queue-based result reporting. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Parent as Parent Process
participant Worker as docling_worker (separate process)
participant Docling as Docling Pipelines
Parent->>Worker: start(file_paths, pipeline, ocr_engine, queue)
Note over Worker: Register signal handlers<br/>Check shutdown flag
Worker->>Worker: Lazy-import Docling modules
alt pipeline == "standard"
Worker->>Docling: Configure PdfPipelineOptions (+optional OCR)
else pipeline == "vlm"
Worker->>Docling: Configure VlmPipelineOptions
end
loop for each file_path
Worker->>Docling: convert/process(file_path)
alt success
Worker-->>Parent: queue.put({file_path, document, status: "ok"})
else error
Worker-->>Parent: queue.put({file_path, error, status: "error"})
end
opt shutdown signaled
Worker-->>Parent: queue.put({status: "shutdown"})
Worker--xParent: exit
end
end
Worker-->>Parent: queue.put({status: "done"})
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)
171-180: Bug: mismatch with new worker error string — missing-Docling path won’t raise ImportErrorThe worker now emits a message starting with “Docling is an optional dependency of Langflow...”, while this code only checks for “Docling is not installed”. Result: you’ll raise RuntimeError instead of ImportError, breaking caller expectations and UX.
Apply this diff to accept both phrasings:
- if isinstance(result, dict) and "error" in result: - msg = result["error"] - if msg.startswith("Docling is not installed"): - raise ImportError(msg) + if isinstance(result, dict) and "error" in result: + msg = result["error"] + # Normalize missing-Docling errors from the worker + if msg.startswith("Docling is not installed") or "optional dependency of Langflow" in msg: + raise ImportError(msg) # Handle interrupt gracefully - return empty result instead of raising error if "Worker interrupted by SIGINT" in msg or "shutdown" in result: self.log("Docling process cancelled by user") result = [] else: raise RuntimeError(msg)
🧹 Nitpick comments (5)
src/backend/base/langflow/components/docling/docling_inline.py (3)
77-111: Robust process health monitor — minor UX nit: consider configurable timeoutThe loop correctly handles both normal completion and crash-before-result. Consider exposing
timeoutas a component input (advanced section) so users with large batches can tune beyond 300s without code changes.
112-131: Don't send SIGTERM preemptively — try a short join before escalatingYou currently send SIGTERM immediately, even after a successful result, which can cut off worker’s finalization/logging unnecessarily. Prefer “wait → TERM → KILL”.
Apply this diff:
def _terminate_process_gracefully(self, proc, timeout_terminate: int = 10, timeout_kill: int = 5): """Terminate process gracefully with escalating signals. First tries SIGTERM, then SIGKILL if needed. """ if not proc.is_alive(): return - self.log("Attempting graceful process termination with SIGTERM") - proc.terminate() # Send SIGTERM - proc.join(timeout=timeout_terminate) + # Give the process a chance to exit cleanly without signals + self.log("Waiting for worker process to exit cleanly") + proc.join(timeout=timeout_terminate) + + if not proc.is_alive(): + return + + self.log("Attempting graceful process termination with SIGTERM") + proc.terminate() # Send SIGTERM + proc.join(timeout=timeout_terminate) if proc.is_alive(): self.log("Process didn't respond to SIGTERM, using SIGKILL") proc.kill() # Send SIGKILL proc.join(timeout=timeout_kill) if proc.is_alive(): self.log("Warning: Process still alive after SIGKILL")
141-144: Optional: name the worker process for easier debuggingA process name shows up in debuggers and logs.
Apply this diff:
proc = ctx.Process( - target=docling_worker, + name="DoclingWorker", + target=docling_worker, args=(file_paths, queue, self.pipeline, self.ocr_engine), )src/backend/base/langflow/base/data/docling_utils.py (2)
114-121: Consider standardizing the missing-Docling message prefixFor downstream consistency, consider starting the message with “Docling is not installed.” This aligns with the check added in the component and reduces string-matching brittleness.
Apply this minimal tweak:
- msg = ( - "Docling is an optional dependency of Langflow. " + msg = ( + "Docling is not installed. Docling is an optional dependency of Langflow. " "Install with `uv pip install 'langflow[docling]'` " "or refer to the documentation" )
167-173: DocumentConverter provides built-in defaults for missing formatsThe
DocumentConverterimplementation and documentation confirm that anyInputFormatnot explicitly provided in theformat_optionsmapping will be populated with the library’s built-in defaultFormatOptions. This means that formats such as DOCX, PPTX, XLSX, and others will automatically receive sensible defaults without any additional code changes .Mapping
InputFormat.IMAGE→PdfFormatOptionis technically supported if you want to apply the PDF processing pipeline (OCR, table extraction, etc.) to images, but it’s not the semantic default. Unless your use case specifically requires PDF-style handling for images, you can:
- Omit the
InputFormat.IMAGEentry entirely and letDocumentConverterassign its defaultImageFormatOption.- Or explicitly map
InputFormat.IMAGE→ anImageFormatOptioninstance, which better reflects standard image processing behavior .If the intent is to reuse the PDF pipeline for images, keep the current mapping but verify that your
PdfPipelineOptionsand backends are configured appropriately for standalone image files (see examples in Docling issue #576).Locations to review:
- src/backend/base/langflow/base/data/docling_utils.py (lines 167–173):
- No changes required for DOCX/PPTX/XLSX support—defaults cover those formats.
- Optional: Replace the
InputFormat.IMAGE→pdf_format_optionmapping with either an omitted key or animage_format_optionto use the semantic default.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
src/backend/base/langflow/base/data/docling_utils.py(2 hunks)src/backend/base/langflow/components/docling/__init__.py(0 hunks)src/backend/base/langflow/components/docling/docling_inline.py(1 hunks)
💤 Files with no reviewable changes (1)
- src/backend/base/langflow/components/docling/init.py
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately
Files:
src/backend/base/langflow/components/docling/docling_inline.py
{src/backend/**/*.py,tests/**/*.py,Makefile}
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code
Files:
src/backend/base/langflow/components/docling/docling_inline.pysrc/backend/base/langflow/base/data/docling_utils.py
src/backend/**/components/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)
In your Python component class, set the
iconattribute to a string matching the frontend icon mapping exactly (case-sensitive).
Files:
src/backend/base/langflow/components/docling/docling_inline.py
🧬 Code graph analysis (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)
src/backend/base/langflow/base/data/docling_utils.py (1)
docling_worker(60-245)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Optimize new Python code in this PR
🔇 Additional comments (7)
src/backend/base/langflow/components/docling/docling_inline.py (3)
6-6: Import relocation LGTM — centralizes worker and avoids component-level heavy depsSwitching to
from langflow.base.data.docling_utils import docling_workermatches the PR goal and keeps the component lean.
16-17: Icon mapping present and case-correct
icon = "Docling"matches the frontend mapping requirement.
139-170: Approved: spawn context & importable target validatedThe use of
get_context("spawn")with thedocling_workertop-level function ensures correct, cross-platform process spawning (Windows/macOS). Verification confirms there are no remaining imports from the oldlangflow.components.doclingpath—everything now referenceslangflow.base.data.docling_utils. Code changes are clean and ready to merge.src/backend/base/langflow/base/data/docling_utils.py (4)
13-57: Document extraction logic is solid — clear error messages and type checksGood handling of DataFrame vs Data vs list[Data], with explicit, actionable errors.
91-99: Signal handling path is appropriate and guarded for platform differencesRegistering SIGTERM/SIGINT with a guarded try/except is the right call for cross-platform support.
215-225: Result packaging is clear — successful conversions only; preserves positional mapping via None placeholdersThis makes downstream rollup straightforward. Logging total successes is a nice touch.
226-245: Good shutdown semantics and error propagationGraceful handling of KeyboardInterrupt and unexpected exceptions with traceback payloads is appropriate for a worker process.
| import traceback | ||
| from contextlib import suppress | ||
|
|
||
| from docling_core.types.doc import DoclingDocument |
There was a problem hiding this comment.
Critical: top-level import of Docling types breaks optional-dependency contract
from docling_core.types.doc import DoclingDocument at module import time will raise ModuleNotFoundError if Docling isn’t installed, causing any import of this module (and thus the component) to fail immediately. The worker carefully lazy-loads Docling; this top-level import defeats that.
Move the import inside extract_docling_documents and surface a friendly error:
-import signal
-import sys
-import traceback
-from contextlib import suppress
-
-from docling_core.types.doc import DoclingDocument
-from loguru import logger
+import signal
+import sys
+import traceback
+from contextlib import suppress
+
+from loguru import loggerAnd at the start of extract_docling_documents:
-def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str) -> list[DoclingDocument]:
+def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str):
+ # Lazy import to keep Docling optional and avoid module import-time failures
+ try:
+ from docling_core.types.doc import DoclingDocument # type: ignore
+ except ModuleNotFoundError as e:
+ msg = (
+ "Docling is not installed. Docling is an optional dependency of Langflow. "
+ "Install with `uv pip install 'langflow[docling]'` or refer to the documentation"
+ )
+ raise TypeError(msg) from eNote: returning a list[DoclingDocument] remains true at runtime; keep the annotation if your tooling supports postponed evaluation, or keep it un-annotated to avoid forward-reference to an optional type.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from docling_core.types.doc import DoclingDocument | |
| import signal | |
| import sys | |
| import traceback | |
| from contextlib import suppress | |
| from loguru import logger | |
| def extract_docling_documents(data_inputs: Data | list[Data] | DataFrame, doc_key: str): | |
| # Lazy import to keep Docling optional and avoid module import-time failures | |
| try: | |
| from docling_core.types.doc import DoclingDocument # type: ignore | |
| except ModuleNotFoundError as e: | |
| msg = ( | |
| "Docling is not installed. Docling is an optional dependency of Langflow. " | |
| "Install with `uv pip install 'langflow[docling]'` or refer to the documentation" | |
| ) | |
| raise TypeError(msg) from e | |
| # …rest of the original function body… |
🤖 Prompt for AI Agents
In src/backend/base/langflow/base/data/docling_utils.py around line 6: the
top-level import "from docling_core.types.doc import DoclingDocument" will raise
ModuleNotFoundError when Docling is not installed and defeats lazy-loading; move
that import into extract_docling_documents so the module only tries to import
Docling at runtime inside the function, wrap the import in a try/except
ImportError and raise or return a clear, friendly error message explaining
Docling is optional and needs to be installed (or return an empty
list/appropriate fallback), and keep or remove the return type annotation
accordingly (use postponed evaluation like "list[DoclingDocument]" only if your
tooling supports it, otherwise omit the annotation to avoid referencing the
optional type at import time).
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (5.55%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #9471 +/- ##
==========================================
- Coverage 33.94% 33.41% -0.53%
==========================================
Files 1196 1186 -10
Lines 56116 56218 +102
Branches 5331 5363 +32
==========================================
- Hits 19046 18783 -263
- Misses 37000 37375 +375
+ Partials 70 60 -10
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
refactor: move docling_worker import to docling_utils for better organization Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>
refactor: move docling_worker import to docling_utils for better organization Co-authored-by: Ítalo Johnny <italojohnnydosanjos@gmail.com>



Move the
docling_workerimport todocling_utilsto enhance code organization and maintainability. This change simplifies the import structure within the project.Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Refactor