refactor(docling): extract processing logic to separate worker process by italojohnny · Pull Request #9393 · langflow-ai/langflow

italojohnny · 2025-08-14T11:53:45Z

Extracts Docling processing to separate worker process while maintaining feature parity with original implementation.

Key changes:

Preserves _get_standard_opts() and _get_vlm_opts() configuration
Maintains VlmPipeline and OCR factory setup
Adds proper error propagation between processes

Summary by CodeRabbit

Refactor
- Moved document conversion to a separate process for improved stability and responsiveness, with centralized OCR/VLM pipeline configuration. Existing interfaces remain unchanged.
Bug Fixes
- Improved error reporting when the conversion engine isn’t available.
- Prevents crashes by isolating heavy conversions and standardizing per-file status handling.

- Move Docling processing to dedicated worker function - Preserve all original pipeline configuration logic - Maintain support for standard and VLM pipelines - Keep complete OCR engine configuration - Add proper error handling for multiprocessing context

coderabbitai · 2025-08-14T11:53:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces a separate multiprocessing worker (docling_worker) for Docling conversions. The inline component now spawns a process with a queue, delegates conversion to the worker, receives results/errors via IPC, and maps outputs to existing Data structures. Worker handles lazy imports, pipeline selection (standard/VLM), OCR options, conversion, and error normalization.

Changes

Cohort / File(s)	Summary
Docling worker introduction `src/backend/base/langflow/components/docling/__init__.py`	Adds docling_worker to perform Docling conversions in a separate process. Handles lazy imports, pipeline option construction (standard/VLM, OCR), converter setup, convert_all execution, result normalization, and queue-based error/success reporting.
Inline refactor to multiprocessing `src/backend/base/langflow/components/docling/docling_inline.py`	Refactors inline processing to spawn a new process using get_context("spawn") and a Queue. Delegates to docling_worker, collects results, raises ImportError on worker-reported errors, and rebuilds processed_data to match prior API outputs. Removes local Docling configuration logic.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as DoclingInlineComponent
  participant MP as multiprocessing (spawn)
  participant P as Worker Process
  participant Q as Queue
  participant W as docling_worker
  participant D as Docling Converter

  Caller->>MP: create Queue, spawn Process(target=docling_worker, args)
  MP->>P: start
  P->>W: run(file_paths, queue, pipeline, ocr_engine)
  W->>D: lazy import + configure pipeline (standard/VLM, OCR)
  W->>D: convert_all(file_paths)
  D-->>W: results
  W->>Q: put(processed_data or {"error": msg})
  Caller->>Q: get()
  Caller->>MP: join process
  alt error
    Q-->>Caller: {"error": msg}
    Caller->>Caller: raise ImportError(msg)
  else success
    Q-->>Caller: [per-file dicts/None]
    Caller->>Caller: map to Data objects
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch refactor/docling-multiprocessing-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

src/backend/base/langflow/components/docling/__init__.py (1)
106-114: Avoid passing heavy/unpicklable objects through multiprocessing.Queue

res.document may be large and/or not picklable, risking timeouts or BrokenPipeError. Consider serializing to a compact form (e.g., JSON) or writing to a temp file and returning a reference, then reconstructing in the parent process. Also, status is not used by the caller; dropping it reduces payload size.

If you stick with pickling the document, please verify stability with realistic inputs. If you prefer, I can draft a serialization-based approach.

Minimal payload tweak (status removal) if you choose to keep pickling:
-        processed_data = [
-            {"document": res.document, "file_path": str(res.input.file), "status": res.status.name}
+        processed_data = [
+            {"document": res.document, "file_path": str(res.input.file)}
             if res.status == ConversionStatus.SUCCESS
             else None
             for res in results
         ]
src/backend/base/langflow/components/docling/docling_inline.py (2)

82-88: Good call on using spawn context; consider naming and reuse

Using get_context("spawn") improves cross-platform stability. Optionally, consider naming the process (name="docling-worker") for easier debugging or reusing a long-lived process if throughput becomes a concern.

If desired, I can sketch a simple single-worker lifecycle manager to amortize spawn costs.

97-97: Leverage worker-provided status (or drop it at the source)

You’re discarding the status information returned from the worker. Either log/report failed conversions here for observability, or remove status from the worker payload to reduce IPC payload and serialization overhead.

I can wire basic logging that reports successes/failures per file before rollup_data if you want that visibility.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e68f6a4 and 0d349b6.

📒 Files selected for processing (2)

src/backend/base/langflow/components/docling/__init__.py (1 hunks)
src/backend/base/langflow/components/docling/docling_inline.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

src/backend/base/langflow/components/docling/__init__.py
src/backend/base/langflow/components/docling/docling_inline.py

src/backend/base/langflow/components/**/__init__.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

Update init.py with alphabetical imports when adding new components

Files:

src/backend/base/langflow/components/docling/__init__.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/base/langflow/components/docling/__init__.py
src/backend/base/langflow/components/docling/docling_inline.py

src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/base/langflow/components/docling/__init__.py
src/backend/base/langflow/components/docling/docling_inline.py

🧬 Code Graph Analysis (1)

src/backend/base/langflow/components/docling/__init__.py (2)

src/backend/base/langflow/services/task/backends/anyio.py (1)

status (24-27)

src/backend/base/langflow/base/astra_assistants/util.py (1)

name (147-151)

🔇 Additional comments (3)

src/backend/base/langflow/components/docling/__init__.py (1)

95-101: Confirm handling of non-PDF inputs (DOCX, PPTX, etc.)

Only InputFormat.PDF and InputFormat.IMAGE are provided with explicit options. Ensure other formats you advertise (e.g., docx, pptx) are correctly handled by DocumentConverter defaults or add explicit options if required.

Would you like me to scan the codebase/usages to confirm non-PDF formats are covered by defaults, or add explicit FormatOptions for common formats?

src/backend/base/langflow/components/docling/docling_inline.py (2)

1-4: LGTM: Worker import and multiprocessing primitives

Importing docling_worker from the package and using multiprocessing primitives at module scope is appropriate given the spawn-start method usage below.

75-99: Async guideline check: verify whether a synchronous component is acceptable here

Per repository guidelines, async methods are encouraged for components. process_files is synchronous and blocks the caller while waiting on the worker. If the surrounding pipeline is async, consider providing an async variant with asyncio.to_thread or an event-loop-friendly approach. If the component framework expects sync, ignore this.

Would you like me to propose an async process_files implementation using asyncio + run_in_executor, while preserving the spawn-based worker behavior?

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

codecov · 2025-08-20T14:55:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.25%. Comparing base (e63e879) to head (b4211fb).
⚠️ Report is 1 commits behind head on main.

❌ Your project status has failed because the head coverage (2.67%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9393      +/-   ##
==========================================
- Coverage   33.27%   33.25%   -0.02%     
==========================================
  Files        1209     1209              
  Lines       57545    57545              
  Branches     5363     5363              
==========================================
- Hits        19146    19137       -9     
- Misses      38339    38348       +9     
  Partials       60       60

Flag	Coverage Δ
backend	`55.15% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2025-08-20T14:57:05Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

#9393) * refactor(docling): extract processing logic to separate worker process - Move Docling processing to dedicated worker function - Preserve all original pipeline configuration logic - Maintain support for standard and VLM pipelines - Keep complete OCR engine configuration - Add proper error handling for multiprocessing context * Update src/backend/base/langflow/components/docling/__init__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update src/backend/base/langflow/components/docling/__init__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [autofix.ci] apply automated fixes * Update src/backend/base/langflow/components/docling/__init__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update src/backend/base/langflow/components/docling/docling_inline.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [autofix.ci] apply automated fixes * feat: add process monitoring and timeout handling * fix: ruff check * feat: add graceful signal handling to docling worker * friendlier error message * Swallow stack trace on interrupt * [autofix.ci] apply automated fixes * fix: ruff error * fix: mypy error --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Jordan Frazier <jordan.frazier@datastax.com>

italojohnny requested review from edwinjosechittilappilly, jordanrfrazier, ogabrielluiz and phact August 14, 2025 11:53

github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 14, 2025

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

jordanrfrazier requested changes Aug 14, 2025

View reviewed changes

Comment thread src/backend/base/langflow/components/docling/docling_inline.py

Comment thread src/backend/base/langflow/components/docling/docling_inline.py Outdated

Update src/backend/base/langflow/components/docling/__init__.py

4782152

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>