Skip to content

refactor(docling): extract processing logic to separate worker process#9393

Merged
italojohnny merged 17 commits into
mainfrom
refactor/docling-multiprocessing-worker
Aug 20, 2025
Merged

refactor(docling): extract processing logic to separate worker process#9393
italojohnny merged 17 commits into
mainfrom
refactor/docling-multiprocessing-worker

Conversation

@italojohnny
Copy link
Copy Markdown
Contributor

@italojohnny italojohnny commented Aug 14, 2025

Extracts Docling processing to separate worker process while maintaining feature parity with original implementation.

Key changes:

  • Preserves _get_standard_opts() and _get_vlm_opts() configuration
  • Maintains VlmPipeline and OCR factory setup
  • Adds proper error propagation between processes

Summary by CodeRabbit

  • Refactor
    • Moved document conversion to a separate process for improved stability and responsiveness, with centralized OCR/VLM pipeline configuration. Existing interfaces remain unchanged.
  • Bug Fixes
    • Improved error reporting when the conversion engine isn’t available.
    • Prevents crashes by isolating heavy conversions and standardizing per-file status handling.

- Move Docling processing to dedicated worker function
- Preserve all original pipeline configuration logic
- Maintain support for standard and VLM pipelines
- Keep complete OCR engine configuration
- Add proper error handling for multiprocessing context
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 14, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces a separate multiprocessing worker (docling_worker) for Docling conversions. The inline component now spawns a process with a queue, delegates conversion to the worker, receives results/errors via IPC, and maps outputs to existing Data structures. Worker handles lazy imports, pipeline selection (standard/VLM), OCR options, conversion, and error normalization.

Changes

Cohort / File(s) Summary
Docling worker introduction
src/backend/base/langflow/components/docling/__init__.py
Adds docling_worker to perform Docling conversions in a separate process. Handles lazy imports, pipeline option construction (standard/VLM, OCR), converter setup, convert_all execution, result normalization, and queue-based error/success reporting.
Inline refactor to multiprocessing
src/backend/base/langflow/components/docling/docling_inline.py
Refactors inline processing to spawn a new process using get_context("spawn") and a Queue. Delegates to docling_worker, collects results, raises ImportError on worker-reported errors, and rebuilds processed_data to match prior API outputs. Removes local Docling configuration logic.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as DoclingInlineComponent
  participant MP as multiprocessing (spawn)
  participant P as Worker Process
  participant Q as Queue
  participant W as docling_worker
  participant D as Docling Converter

  Caller->>MP: create Queue, spawn Process(target=docling_worker, args)
  MP->>P: start
  P->>W: run(file_paths, queue, pipeline, ocr_engine)
  W->>D: lazy import + configure pipeline (standard/VLM, OCR)
  W->>D: convert_all(file_paths)
  D-->>W: results
  W->>Q: put(processed_data or {"error": msg})
  Caller->>Q: get()
  Caller->>MP: join process
  alt error
    Q-->>Caller: {"error": msg}
    Caller->>Caller: raise ImportError(msg)
  else success
    Q-->>Caller: [per-file dicts/None]
    Caller->>Caller: map to Data objects
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/docling-multiprocessing-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 14, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
src/backend/base/langflow/components/docling/__init__.py (1)

106-114: Avoid passing heavy/unpicklable objects through multiprocessing.Queue

res.document may be large and/or not picklable, risking timeouts or BrokenPipeError. Consider serializing to a compact form (e.g., JSON) or writing to a temp file and returning a reference, then reconstructing in the parent process. Also, status is not used by the caller; dropping it reduces payload size.

If you stick with pickling the document, please verify stability with realistic inputs. If you prefer, I can draft a serialization-based approach.

Minimal payload tweak (status removal) if you choose to keep pickling:

-        processed_data = [
-            {"document": res.document, "file_path": str(res.input.file), "status": res.status.name}
+        processed_data = [
+            {"document": res.document, "file_path": str(res.input.file)}
             if res.status == ConversionStatus.SUCCESS
             else None
             for res in results
         ]
src/backend/base/langflow/components/docling/docling_inline.py (2)

82-88: Good call on using spawn context; consider naming and reuse

Using get_context("spawn") improves cross-platform stability. Optionally, consider naming the process (name="docling-worker") for easier debugging or reusing a long-lived process if throughput becomes a concern.

If desired, I can sketch a simple single-worker lifecycle manager to amortize spawn costs.


97-97: Leverage worker-provided status (or drop it at the source)

You’re discarding the status information returned from the worker. Either log/report failed conversions here for observability, or remove status from the worker payload to reduce IPC payload and serialization overhead.

I can wire basic logging that reports successes/failures per file before rollup_data if you want that visibility.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e68f6a4 and 0d349b6.

📒 Files selected for processing (2)
  • src/backend/base/langflow/components/docling/__init__.py (1 hunks)
  • src/backend/base/langflow/components/docling/docling_inline.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

  • src/backend/base/langflow/components/docling/__init__.py
  • src/backend/base/langflow/components/docling/docling_inline.py
src/backend/base/langflow/components/**/__init__.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

Update init.py with alphabetical imports when adding new components

Files:

  • src/backend/base/langflow/components/docling/__init__.py
{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

  • src/backend/base/langflow/components/docling/__init__.py
  • src/backend/base/langflow/components/docling/docling_inline.py
src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/base/langflow/components/docling/__init__.py
  • src/backend/base/langflow/components/docling/docling_inline.py
🧬 Code Graph Analysis (1)
src/backend/base/langflow/components/docling/__init__.py (2)
src/backend/base/langflow/services/task/backends/anyio.py (1)
  • status (24-27)
src/backend/base/langflow/base/astra_assistants/util.py (1)
  • name (147-151)
🔇 Additional comments (3)
src/backend/base/langflow/components/docling/__init__.py (1)

95-101: Confirm handling of non-PDF inputs (DOCX, PPTX, etc.)

Only InputFormat.PDF and InputFormat.IMAGE are provided with explicit options. Ensure other formats you advertise (e.g., docx, pptx) are correctly handled by DocumentConverter defaults or add explicit options if required.

Would you like me to scan the codebase/usages to confirm non-PDF formats are covered by defaults, or add explicit FormatOptions for common formats?

src/backend/base/langflow/components/docling/docling_inline.py (2)

1-4: LGTM: Worker import and multiprocessing primitives

Importing docling_worker from the package and using multiprocessing primitives at module scope is appropriate given the spawn-start method usage below.


75-99: Async guideline check: verify whether a synchronous component is acceptable here

Per repository guidelines, async methods are encouraged for components. process_files is synchronous and blocks the caller while waiting on the worker. If the surrounding pipeline is async, consider providing an async variant with asyncio.to_thread or an event-loop-friendly approach. If the component framework expects sync, ignore this.

Would you like me to propose an async process_files implementation using asyncio + run_in_executor, while preserving the spawn-based worker behavior?

Comment thread src/backend/base/langflow/components/docling/__init__.py Outdated
Comment thread src/backend/base/langflow/components/docling/__init__.py
Comment thread src/backend/base/langflow/components/docling/__init__.py Outdated
Comment thread src/backend/base/langflow/components/docling/docling_inline.py Outdated
Comment thread src/backend/base/langflow/components/docling/docling_inline.py
Comment thread src/backend/base/langflow/components/docling/docling_inline.py Outdated
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 18, 2025
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 18, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 18, 2025
italojohnny and others added 2 commits August 18, 2025 11:01
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 18, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 18, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 20, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 20, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 20, 2025
@github-actions github-actions Bot added the lgtm This PR has been approved by a maintainer label Aug 20, 2025
@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 20, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.25%. Comparing base (e63e879) to head (b4211fb).
⚠️ Report is 1 commits behind head on main.

❌ Your project status has failed because the head coverage (2.67%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #9393      +/-   ##
==========================================
- Coverage   33.27%   33.25%   -0.02%     
==========================================
  Files        1209     1209              
  Lines       57545    57545              
  Branches     5363     5363              
==========================================
- Hits        19146    19137       -9     
- Misses      38339    38348       +9     
  Partials       60       60              
Flag Coverage Δ
backend 55.15% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sonarqubecloud
Copy link
Copy Markdown

@github-actions github-actions Bot added refactor Maintenance tasks and housekeeping and removed refactor Maintenance tasks and housekeeping labels Aug 20, 2025
@italojohnny italojohnny added this pull request to the merge queue Aug 20, 2025
Merged via the queue into main with commit ea918df Aug 20, 2025
130 of 133 checks passed
@italojohnny italojohnny deleted the refactor/docling-multiprocessing-worker branch August 20, 2025 17:22
lucaseduoli pushed a commit that referenced this pull request Aug 22, 2025
#9393)

* refactor(docling): extract processing logic to separate worker process

- Move Docling processing to dedicated worker function
- Preserve all original pipeline configuration logic
- Maintain support for standard and VLM pipelines
- Keep complete OCR engine configuration
- Add proper error handling for multiprocessing context

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update src/backend/base/langflow/components/docling/docling_inline.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [autofix.ci] apply automated fixes

* feat: add process monitoring and timeout handling

* fix: ruff check

* feat: add graceful signal handling to docling worker

* friendlier error message

* Swallow stack trace on interrupt

* [autofix.ci] apply automated fixes

* fix: ruff error

* fix: mypy error

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Jordan Frazier <jordan.frazier@datastax.com>
lucaseduoli pushed a commit that referenced this pull request Aug 25, 2025
#9393)

* refactor(docling): extract processing logic to separate worker process

- Move Docling processing to dedicated worker function
- Preserve all original pipeline configuration logic
- Maintain support for standard and VLM pipelines
- Keep complete OCR engine configuration
- Add proper error handling for multiprocessing context

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/docling/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update src/backend/base/langflow/components/docling/docling_inline.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [autofix.ci] apply automated fixes

* feat: add process monitoring and timeout handling

* fix: ruff check

* feat: add graceful signal handling to docling worker

* friendlier error message

* Swallow stack trace on interrupt

* [autofix.ci] apply automated fixes

* fix: ruff error

* fix: mypy error

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Jordan Frazier <jordan.frazier@datastax.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer refactor Maintenance tasks and housekeeping

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants