fix: Proper support for VLM in Docling by erichare · Pull Request #10094 · langflow-ai/langflow

erichare · 2025-10-02T18:35:34Z

This pull request introduces a new dependency and updates project metadata. The most significant changes are as follows:

Dependency Updates

Added mlx-vlm version >=0.0.7 to the list of dependencies in pyproject.toml.

Project Metadata

Updated the code_hash value in the Document Q&A.json starter project to reflect the latest code state.

Summary by CodeRabbit

New Features
- Advanced document parsing with Docling, including structured and Markdown outputs, OCR selection, and pipeline modes (standard/VLM).
- Dynamic UI controls for advanced parsing, image/page placeholders, doc key, and context-aware outputs.
- Single-file results can be split into individual items for easier analysis.
- Optional VLM acceleration on macOS when available.
Improvements
- More reliable processing via subprocess isolation, safer path handling, and clearer error surfacing.
Chores
- Added runtime dependency: mlx-vlm (>= 0.0.7).

coderabbitai · 2025-10-02T18:35:42Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds mlx-vlm dependency. Updates four starter project templates to embed a Docling-enabled FileComponent that can offload parsing to a subprocess with optional VLM/OCR pipelines and dynamic outputs. Adjusts lfx file component to simplify Docling imports, refine VLM construction (including optional mlx_vlm on macOS), and raise errors on failures.

Changes

Cohort / File(s)	Summary
Dependencies `pyproject.toml`	Adds runtime dependency `mlx-vlm>=0.0.7`.
Starter project templates (Docling-enabled FileComponent) `src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json`, `.../Portfolio Website Code Generator.json`, `.../Text Sentiment Analysis.json`, `.../Vector Store RAG.json`	Replaces embedded FileComponent with version supporting Docling parsing via isolated subprocess, optional VLM/OCR pipelines, dynamic UI fields (advanced mode, pipeline, ocr_engine, placeholders), validation, and adaptive outputs (raw/markdown/structured). Updates code hashes accordingly.
LFX File component Docling/VLM adjustments `src/lfx/src/lfx/components/data/file.py`	Simplifies Docling import flow; constructs VLM pipeline with explicit options; on macOS optionally enables mlx_vlm; raises exceptions on VLM setup/import failures instead of silent fallbacks; maintains standard converter with stricter error propagation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant FC as FileComponent (starter templates)
  participant SP as Subprocess (Docling Runner)
  participant DL as Docling/VLM Pipelines

  U->>FC: Provide file(s) and options (advanced, pipeline, OCR, placeholders)
  alt Single file & Docling-compatible & advanced enabled
    FC->>SP: Launch subprocess with JSON config
    SP->>DL: Initialize converters (standard/VLM) and parse
    DL-->>SP: Parsed result (markdown/structured/meta)
    SP-->>FC: JSON result or error
    opt UNNEST single-file outputs
      FC->>FC: Map to Raw/Markdown/Structured outputs
    end
  else Multi-file or non-compatible
    FC->>FC: Standard/parallel file loading
  end
  FC-->>U: Outputs (Raw Content, Markdown, Structured, File Path, Meta)
  note over FC,SP: Errors are propagated (no silent VLM fallback)
  note over FC: On macOS, mlx_vlm may be enabled if available

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: Run docling processing in subprocess #9541 — Introduces subprocess-based Docling processing in FileComponent, matching the offloading and IPC approach here.
feat: Add support for advanced parsing with docling in the File Component #9398 — Modifies Docling/VLM integration paths similar to the lfx file component changes in this PR.
fix: Clean up some more base templates #8706 — Updates the same starter project templates’ File component implementations.

Suggested labels

bug, size:XL, lgtm

Suggested reviewers

jordanrfrazier
ogabrielluiz

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Test Coverage For New Implementations	❌ Error	The PR substantially rewrites multiple starter project FileComponent implementations and introduces new Docling/VLM handling without adding or updating any backend or frontend test files, and no existing tests appear to cover the new subprocess orchestration or VLM pathways, so the required regression and feature coverage is missing.	Please add targeted tests that exercise the new Docling advanced parsing logic, subprocess handling, and VLM pathways—covering both success and failure scenarios—following the project’s naming conventions so the coverage reflects the introduced functionality.
Test Quality And Coverage	⚠️ Warning	I inspected the repository and found no new or updated test files alongside the substantial Docling- and VLM-related logic introduced in `src/lfx/src/lfx/components/data/file.py` and the starter project components. Existing tests only cover legacy behaviors and do not exercise the new subprocess orchestration, Docling compatibility gating, or VLM pipeline branches, leaving the primary new functionality unverified. Because the PR implements complex control flow without corresponding behavioral tests, overall test quality and coverage for the changes are insufficient.	Please add targeted tests that validate Docling subprocess execution paths, VLM pipeline initialization (including darwin-specific mlx-vlm handling), and the updated FileComponent behaviors, ensuring both expected success scenarios and error handling are covered.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “fix: Proper support for VLM in Docling” succinctly conveys the core update of enabling VLM support within Docling and follows conventional commit style without extraneous detail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test File Naming And Structure	✅ Passed	I reviewed the files changed in this pull request and found that none of them add or modify any backend, frontend, or integration test files, so there are no new or altered tests to evaluate against the required naming or structural patterns. Since the existing test suite remains untouched, the repository’s current test organization and standards are unaffected by this change. Therefore, there is no evidence that the pull request introduces any violations of the prescribed testing structure or conventions.
Excessive Mock Usage Warning	✅ Passed	No test files were added or modified in this pull request, and the existing test suite does not show evidence of heavy or inappropriate mocking that would impede meaningful behavior verification, so this change introduces no new risk of excessive mock usage.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

src/backend/base/langflow/initial_setup/starter_projects/Text Sentiment Analysis.json (1)

2270-2470: Fix missing VLM imports in the Docling subprocess.

When pipeline == "vlm", the child script references VlmPipelineOptions and vlm_model_specs without ever importing them, so the block raises NameError and falls back to the plain DocumentConverter. As a result, the advertised VLM path never runs. Pull the proper symbols from Docling (and actually wire the options into PdfFormatOption) so the VLM pipeline executes.

-                if pipeline == "vlm":
-                    try:
-                        from docling.pipeline.vlm_pipeline import VlmPipeline
-                        from docling.document_converter import PdfFormatOption  # type: ignore
-
-                        vl_pipe = VlmPipelineOptions(
-                            vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
-                        )
+                if pipeline == "vlm":
+                    try:
+                        from docling.pipeline.vlm_pipeline import VlmPipeline, VlmPipelineOptions  # type: ignore
+                        from docling.datamodel import vlm_model_specs  # type: ignore
+                        from docling.document_converter import PdfFormatOption  # type: ignore
+
+                        vl_pipe = VlmPipelineOptions(
+                            vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
+                        )
@@
-                        if hasattr(input_format, "PDF"):
-                            fmt[getattr(input_format, "PDF")] = PdfFormatOption(pipeline_cls=VlmPipeline)
-                        if hasattr(input_format, "IMAGE"):
-                            fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(pipeline_cls=VlmPipeline)
+                        if hasattr(input_format, "PDF"):
+                            fmt[getattr(input_format, "PDF")] = PdfFormatOption(
+                                pipeline_cls=VlmPipeline,
+                                pipeline_options=vl_pipe,
+                            )
+                        if hasattr(input_format, "IMAGE"):
+                            fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(
+                                pipeline_cls=VlmPipeline,
+                                pipeline_options=vl_pipe,
+                            )

src/backend/base/langflow/initial_setup/starter_projects/Portfolio Website Code Generator.json (1)

1120-1188: Fix undefined VLM dependencies in the child script

When the user selects pipeline="vlm", the subprocess hits a NameError because VlmPipelineOptions and vlm_model_specs are referenced but never imported inside the child script. That crash makes the new VLM path unusable.

Please import those symbols before they’re used (e.g., from docling.pipeline.vlm_pipeline import VlmPipeline, VlmPipelineOptions and from docling.datamodel import vlm_model_specs, or pull the concrete constants directly) so the VLM branch can execute.

src/lfx/src/lfx/components/data/file.py (1)

360-377: Wire VLM pipeline options into PdfFormatOption

vl_pipe (with GRANITEDOCLING specs and optional MLX switch) is never handed to PdfFormatOption, so the VLM pipeline always falls back to defaults and the new mlx_vlm path never activates. Please pass pipeline_options=vl_pipe when constructing the format options for both PDF and IMAGE inputs.

Apply this diff:

-                        if hasattr(input_format, "PDF"):
-                            fmt[getattr(input_format, "PDF")] = PdfFormatOption(pipeline_cls=VlmPipeline)
-                        if hasattr(input_format, "IMAGE"):
-                            fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(pipeline_cls=VlmPipeline)
+                        if hasattr(input_format, "PDF"):
+                            fmt[getattr(input_format, "PDF")] = PdfFormatOption(
+                                pipeline_cls=VlmPipeline,
+                                pipeline_options=vl_pipe,
+                            )
+                        if hasattr(input_format, "IMAGE"):
+                            fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(
+                                pipeline_cls=VlmPipeline,
+                                pipeline_options=vl_pipe,
+                            )

src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (1)

2800-2885: VLM pipeline never initializes due to missing imports
Inside the embedded Docling subprocess script, VlmPipelineOptions and vlm_model_specs are used but never imported. At runtime this raises a NameError, tripping the except block and silently falling back to the default DocumentConverter, so the new vlm pipeline path is effectively non-functional. Please import the needed symbols (e.g. from docling.pipeline.vlm_pipeline import VlmPipeline, VlmPipelineOptions and from docling.datamodel import vlm_model_specs) and wire them into the converter setup so the VLM branch actually runs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff6b4d and b0084d8.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

pyproject.toml (1 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json (2 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Portfolio Website Code Generator.json (2 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Text Sentiment Analysis.json (2 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (2 hunks)
src/lfx/src/lfx/components/data/file.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
GitHub Check: Lint Backend / Run Mypy (3.13)
GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
GitHub Check: Test Starter Templates
GitHub Check: Optimize new Python code in this PR
GitHub Check: Update Starter Projects
GitHub Check: test-starter-projects

codecov · 2025-10-02T18:53:22Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.17%. Comparing base (a6eb8d2) to head (d7605f5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/backend/base/langflow/api/v1/knowledge_bases.py	0.00%	3 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (47.08%) is below the target coverage (55.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10094      +/-   ##
==========================================
- Coverage   24.21%   24.17%   -0.05%     
==========================================
  Files        1086     1086              
  Lines       40044    40044              
  Branches     5541     5541              
==========================================
- Hits         9696     9679      -17     
- Misses      30177    30194      +17     
  Partials      171      171

Flag	Coverage Δ
backend	`47.08% <0.00%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rc/backend/base/langflow/api/v1/knowledge_bases.py	`17.37% <0.00%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: Tool Mode Support for File Components * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* feat: Better support for advanced parser in files * [autofix.ci] apply automated fixes * Add docling mocked tests * Update file.py * Update test_file_component.py * [autofix.ci] apply automated fixes * Update News Aggregator.json * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

sonarqubecloud · 2025-10-03T18:06:38Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

rodrigosnader

LGTM

fix: Proper support for VLM in Docling

b0084d8

erichare requested a review from jordanrfrazier October 2, 2025 18:35

github-actions Bot added the bug Something isn't working label Oct 2, 2025

[autofix.ci] apply automated fixes

0d2d28a