Skip to content

feat: Better support for advanced parser in File Component#10048

Merged
erichare merged 14 commits into
fix-docling-vlmfrom
fix-file-behavior
Oct 3, 2025
Merged

feat: Better support for advanced parser in File Component#10048
erichare merged 14 commits into
fix-docling-vlmfrom
fix-file-behavior

Conversation

@erichare
Copy link
Copy Markdown
Collaborator

@erichare erichare commented Sep 30, 2025

This pull request updates the advanced document processing ("Docling") feature in the FileComponent to support processing multiple files at once, as long as all selected files are compatible. Previously, advanced processing was limited to a single file. The changes update both the UI logic and backend processing to reflect this expanded capability.

Advanced document processing enhancements:

  • The advanced_mode option is now shown in the UI even when multiple files are selected, as long as all files are compatible with Docling. Previously, it was only available for a single file.
  • The logic in update_build_config now enables advanced processing if all selected files are non-tabular and Docling-compatible, rather than requiring exactly one file.

Backend processing updates:

  • The backend processing (process_files) now allows advanced processing for multiple compatible files, updating the docstring and logic accordingly.
  • The advanced processing path in process_file_standard now checks that all files are Docling-compatible before enabling advanced processing, and processes each file individually in a subprocess, aggregating the results. [1] [2]

Summary by CodeRabbit

  • New Features

    • Advanced Parser is now visible by default across starter projects.
    • Expanded advanced options: pipeline selection, OCR toggle, and markdown/structured export settings.
    • Multi-file advanced parsing when files are compatible.
    • Enhanced outputs, including markdown and structured results.
  • Bug Fixes

    • Safer file-path validation and clearer error reporting.
    • Consistent aggregation of results for multi-file processing.
  • Refactor

    • Unified advanced parsing flow with isolated processing for improved reliability.
    • Streamlined UI with dynamic visibility for related controls.
  • Chores

    • Updated Google dependency to version 1.117.0 in News Aggregator.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 30, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updates starter project JSONs and the core FileComponent to enable Advanced Parser by default, integrate Docling processing via a subprocess, expand UI inputs/visibility logic, adjust multi-file advanced handling, and bump a dependency in News Aggregator. Core file.py now supports advanced processing across multiple files when all are compatible.

Changes

Cohort / File(s) Summary
Core File Component refactor
src/lfx/src/lfx/components/data/file.py
Advanced Parser UI shown by default; eligibility now requires all selected files be Docling-compatible; consolidates advanced multi-file flow; aggregates results; preserves standard multi-file loading.
Starter projects: Advanced Parser defaults & Docling subprocess integration
src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json, .../Portfolio Website Code Generator.json, .../Text Sentiment Analysis.json, .../Vector Store RAG.json
Replaces/expands FileComponent code blocks to use Docling via subprocess; adds/expands UI inputs (advanced_mode, pipeline, ocr_engine, markdown placeholders); sets Advanced Parser show=true; updates code_hash to ee645f9c4966; enhances error/output mapping.
Starter project: Dependency bump only
src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json
Updates google package version from 0.8.5 to 1.117.0; no functional logic changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User/UI
  participant FC as FileComponent
  participant DL as Docling Subprocess
  participant MAP as Result Mapper
  participant OUT as Outputs

  U->>FC: Provide file paths + Advanced Parser options
  alt All files Docling-compatible AND advanced_mode
    FC->>DL: Invoke subprocess with args (pipeline, ocr, markdown, ...)
    DL-->>FC: JSON results (structured/markdown/errors)
    FC->>MAP: Parse and assemble Data/DataFrame
    MAP-->>OUT: Structured/Markdown/Raw outputs (aggregated)
  else Standard path
    FC-->>OUT: Raw Content / File Path via standard loaders
  end
Loading
sequenceDiagram
  autonumber
  participant FC as FileComponent(process_files)
  participant CHK as Compatibility Check
  participant DL as Docling Subprocess
  participant AGG as Aggregator

  FC->>CHK: Verify all files not *.csv/*.xlsx/*.parquet
  alt Compatible
    FC->>DL: Process all files via Docling
    DL-->>FC: Per-file JSON results
    FC->>AGG: Collect into final list
    AGG-->>FC: final_return
  else Incompatible
    FC-->>FC: Fall back to standard multi-file processing
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

size:XXL, lgtm

Suggested reviewers

  • edwinjosechittilappilly
  • ogabrielluiz
  • italojohnny

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 3 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error The PR introduces major changes to FileComponent’s advanced multi‐file processing, subprocess Docling workflows, input validation, and output aggregation but omits any new or updated test files for this functionality. The diff shows only a test for the Vector Store RAG starter project and no tests were added or modified for process_files, process_file_standard, update_build_config, or advanced_mode logic in src/lfx/src/lfx/components/data/file.py, nor are there unit or integration tests validating the new subprocess invocation or multi‐file aggregation paths. This lack of coverage fails to ensure correctness or guard against regressions in the new feature areas. Please add comprehensive tests for the updated FileComponent, including unit tests for the new multi‐file advanced processing paths, subprocess‐based Docling invocation, input validation, and output aggregation behaviors, as well as integration tests to confirm starter project JSON changes for all five templates, following the existing test naming conventions (test_*.py for backend).
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Quality And Coverage ⚠️ Warning I inspected the repository for new or modified tests related to the FileComponent advanced parser changes and could not find any tests added or updated in this PR; no pytest files reference FileComponent, advanced_mode, or Docling, and no Playwright/E2E tests cover the UI toggle changes. The backend changes introduce new multi-file advanced processing and subprocess-based Docling flows, but there are no tests validating success paths, error handling, or aggregation behavior, nor async-specific patterns where applicable. Comparing the PR branch against origin/main shows only JSON starter project updates and src/lfx/src/lfx/components/data/file.py modifications, with no test files changed. Given the scope and complexity of the new behavior, the absence of targeted tests means the main functionality is not adequately covered, and API/UI behavior is not validated beyond smoke-level defaults. Add pytest coverage for src/lfx/src/lfx/components/data/file.py covering: enabling advanced_mode for multiple compatible files, subprocess invocation success and error cases (mock subprocess to return structured and markdown outputs), mixed compatible/incompatible file sets, and aggregation across multiple files; also verify standard path remains unchanged when advanced_mode is disabled. For UI, add Playwright tests to assert Advanced Parser visibility logic and toggling behavior when multiple files are selected. If there are API endpoints exposing this functionality, include success and error response tests using TestClient/AsyncClient, ensuring proper async patterns and meaningful assertions beyond smoke tests.
Excessive Mock Usage Warning ⚠️ Warning
Test File Naming And Structure ❓ Inconclusive I scanned the PR branch for test files and structures but found no evidence to evaluate against the criteria. The PR modifies JSON starter projects and a backend Python module, yet there are no test files surfaced matching backend pytest patterns (test_.py) or frontend Playwright patterns (.test.ts/tsx), nor clear integration/e2e directories or markers, so I cannot verify naming conventions, structure, or scenario coverage. Without locating relevant tests, I cannot assess setup/teardown usage, descriptive test names, or inclusion of edge cases and negative paths. Additional repository inspection is required to conclude this check. Please run the attached repository scan to list existing tests and share the results, or point me to the test directories for backend (pytest), frontend (Playwright), and integration/e2e tests. If tests are missing for these changes, add pytest files named test_*.py with descriptive test_... function names, include fixtures/setup/teardown where needed, add Playwright *.test.ts/tsx for UI paths, clearly mark integration tests (e.g., tests/integration with @pytest.mark.integration or e2e folders), and ensure both positive and negative cases plus edge conditions are covered.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “feat: Better support for advanced parser in File Component” succinctly and accurately captures the primary change—enhancing the advanced parser functionality within the FileComponent—without extraneous details, making it clear to readers what the pull request addresses.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 30, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 30, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 24.21%. Comparing base (90f4006) to head (58ca5d1).
⚠️ Report is 32 commits behind head on fix-docling-vlm.

❌ Your project check has failed because the head coverage (47.20%) is below the target coverage (55.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##           fix-docling-vlm   #10048      +/-   ##
===================================================
+ Coverage            24.20%   24.21%   +0.01%     
===================================================
  Files                 1091     1091              
  Lines                40038    40037       -1     
  Branches              5543     5542       -1     
===================================================
+ Hits                  9690     9694       +4     
+ Misses               30177    30172       -5     
  Partials               171      171              
Flag Coverage Δ
backend 47.20% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@erichare erichare enabled auto-merge September 30, 2025 17:44
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 30, 2025
@erichare erichare disabled auto-merge September 30, 2025 18:40
Comment thread src/lfx/src/lfx/components/data/file.py Outdated
@edwinjosechittilappilly
Copy link
Copy Markdown
Collaborator

@erichare If time permits should we add tests for checking the function with what happens if the file is not docling compatible?

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 1, 2025
@erichare erichare enabled auto-merge October 1, 2025 22:38
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 2, 2025
@erichare erichare changed the base branch from main to fix-docling-vlm October 3, 2025 17:59
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 3, 2025
auto-merge was automatically disabled October 3, 2025 18:01

Merge commits are not allowed on this repository

@erichare erichare merged commit 65c3734 into fix-docling-vlm Oct 3, 2025
8 of 9 checks passed
@erichare erichare deleted the fix-file-behavior branch October 3, 2025 18:01
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 3, 2025
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Oct 3, 2025

github-merge-queue Bot pushed a commit that referenced this pull request Oct 3, 2025
* fix: Proper support for VLM in Docling

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* Update file.py

* [autofix.ci] apply automated fixes

* Update pyproject.toml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update uv.lock

* Fix project specs

* Add jpg as accepted file type

* [autofix.ci] apply automated fixes

* Update dep structure

* One more attempt at getting this right

* And again

* Add docling core

* Update pyproject.toml

* Update deps

* [autofix.ci] apply automated fixes

* Update knowledge_bases.py

* Package version bumps

* Add pytest tests for advanced mode

* Update test_file_component.py

* Update test_file_component.py

* Make pipeline a visible option in advanced mode

* Feat tool mode files (#10107)

* feat: Tool Mode Support for File Components

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* feat: Better support for advanced parser in File Component (#10048)

* feat: Better support for advanced parser in files

* [autofix.ci] apply automated fixes

* Add docling mocked tests

* Update file.py

* Update test_file_component.py

* [autofix.ci] apply automated fixes

* Update News Aggregator.json

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* Update file.py

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants