feat: Add support for advanced parsing with docling in the File Component by erichare · Pull Request #9398 · langflow-ai/langflow

erichare · 2025-08-14T15:25:45Z

This pull request adds support for an advanced parsing mode, available for documents such as PDFs, that uses docling to parse the document in a structured manner.

Summary by CodeRabbit

New Features
- Advanced file processing option with OCR selection, pipeline choice, and export to Markdown (image and page-break placeholders supported).
- Expanded supported file types for uploads and processing.
- Single-file advanced export with status feedback.
Enhancements
- Dynamic UI shows/hides advanced options based on file count and mode.
- Automatic fallback to standard processing when advanced tools aren’t available, with clearer progress/error messages.
- Improved outputs compatible with tabular views for easier downstream use.

coderabbitai · 2025-08-14T15:25:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces an optional Docling-based advanced processing path in FileComponent, adding new enums, constants, inputs, dynamic UI behavior, import strategy handling, advanced conversion and export to Markdown, DataFrame-based loading, and fallback to standard processing. Also adjusts outputs and build config updates based on advanced mode and file/path context.

Changes

Cohort / File(s)	Summary of Changes
File component + Docling integration `src/backend/base/langflow/components/data/file.py`	Rebuilt FileComponent to add Docling-driven advanced processing: new public enums and DoclingImports container; expanded VALID_EXTENSIONS and constants (EXPORT_FORMAT, IMAGE_MODE); new inputs (advanced_mode, pipeline, ocr_engine, md_image_placeholder, md_page_break_placeholder, doc_key); dynamic update_build_config/update_outputs; multi-strategy Docling import (_try_import_docling), converter creation, compatibility checks; advanced processing/export paths (_process_with_docling_and_export, _export_document); DataFrame loader (load_files_advanced); process_files orchestrates Docling-first with standard fallback; name changed to "ile".

Sequence Diagram(s)

sequenceDiagram
  participant UI as UI
  participant FC as FileComponent
  participant DI as Docling Imports
  participant DC as Docling Converter
  participant FS as Standard Loader

  UI->>FC: process_files(paths, advanced_mode)
  alt advanced_mode and docling-compatible
    FC->>DI: _try_import_docling()
    alt imports available
      FC->>DC: _create_advanced_converter()
      FC->>DC: convert(file)
      DC-->>FC: document
      FC->>FC: _export_document(document)
      FC-->>UI: DataFrame/Markdown output
    else imports unavailable
      FC->>FS: process_file_standard(file)
      FS-->>FC: text/data
      FC-->>UI: standard output
    end
  else not advanced or incompatible
    FC->>FS: process_file_standard(file)
    FS-->>FC: text/data
    FC-->>UI: standard output
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: Better multi-file consistency for File Component #8625: Also modifies the same FileComponent, updating inputs/outputs and the update_outputs method.

Suggested labels

size:XXL

Suggested reviewers

rodrigosnader
edwinjosechittilappilly
Yukiyukiyeah
ogabrielluiz

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-docling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

src/backend/base/langflow/components/data/file.py (4)

303-373: Consider simplifying the import strategy fallback logic

The multiple import strategies with different fallback paths make the code complex and harder to maintain. Consider consolidating the import logic or using a more robust import resolution mechanism.

Consider creating a dedicated module for handling docling imports that encapsulates all the version compatibility logic. This would:

Centralize the import resolution logic
Make it easier to update when docling's API changes
Reduce the complexity in this component file

Would you like me to help create a separate docling_imports.py module to handle this complexity?

415-449: Potential performance issue with file extension checking

The _is_docling_compatible method creates a large list of extensions on every call and uses a linear search. This could be optimized using a set for O(1) lookups.

Move the extensions to a class-level constant and use a set:

+    # Class-level constant for better performance
+    DOCLING_EXTENSIONS = frozenset([
+        ".adoc", ".asciidoc", ".asc", ".bmp", ".csv", ".dotx", ".dotm",
+        ".docm", ".docx", ".htm", ".html", ".jpeg", ".json", ".md", ".pdf",
+        ".png", ".potx", ".ppsx", ".pptm", ".potm", ".ppsm", ".pptx",
+        ".tiff", ".txt", ".xls", ".xlsx", ".xhtml", ".xml", ".webp"
+    ])
+
     def _is_docling_compatible(self, file_path: str) -> bool:
         """Check if file is compatible with Docling processing."""
-        # All VALID_EXTENSIONS are Docling compatible (except for TEXT_FILE_TYPES which may overlap)
-        docling_extensions = [
-            ".adoc",
-            ".asciidoc",
-            # ... (all extensions)
-            ".webp",
-        ]
-        return any(file_path.lower().endswith(ext) for ext in docling_extensions)
+        import os
+        _, ext = os.path.splitext(file_path.lower())
+        return ext in self.DOCLING_EXTENSIONS

620-640: Improve error handling in _export_document method

The nested try-except blocks with multiple fallbacks could mask underlying issues. Consider logging the specific exception types for better debugging.

Improve error handling with specific exception types:

     def _export_document(self, document: Any, image_ref_mode: type[Enum]) -> str:
         """Export document to Markdown format with placeholder images."""
         try:
             image_mode = (
                 image_ref_mode(self.IMAGE_MODE) if hasattr(image_ref_mode, self.IMAGE_MODE) else self.IMAGE_MODE
             )
 
             # Always export to Markdown since it's fixed
             return document.export_to_markdown(
                 image_mode=image_mode,
                 image_placeholder=self.md_image_placeholder,
                 page_break_placeholder=self.md_page_break_placeholder,
             )
 
-        except Exception as e:  # noqa: BLE001
+        except AttributeError as e:
+            self.log(f"Document does not support Markdown export: {e}, trying text export")
+        except Exception as e:  # noqa: BLE001
             self.log(f"Markdown export failed: {e}, using basic text export")
-            # Fallback to basic text export
-            try:
-                return document.export_to_text()
-            except Exception:  # noqa: BLE001
-                return str(document)
+        
+        # Fallback to basic text export
+        try:
+            return document.export_to_text()
+        except AttributeError:
+            self.log("Document does not support text export, using string representation")
+            return str(document)
+        except Exception as e:  # noqa: BLE001
+            self.log(f"Text export failed: {e}, using string representation")
+            return str(document)

123-133: Consider handling the docling installation requirement programmatically

The info message mentions that docling requires installation via uv pip install docling, but users might not notice this requirement until runtime when imports fail.

Consider adding a helper method to check and report missing dependencies more clearly:

def _check_docling_availability(self) -> bool:
    """Check if docling is available and provide helpful installation instructions."""
    if self._try_import_docling() is None:
        self.log(
            "Docling is not installed. To enable advanced parsing features, "
            "please install it with: uv pip install docling"
        )
        return False
    return True

Would you like me to help create a more comprehensive dependency checking mechanism?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 75970e5 and cc843e4.

📒 Files selected for processing (1)

src/backend/base/langflow/components/data/file.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

src/backend/base/langflow/components/data/file.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/base/langflow/components/data/file.py

src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/base/langflow/components/data/file.py

🧬 Code Graph Analysis (1)

src/backend/base/langflow/components/data/file.py (6)

src/backend/base/langflow/base/data/base_file.py (7)

data (48-49)

data (52-60)

BaseFileComponent (23-685)

process_files (187-195)

BaseFile (31-100)

rollup_data (411-462)

load_files (363-388)

src/backend/base/langflow/base/data/utils.py (2)

parallel_load_data (185-198)

parse_text_file_to_data (137-166)

src/backend/base/langflow/inputs/inputs.py (5)

BoolInput (413-425)

DropdownInput (467-491)

FileInput (612-622)

MessageTextInput (205-256)

StrInput (128-184)

src/backend/base/langflow/template/field/base.py (1)

Output (181-257)

src/backend/base/langflow/schema/dataframe.py (1)

DataFrame (11-206)

src/backend/base/langflow/schema/data.py (1)

Data (23-277)

🪛 GitHub Check: Ruff Style Check (3.13)

src/backend/base/langflow/components/data/file.py

[failure] 246-246: Ruff (ARG002)
src/backend/base/langflow/components/data/file.py:246:78: ARG002 Unused method argument: field_value

🪛 GitHub Actions: Ruff Style Check

src/backend/base/langflow/components/data/file.py

[error] 246-246: Ruff: ARG002 Unused method argument: field_value. (command: uv run --only-dev ruff check --output-format=github .)

codecov · 2025-08-22T03:14:25Z

Codecov Report

❌ Patch coverage is 23.44498% with 160 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.93%. Comparing base (59937ee) to head (6843f59).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/backend/base/langflow/components/data/file.py	23.44%	160 Missing ⚠️

❌ Your patch status has failed because the patch coverage (23.44%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (3.80%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9398      +/-   ##
==========================================
- Coverage   33.96%   33.93%   -0.03%     
==========================================
  Files        1195     1195              
  Lines       55823    55935     +112     
  Branches     5370     5331      -39     
==========================================
+ Hits        18960    18984      +24     
- Misses      36793    36881      +88     
  Partials       70       70

Flag	Coverage Δ
backend	`56.61% <23.44%> (-0.23%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/backend/base/langflow/components/data/file.py	`27.30% <23.44%> (-28.08%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2025-08-22T04:19:14Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

erichare added 3 commits August 14, 2025 08:00

Docling support for file component

1ece9f3

Name as previous

f1eed2d

Update logic of file path value

dc4e0c2

github-actions Bot added the enhancement New feature or request label Aug 14, 2025

[autofix.ci] apply automated fixes

cc843e4

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

Comment thread src/backend/base/langflow/components/data/file.py Outdated

Comment thread src/backend/base/langflow/components/data/file.py Outdated

Comment thread src/backend/base/langflow/components/data/file.py Outdated

Comment thread src/backend/base/langflow/components/data/file.py Outdated

Fix two errors in linting

0e12662

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

erichare requested a review from edwinjosechittilappilly August 14, 2025 15:49

erichare added the DO NOT MERGE Don't Merge this PR label Aug 14, 2025

Latest file component updates

963bc98

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

autofix-ci Bot and others added 2 commits August 14, 2025 18:42

[autofix.ci] apply automated fixes

a0b7bf5

Ruff updates

674a67a

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

[autofix.ci] apply automated fixes

8b08f22

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025

Merge branch 'main' into feat-docling

cb3bb81

github-actions Bot removed the enhancement New feature or request label Aug 14, 2025