Skip to content

feat: Add support for advanced parsing with docling in the File Component#9398

Merged
erichare merged 31 commits into
mainfrom
feat-docling
Aug 22, 2025
Merged

feat: Add support for advanced parsing with docling in the File Component#9398
erichare merged 31 commits into
mainfrom
feat-docling

Conversation

@erichare
Copy link
Copy Markdown
Collaborator

@erichare erichare commented Aug 14, 2025

This pull request adds support for an advanced parsing mode, available for documents such as PDFs, that uses docling to parse the document in a structured manner.

Summary by CodeRabbit

  • New Features

    • Advanced file processing option with OCR selection, pipeline choice, and export to Markdown (image and page-break placeholders supported).
    • Expanded supported file types for uploads and processing.
    • Single-file advanced export with status feedback.
  • Enhancements

    • Dynamic UI shows/hides advanced options based on file count and mode.
    • Automatic fallback to standard processing when advanced tools aren’t available, with clearer progress/error messages.
    • Improved outputs compatible with tabular views for easier downstream use.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 14, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces an optional Docling-based advanced processing path in FileComponent, adding new enums, constants, inputs, dynamic UI behavior, import strategy handling, advanced conversion and export to Markdown, DataFrame-based loading, and fallback to standard processing. Also adjusts outputs and build config updates based on advanced mode and file/path context.

Changes

Cohort / File(s) Summary of Changes
File component + Docling integration
src/backend/base/langflow/components/data/file.py
Rebuilt FileComponent to add Docling-driven advanced processing: new public enums and DoclingImports container; expanded VALID_EXTENSIONS and constants (EXPORT_FORMAT, IMAGE_MODE); new inputs (advanced_mode, pipeline, ocr_engine, md_image_placeholder, md_page_break_placeholder, doc_key); dynamic update_build_config/update_outputs; multi-strategy Docling import (_try_import_docling), converter creation, compatibility checks; advanced processing/export paths (_process_with_docling_and_export, _export_document); DataFrame loader (load_files_advanced); process_files orchestrates Docling-first with standard fallback; name changed to "ile".

Sequence Diagram(s)

sequenceDiagram
  participant UI as UI
  participant FC as FileComponent
  participant DI as Docling Imports
  participant DC as Docling Converter
  participant FS as Standard Loader

  UI->>FC: process_files(paths, advanced_mode)
  alt advanced_mode and docling-compatible
    FC->>DI: _try_import_docling()
    alt imports available
      FC->>DC: _create_advanced_converter()
      FC->>DC: convert(file)
      DC-->>FC: document
      FC->>FC: _export_document(document)
      FC-->>UI: DataFrame/Markdown output
    else imports unavailable
      FC->>FS: process_file_standard(file)
      FS-->>FC: text/data
      FC-->>UI: standard output
    end
  else not advanced or incompatible
    FC->>FS: process_file_standard(file)
    FS-->>FC: text/data
    FC-->>UI: standard output
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

size:XXL

Suggested reviewers

  • rodrigosnader
  • edwinjosechittilappilly
  • Yukiyukiyeah
  • ogabrielluiz
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-docling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot added the enhancement New feature or request label Aug 14, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
src/backend/base/langflow/components/data/file.py (4)

303-373: Consider simplifying the import strategy fallback logic

The multiple import strategies with different fallback paths make the code complex and harder to maintain. Consider consolidating the import logic or using a more robust import resolution mechanism.

Consider creating a dedicated module for handling docling imports that encapsulates all the version compatibility logic. This would:

  1. Centralize the import resolution logic
  2. Make it easier to update when docling's API changes
  3. Reduce the complexity in this component file

Would you like me to help create a separate docling_imports.py module to handle this complexity?


415-449: Potential performance issue with file extension checking

The _is_docling_compatible method creates a large list of extensions on every call and uses a linear search. This could be optimized using a set for O(1) lookups.

Move the extensions to a class-level constant and use a set:

+    # Class-level constant for better performance
+    DOCLING_EXTENSIONS = frozenset([
+        ".adoc", ".asciidoc", ".asc", ".bmp", ".csv", ".dotx", ".dotm",
+        ".docm", ".docx", ".htm", ".html", ".jpeg", ".json", ".md", ".pdf",
+        ".png", ".potx", ".ppsx", ".pptm", ".potm", ".ppsm", ".pptx",
+        ".tiff", ".txt", ".xls", ".xlsx", ".xhtml", ".xml", ".webp"
+    ])
+
     def _is_docling_compatible(self, file_path: str) -> bool:
         """Check if file is compatible with Docling processing."""
-        # All VALID_EXTENSIONS are Docling compatible (except for TEXT_FILE_TYPES which may overlap)
-        docling_extensions = [
-            ".adoc",
-            ".asciidoc",
-            # ... (all extensions)
-            ".webp",
-        ]
-        return any(file_path.lower().endswith(ext) for ext in docling_extensions)
+        import os
+        _, ext = os.path.splitext(file_path.lower())
+        return ext in self.DOCLING_EXTENSIONS

620-640: Improve error handling in _export_document method

The nested try-except blocks with multiple fallbacks could mask underlying issues. Consider logging the specific exception types for better debugging.

Improve error handling with specific exception types:

     def _export_document(self, document: Any, image_ref_mode: type[Enum]) -> str:
         """Export document to Markdown format with placeholder images."""
         try:
             image_mode = (
                 image_ref_mode(self.IMAGE_MODE) if hasattr(image_ref_mode, self.IMAGE_MODE) else self.IMAGE_MODE
             )
 
             # Always export to Markdown since it's fixed
             return document.export_to_markdown(
                 image_mode=image_mode,
                 image_placeholder=self.md_image_placeholder,
                 page_break_placeholder=self.md_page_break_placeholder,
             )
 
-        except Exception as e:  # noqa: BLE001
+        except AttributeError as e:
+            self.log(f"Document does not support Markdown export: {e}, trying text export")
+        except Exception as e:  # noqa: BLE001
             self.log(f"Markdown export failed: {e}, using basic text export")
-            # Fallback to basic text export
-            try:
-                return document.export_to_text()
-            except Exception:  # noqa: BLE001
-                return str(document)
+        
+        # Fallback to basic text export
+        try:
+            return document.export_to_text()
+        except AttributeError:
+            self.log("Document does not support text export, using string representation")
+            return str(document)
+        except Exception as e:  # noqa: BLE001
+            self.log(f"Text export failed: {e}, using string representation")
+            return str(document)

123-133: Consider handling the docling installation requirement programmatically

The info message mentions that docling requires installation via uv pip install docling, but users might not notice this requirement until runtime when imports fail.

Consider adding a helper method to check and report missing dependencies more clearly:

def _check_docling_availability(self) -> bool:
    """Check if docling is available and provide helpful installation instructions."""
    if self._try_import_docling() is None:
        self.log(
            "Docling is not installed. To enable advanced parsing features, "
            "please install it with: uv pip install docling"
        )
        return False
    return True

Would you like me to help create a more comprehensive dependency checking mechanism?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 75970e5 and cc843e4.

📒 Files selected for processing (1)
  • src/backend/base/langflow/components/data/file.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

  • src/backend/base/langflow/components/data/file.py
{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

  • src/backend/base/langflow/components/data/file.py
src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/base/langflow/components/data/file.py
🧬 Code Graph Analysis (1)
src/backend/base/langflow/components/data/file.py (6)
src/backend/base/langflow/base/data/base_file.py (7)
  • data (48-49)
  • data (52-60)
  • BaseFileComponent (23-685)
  • process_files (187-195)
  • BaseFile (31-100)
  • rollup_data (411-462)
  • load_files (363-388)
src/backend/base/langflow/base/data/utils.py (2)
  • parallel_load_data (185-198)
  • parse_text_file_to_data (137-166)
src/backend/base/langflow/inputs/inputs.py (5)
  • BoolInput (413-425)
  • DropdownInput (467-491)
  • FileInput (612-622)
  • MessageTextInput (205-256)
  • StrInput (128-184)
src/backend/base/langflow/template/field/base.py (1)
  • Output (181-257)
src/backend/base/langflow/schema/dataframe.py (1)
  • DataFrame (11-206)
src/backend/base/langflow/schema/data.py (1)
  • Data (23-277)
🪛 GitHub Check: Ruff Style Check (3.13)
src/backend/base/langflow/components/data/file.py

[failure] 246-246: Ruff (ARG002)
src/backend/base/langflow/components/data/file.py:246:78: ARG002 Unused method argument: field_value

🪛 GitHub Actions: Ruff Style Check
src/backend/base/langflow/components/data/file.py

[error] 246-246: Ruff: ARG002 Unused method argument: field_value. (command: uv run --only-dev ruff check --output-format=github .)

Comment thread src/backend/base/langflow/components/data/file.py Outdated
Comment thread src/backend/base/langflow/components/data/file.py Outdated
Comment thread src/backend/base/langflow/components/data/file.py Outdated
Comment thread src/backend/base/langflow/components/data/file.py Outdated
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025
@erichare erichare added the DO NOT MERGE Don't Merge this PR label Aug 14, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 14, 2025
@github-actions github-actions Bot removed the enhancement New feature or request label Aug 14, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 22, 2025

Codecov Report

❌ Patch coverage is 23.44498% with 160 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.93%. Comparing base (59937ee) to head (6843f59).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/backend/base/langflow/components/data/file.py 23.44% 160 Missing ⚠️

❌ Your patch status has failed because the patch coverage (23.44%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (3.80%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #9398      +/-   ##
==========================================
- Coverage   33.96%   33.93%   -0.03%     
==========================================
  Files        1195     1195              
  Lines       55823    55935     +112     
  Branches     5370     5331      -39     
==========================================
+ Hits        18960    18984      +24     
- Misses      36793    36881      +88     
  Partials       70       70              
Flag Coverage Δ
backend 56.61% <23.44%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/backend/base/langflow/components/data/file.py 27.30% <23.44%> (-28.08%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Aug 22, 2025
@sonarqubecloud
Copy link
Copy Markdown

@erichare erichare added this pull request to the merge queue Aug 22, 2025
Merged via the queue into main with commit 462b630 Aug 22, 2025
129 of 134 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants