feat: Add support for advanced parsing with docling in the File Component#9398
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughIntroduces an optional Docling-based advanced processing path in FileComponent, adding new enums, constants, inputs, dynamic UI behavior, import strategy handling, advanced conversion and export to Markdown, DataFrame-based loading, and fallback to standard processing. Also adjusts outputs and build config updates based on advanced mode and file/path context. Changes
Sequence Diagram(s)sequenceDiagram
participant UI as UI
participant FC as FileComponent
participant DI as Docling Imports
participant DC as Docling Converter
participant FS as Standard Loader
UI->>FC: process_files(paths, advanced_mode)
alt advanced_mode and docling-compatible
FC->>DI: _try_import_docling()
alt imports available
FC->>DC: _create_advanced_converter()
FC->>DC: convert(file)
DC-->>FC: document
FC->>FC: _export_document(document)
FC-->>UI: DataFrame/Markdown output
else imports unavailable
FC->>FS: process_file_standard(file)
FS-->>FC: text/data
FC-->>UI: standard output
end
else not advanced or incompatible
FC->>FS: process_file_standard(file)
FS-->>FC: text/data
FC-->>UI: standard output
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
src/backend/base/langflow/components/data/file.py (4)
303-373: Consider simplifying the import strategy fallback logicThe multiple import strategies with different fallback paths make the code complex and harder to maintain. Consider consolidating the import logic or using a more robust import resolution mechanism.
Consider creating a dedicated module for handling docling imports that encapsulates all the version compatibility logic. This would:
- Centralize the import resolution logic
- Make it easier to update when docling's API changes
- Reduce the complexity in this component file
Would you like me to help create a separate
docling_imports.pymodule to handle this complexity?
415-449: Potential performance issue with file extension checkingThe
_is_docling_compatiblemethod creates a large list of extensions on every call and uses a linear search. This could be optimized using a set for O(1) lookups.Move the extensions to a class-level constant and use a set:
+ # Class-level constant for better performance + DOCLING_EXTENSIONS = frozenset([ + ".adoc", ".asciidoc", ".asc", ".bmp", ".csv", ".dotx", ".dotm", + ".docm", ".docx", ".htm", ".html", ".jpeg", ".json", ".md", ".pdf", + ".png", ".potx", ".ppsx", ".pptm", ".potm", ".ppsm", ".pptx", + ".tiff", ".txt", ".xls", ".xlsx", ".xhtml", ".xml", ".webp" + ]) + def _is_docling_compatible(self, file_path: str) -> bool: """Check if file is compatible with Docling processing.""" - # All VALID_EXTENSIONS are Docling compatible (except for TEXT_FILE_TYPES which may overlap) - docling_extensions = [ - ".adoc", - ".asciidoc", - # ... (all extensions) - ".webp", - ] - return any(file_path.lower().endswith(ext) for ext in docling_extensions) + import os + _, ext = os.path.splitext(file_path.lower()) + return ext in self.DOCLING_EXTENSIONS
620-640: Improve error handling in _export_document methodThe nested try-except blocks with multiple fallbacks could mask underlying issues. Consider logging the specific exception types for better debugging.
Improve error handling with specific exception types:
def _export_document(self, document: Any, image_ref_mode: type[Enum]) -> str: """Export document to Markdown format with placeholder images.""" try: image_mode = ( image_ref_mode(self.IMAGE_MODE) if hasattr(image_ref_mode, self.IMAGE_MODE) else self.IMAGE_MODE ) # Always export to Markdown since it's fixed return document.export_to_markdown( image_mode=image_mode, image_placeholder=self.md_image_placeholder, page_break_placeholder=self.md_page_break_placeholder, ) - except Exception as e: # noqa: BLE001 + except AttributeError as e: + self.log(f"Document does not support Markdown export: {e}, trying text export") + except Exception as e: # noqa: BLE001 self.log(f"Markdown export failed: {e}, using basic text export") - # Fallback to basic text export - try: - return document.export_to_text() - except Exception: # noqa: BLE001 - return str(document) + + # Fallback to basic text export + try: + return document.export_to_text() + except AttributeError: + self.log("Document does not support text export, using string representation") + return str(document) + except Exception as e: # noqa: BLE001 + self.log(f"Text export failed: {e}, using string representation") + return str(document)
123-133: Consider handling the docling installation requirement programmaticallyThe info message mentions that docling requires installation via
uv pip install docling, but users might not notice this requirement until runtime when imports fail.Consider adding a helper method to check and report missing dependencies more clearly:
def _check_docling_availability(self) -> bool: """Check if docling is available and provide helpful installation instructions.""" if self._try_import_docling() is None: self.log( "Docling is not installed. To enable advanced parsing features, " "please install it with: uv pip install docling" ) return False return TrueWould you like me to help create a more comprehensive dependency checking mechanism?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/backend/base/langflow/components/data/file.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately
Files:
src/backend/base/langflow/components/data/file.py
{src/backend/**/*.py,tests/**/*.py,Makefile}
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code
Files:
src/backend/base/langflow/components/data/file.py
src/backend/**/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)
In your Python component class, set the
iconattribute to a string matching the frontend icon mapping exactly (case-sensitive).
Files:
src/backend/base/langflow/components/data/file.py
🧬 Code Graph Analysis (1)
src/backend/base/langflow/components/data/file.py (6)
src/backend/base/langflow/base/data/base_file.py (7)
data(48-49)data(52-60)BaseFileComponent(23-685)process_files(187-195)BaseFile(31-100)rollup_data(411-462)load_files(363-388)src/backend/base/langflow/base/data/utils.py (2)
parallel_load_data(185-198)parse_text_file_to_data(137-166)src/backend/base/langflow/inputs/inputs.py (5)
BoolInput(413-425)DropdownInput(467-491)FileInput(612-622)MessageTextInput(205-256)StrInput(128-184)src/backend/base/langflow/template/field/base.py (1)
Output(181-257)src/backend/base/langflow/schema/dataframe.py (1)
DataFrame(11-206)src/backend/base/langflow/schema/data.py (1)
Data(23-277)
🪛 GitHub Check: Ruff Style Check (3.13)
src/backend/base/langflow/components/data/file.py
[failure] 246-246: Ruff (ARG002)
src/backend/base/langflow/components/data/file.py:246:78: ARG002 Unused method argument: field_value
🪛 GitHub Actions: Ruff Style Check
src/backend/base/langflow/components/data/file.py
[error] 246-246: Ruff: ARG002 Unused method argument: field_value. (command: uv run --only-dev ruff check --output-format=github .)
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (23.44%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #9398 +/- ##
==========================================
- Coverage 33.96% 33.93% -0.03%
==========================================
Files 1195 1195
Lines 55823 55935 +112
Branches 5370 5331 -39
==========================================
+ Hits 18960 18984 +24
- Misses 36793 36881 +88
Partials 70 70
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|



This pull request adds support for an advanced parsing mode, available for documents such as PDFs, that uses
doclingto parse the document in a structured manner.Summary by CodeRabbit
New Features
Enhancements