feat: S3 file size and associations to flows by ricofurtado · Pull Request #10819 · langflow-ai/langflow

ricofurtado · 2025-12-01T19:48:37Z

This pull request introduces significant improvements to the file storage API, focusing on enhancing file metadata, supporting file usage tracking, and updating related tests and documentation. The changes standardize the return value of the list_files method to include file metadata, add an is_used flag to indicate file usage within flows, and update both local and S3 storage implementations. Related tests and documentation have been revised to reflect these updates.

File Metadata & Usage Tracking Enhancements:

The list_files method in both local and S3 storage backends now returns a list of dictionaries containing file metadata (name and size) instead of just file names. The abstract interface and all usages have been updated accordingly. [1] [2] [3] [4] [5]
The file listing API (api/v1/files.py) now includes an is_used flag for each file, indicating whether the file is referenced in the flow's nodes. The logic for determining file usage is implemented in a new utility function, which is also exported for use elsewhere. [1] [2] [3] [4]

Testing and Validation Updates:

All relevant integration and unit tests have been updated to validate the new file metadata structure and the presence of the is_used flag. Assertions now check for name, size, and is_used fields in API responses, ensuring robust coverage of the new functionality. [1] [2] [3] [4] [5]

Documentation Improvements:

A comprehensive markdown document (langflow-files-api-comparison.md) has been added, comparing v1 and v2 file APIs. It details the differences in file association, metadata, batch operations, and UI/LFX support, providing clear guidance for developers and users.

Internal Refactoring & Maintenance:

Adjusted orphaned record cleanup logic to accommodate the new file metadata structure, ensuring that file deletions reference the correct file name. [1] [2] [3]
Minor code and test maintenance, such as correcting line numbers and adding missing fields in test fixtures. [1] [2]

These changes collectively modernize the file management API, improve metadata richness, and establish a foundation for more advanced file operations and UI features.

Summary by CodeRabbit

Release Notes

New Features
- Files now display usage status and size metadata in listings
- Files actively used in flows are now protected from deletion
- Improved filename sanitization for file uploads and downloads
- Enhanced batch delete operations with detailed status reporting and error information
- Better support for non-ASCII filenames during file downloads
Documentation
- Added comprehensive File API comparison documentation

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…services, updating tests.

… response, and add API comparison documentation.

…ndling APIs

coderabbitai · 2025-12-01T19:48:54Z

Walkthrough

Storage service layer refactored to return file metadata dictionaries with name and size instead of plain strings. File deletion enhanced with in-use detection and batch error handling. Filename sanitization added for security. New utility functions detect file usage within flows. Tests updated to validate new metadata structure and deletion behavior.

Changes

Cohort / File(s)	Change Summary
Storage Service Interface Updates `src/backend/base/langflow/services/storage/service.py`, `src/backend/base/langflow/services/storage/local.py`, `src/backend/base/langflow/services/storage/s3.py`	Storage service list_files signature changed from returning `list[str]` to `list[dict]` with `name` and `size` fields; implementations updated to build metadata dictionaries and compute file sizes via async stat operations or S3 object metadata.
Storage Cleanup Task `src/backend/base/langflow/services/task/temp_flow_cleanup.py`	Orphaned flow cleanup updated to access file names via `file["name"]` from new dict structure; added rmdir for empty flow directories after file deletion; sqlalchemy.delete import added.
File API v2 Enhancements `src/backend/base/langflow/api/v2/files.py`	Added filename sanitization (sanitize_filename, sanitize_content_disposition), file usage detection (is_file_used, is_file_in_use, get_user_flows), storage helpers (try_delete_from_storage, delete_from_storage), and save_file_routine; batch delete refined to skip in-use files and aggregate failures; uploads and edits apply sanitization; downloads use RFC 5987 encoding.
File API v1 Usage Tracking `src/backend/base/langflow/api/v1/files.py`	Added is_file_used helper and integrated into list_files to attach is_used boolean flag to each file entry.
Utility Functions Export `src/backend/base/langflow/api/utils/__init__.py`, `src/backend/base/langflow/api/utils/core.py`	New is_file_used utility function added to core module and exported via all list in init.py for cross-module accessibility.
Integration Tests `src/backend/tests/integration/storage/test_s3_storage_service.py`	Test expectations updated to extract file names from returned dicts and validate name and size fields; flow isolation assertions adapted to dict-based results.
Unit Tests – File APIs `src/backend/tests/unit/api/v1/test_files.py`, `src/backend/tests/unit/api/v2/test_files.py`	v1 tests expanded to validate is_used, size, and structured response; v2 tests updated to handle files_not_deleted field in delete responses, mock flows queries, and verify in-use file rejection; new test_delete_file_in_use added.
Unit Tests – Storage Services `src/backend/tests/unit/services/storage/test_local_storage_service.py`	Test assertions updated to extract file names from dict entries and validate name/size fields; membership and exclusion checks adapted to dict-based list structure.
Configuration & Documentation `.secrets.baseline`, `langflow-files-api-comparison.md`	.secrets.baseline entry for input_mixin.py repositioned and line number adjusted for test_files.py; new documentation file contrasting v1 flow-based and v2 user-based file API architectures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Storage signature cascade: list_files changes propagate across three storage implementations and affect multiple callsites (v1/v2 files.py, cleanup task, tests).
Deletion logic in v2/files.py: New in-use detection, batch error aggregation, and storage failure handling add decision points and require careful state management.
File sanitization: Security-critical path for filename handling; RFC 5987 encoding and ASCII quoting logic should be validated for correctness.
Test coverage density: Multiple test files updated with dict-based assertions; verify all new behavior (is_used flag, files_not_deleted, in-use rejection) is properly covered.

Possibly related PRs

feat: add s3 file storage implementation #10526: Directly modifies storage subsystem interfaces and file listing implementations referenced in this PR.
feat: migrate from loguru to structlog #9321: Adds file-usage helpers (is_file_used / is_file_in_use) to v2/files.py, overlapping with this PR's usage detection logic.
release: merge release-v1.6.0 into main #9889: Modifies v2/files.py for per-user MCP file handling; shares file API enhancement context with this PR.

Suggested labels

enhancement, size:M, lgtm

Suggested reviewers

ogabrielluiz
Adam-Aghili
jordanrfrazier

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 3 warnings)

Check name	Status	Explanation	Resolution
Test Coverage For New Implementations	❌ Error	PR lacks comprehensive unit test coverage for new functionality and fails to catch two critical unaddressed issues: ImportError bug and code duplication of is_file_used across three files.	Fix ImportError by adding is_file_used to imports in init.py, deduplicate is_file_used by centralizing it, add unit tests for sanitization and helper functions, then verify tests catch the import error.
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Quality And Coverage	⚠️ Warning	PR lacks comprehensive test coverage for security-critical functions sanitize_filename() and sanitize_content_disposition(), and new async helpers lack isolated unit tests.	Add dedicated unit tests for sanitize_filename(), sanitize_content_disposition(), is_file_used(), save_file_routine(), and try_delete_from_storage() covering edge cases and error scenarios.
Excessive Mock Usage Warning	⚠️ Warning	Pull request shows excessive mock usage with brittle interdependencies and repeated patterns, indicating poor test design that couples tests to implementation details rather than testing actual behavior.	Refactor using integration test patterns with real session_scope() and model instantiation, or consolidate mocks into shared pytest fixtures and eliminate side_effect chains that assume exact call sequences.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Test File Naming And Structure	✅ Passed	All test files follow test_*.py naming pattern with proper pytest structure, class-based organization across integration and unit directories, and comprehensive coverage of initialization, file operations, streaming, deletion, and edge cases.
Title check	✅ Passed	The pull request title 'feat: S3 file size and associations to flows' directly aligns with the main changes: standardizing list_files to return file metadata (name and size) and adding file usage tracking (is_used flag) to show associations between files and flows.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch s3-file-size-and-associations-to-flows

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-12-01T19:51:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.39%. Comparing base (5226daa) to head (ef63f8d).
⚠️ Report is 406 commits behind head on main.

❌ Your project check has failed because the head coverage (40.04%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10819      +/-   ##
==========================================
- Coverage   32.44%   32.39%   -0.06%     
==========================================
  Files        1367     1367              
  Lines       63315    63235      -80     
  Branches     9357     9358       +1     
==========================================
- Hits        20544    20482      -62     
+ Misses      41738    41720      -18     
  Partials     1033     1033

Flag	Coverage Δ
lfx	`40.04% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/backend/base/langflow/api/utils/core.py	`62.44% <ø> (ø)`
src/backend/base/langflow/api/v1/files.py	`66.14% <ø> (ø)`
src/backend/base/langflow/api/v2/files.py	`59.24% <ø> (-3.12%)`	⬇️
...rc/backend/base/langflow/services/storage/local.py	`85.54% <ø> (-0.35%)`	⬇️
src/backend/base/langflow/services/storage/s3.py	`11.88% <ø> (-0.53%)`	⬇️
.../backend/base/langflow/services/storage/service.py	`78.12% <ø> (-0.67%)`	⬇️
...d/base/langflow/services/task/temp_flow_cleanup.py	`62.16% <ø> (+0.62%)`	⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/backend/tests/integration/storage/test_s3_storage_service.py (2)
450-452: Bug: Assertion incompatible with new return type.

Line 452 compares file_name (a string) directly to files (now a list[dict]). This assertion will always fail since "to_delete.txt" in [{"name": "to_delete.txt", "size": 9}] is False.

Apply this diff:
         # Verify it exists
         files = await s3_storage_service.list_files(test_flow_id)
-        assert file_name in files
+        assert file_name in [f["name"] for f in files]
578-581: Bug: Same assertion incompatibility with new return type.

Similar to the previous issue, this assertion compares strings to a list of dicts and will fail.

Apply this diff:
             # Verify all files exist
             listed = await s3_storage_service.list_files(test_flow_id)
             assert len(listed) == 5
+            listed_names = [f["name"] for f in listed]
             for file_name in file_names:
-                assert file_name in listed
+                assert file_name in listed_names

🧹 Nitpick comments (3)

src/backend/base/langflow/api/utils/core.py (1)
416-433: Code duplication detected across three files.

This exact implementation of is_file_used exists in three locations:

src/backend/base/langflow/api/utils/core.py (this file)

src/backend/base/langflow/api/v1/files.py (lines 217-234)

src/backend/base/langflow/api/v2/files.py (lines 189-206)

Since this utility is now exported from api/utils, the other two files should import from here rather than duplicating the logic.
# In src/backend/base/langflow/api/v1/files.py and v2/files.py:
+from langflow.api.utils import is_file_used
-def is_file_used(flow_data: dict | None, file_name: str) -> bool:
-    """Check if a file is used in the flow."""
-    if not flow_data or "nodes" not in flow_data:
-        return False
-    ...
src/backend/base/langflow/api/v2/files.py (1)
190-207: Duplicate implementation - import from api/utils instead.

This is_file_used function is identical to the one in src/backend/base/langflow/api/utils/core.py. Since it's exported from utils, this file should import it rather than redefine it.
+from langflow.api.utils import is_file_used
+
 async def is_file_in_use(session: DbSession, user_id: uuid.UUID, file_name: str) -> bool:
     """Check if a file is used in any of the user's flows."""
     flows = await get_user_flows(session, user_id)
     return any(is_file_used(flow.data, file_name) for flow in flows)
-
-
-def is_file_used(flow_data: dict | None, file_name: str) -> bool:
-    """Check if a file is used in the flow."""
-    ...
src/backend/base/langflow/services/task/temp_flow_cleanup.py (1)
47-59: Clarify orphaned_flow_ids typing and avoid type: ignore

The cleanup logic is sound, but a couple of details are worth tightening up:

The type: ignore[arg-type] on the delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)) call suggests a mismatch between the inferred type of orphaned_flow_ids and what in_ expects.

The same orphaned_flow_ids iterable is later used as for flow_id in orphaned_flow_ids: and passed as str(flow_id) into list_files / delete_file and for constructing flow_dir. If session.exec(...).all() is returning row objects instead of bare scalar IDs, str(flow_id) will not match the actual flow ID string and both DB delete semantics and storage cleanup targets become brittle.

To make this robust and drop the type: ignore, consider shaping orphaned_flow_ids explicitly as a list of scalar IDs via .scalars().all() and annotating it:
-                orphaned_flow_ids = (
-                    await session.exec(
-                        select(col(table.flow_id).distinct()).where(col(table.flow_id).not_in(flow_ids_subquery))
-                    )
-                ).all()
+                result = await session.exec(
+                    select(col(table.flow_id))
+                    .where(col(table.flow_id).not_in(flow_ids_subquery))
+                    .distinct()
+                )
+                orphaned_flow_ids: list[str] = result.scalars().all()
...
-                    await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)))  # type: ignore[arg-type]
+                    await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)))
This also guarantees that the flow_id used for file listing/deletion and directory removal is the plain ID value you expect.

The switch to file["name"] for delete_file and logging correctly aligns with the new list_files metadata format.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 056faea and ef63f8d.

📒 Files selected for processing (14)

.secrets.baseline (3 hunks)
langflow-files-api-comparison.md (1 hunks)
src/backend/base/langflow/api/utils/__init__.py (1 hunks)
src/backend/base/langflow/api/utils/core.py (1 hunks)
src/backend/base/langflow/api/v1/files.py (2 hunks)
src/backend/base/langflow/api/v2/files.py (11 hunks)
src/backend/base/langflow/services/storage/local.py (2 hunks)
src/backend/base/langflow/services/storage/s3.py (2 hunks)
src/backend/base/langflow/services/storage/service.py (1 hunks)
src/backend/base/langflow/services/task/temp_flow_cleanup.py (3 hunks)
src/backend/tests/integration/storage/test_s3_storage_service.py (2 hunks)
src/backend/tests/unit/api/v1/test_files.py (4 hunks)
src/backend/tests/unit/api/v2/test_files.py (11 hunks)
src/backend/tests/unit/services/storage/test_local_storage_service.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

src/backend/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/**/*.py: Use FastAPI async patterns with await for async operations in component execution methods
Use asyncio.create_task() for background tasks and implement proper cleanup with try/except for asyncio.CancelledError
Use queue.put_nowait() for non-blocking queue operations and asyncio.wait_for() with timeouts for controlled get operations

Files:

src/backend/base/langflow/api/v1/files.py
src/backend/base/langflow/api/utils/core.py
src/backend/base/langflow/services/storage/local.py
src/backend/base/langflow/services/storage/service.py
src/backend/tests/unit/api/v1/test_files.py
src/backend/tests/unit/api/v2/test_files.py
src/backend/tests/unit/services/storage/test_local_storage_service.py
src/backend/base/langflow/services/task/temp_flow_cleanup.py
src/backend/tests/integration/storage/test_s3_storage_service.py
src/backend/base/langflow/services/storage/s3.py
src/backend/base/langflow/api/utils/__init__.py
src/backend/base/langflow/api/v2/files.py

src/backend/base/langflow/api/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

Backend API endpoints should be organized by version (v1/, v2/) under src/backend/base/langflow/api/ with specific modules for features (chat.py, flows.py, users.py, etc.)

Files:

src/backend/base/langflow/api/v1/files.py
src/backend/base/langflow/api/utils/core.py
src/backend/base/langflow/api/utils/__init__.py
src/backend/base/langflow/api/v2/files.py

src/backend/tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)

src/backend/tests/**/*.py: Place backend unit tests in src/backend/tests/ directory, component tests in src/backend/tests/unit/components/ organized by component subdirectory, and integration tests accessible via make integration_tests
Use same filename as component with appropriate test prefix/suffix (e.g., my_component.py → test_my_component.py)
Use the client fixture (FastAPI Test Client) defined in src/backend/tests/conftest.py for API tests; it provides an async httpx.AsyncClient with automatic in-memory SQLite database and mocked environment variables. Skip client creation by marking test with @pytest.mark.noclient
Inherit from the correct ComponentTestBase family class located in src/backend/tests/base.py based on API access needs: ComponentTestBase (no API), ComponentTestBaseWithClient (needs API), or ComponentTestBaseWithoutClient (pure logic). Provide three required fixtures: component_class, default_kwargs, and file_names_mapping
Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Use @pytest.mark.asyncio decorator for async component tests and ensure async methods are properly awaited
Test background tasks using asyncio.create_task() and verify completion with asyncio.wait_for() with appropriate timeout constraints
Test queue operations using non-blocking queue.put_nowait() and asyncio.wait_for(queue.get(), timeout=...) to verify queue processing without blocking
Use @pytest.mark.no_blockbuster marker to skip the blockbuster plugin in specific tests
For database tests that may fail in batch runs, run them sequentially using uv run pytest src/backend/tests/unit/test_database.py r...

Files:

src/backend/tests/unit/api/v1/test_files.py
src/backend/tests/unit/api/v2/test_files.py
src/backend/tests/unit/services/storage/test_local_storage_service.py
src/backend/tests/integration/storage/test_s3_storage_service.py

**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (Custom checks)

Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)

Files:

src/backend/tests/unit/api/v1/test_files.py
src/backend/tests/unit/api/v2/test_files.py
src/backend/tests/unit/services/storage/test_local_storage_service.py
src/backend/tests/integration/storage/test_s3_storage_service.py

**/test_*.py

📄 CodeRabbit inference engine (Custom checks)

**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend

Files:

src/backend/tests/unit/api/v1/test_files.py
src/backend/tests/unit/api/v2/test_files.py
src/backend/tests/unit/services/storage/test_local_storage_service.py
src/backend/tests/integration/storage/test_s3_storage_service.py

🧠 Learnings (13)

📚 Learning: 2025-11-24T19:46:09.104Z

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/api/**/*.py : Backend API endpoints should be organized by version (v1/, v2/) under `src/backend/base/langflow/api/` with specific modules for features (chat.py, flows.py, users.py, etc.)

Applied to files:

langflow-files-api-comparison.md