feat: S3 file size and associations to flows#10819
Conversation
…services, updating tests.
… response, and add API comparison documentation.
WalkthroughStorage service layer refactored to return file metadata dictionaries with name and size instead of plain strings. File deletion enhanced with in-use detection and batch error handling. Filename sanitization added for security. New utility functions detect file usage within flows. Tests updated to validate new metadata structure and deletion behavior. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 error, 3 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (40.04%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #10819 +/- ##
==========================================
- Coverage 32.44% 32.39% -0.06%
==========================================
Files 1367 1367
Lines 63315 63235 -80
Branches 9357 9358 +1
==========================================
- Hits 20544 20482 -62
+ Misses 41738 41720 -18
Partials 1033 1033
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/backend/tests/integration/storage/test_s3_storage_service.py (2)
450-452: Bug: Assertion incompatible with new return type.Line 452 compares
file_name(a string) directly tofiles(now alist[dict]). This assertion will always fail since"to_delete.txt" in [{"name": "to_delete.txt", "size": 9}]isFalse.Apply this diff:
# Verify it exists files = await s3_storage_service.list_files(test_flow_id) - assert file_name in files + assert file_name in [f["name"] for f in files]
578-581: Bug: Same assertion incompatibility with new return type.Similar to the previous issue, this assertion compares strings to a list of dicts and will fail.
Apply this diff:
# Verify all files exist listed = await s3_storage_service.list_files(test_flow_id) assert len(listed) == 5 + listed_names = [f["name"] for f in listed] for file_name in file_names: - assert file_name in listed + assert file_name in listed_names
🧹 Nitpick comments (3)
src/backend/base/langflow/api/utils/core.py (1)
416-433: Code duplication detected across three files.This exact implementation of
is_file_usedexists in three locations:
src/backend/base/langflow/api/utils/core.py(this file)src/backend/base/langflow/api/v1/files.py(lines 217-234)src/backend/base/langflow/api/v2/files.py(lines 189-206)Since this utility is now exported from
api/utils, the other two files should import from here rather than duplicating the logic.# In src/backend/base/langflow/api/v1/files.py and v2/files.py: +from langflow.api.utils import is_file_used -def is_file_used(flow_data: dict | None, file_name: str) -> bool: - """Check if a file is used in the flow.""" - if not flow_data or "nodes" not in flow_data: - return False - ...src/backend/base/langflow/api/v2/files.py (1)
190-207: Duplicate implementation - import from api/utils instead.This
is_file_usedfunction is identical to the one insrc/backend/base/langflow/api/utils/core.py. Since it's exported from utils, this file should import it rather than redefine it.+from langflow.api.utils import is_file_used + async def is_file_in_use(session: DbSession, user_id: uuid.UUID, file_name: str) -> bool: """Check if a file is used in any of the user's flows.""" flows = await get_user_flows(session, user_id) return any(is_file_used(flow.data, file_name) for flow in flows) - - -def is_file_used(flow_data: dict | None, file_name: str) -> bool: - """Check if a file is used in the flow.""" - ...src/backend/base/langflow/services/task/temp_flow_cleanup.py (1)
47-59: Clarifyorphaned_flow_idstyping and avoidtype: ignoreThe cleanup logic is sound, but a couple of details are worth tightening up:
- The
type: ignore[arg-type]on thedelete(table).where(col(table.flow_id).in_(orphaned_flow_ids))call suggests a mismatch between the inferred type oforphaned_flow_idsand whatin_expects.- The same
orphaned_flow_idsiterable is later used asfor flow_id in orphaned_flow_ids:and passed asstr(flow_id)intolist_files/delete_fileand for constructingflow_dir. Ifsession.exec(...).all()is returning row objects instead of bare scalar IDs,str(flow_id)will not match the actual flow ID string and both DB delete semantics and storage cleanup targets become brittle.To make this robust and drop the
type: ignore, consider shapingorphaned_flow_idsexplicitly as a list of scalar IDs via.scalars().all()and annotating it:- orphaned_flow_ids = ( - await session.exec( - select(col(table.flow_id).distinct()).where(col(table.flow_id).not_in(flow_ids_subquery)) - ) - ).all() + result = await session.exec( + select(col(table.flow_id)) + .where(col(table.flow_id).not_in(flow_ids_subquery)) + .distinct() + ) + orphaned_flow_ids: list[str] = result.scalars().all() ... - await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids))) # type: ignore[arg-type] + await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)))This also guarantees that the
flow_idused for file listing/deletion and directory removal is the plain ID value you expect.The switch to
file["name"]fordelete_fileand logging correctly aligns with the newlist_filesmetadata format.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
.secrets.baseline(3 hunks)langflow-files-api-comparison.md(1 hunks)src/backend/base/langflow/api/utils/__init__.py(1 hunks)src/backend/base/langflow/api/utils/core.py(1 hunks)src/backend/base/langflow/api/v1/files.py(2 hunks)src/backend/base/langflow/api/v2/files.py(11 hunks)src/backend/base/langflow/services/storage/local.py(2 hunks)src/backend/base/langflow/services/storage/s3.py(2 hunks)src/backend/base/langflow/services/storage/service.py(1 hunks)src/backend/base/langflow/services/task/temp_flow_cleanup.py(3 hunks)src/backend/tests/integration/storage/test_s3_storage_service.py(2 hunks)src/backend/tests/unit/api/v1/test_files.py(4 hunks)src/backend/tests/unit/api/v2/test_files.py(11 hunks)src/backend/tests/unit/services/storage/test_local_storage_service.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/backend/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
src/backend/**/*.py: Use FastAPI async patterns withawaitfor async operations in component execution methods
Useasyncio.create_task()for background tasks and implement proper cleanup with try/except forasyncio.CancelledError
Usequeue.put_nowait()for non-blocking queue operations andasyncio.wait_for()with timeouts for controlled get operations
Files:
src/backend/base/langflow/api/v1/files.pysrc/backend/base/langflow/api/utils/core.pysrc/backend/base/langflow/services/storage/local.pysrc/backend/base/langflow/services/storage/service.pysrc/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/base/langflow/services/task/temp_flow_cleanup.pysrc/backend/tests/integration/storage/test_s3_storage_service.pysrc/backend/base/langflow/services/storage/s3.pysrc/backend/base/langflow/api/utils/__init__.pysrc/backend/base/langflow/api/v2/files.py
src/backend/base/langflow/api/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
Backend API endpoints should be organized by version (v1/, v2/) under
src/backend/base/langflow/api/with specific modules for features (chat.py, flows.py, users.py, etc.)
Files:
src/backend/base/langflow/api/v1/files.pysrc/backend/base/langflow/api/utils/core.pysrc/backend/base/langflow/api/utils/__init__.pysrc/backend/base/langflow/api/v2/files.py
src/backend/tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)
src/backend/tests/**/*.py: Place backend unit tests insrc/backend/tests/directory, component tests insrc/backend/tests/unit/components/organized by component subdirectory, and integration tests accessible viamake integration_tests
Use same filename as component with appropriate test prefix/suffix (e.g.,my_component.py→test_my_component.py)
Use theclientfixture (FastAPI Test Client) defined insrc/backend/tests/conftest.pyfor API tests; it provides an asynchttpx.AsyncClientwith automatic in-memory SQLite database and mocked environment variables. Skip client creation by marking test with@pytest.mark.noclient
Inherit from the correctComponentTestBasefamily class located insrc/backend/tests/base.pybased on API access needs:ComponentTestBase(no API),ComponentTestBaseWithClient(needs API), orComponentTestBaseWithoutClient(pure logic). Provide three required fixtures:component_class,default_kwargs, andfile_names_mapping
Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Use@pytest.mark.asynciodecorator for async component tests and ensure async methods are properly awaited
Test background tasks usingasyncio.create_task()and verify completion withasyncio.wait_for()with appropriate timeout constraints
Test queue operations using non-blockingqueue.put_nowait()andasyncio.wait_for(queue.get(), timeout=...)to verify queue processing without blocking
Use@pytest.mark.no_blockbustermarker to skip the blockbuster plugin in specific tests
For database tests that may fail in batch runs, run them sequentially usinguv run pytest src/backend/tests/unit/test_database.pyr...
Files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/tests/integration/storage/test_s3_storage_service.py
**/{test_*.py,*.test.ts,*.test.tsx}
📄 CodeRabbit inference engine (Custom checks)
Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)
Files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/tests/integration/storage/test_s3_storage_service.py
**/test_*.py
📄 CodeRabbit inference engine (Custom checks)
**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend
Files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/tests/integration/storage/test_s3_storage_service.py
🧠 Learnings (13)
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/api/**/*.py : Backend API endpoints should be organized by version (v1/, v2/) under `src/backend/base/langflow/api/` with specific modules for features (chat.py, flows.py, users.py, etc.)
Applied to files:
langflow-files-api-comparison.md
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `aiofiles` and `anyio.Path` for async file operations in tests; create temporary test files using `tmp_path` fixture and verify file existence and content
Applied to files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/tests/integration/storage/test_s3_storage_service.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test component versioning and backward compatibility using `file_names_mapping` fixture with `VersionComponentMapping` objects mapping component files across Langflow versions
Applied to files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.pysrc/backend/tests/unit/services/storage/test_local_storage_service.pysrc/backend/tests/integration/storage/test_s3_storage_service.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `monkeypatch` fixture to mock internal functions for testing error handling scenarios; validate error status codes and error message content in responses
Applied to files:
src/backend/tests/unit/api/v1/test_files.pysrc/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use async fixtures with proper cleanup using try/finally blocks to ensure resources are properly released after tests complete
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `pytest.mark.api_key_required` and `pytest.mark.no_blockbuster` markers for components that need external APIs; use `MockLanguageModel` from `tests.unit.mock_language_model` for testing without external API keys
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Each test should have a clear docstring explaining its purpose; complex test setups should be commented; mock usage should be documented; expected behaviors should be explicitly stated
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `pytest.mark.asyncio` decorator for async component tests and ensure async methods are properly awaited
Applied to files:
src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/**/*.py : Use `asyncio.create_task()` for background tasks and implement proper cleanup with try/except for `asyncio.CancelledError`
Applied to files:
src/backend/base/langflow/services/task/temp_flow_cleanup.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/components/**/__init__.py : Update `__init__.py` with alphabetically sorted imports when adding new components
Applied to files:
src/backend/base/langflow/services/task/temp_flow_cleanup.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/services/database/models/**/*.py : Database models should be organized by domain (api_key/, flow/, folder/, user/, etc.) under `src/backend/base/langflow/services/database/models/`
Applied to files:
src/backend/base/langflow/services/task/temp_flow_cleanup.py
🧬 Code graph analysis (5)
src/backend/base/langflow/api/v1/files.py (2)
src/backend/base/langflow/api/v2/files.py (1)
is_file_used(190-207)src/backend/base/langflow/api/utils/core.py (1)
is_file_used(416-433)
src/backend/base/langflow/api/utils/core.py (2)
src/backend/base/langflow/api/v1/files.py (1)
is_file_used(218-235)src/backend/base/langflow/api/v2/files.py (1)
is_file_used(190-207)
src/backend/base/langflow/services/storage/local.py (2)
src/backend/base/langflow/services/storage/s3.py (1)
list_files(225-262)src/backend/base/langflow/services/storage/service.py (1)
list_files(43-44)
src/backend/tests/unit/api/v2/test_files.py (4)
src/backend/base/langflow/api/v2/files.py (1)
delete_file(790-818)src/backend/base/langflow/services/storage/local.py (1)
delete_file(157-172)src/backend/base/langflow/services/storage/s3.py (1)
delete_file(264-284)src/backend/base/langflow/services/storage/service.py (1)
delete_file(51-52)
src/backend/base/langflow/services/task/temp_flow_cleanup.py (6)
src/backend/base/langflow/api/v1/files.py (1)
delete_file(205-215)src/backend/base/langflow/services/storage/local.py (1)
delete_file(157-172)src/backend/base/langflow/services/storage/s3.py (1)
delete_file(264-284)src/backend/base/langflow/services/storage/service.py (1)
delete_file(51-52)src/lfx/src/lfx/services/storage/service.py (1)
delete_file(160-170)src/lfx/src/lfx/services/storage/local.py (1)
delete_file(139-154)
🪛 Gitleaks (8.29.1)
.secrets.baseline
[high] 878-878: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 LanguageTool
langflow-files-api-comparison.md
[style] ~13-~13: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e user, with robust error handling. - File Metadata Update: Allows renaming file...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: Lint Backend / Run Mypy (3.12)
- GitHub Check: Test Docker Images / Test docker images
- GitHub Check: Lint Backend / Run Mypy (3.11)
- GitHub Check: Lint Backend / Run Mypy (3.10)
- GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
- GitHub Check: Run Backend Tests / LFX Tests - Python 3.10
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
- GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
- GitHub Check: Test Starter Templates
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Update Component Index
- GitHub Check: Update Starter Projects
🔇 Additional comments (21)
.secrets.baseline (1)
874-883: LGTM - Secrets baseline updated correctly.The entry for
input_mixin.pywithis_secret: falsecorrectly tracks a known false positive. The static analysis warning from Gitleaks is expected here since this baseline file contains hashed representations of detected patterns, not actual secrets.langflow-files-api-comparison.md (1)
1-64: Good documentation comparing v1 and v2 file APIs.The comparison tables clearly outline the functional differences between the flow-based v1 and user-based v2 APIs. This will help developers understand which API to use for their use case.
src/backend/tests/integration/storage/test_s3_storage_service.py (1)
395-401: Good test updates for new file metadata structure.The test correctly validates the new dict structure with
nameandsizefields, and verifies the expected size of 7 bytes for "content".src/backend/base/langflow/api/v1/files.py (1)
195-197: Good integration of file usage tracking.The is_used flag is correctly added to each file entry by checking the flow's node templates.
src/backend/base/langflow/services/storage/service.py (1)
42-44: Interface change is correctly implemented across all backend implementations.The return type change from
list[str]tolist[dict]on line 43 aligns with the PR objective to include file metadata. AllStorageServiceimplementations in the backend (local.pyands3.py) have been properly updated to match the new return type. This is a breaking change for the abstract interface, but all known implementations are consistent.src/backend/tests/unit/api/v1/test_files.py (3)
61-61: LGTM!Adding
optins=Noneensures the fixture aligns with the User model's expected fields, preventing potential validation issues.
208-214: Good test coverage for new file metadata structure.The assertions correctly validate:
- Presence of required fields (
name,size,is_used)- File name suffix matching
- Correct size calculation (12 bytes for "test content")
- Type validation for
is_usedas boolean
252-254: LGTM!The test correctly adapts to the new dict-based file listing by extracting file names before performing membership assertions.
Also applies to: 269-271
src/backend/tests/unit/services/storage/test_local_storage_service.py (2)
168-174: LGTM!The test correctly validates the new file metadata structure with proper assertions for both
nameandsizefields. The size validation (7 bytes for "content") is accurate.
186-187: LGTM!All list operation tests are properly adapted to extract file names from the dict entries before performing membership assertions, maintaining test clarity.
Also applies to: 203-204, 221-223, 240-241
src/backend/base/langflow/services/storage/local.py (1)
127-155: LGTM! Clean implementation of enriched file metadata.The updated
list_filesmethod:
- Returns consistent structure (
name,size) matching the S3 implementation- Uses proper async iteration with
folder_path.iterdir()- Correctly filters to only include files (not directories)
- Handles edge cases (missing directory, errors) gracefully
The per-file
stat()call is a reasonable trade-off for providing size metadata.src/backend/base/langflow/services/storage/s3.py (1)
225-262: LGTM! Efficient S3 implementation leveraging existing metadata.The S3 implementation efficiently extracts file size from the
list_objects_v2response (which includesSizeby default), avoiding any additional API calls compared to the previous implementation.src/backend/tests/unit/api/v2/test_files.py (3)
202-202: LGTM!The expected response format correctly reflects the new API contract that includes
files_not_deletedfield for consistency across delete operations.Also applies to: 762-762
970-974: LGTM!The mock setup consistently adds
mock_exec_flowsto simulate the additional database query for checking file usage in flows. Theside_effectpattern correctly returns different results for sequentialsession.execcalls.Also applies to: 1018-1022, 1070-1074, 1132-1136, 1171-1175, 1220-1224, 1261-1265
1292-1339: Good test coverage for file-in-use protection.This new test verifies the critical behavior that files referenced in flow nodes cannot be deleted, protecting users from accidentally breaking their flows. The test correctly:
- Mocks a flow with a node template referencing the file
- Verifies the appropriate response structure with
files_not_deleted- Confirms storage and database delete are NOT called
src/backend/base/langflow/api/v2/files.py (5)
38-73: Good security-conscious filename sanitization.The implementation handles key security concerns:
- Path traversal prevention via
Path(filename).name- Dangerous character replacement with safe subset
- Hidden file prevention by stripping leading dots
- Length limits with extension preservation
One minor note: the regex
[^\w.\- ()]allows underscores (via\w) but the comment mentions them explicitly. This is correct behavior.
76-100: RFC 5987 compliant Content-Disposition handling.Good implementation supporting both ASCII and non-ASCII filenames with proper encoding fallback. The quote escaping for backslash and double-quote characters prevents header injection.
800-804: File-in-use protection returns 200 with informative response.This is a design choice worth noting: the endpoint returns HTTP 200 (not 4xx) when a file cannot be deleted due to being in use. This allows the client to distinguish between errors and intentional rejections while maintaining a consistent response structure.
536-577: Comprehensive batch delete with proper failure categorization.The implementation correctly:
- Separates in-use files before attempting deletion
- Categorizes storage failures as transient vs permanent
- Only deletes DB records for files successfully removed from storage (or permanently gone)
- Provides detailed response with deleted, not-deleted, and failure counts
761-771: Good input validation for file renaming.The validation ensures:
- Empty names are rejected
- Names containing disallowed characters are rejected with clear feedback
- Only alphanumeric characters, spaces, dots, hyphens, underscores, and parentheses are permitted
The strict comparison (
sanitized_name != name.strip()) is intentional: it prevents silent modification of user input and explicitly rejects names containing characters that would be converted or removed (e.g.,@,#,!). This aligns with the security principle of failing fast rather than implicitly sanitizing. The error message clearly communicates what characters are allowed.src/backend/base/langflow/services/task/temp_flow_cleanup.py (1)
8-9: SQLAlchemy/sqlmodel imports are appropriate
delete,col, andselectare imported consistently with their usage below; no issues here.
| "get_top_level_vertices", | ||
| # Functions | ||
| "has_api_terms", | ||
| "is_file_used", |
There was a problem hiding this comment.
is_file_used is exported but not imported.
The function is_file_used is added to __all__ but is missing from the import statement at lines 9-39. This will cause an ImportError when attempting to import it from langflow.api.utils.
Add is_file_used to the import statement:
from langflow.api.utils.core import (
API_WORDS,
MAX_PAGE_SIZE,
MIN_PAGE_SIZE,
CurrentActiveMCPUser,
CurrentActiveUser,
DbSession,
EventDeliveryType,
build_and_cache_graph_from_data,
build_graph_from_data,
build_graph_from_db,
build_graph_from_db_no_cache,
build_input_keys_response,
cascade_delete_flow,
check_langflow_version,
custom_params,
extract_global_variables_from_headers,
format_elapsed_time,
format_exception_message,
format_syntax_error_message,
get_causing_exception,
get_is_component_from_data,
get_suggestion_message,
get_top_level_vertices,
has_api_terms,
+ is_file_used,
parse_exception,
parse_value,
remove_api_keys,
validate_is_component,
verify_public_flow_and_get_user,
)🤖 Prompt for AI Agents
In src/backend/base/langflow/api/utils/__init__.py around line 71, the name
"is_file_used" is included in __all__ but not imported in the import block at
lines 9-39; add "is_file_used" to that import statement (import it from the same
module where the other file utility functions are imported from) so the symbol
exported in __all__ is actually defined and avoid the ImportError.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| def is_file_used(flow_data: dict | None, file_name: str) -> bool: | ||
| """Check if a file is used in the flow.""" | ||
| if not flow_data or "nodes" not in flow_data: | ||
| return False | ||
|
|
||
| for node in flow_data["nodes"]: | ||
| node_data = node.get("data", {}).get("node", {}) | ||
| template = node_data.get("template", {}) | ||
| for field in template.values(): | ||
| if isinstance(field, dict) and "value" in field: | ||
| value = field["value"] | ||
| if isinstance(value, str) and file_name in value: | ||
| return True | ||
| if isinstance(value, list): | ||
| for item in value: | ||
| if isinstance(item, str) and file_name in item: | ||
| return True | ||
| return False |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Code duplication: is_file_used is defined in three places.
This function is identically implemented in:
- Here (
api/v1/files.py) api/v2/files.py(lines 189-206)api/utils/core.py(lines 415-432)
Since the function is already exported from langflow.api.utils (once the missing import is fixed), this local definition should be removed in favor of importing from the centralized location.
Remove the local definition and import from utils:
-from langflow.api.utils import CurrentActiveUser, DbSession
+from langflow.api.utils import CurrentActiveUser, DbSession, is_file_used
from langflow.api.v1.schemas import UploadFileResponseThen remove the local function definition at lines 218-235.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/backend/base/langflow/api/v1/files.py around lines 218 to 235, remove the
local is_file_used function and instead import the canonical implementation from
langflow.api.utils (or langflow.api.utils.core) where it is exported; update the
module imports to add the single import (e.g., from langflow.api.utils import
is_file_used) and delete the duplicated function definition at lines 218-235,
making sure all local references continue to work with the imported function.
There was a problem hiding this comment.
DRY, the question is: where to put it utils/core?
There was a problem hiding this comment.
@ricofurtado Yes, utils/core.py is the right place for is_file_used! Here's why:
Why utils/core.py is appropriate:
- ✅ API-agnostic utility - The function operates purely on dictionary data structures, with no dependencies on v1 or v2 specific logic
- ✅ Shared across versions - Both v1 and v2 file APIs need this functionality
- ✅ Already exported - It's already in
utils/core.pyand exported fromapi/utils/__init__.py - ✅ Consistent pattern - Other flow-data utilities already live in that module
Next steps to complete the DRY refactor:
-
In
api/v1/files.py- Remove the local definition (lines 218-235 in the diff) and import:from langflow.api.utils import CurrentActiveUser, DbSession, is_file_used
-
In
api/v2/files.py- Remove the duplicate at lines 189-206 and add the same import
This keeps the canonical implementation in utils/core.py where it's version-neutral, tested once, and maintained in a single location.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| for node in flow_data["nodes"]: | ||
| node_data = node.get("data", {}).get("node", {}) | ||
| template = node_data.get("template", {}) | ||
| for field in template.values(): | ||
| if isinstance(field, dict) and "value" in field: | ||
| value = field["value"] | ||
| if isinstance(value, str) and file_name in value: | ||
| return True | ||
| if isinstance(value, list): | ||
| for item in value: | ||
| if isinstance(item, str) and file_name in item: | ||
| return True |
There was a problem hiding this comment.
⚡️Codeflash found 21% (0.21x) speedup for is_file_used in src/backend/base/langflow/api/utils/core.py
⏱️ Runtime : 1.74 milliseconds → 1.44 milliseconds (best of 124 runs)
📝 Explanation and details
The optimized code achieves a 20% speedup through several key micro-optimizations that reduce redundant operations and improve early exit behavior:
Key Optimizations:
-
Eliminated chained
.get()calls with defaults: The original code usednode.get("data", {}).get("node", {})andnode_data.get("template", {}), which creates temporary empty dictionaries even when keys don't exist. The optimized version uses single.get()calls followed by explicitNonechecks withcontinuestatements, avoiding unnecessary object creation. -
Added intermediate variable storage: Storing
nodes = flow_data["nodes"]once avoids repeated dictionary lookups. Similarly, storingval = field.get("value")eliminates the redundantfield["value"]access after the"value" in fieldcheck. -
Restructured conditional logic for better short-circuiting: The optimized version uses early
continuestatements to skip nodes missing required keys (data,node,template), reducing nesting and improving branch prediction. This is particularly effective when dealing with malformed nodes. -
Simplified field validation: Instead of
isinstance(field, dict) and "value" in field, the code first checksisinstance(field, dict)with acontinue, then directly gets the value, eliminating the redundant"value" in fieldcheck followed byfield["value"]access.
Performance Impact:
The optimizations are most effective for scenarios with:
- Large node counts (test cases with 1000+ nodes show the biggest gains)
- Nodes with missing or malformed structure (early exits reduce unnecessary processing)
- Complex template hierarchies (reduced dictionary lookups compound savings)
The line profiler shows the optimized version processes the same workload with fewer operations per line, particularly in the hot paths where field validation and value extraction occur thousands of times. While the total runtime appears slightly higher in the profiler due to additional conditional checks, the actual measured runtime is 20% faster, indicating more efficient execution paths and reduced memory allocation overhead.
✅ Correctness verification report:
| Test | Status |
|---|---|
| ⚙️ Existing Unit Tests | 🔘 None Found |
| 🌀 Generated Regression Tests | ✅ 52 Passed |
| ⏪ Replay Tests | 🔘 None Found |
| 🔎 Concolic Coverage Tests | 🔘 None Found |
| 📊 Tests Coverage | 100.0% |
🌀 Generated Regression Tests and Runtime
from __future__ import annotations
# imports
import pytest
from langflow.api.utils.core import is_file_used
# unit tests
# -------------------- BASIC TEST CASES --------------------
def test_basic_file_found_as_exact_string():
# File name appears as exact string in template value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "myfile.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_basic_file_not_found():
# File name does not appear in any node
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "otherfile.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_basic_file_found_in_list():
# File name appears in a list of values
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": ["a.txt", "myfile.txt", "b.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_basic_file_found_as_substring():
# File name appears as substring in a value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "folder/myfile.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_basic_multiple_nodes_one_match():
# File name appears in only one of several nodes
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "notit.txt"}
}
}
}
},
{
"data": {
"node": {
"template": {
"file": {"value": "myfile.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
# -------------------- EDGE TEST CASES --------------------
def test_edge_empty_flow_data():
# flow_data is None
codeflash_output = is_file_used(None, "myfile.txt")
# flow_data is empty dict
codeflash_output = is_file_used({}, "myfile.txt")
def test_edge_no_nodes_key():
# flow_data missing 'nodes' key
flow_data = {"something_else": []}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_nodes_empty():
# 'nodes' is empty list
flow_data = {"nodes": []}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_node_missing_data_node_template():
# Node missing 'data', 'node', or 'template'
flow_data = {
"nodes": [
{}, # no data
{"data": {}}, # no node
{"data": {"node": {}}}, # no template
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_template_field_not_dict():
# Template value is not a dict
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": "myfile.txt"
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_field_dict_without_value():
# Template field is dict but has no 'value' key
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"not_value": "myfile.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_value_is_list_with_non_string_items():
# List contains non-string items
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": ["a.txt", 123, None, {"x": 1}, "myfile.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_value_is_empty_string_or_list():
# Value is empty string
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": ""}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
# Value is empty list
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": []}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_edge_file_name_is_empty_string():
# file_name is empty string, should match any non-empty string value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "something"}
}
}
}
}
]
}
# '' in 'something' is always True
codeflash_output = is_file_used(flow_data, "")
def test_edge_file_name_not_in_any_value():
# file_name is not a substring of any value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"file": {"value": "abc.txt"},
"files": {"value": ["def.txt", "ghi.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "xyz.txt")
def test_edge_value_is_list_with_substring_matches():
# file_name is a substring of one of the list items
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": ["folder/myfile.txt", "other.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
# -------------------- LARGE SCALE TEST CASES --------------------
def test_large_scale_many_nodes_file_at_end():
# Many nodes, file name only at the last node
nodes = [
{
"data": {
"node": {
"template": {
"file": {"value": f"file_{i}.txt"}
}
}
}
}
for i in range(999)
]
nodes.append({
"data": {
"node": {
"template": {
"file": {"value": "myfile.txt"}
}
}
}
})
flow_data = {"nodes": nodes}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_large_scale_many_nodes_no_match():
# Many nodes, no file name matches
nodes = [
{
"data": {
"node": {
"template": {
"file": {"value": f"file_{i}.txt"}
}
}
}
}
for i in range(1000)
]
flow_data = {"nodes": nodes}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_large_scale_node_with_large_list():
# One node with a large list, file name in the middle
files = [f"file_{i}.txt" for i in range(500)]
files.insert(250, "myfile.txt")
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": files}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_large_scale_multiple_possible_fields():
# Many nodes, each with multiple template fields, file name in one field
nodes = []
for i in range(500):
nodes.append({
"data": {
"node": {
"template": {
"field1": {"value": f"file_{i}.txt"},
"field2": {"value": f"other_{i}.txt"},
"field3": {"value": [f"list_{i}.txt", "myfile.txt" if i == 123 else f"not_{i}.txt"]}
}
}
}
})
flow_data = {"nodes": nodes}
codeflash_output = is_file_used(flow_data, "myfile.txt")
def test_large_scale_file_name_appears_multiple_times():
# File name appears in multiple nodes and fields
nodes = []
for i in range(10):
nodes.append({
"data": {
"node": {
"template": {
"field": {"value": "myfile.txt" if i % 3 == 0 else f"file_{i}.txt"}
}
}
}
})
flow_data = {"nodes": nodes}
codeflash_output = is_file_used(flow_data, "myfile.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations
# imports
import pytest # used for our unit tests
from langflow.api.utils.core import is_file_used
# unit tests
# -------------------
# Basic Test Cases
# -------------------
def test_basic_file_found_in_single_node():
# File name is directly in a string value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"input_file": {"value": "my_file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_basic_file_not_found():
# File name is not present
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"input_file": {"value": "other_file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_basic_file_in_list():
# File name is present in a list of strings
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": ["a.txt", "my_file.txt", "b.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_basic_file_not_in_list():
# File name is not present in the list
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"files": {"value": ["a.txt", "b.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_basic_file_substring_match():
# File name is a substring of the value
flow_data = {
"nodes": [
{
"data": {
"node": {
"template": {
"input_file": {"value": "prefix_my_file.txt_suffix"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
# -------------------
# Edge Test Cases
# -------------------
def test_edge_flow_data_none():
# flow_data is None
codeflash_output = is_file_used(None, "my_file.txt")
def test_edge_flow_data_empty_dict():
# flow_data is empty dict
codeflash_output = is_file_used({}, "my_file.txt")
def test_edge_nodes_missing():
# flow_data missing 'nodes' key
flow_data = {"not_nodes": []}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_nodes_empty_list():
# flow_data with empty nodes list
flow_data = {"nodes": []}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_node_missing_data():
# node missing 'data' key
flow_data = {
"nodes": [
{}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_node_missing_node_key():
# node['data'] missing 'node' key
flow_data = {
"nodes": [
{"data": {}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_node_missing_template_key():
# node['data']['node'] missing 'template' key
flow_data = {
"nodes": [
{"data": {"node": {}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_template_empty():
# template is empty dict
flow_data = {
"nodes": [
{"data": {"node": {"template": {}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_field_not_dict():
# template field is not a dict
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": "my_file.txt"}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_field_dict_missing_value():
# template field dict missing 'value' key
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_value_not_str_or_list():
# 'value' is an int, not a str or list
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": 123}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_value_list_with_non_str_items():
# 'value' is a list with non-str items
flow_data = {
"nodes": [
{"data": {"node": {"template": {"files": {"value": ["a.txt", 123, None]}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_file_name_empty_string():
# file_name is empty string, should match any string value containing ''
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": "something"}}}}}
]
}
codeflash_output = is_file_used(flow_data, "")
def test_edge_file_name_special_characters():
# file_name contains special characters
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": "file@#$.txt"}}}}}
]
}
codeflash_output = is_file_used(flow_data, "file@#$.txt")
def test_edge_multiple_nodes_file_in_second():
# file_name present in second node only
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": "other.txt"}}}}},
{"data": {"node": {"template": {"input_file": {"value": "my_file.txt"}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_multiple_fields_file_in_second_field():
# file_name present in second field only
flow_data = {
"nodes": [
{"data": {"node": {"template": {
"field1": {"value": "other.txt"},
"field2": {"value": "my_file.txt"}
}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_value_list_file_substring():
# file_name is a substring of a list item
flow_data = {
"nodes": [
{"data": {"node": {"template": {"files": {"value": ["prefix_my_file.txt_suffix"]}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_edge_file_name_not_in_any_node():
# file_name not present in any node
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": "other.txt"}}}}},
{"data": {"node": {"template": {"input_file": {"value": "another.txt"}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
# -------------------
# Large Scale Test Cases
# -------------------
def test_large_scale_many_nodes_file_in_last():
# Large number of nodes, file_name present in the last node
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": f"file_{i}.txt"}}}}}
for i in range(999)
] + [
{"data": {"node": {"template": {"input_file": {"value": "my_file.txt"}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_large_scale_many_nodes_file_not_present():
# Large number of nodes, file_name not present
flow_data = {
"nodes": [
{"data": {"node": {"template": {"input_file": {"value": f"file_{i}.txt"}}}}}
for i in range(1000)
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_large_scale_many_fields_per_node_file_in_middle_field():
# Each node has many fields, file_name present in a middle field of one node
fields = {
f"field_{i}": {"value": f"file_{i}.txt"} for i in range(500)
}
fields["field_250"] = {"value": "my_file.txt"}
flow_data = {
"nodes": [
{"data": {"node": {"template": fields}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_large_scale_value_list_file_in_middle():
# 'value' is a large list, file_name present in the middle
values = [f"file_{i}.txt" for i in range(500)]
values[250] = "my_file.txt"
flow_data = {
"nodes": [
{"data": {"node": {"template": {"files": {"value": values}}}}}
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_large_scale_multiple_nodes_and_fields_file_not_present():
# Many nodes and many fields, file_name not present
flow_data = {
"nodes": [
{"data": {"node": {"template": {
f"field_{j}": {"value": f"file_{i}_{j}.txt"}
for j in range(10)
}}}}
for i in range(100)
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
def test_large_scale_multiple_nodes_and_fields_file_in_first_node_last_field():
# Many nodes and many fields, file_name present in first node, last field
flow_data = {
"nodes": [
{"data": {"node": {"template": {
**{f"field_{j}": {"value": f"file_{0}_{j}.txt"} for j in range(9)},
"field_9": {"value": "my_file.txt"}
}}}},
*[
{"data": {"node": {"template": {
f"field_{j}": {"value": f"file_{i}_{j}.txt"}
for j in range(10)
}}}}
for i in range(1, 100)
]
]
}
codeflash_output = is_file_used(flow_data, "my_file.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.To test or edit this optimization locally git merge codeflash/optimize-pr10819-2025-12-01T19.59.30
Click to see suggested changes
| for node in flow_data["nodes"]: | |
| node_data = node.get("data", {}).get("node", {}) | |
| template = node_data.get("template", {}) | |
| for field in template.values(): | |
| if isinstance(field, dict) and "value" in field: | |
| value = field["value"] | |
| if isinstance(value, str) and file_name in value: | |
| return True | |
| if isinstance(value, list): | |
| for item in value: | |
| if isinstance(item, str) and file_name in item: | |
| return True | |
| nodes = flow_data["nodes"] | |
| for node in nodes: | |
| node_data = node.get("data") | |
| if not node_data: | |
| continue | |
| node_obj = node_data.get("node") | |
| if not node_obj: | |
| continue | |
| template = node_obj.get("template") | |
| if not template: | |
| continue | |
| for field in template.values(): | |
| # Fastest path: common-case check, avoid double get | |
| if not isinstance(field, dict): | |
| continue | |
| val = field.get("value") | |
| if isinstance(val, str): | |
| if file_name in val: | |
| return True | |
| elif isinstance(val, list): | |
| for item in val: | |
| if isinstance(item, str) and file_name in item: | |
| return True |
| for node in flow_data["nodes"]: | ||
| node_data = node.get("data", {}).get("node", {}) | ||
| template = node_data.get("template", {}) | ||
| for field in template.values(): | ||
| if isinstance(field, dict) and "value" in field: | ||
| value = field["value"] | ||
| if isinstance(value, str) and file_name in value: | ||
| return True | ||
| if isinstance(value, list): |
There was a problem hiding this comment.
⚡️Codeflash found 18% (0.18x) speedup for is_file_used in src/backend/base/langflow/api/v2/files.py
⏱️ Runtime : 1.78 milliseconds → 1.52 milliseconds (best of 122 runs)
📝 Explanation and details
The optimized code achieves a 17% speedup through several key micro-optimizations that reduce object allocations and method calls:
Key Optimizations:
-
Eliminated unnecessary dict allocations: The original code used
.get("key", {})which creates empty dictionaries even when not needed. The optimized version uses.get("key")and explicit None checks, avoiding these allocations entirely. -
Reduced chained method calls: Instead of
node.get("data", {}).get("node", {}), the optimization breaks this into separate calls with early exit conditions, reducing the number of method invocations per iteration. -
Faster type checking: Replaced
isinstance(field, dict) and "value" in fieldwithtype(field) is dictfollowed by.get("value"). Thetype()check is faster thanisinstance(), and using.get()instead of membership testing followed by dictionary access is more efficient. -
Better control flow structure: Added explicit
continuestatements for early exit when intermediate objects are None, avoiding unnecessary nested operations on invalid data.
Performance Impact:
The optimizations are most effective for flows with:
- Many nodes with missing or incomplete data structures (benefits from early exits)
- Large templates with mixed field types (benefits from faster type checking)
- Scenarios where the file is found early (benefits from reduced per-iteration overhead)
From the test results, the optimization provides consistent speedups across various workload patterns, from simple single-node cases to large-scale flows with 1000+ nodes. The early-exit optimizations are particularly beneficial when processing malformed or incomplete node data, which appears common in real-world usage based on the test coverage.
✅ Correctness verification report:
| Test | Status |
|---|---|
| ⚙️ Existing Unit Tests | 🔘 None Found |
| 🌀 Generated Regression Tests | ✅ 52 Passed |
| ⏪ Replay Tests | 🔘 None Found |
| 🔎 Concolic Coverage Tests | 🔘 None Found |
| 📊 Tests Coverage | 100.0% |
🌀 Generated Regression Tests and Runtime
import pytest
from langflow.api.v2.files import is_file_used
# unit tests
# Basic Test Cases
def test_file_used_simple_string():
# File name appears in a string value
flow = {
"nodes": [
{"data": {"node": {"template": {"f1": {"value": "myfile.txt"}}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_substring():
# File name is a substring within the value
flow = {
"nodes": [
{"data": {"node": {"template": {"f1": {"value": "path/to/myfile.txt"}}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_in_list():
# File name appears in a list of strings
flow = {
"nodes": [
{"data": {"node": {"template": {"f1": {"value": ["other.txt", "myfile.txt"]}}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_not_used():
# File name does not appear anywhere
flow = {
"nodes": [
{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_multiple_nodes():
# File name appears in only one node among several
flow = {
"nodes": [
{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}},
{"data": {"node": {"template": {"f2": {"value": "myfile.txt"}}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
# Edge Test Cases
def test_empty_flow_data():
# flow_data is None
codeflash_output = is_file_used(None, "myfile.txt")
def test_flow_data_missing_nodes():
# flow_data does not have "nodes"
flow = {"something_else": []}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_empty_nodes_list():
# "nodes" is an empty list
flow = {"nodes": []}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_node_missing_data():
# Node missing "data" key
flow = {"nodes": [{}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_node_data_missing_node():
# Node's "data" missing "node"
flow = {"nodes": [{"data": {}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_node_template_missing():
# Node's "node" missing "template"
flow = {"nodes": [{"data": {"node": {}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_field_not_dict():
# Template field is not a dict
flow = {"nodes": [{"data": {"node": {"template": {"f1": "notadict"}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_field_dict_no_value():
# Template field is dict but missing "value"
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"notvalue": "x"}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_value_is_list_of_non_strings():
# Value is a list, but contains non-strings
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": [1, 2, 3]}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_value_is_empty_string():
# Value is an empty string
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": ""}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_value_is_empty_list():
# Value is an empty list
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": []}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_case_sensitive():
# File name matching is case sensitive
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "MyFile.txt"}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt") # Should be case sensitive
def test_file_name_is_empty():
# File name is empty string, should match all non-empty strings
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "something"}}}}}]}
codeflash_output = is_file_used(flow, "")
def test_file_used_in_list_substring():
# File name is substring in one of the list items
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": ["abc_myfile.txt_def"]}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_multiple_fields():
# File name appears in multiple template fields
flow = {
"nodes": [
{"data": {"node": {"template": {
"f1": {"value": "other.txt"},
"f2": {"value": "myfile.txt"}
}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_multiple_list_items():
# File name appears in several items in a value list
flow = {
"nodes": [
{"data": {"node": {"template": {
"f1": {"value": ["myfile.txt", "anotherfile.txt", "myfile.txt"]}
}}}}
]
}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_file_used_with_special_characters():
# File name contains special regex characters
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "my[file].txt"}}}}}]}
codeflash_output = is_file_used(flow, "my[file].txt")
# Large Scale Test Cases
def test_large_number_of_nodes_file_present():
# Large flow with file present in one of many nodes
nodes = [{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}} for _ in range(999)]
nodes.append({"data": {"node": {"template": {"f1": {"value": "myfile.txt"}}}}})
flow = {"nodes": nodes}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_large_number_of_nodes_file_absent():
# Large flow with file absent
nodes = [{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}} for _ in range(1000)]
flow = {"nodes": nodes}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_large_number_of_fields_in_template():
# Large number of fields in a single template, file present in one
template = {f"f{i}": {"value": "other.txt"} for i in range(999)}
template["target"] = {"value": "myfile.txt"}
flow = {"nodes": [{"data": {"node": {"template": template}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_large_list_in_value():
# Value is a large list, file present in one item
value_list = ["other.txt"] * 999 + ["myfile.txt"]
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": value_list}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
def test_large_list_in_value_absent():
# Value is a large list, file not present
value_list = ["other.txt"] * 1000
flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": value_list}}}}}]}
codeflash_output = is_file_used(flow, "myfile.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from langflow.api.v2.files import is_file_used
# unit tests
# ----------------------------- Basic Test Cases -----------------------------
def test_none_flow_data_returns_false():
# flow_data is None
codeflash_output = is_file_used(None, "file.txt")
def test_missing_nodes_key_returns_false():
# flow_data missing 'nodes' key
codeflash_output = is_file_used({}, "file.txt")
def test_empty_nodes_list_returns_false():
# flow_data with empty nodes list
codeflash_output = is_file_used({"nodes": []}, "file.txt")
def test_file_used_in_single_node_string_value():
# file_name present as substring in a string value
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "some/path/file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_not_used_in_single_node_string_value():
# file_name not present in any value
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "some/path/other.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_used_in_list_of_strings():
# file_name present in a list of strings
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": ["a.txt", "b.txt", "file.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_not_used_in_list_of_strings():
# file_name not present in any string in the list
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": ["a.txt", "b.txt", "c.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_used_in_multiple_nodes():
# file_name present in the second node
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "a.txt"}
}
}
}
},
{
"data": {
"node": {
"template": {
"input2": {"value": "file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
# ----------------------------- Edge Test Cases -----------------------------
def test_file_name_is_empty_string():
# file_name is empty string, should match any string (since "" in x is always True)
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "anything"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "")
def test_node_without_data_key():
# node missing 'data' key
flow = {
"nodes": [
{}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_node_data_without_node_key():
# node['data'] missing 'node' key
flow = {
"nodes": [
{"data": {}}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_node_data_node_without_template_key():
# node['data']['node'] missing 'template' key
flow = {
"nodes": [
{"data": {"node": {}}}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_template_field_not_dict():
# template field is not a dict
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": "not_a_dict"
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_template_field_dict_without_value_key():
# template field dict missing 'value'
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"not_value": "file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_value_is_list_with_non_string_items():
# value is a list with non-string items
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": ["a.txt", 123, None, {"x": 1}]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_value_is_non_string_non_list():
# value is an int, not a string or list
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": 123}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_name_is_substring_of_value():
# file_name is a substring of a longer string
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "somefile.txt.backup"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_name_is_only_part_of_value():
# file_name is only part of a string in a list
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": ["xxxfile.txtyyy", "zzz"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_multiple_fields_in_template():
# file_name present in one of multiple fields
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "a.txt"},
"input2": {"value": "file.txt"},
"input3": {"value": ["b.txt", "c.txt"]}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_file_name_in_multiple_nodes_and_fields():
# file_name present in multiple places, should short-circuit on first found
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input1": {"value": "not_it"}
}
}
}
},
{
"data": {
"node": {
"template": {
"input2": {"value": ["nope", "file.txt", "another"]}
}
}
}
},
{
"data": {
"node": {
"template": {
"input3": {"value": "file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
# ----------------------------- Large Scale Test Cases -----------------------------
def test_large_number_of_nodes_no_match():
# Large flow, file_name not present
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input": {"value": f"file_{i}.txt"}
}
}
}
} for i in range(1000)
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_large_number_of_nodes_with_match_at_end():
# Large flow, file_name present in last node
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input": {"value": f"file_{i}.txt"}
}
}
}
} for i in range(999)
] + [
{
"data": {
"node": {
"template": {
"input": {"value": "file.txt"}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
def test_large_number_of_nodes_with_match_in_middle():
# Large flow, file_name present in the middle
nodes = [
{
"data": {
"node": {
"template": {
"input": {"value": f"file_{i}.txt"}
}
}
}
} for i in range(500)
]
nodes.append(
{
"data": {
"node": {
"template": {
"input": {"value": "file.txt"}
}
}
}
}
)
nodes += [
{
"data": {
"node": {
"template": {
"input": {"value": f"file_{i}.txt"}
}
}
}
} for i in range(501, 1000)
]
flow = {"nodes": nodes}
codeflash_output = is_file_used(flow, "file.txt")
def test_large_number_of_fields_per_node():
# Each node has many fields, only one has the file_name
template = {f"input{i}": {"value": f"file_{i}.txt"} for i in range(50)}
template["special"] = {"value": "file.txt"}
flow = {
"nodes": [
{
"data": {
"node": {
"template": template
}
}
}
] * 20
}
codeflash_output = is_file_used(flow, "file.txt")
def test_large_list_of_values():
# value is a large list, file_name present near the end
value_list = [f"file_{i}.txt" for i in range(999)] + ["file.txt"]
flow = {
"nodes": [
{
"data": {
"node": {
"template": {
"input": {"value": value_list}
}
}
}
}
]
}
codeflash_output = is_file_used(flow, "file.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.To test or edit this optimization locally git merge codeflash/optimize-pr10819-2025-12-01T20.38.21
Click to see suggested changes
| for node in flow_data["nodes"]: | |
| node_data = node.get("data", {}).get("node", {}) | |
| template = node_data.get("template", {}) | |
| for field in template.values(): | |
| if isinstance(field, dict) and "value" in field: | |
| value = field["value"] | |
| if isinstance(value, str) and file_name in value: | |
| return True | |
| if isinstance(value, list): | |
| nodes = flow_data["nodes"] | |
| for node in nodes: | |
| data = node.get("data") | |
| if not data: | |
| continue | |
| node_data = data.get("node") | |
| if not node_data: | |
| continue | |
| template = node_data.get("template") | |
| if not template: | |
| continue | |
| # Extract values once to local variable, avoids .values() call per loop iteration | |
| for field in template.values(): | |
| # Fast path: skip non-dict fields up front | |
| if type(field) is dict: | |
| value = field.get("value") | |
| if isinstance(value, str): | |
| if file_name in value: | |
| return True | |
| elif isinstance(value, list): | |
| # For small lists (which is common), fast in-place iteration is enough | |
| # List comprehension is not helpful for early return, so keep as loop |
Adam-Aghili
left a comment
There was a problem hiding this comment.
Changes generally make sense, Alittle confused how to test this manually. Rabbit brought up some good points
| "get_top_level_vertices", | ||
| # Functions | ||
| "has_api_terms", | ||
| "is_file_used", |
| def is_file_used(flow_data: dict | None, file_name: str) -> bool: | ||
| """Check if a file is used in the flow.""" | ||
| if not flow_data or "nodes" not in flow_data: | ||
| return False | ||
|
|
||
| for node in flow_data["nodes"]: | ||
| node_data = node.get("data", {}).get("node", {}) | ||
| template = node_data.get("template", {}) | ||
| for field in template.values(): | ||
| if isinstance(field, dict) and "value" in field: | ||
| value = field["value"] | ||
| if isinstance(value, str) and file_name in value: | ||
| return True | ||
| if isinstance(value, list): | ||
| for item in value: | ||
| if isinstance(item, str) and file_name in item: | ||
| return True | ||
| return False |
| MAX_FILENAME_LENGTH = 255 | ||
| # Maximum reasonable extension length | ||
| MAX_EXTENSION_LENGTH = 20 |
There was a problem hiding this comment.
why are these the max lengths?
This pull request introduces significant improvements to the file storage API, focusing on enhancing file metadata, supporting file usage tracking, and updating related tests and documentation. The changes standardize the return value of the
list_filesmethod to include file metadata, add anis_usedflag to indicate file usage within flows, and update both local and S3 storage implementations. Related tests and documentation have been revised to reflect these updates.File Metadata & Usage Tracking Enhancements:
list_filesmethod in both local and S3 storage backends now returns a list of dictionaries containing file metadata (nameandsize) instead of just file names. The abstract interface and all usages have been updated accordingly. [1] [2] [3] [4] [5]api/v1/files.py) now includes anis_usedflag for each file, indicating whether the file is referenced in the flow's nodes. The logic for determining file usage is implemented in a new utility function, which is also exported for use elsewhere. [1] [2] [3] [4]Testing and Validation Updates:
is_usedflag. Assertions now check forname,size, andis_usedfields in API responses, ensuring robust coverage of the new functionality. [1] [2] [3] [4] [5]Documentation Improvements:
langflow-files-api-comparison.md) has been added, comparing v1 and v2 file APIs. It details the differences in file association, metadata, batch operations, and UI/LFX support, providing clear guidance for developers and users.Internal Refactoring & Maintenance:
These changes collectively modernize the file management API, improve metadata richness, and establish a foundation for more advanced file operations and UI features.
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.