Skip to content

feat: S3 file size and associations to flows#10819

Open
ricofurtado wants to merge 4 commits into
mainfrom
s3-file-size-and-associations-to-flows
Open

feat: S3 file size and associations to flows#10819
ricofurtado wants to merge 4 commits into
mainfrom
s3-file-size-and-associations-to-flows

Conversation

@ricofurtado
Copy link
Copy Markdown
Contributor

@ricofurtado ricofurtado commented Dec 1, 2025

This pull request introduces significant improvements to the file storage API, focusing on enhancing file metadata, supporting file usage tracking, and updating related tests and documentation. The changes standardize the return value of the list_files method to include file metadata, add an is_used flag to indicate file usage within flows, and update both local and S3 storage implementations. Related tests and documentation have been revised to reflect these updates.

File Metadata & Usage Tracking Enhancements:

  • The list_files method in both local and S3 storage backends now returns a list of dictionaries containing file metadata (name and size) instead of just file names. The abstract interface and all usages have been updated accordingly. [1] [2] [3] [4] [5]
  • The file listing API (api/v1/files.py) now includes an is_used flag for each file, indicating whether the file is referenced in the flow's nodes. The logic for determining file usage is implemented in a new utility function, which is also exported for use elsewhere. [1] [2] [3] [4]

Testing and Validation Updates:

  • All relevant integration and unit tests have been updated to validate the new file metadata structure and the presence of the is_used flag. Assertions now check for name, size, and is_used fields in API responses, ensuring robust coverage of the new functionality. [1] [2] [3] [4] [5]

Documentation Improvements:

  • A comprehensive markdown document (langflow-files-api-comparison.md) has been added, comparing v1 and v2 file APIs. It details the differences in file association, metadata, batch operations, and UI/LFX support, providing clear guidance for developers and users.

Internal Refactoring & Maintenance:

  • Adjusted orphaned record cleanup logic to accommodate the new file metadata structure, ensuring that file deletions reference the correct file name. [1] [2] [3]
  • Minor code and test maintenance, such as correcting line numbers and adding missing fields in test fixtures. [1] [2]

These changes collectively modernize the file management API, improve metadata richness, and establish a foundation for more advanced file operations and UI features.

Summary by CodeRabbit

Release Notes

  • New Features

    • Files now display usage status and size metadata in listings
    • Files actively used in flows are now protected from deletion
    • Improved filename sanitization for file uploads and downloads
    • Enhanced batch delete operations with detailed status reporting and error information
    • Better support for non-ASCII filenames during file downloads
  • Documentation

    • Added comprehensive File API comparison documentation

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions Bot added the community Pull Request from an external contributor label Dec 1, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 1, 2025

Walkthrough

Storage service layer refactored to return file metadata dictionaries with name and size instead of plain strings. File deletion enhanced with in-use detection and batch error handling. Filename sanitization added for security. New utility functions detect file usage within flows. Tests updated to validate new metadata structure and deletion behavior.

Changes

Cohort / File(s) Change Summary
Storage Service Interface Updates
src/backend/base/langflow/services/storage/service.py, src/backend/base/langflow/services/storage/local.py, src/backend/base/langflow/services/storage/s3.py
Storage service list_files signature changed from returning list[str] to list[dict] with name and size fields; implementations updated to build metadata dictionaries and compute file sizes via async stat operations or S3 object metadata.
Storage Cleanup Task
src/backend/base/langflow/services/task/temp_flow_cleanup.py
Orphaned flow cleanup updated to access file names via file["name"] from new dict structure; added rmdir for empty flow directories after file deletion; sqlalchemy.delete import added.
File API v2 Enhancements
src/backend/base/langflow/api/v2/files.py
Added filename sanitization (sanitize_filename, sanitize_content_disposition), file usage detection (is_file_used, is_file_in_use, get_user_flows), storage helpers (try_delete_from_storage, delete_from_storage), and save_file_routine; batch delete refined to skip in-use files and aggregate failures; uploads and edits apply sanitization; downloads use RFC 5987 encoding.
File API v1 Usage Tracking
src/backend/base/langflow/api/v1/files.py
Added is_file_used helper and integrated into list_files to attach is_used boolean flag to each file entry.
Utility Functions Export
src/backend/base/langflow/api/utils/__init__.py, src/backend/base/langflow/api/utils/core.py
New is_file_used utility function added to core module and exported via all list in init.py for cross-module accessibility.
Integration Tests
src/backend/tests/integration/storage/test_s3_storage_service.py
Test expectations updated to extract file names from returned dicts and validate name and size fields; flow isolation assertions adapted to dict-based results.
Unit Tests – File APIs
src/backend/tests/unit/api/v1/test_files.py, src/backend/tests/unit/api/v2/test_files.py
v1 tests expanded to validate is_used, size, and structured response; v2 tests updated to handle files_not_deleted field in delete responses, mock flows queries, and verify in-use file rejection; new test_delete_file_in_use added.
Unit Tests – Storage Services
src/backend/tests/unit/services/storage/test_local_storage_service.py
Test assertions updated to extract file names from dict entries and validate name/size fields; membership and exclusion checks adapted to dict-based list structure.
Configuration & Documentation
.secrets.baseline, langflow-files-api-comparison.md
.secrets.baseline entry for input_mixin.py repositioned and line number adjusted for test_files.py; new documentation file contrasting v1 flow-based and v2 user-based file API architectures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Storage signature cascade: list_files changes propagate across three storage implementations and affect multiple callsites (v1/v2 files.py, cleanup task, tests).
  • Deletion logic in v2/files.py: New in-use detection, batch error aggregation, and storage failure handling add decision points and require careful state management.
  • File sanitization: Security-critical path for filename handling; RFC 5987 encoding and ASCII quoting logic should be validated for correctness.
  • Test coverage density: Multiple test files updated with dict-based assertions; verify all new behavior (is_used flag, files_not_deleted, in-use rejection) is properly covered.

Possibly related PRs

Suggested labels

enhancement, size:M, lgtm

Suggested reviewers

  • ogabrielluiz
  • Adam-Aghili
  • jordanrfrazier

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 3 warnings)
Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error PR lacks comprehensive unit test coverage for new functionality and fails to catch two critical unaddressed issues: ImportError bug and code duplication of is_file_used across three files. Fix ImportError by adding is_file_used to imports in init.py, deduplicate is_file_used by centralizing it, add unit tests for sanitization and helper functions, then verify tests catch the import error.
Docstring Coverage ⚠️ Warning Docstring coverage is 78.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Quality And Coverage ⚠️ Warning PR lacks comprehensive test coverage for security-critical functions sanitize_filename() and sanitize_content_disposition(), and new async helpers lack isolated unit tests. Add dedicated unit tests for sanitize_filename(), sanitize_content_disposition(), is_file_used(), save_file_routine(), and try_delete_from_storage() covering edge cases and error scenarios.
Excessive Mock Usage Warning ⚠️ Warning Pull request shows excessive mock usage with brittle interdependencies and repeated patterns, indicating poor test design that couples tests to implementation details rather than testing actual behavior. Refactor using integration test patterns with real session_scope() and model instantiation, or consolidate mocks into shared pytest fixtures and eliminate side_effect chains that assume exact call sequences.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Test File Naming And Structure ✅ Passed All test files follow test_*.py naming pattern with proper pytest structure, class-based organization across integration and unit directories, and comprehensive coverage of initialization, file operations, streaming, deletion, and edge cases.
Title check ✅ Passed The pull request title 'feat: S3 file size and associations to flows' directly aligns with the main changes: standardizing list_files to return file metadata (name and size) and adding file usage tracking (is_used flag) to show associations between files and flows.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch s3-file-size-and-associations-to-flows

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ricofurtado ricofurtado changed the title S3 file size and associations to flows feat: S3 file size and associations to flows Dec 1, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.39%. Comparing base (5226daa) to head (ef63f8d).
⚠️ Report is 406 commits behind head on main.

❌ Your project check has failed because the head coverage (40.04%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10819      +/-   ##
==========================================
- Coverage   32.44%   32.39%   -0.06%     
==========================================
  Files        1367     1367              
  Lines       63315    63235      -80     
  Branches     9357     9358       +1     
==========================================
- Hits        20544    20482      -62     
+ Misses      41738    41720      -18     
  Partials     1033     1033              
Flag Coverage Δ
lfx 40.04% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/backend/base/langflow/api/utils/core.py 62.44% <ø> (ø)
src/backend/base/langflow/api/v1/files.py 66.14% <ø> (ø)
src/backend/base/langflow/api/v2/files.py 59.24% <ø> (-3.12%) ⬇️
...rc/backend/base/langflow/services/storage/local.py 85.54% <ø> (-0.35%) ⬇️
src/backend/base/langflow/services/storage/s3.py 11.88% <ø> (-0.53%) ⬇️
.../backend/base/langflow/services/storage/service.py 78.12% <ø> (-0.67%) ⬇️
...d/base/langflow/services/task/temp_flow_cleanup.py 62.16% <ø> (+0.62%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added the enhancement New feature or request label Dec 1, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/backend/tests/integration/storage/test_s3_storage_service.py (2)

450-452: Bug: Assertion incompatible with new return type.

Line 452 compares file_name (a string) directly to files (now a list[dict]). This assertion will always fail since "to_delete.txt" in [{"name": "to_delete.txt", "size": 9}] is False.

Apply this diff:

         # Verify it exists
         files = await s3_storage_service.list_files(test_flow_id)
-        assert file_name in files
+        assert file_name in [f["name"] for f in files]

578-581: Bug: Same assertion incompatibility with new return type.

Similar to the previous issue, this assertion compares strings to a list of dicts and will fail.

Apply this diff:

             # Verify all files exist
             listed = await s3_storage_service.list_files(test_flow_id)
             assert len(listed) == 5
+            listed_names = [f["name"] for f in listed]
             for file_name in file_names:
-                assert file_name in listed
+                assert file_name in listed_names
🧹 Nitpick comments (3)
src/backend/base/langflow/api/utils/core.py (1)

416-433: Code duplication detected across three files.

This exact implementation of is_file_used exists in three locations:

  • src/backend/base/langflow/api/utils/core.py (this file)
  • src/backend/base/langflow/api/v1/files.py (lines 217-234)
  • src/backend/base/langflow/api/v2/files.py (lines 189-206)

Since this utility is now exported from api/utils, the other two files should import from here rather than duplicating the logic.

# In src/backend/base/langflow/api/v1/files.py and v2/files.py:
+from langflow.api.utils import is_file_used
-def is_file_used(flow_data: dict | None, file_name: str) -> bool:
-    """Check if a file is used in the flow."""
-    if not flow_data or "nodes" not in flow_data:
-        return False
-    ...
src/backend/base/langflow/api/v2/files.py (1)

190-207: Duplicate implementation - import from api/utils instead.

This is_file_used function is identical to the one in src/backend/base/langflow/api/utils/core.py. Since it's exported from utils, this file should import it rather than redefine it.

+from langflow.api.utils import is_file_used
+
 async def is_file_in_use(session: DbSession, user_id: uuid.UUID, file_name: str) -> bool:
     """Check if a file is used in any of the user's flows."""
     flows = await get_user_flows(session, user_id)
     return any(is_file_used(flow.data, file_name) for flow in flows)
-
-
-def is_file_used(flow_data: dict | None, file_name: str) -> bool:
-    """Check if a file is used in the flow."""
-    ...
src/backend/base/langflow/services/task/temp_flow_cleanup.py (1)

47-59: Clarify orphaned_flow_ids typing and avoid type: ignore

The cleanup logic is sound, but a couple of details are worth tightening up:

  • The type: ignore[arg-type] on the delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)) call suggests a mismatch between the inferred type of orphaned_flow_ids and what in_ expects.
  • The same orphaned_flow_ids iterable is later used as for flow_id in orphaned_flow_ids: and passed as str(flow_id) into list_files / delete_file and for constructing flow_dir. If session.exec(...).all() is returning row objects instead of bare scalar IDs, str(flow_id) will not match the actual flow ID string and both DB delete semantics and storage cleanup targets become brittle.

To make this robust and drop the type: ignore, consider shaping orphaned_flow_ids explicitly as a list of scalar IDs via .scalars().all() and annotating it:

-                orphaned_flow_ids = (
-                    await session.exec(
-                        select(col(table.flow_id).distinct()).where(col(table.flow_id).not_in(flow_ids_subquery))
-                    )
-                ).all()
+                result = await session.exec(
+                    select(col(table.flow_id))
+                    .where(col(table.flow_id).not_in(flow_ids_subquery))
+                    .distinct()
+                )
+                orphaned_flow_ids: list[str] = result.scalars().all()
...
-                    await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)))  # type: ignore[arg-type]
+                    await session.exec(delete(table).where(col(table.flow_id).in_(orphaned_flow_ids)))

This also guarantees that the flow_id used for file listing/deletion and directory removal is the plain ID value you expect.

The switch to file["name"] for delete_file and logging correctly aligns with the new list_files metadata format.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 056faea and ef63f8d.

📒 Files selected for processing (14)
  • .secrets.baseline (3 hunks)
  • langflow-files-api-comparison.md (1 hunks)
  • src/backend/base/langflow/api/utils/__init__.py (1 hunks)
  • src/backend/base/langflow/api/utils/core.py (1 hunks)
  • src/backend/base/langflow/api/v1/files.py (2 hunks)
  • src/backend/base/langflow/api/v2/files.py (11 hunks)
  • src/backend/base/langflow/services/storage/local.py (2 hunks)
  • src/backend/base/langflow/services/storage/s3.py (2 hunks)
  • src/backend/base/langflow/services/storage/service.py (1 hunks)
  • src/backend/base/langflow/services/task/temp_flow_cleanup.py (3 hunks)
  • src/backend/tests/integration/storage/test_s3_storage_service.py (2 hunks)
  • src/backend/tests/unit/api/v1/test_files.py (4 hunks)
  • src/backend/tests/unit/api/v2/test_files.py (11 hunks)
  • src/backend/tests/unit/services/storage/test_local_storage_service.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/backend/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/**/*.py: Use FastAPI async patterns with await for async operations in component execution methods
Use asyncio.create_task() for background tasks and implement proper cleanup with try/except for asyncio.CancelledError
Use queue.put_nowait() for non-blocking queue operations and asyncio.wait_for() with timeouts for controlled get operations

Files:

  • src/backend/base/langflow/api/v1/files.py
  • src/backend/base/langflow/api/utils/core.py
  • src/backend/base/langflow/services/storage/local.py
  • src/backend/base/langflow/services/storage/service.py
  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/base/langflow/services/task/temp_flow_cleanup.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
  • src/backend/base/langflow/services/storage/s3.py
  • src/backend/base/langflow/api/utils/__init__.py
  • src/backend/base/langflow/api/v2/files.py
src/backend/base/langflow/api/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

Backend API endpoints should be organized by version (v1/, v2/) under src/backend/base/langflow/api/ with specific modules for features (chat.py, flows.py, users.py, etc.)

Files:

  • src/backend/base/langflow/api/v1/files.py
  • src/backend/base/langflow/api/utils/core.py
  • src/backend/base/langflow/api/utils/__init__.py
  • src/backend/base/langflow/api/v2/files.py
src/backend/tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)

src/backend/tests/**/*.py: Place backend unit tests in src/backend/tests/ directory, component tests in src/backend/tests/unit/components/ organized by component subdirectory, and integration tests accessible via make integration_tests
Use same filename as component with appropriate test prefix/suffix (e.g., my_component.pytest_my_component.py)
Use the client fixture (FastAPI Test Client) defined in src/backend/tests/conftest.py for API tests; it provides an async httpx.AsyncClient with automatic in-memory SQLite database and mocked environment variables. Skip client creation by marking test with @pytest.mark.noclient
Inherit from the correct ComponentTestBase family class located in src/backend/tests/base.py based on API access needs: ComponentTestBase (no API), ComponentTestBaseWithClient (needs API), or ComponentTestBaseWithoutClient (pure logic). Provide three required fixtures: component_class, default_kwargs, and file_names_mapping
Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Use @pytest.mark.asyncio decorator for async component tests and ensure async methods are properly awaited
Test background tasks using asyncio.create_task() and verify completion with asyncio.wait_for() with appropriate timeout constraints
Test queue operations using non-blocking queue.put_nowait() and asyncio.wait_for(queue.get(), timeout=...) to verify queue processing without blocking
Use @pytest.mark.no_blockbuster marker to skip the blockbuster plugin in specific tests
For database tests that may fail in batch runs, run them sequentially using uv run pytest src/backend/tests/unit/test_database.py r...

Files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (Custom checks)

Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)

Files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
**/test_*.py

📄 CodeRabbit inference engine (Custom checks)

**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend

Files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
🧠 Learnings (13)
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/api/**/*.py : Backend API endpoints should be organized by version (v1/, v2/) under `src/backend/base/langflow/api/` with specific modules for features (chat.py, flows.py, users.py, etc.)

Applied to files:

  • langflow-files-api-comparison.md
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `aiofiles` and `anyio.Path` for async file operations in tests; create temporary test files using `tmp_path` fixture and verify file existence and content

Applied to files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test component versioning and backward compatibility using `file_names_mapping` fixture with `VersionComponentMapping` objects mapping component files across Langflow versions

Applied to files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
  • src/backend/tests/unit/services/storage/test_local_storage_service.py
  • src/backend/tests/integration/storage/test_s3_storage_service.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `monkeypatch` fixture to mock internal functions for testing error handling scenarios; validate error status codes and error message content in responses

Applied to files:

  • src/backend/tests/unit/api/v1/test_files.py
  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use async fixtures with proper cleanup using try/finally blocks to ensure resources are properly released after tests complete

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `pytest.mark.api_key_required` and `pytest.mark.no_blockbuster` markers for components that need external APIs; use `MockLanguageModel` from `tests.unit.mock_language_model` for testing without external API keys

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Each test should have a clear docstring explaining its purpose; complex test setups should be commented; mock usage should be documented; expected behaviors should be explicitly stated

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use `pytest.mark.asyncio` decorator for async component tests and ensure async methods are properly awaited

Applied to files:

  • src/backend/tests/unit/api/v2/test_files.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/**/*.py : Use `asyncio.create_task()` for background tasks and implement proper cleanup with try/except for `asyncio.CancelledError`

Applied to files:

  • src/backend/base/langflow/services/task/temp_flow_cleanup.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/components/**/__init__.py : Update `__init__.py` with alphabetically sorted imports when adding new components

Applied to files:

  • src/backend/base/langflow/services/task/temp_flow_cleanup.py
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/base/langflow/services/database/models/**/*.py : Database models should be organized by domain (api_key/, flow/, folder/, user/, etc.) under `src/backend/base/langflow/services/database/models/`

Applied to files:

  • src/backend/base/langflow/services/task/temp_flow_cleanup.py
🧬 Code graph analysis (5)
src/backend/base/langflow/api/v1/files.py (2)
src/backend/base/langflow/api/v2/files.py (1)
  • is_file_used (190-207)
src/backend/base/langflow/api/utils/core.py (1)
  • is_file_used (416-433)
src/backend/base/langflow/api/utils/core.py (2)
src/backend/base/langflow/api/v1/files.py (1)
  • is_file_used (218-235)
src/backend/base/langflow/api/v2/files.py (1)
  • is_file_used (190-207)
src/backend/base/langflow/services/storage/local.py (2)
src/backend/base/langflow/services/storage/s3.py (1)
  • list_files (225-262)
src/backend/base/langflow/services/storage/service.py (1)
  • list_files (43-44)
src/backend/tests/unit/api/v2/test_files.py (4)
src/backend/base/langflow/api/v2/files.py (1)
  • delete_file (790-818)
src/backend/base/langflow/services/storage/local.py (1)
  • delete_file (157-172)
src/backend/base/langflow/services/storage/s3.py (1)
  • delete_file (264-284)
src/backend/base/langflow/services/storage/service.py (1)
  • delete_file (51-52)
src/backend/base/langflow/services/task/temp_flow_cleanup.py (6)
src/backend/base/langflow/api/v1/files.py (1)
  • delete_file (205-215)
src/backend/base/langflow/services/storage/local.py (1)
  • delete_file (157-172)
src/backend/base/langflow/services/storage/s3.py (1)
  • delete_file (264-284)
src/backend/base/langflow/services/storage/service.py (1)
  • delete_file (51-52)
src/lfx/src/lfx/services/storage/service.py (1)
  • delete_file (160-170)
src/lfx/src/lfx/services/storage/local.py (1)
  • delete_file (139-154)
🪛 Gitleaks (8.29.1)
.secrets.baseline

[high] 878-878: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 LanguageTool
langflow-files-api-comparison.md

[style] ~13-~13: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e user, with robust error handling. - File Metadata Update: Allows renaming file...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: Lint Backend / Run Mypy (3.12)
  • GitHub Check: Test Docker Images / Test docker images
  • GitHub Check: Lint Backend / Run Mypy (3.11)
  • GitHub Check: Lint Backend / Run Mypy (3.10)
  • GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
  • GitHub Check: Run Backend Tests / LFX Tests - Python 3.10
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
  • GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
  • GitHub Check: Test Starter Templates
  • GitHub Check: Optimize new Python code in this PR
  • GitHub Check: Update Component Index
  • GitHub Check: Update Starter Projects
🔇 Additional comments (21)
.secrets.baseline (1)

874-883: LGTM - Secrets baseline updated correctly.

The entry for input_mixin.py with is_secret: false correctly tracks a known false positive. The static analysis warning from Gitleaks is expected here since this baseline file contains hashed representations of detected patterns, not actual secrets.

langflow-files-api-comparison.md (1)

1-64: Good documentation comparing v1 and v2 file APIs.

The comparison tables clearly outline the functional differences between the flow-based v1 and user-based v2 APIs. This will help developers understand which API to use for their use case.

src/backend/tests/integration/storage/test_s3_storage_service.py (1)

395-401: Good test updates for new file metadata structure.

The test correctly validates the new dict structure with name and size fields, and verifies the expected size of 7 bytes for "content".

src/backend/base/langflow/api/v1/files.py (1)

195-197: Good integration of file usage tracking.

The is_used flag is correctly added to each file entry by checking the flow's node templates.

src/backend/base/langflow/services/storage/service.py (1)

42-44: Interface change is correctly implemented across all backend implementations.

The return type change from list[str] to list[dict] on line 43 aligns with the PR objective to include file metadata. All StorageService implementations in the backend (local.py and s3.py) have been properly updated to match the new return type. This is a breaking change for the abstract interface, but all known implementations are consistent.

src/backend/tests/unit/api/v1/test_files.py (3)

61-61: LGTM!

Adding optins=None ensures the fixture aligns with the User model's expected fields, preventing potential validation issues.


208-214: Good test coverage for new file metadata structure.

The assertions correctly validate:

  • Presence of required fields (name, size, is_used)
  • File name suffix matching
  • Correct size calculation (12 bytes for "test content")
  • Type validation for is_used as boolean

252-254: LGTM!

The test correctly adapts to the new dict-based file listing by extracting file names before performing membership assertions.

Also applies to: 269-271

src/backend/tests/unit/services/storage/test_local_storage_service.py (2)

168-174: LGTM!

The test correctly validates the new file metadata structure with proper assertions for both name and size fields. The size validation (7 bytes for "content") is accurate.


186-187: LGTM!

All list operation tests are properly adapted to extract file names from the dict entries before performing membership assertions, maintaining test clarity.

Also applies to: 203-204, 221-223, 240-241

src/backend/base/langflow/services/storage/local.py (1)

127-155: LGTM! Clean implementation of enriched file metadata.

The updated list_files method:

  • Returns consistent structure (name, size) matching the S3 implementation
  • Uses proper async iteration with folder_path.iterdir()
  • Correctly filters to only include files (not directories)
  • Handles edge cases (missing directory, errors) gracefully

The per-file stat() call is a reasonable trade-off for providing size metadata.

src/backend/base/langflow/services/storage/s3.py (1)

225-262: LGTM! Efficient S3 implementation leveraging existing metadata.

The S3 implementation efficiently extracts file size from the list_objects_v2 response (which includes Size by default), avoiding any additional API calls compared to the previous implementation.

src/backend/tests/unit/api/v2/test_files.py (3)

202-202: LGTM!

The expected response format correctly reflects the new API contract that includes files_not_deleted field for consistency across delete operations.

Also applies to: 762-762


970-974: LGTM!

The mock setup consistently adds mock_exec_flows to simulate the additional database query for checking file usage in flows. The side_effect pattern correctly returns different results for sequential session.exec calls.

Also applies to: 1018-1022, 1070-1074, 1132-1136, 1171-1175, 1220-1224, 1261-1265


1292-1339: Good test coverage for file-in-use protection.

This new test verifies the critical behavior that files referenced in flow nodes cannot be deleted, protecting users from accidentally breaking their flows. The test correctly:

  • Mocks a flow with a node template referencing the file
  • Verifies the appropriate response structure with files_not_deleted
  • Confirms storage and database delete are NOT called
src/backend/base/langflow/api/v2/files.py (5)

38-73: Good security-conscious filename sanitization.

The implementation handles key security concerns:

  • Path traversal prevention via Path(filename).name
  • Dangerous character replacement with safe subset
  • Hidden file prevention by stripping leading dots
  • Length limits with extension preservation

One minor note: the regex [^\w.\- ()] allows underscores (via \w) but the comment mentions them explicitly. This is correct behavior.


76-100: RFC 5987 compliant Content-Disposition handling.

Good implementation supporting both ASCII and non-ASCII filenames with proper encoding fallback. The quote escaping for backslash and double-quote characters prevents header injection.


800-804: File-in-use protection returns 200 with informative response.

This is a design choice worth noting: the endpoint returns HTTP 200 (not 4xx) when a file cannot be deleted due to being in use. This allows the client to distinguish between errors and intentional rejections while maintaining a consistent response structure.


536-577: Comprehensive batch delete with proper failure categorization.

The implementation correctly:

  • Separates in-use files before attempting deletion
  • Categorizes storage failures as transient vs permanent
  • Only deletes DB records for files successfully removed from storage (or permanently gone)
  • Provides detailed response with deleted, not-deleted, and failure counts

761-771: Good input validation for file renaming.

The validation ensures:

  • Empty names are rejected
  • Names containing disallowed characters are rejected with clear feedback
  • Only alphanumeric characters, spaces, dots, hyphens, underscores, and parentheses are permitted

The strict comparison (sanitized_name != name.strip()) is intentional: it prevents silent modification of user input and explicitly rejects names containing characters that would be converted or removed (e.g., @, #, !). This aligns with the security principle of failing fast rather than implicitly sanitizing. The error message clearly communicates what characters are allowed.

src/backend/base/langflow/services/task/temp_flow_cleanup.py (1)

8-9: SQLAlchemy/sqlmodel imports are appropriate

delete, col, and select are imported consistently with their usage below; no issues here.

"get_top_level_vertices",
# Functions
"has_api_terms",
"is_file_used",
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

is_file_used is exported but not imported.

The function is_file_used is added to __all__ but is missing from the import statement at lines 9-39. This will cause an ImportError when attempting to import it from langflow.api.utils.

Add is_file_used to the import statement:

 from langflow.api.utils.core import (
     API_WORDS,
     MAX_PAGE_SIZE,
     MIN_PAGE_SIZE,
     CurrentActiveMCPUser,
     CurrentActiveUser,
     DbSession,
     EventDeliveryType,
     build_and_cache_graph_from_data,
     build_graph_from_data,
     build_graph_from_db,
     build_graph_from_db_no_cache,
     build_input_keys_response,
     cascade_delete_flow,
     check_langflow_version,
     custom_params,
     extract_global_variables_from_headers,
     format_elapsed_time,
     format_exception_message,
     format_syntax_error_message,
     get_causing_exception,
     get_is_component_from_data,
     get_suggestion_message,
     get_top_level_vertices,
     has_api_terms,
+    is_file_used,
     parse_exception,
     parse_value,
     remove_api_keys,
     validate_is_component,
     verify_public_flow_and_get_user,
 )
🤖 Prompt for AI Agents
In src/backend/base/langflow/api/utils/__init__.py around line 71, the name
"is_file_used" is included in __all__ but not imported in the import block at
lines 9-39; add "is_file_used" to that import statement (import it from the same
module where the other file utility functions are imported from) so the symbol
exported in __all__ is actually defined and avoid the ImportError.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado this seems relevent

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +218 to +235
def is_file_used(flow_data: dict | None, file_name: str) -> bool:
"""Check if a file is used in the flow."""
if not flow_data or "nodes" not in flow_data:
return False

for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
for item in value:
if isinstance(item, str) and file_name in item:
return True
return False
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Code duplication: is_file_used is defined in three places.

This function is identically implemented in:

  1. Here (api/v1/files.py)
  2. api/v2/files.py (lines 189-206)
  3. api/utils/core.py (lines 415-432)

Since the function is already exported from langflow.api.utils (once the missing import is fixed), this local definition should be removed in favor of importing from the centralized location.

Remove the local definition and import from utils:

-from langflow.api.utils import CurrentActiveUser, DbSession
+from langflow.api.utils import CurrentActiveUser, DbSession, is_file_used
 from langflow.api.v1.schemas import UploadFileResponse

Then remove the local function definition at lines 218-235.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/backend/base/langflow/api/v1/files.py around lines 218 to 235, remove the
local is_file_used function and instead import the canonical implementation from
langflow.api.utils (or langflow.api.utils.core) where it is exported; update the
module imports to add the single import (e.g., from langflow.api.utils import
is_file_used) and delete the duplicated function definition at lines 218-235,
making sure all local references continue to work with the imported function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY, the question is: where to put it utils/core?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado Yes, utils/core.py is the right place for is_file_used! Here's why:

Why utils/core.py is appropriate:

  1. API-agnostic utility - The function operates purely on dictionary data structures, with no dependencies on v1 or v2 specific logic
  2. Shared across versions - Both v1 and v2 file APIs need this functionality
  3. Already exported - It's already in utils/core.py and exported from api/utils/__init__.py
  4. Consistent pattern - Other flow-data utilities already live in that module

Next steps to complete the DRY refactor:

  1. In api/v1/files.py - Remove the local definition (lines 218-235 in the diff) and import:

    from langflow.api.utils import CurrentActiveUser, DbSession, is_file_used
  2. In api/v2/files.py - Remove the duplicate at lines 189-206 and add the same import

This keeps the canonical implementation in utils/core.py where it's version-neutral, tested once, and maintained in a single location.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up on this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +421 to +432
for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
for item in value:
if isinstance(item, str) and file_name in item:
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️Codeflash found 21% (0.21x) speedup for is_file_used in src/backend/base/langflow/api/utils/core.py

⏱️ Runtime : 1.74 milliseconds 1.44 milliseconds (best of 124 runs)

📝 Explanation and details

The optimized code achieves a 20% speedup through several key micro-optimizations that reduce redundant operations and improve early exit behavior:

Key Optimizations:

  1. Eliminated chained .get() calls with defaults: The original code used node.get("data", {}).get("node", {}) and node_data.get("template", {}), which creates temporary empty dictionaries even when keys don't exist. The optimized version uses single .get() calls followed by explicit None checks with continue statements, avoiding unnecessary object creation.

  2. Added intermediate variable storage: Storing nodes = flow_data["nodes"] once avoids repeated dictionary lookups. Similarly, storing val = field.get("value") eliminates the redundant field["value"] access after the "value" in field check.

  3. Restructured conditional logic for better short-circuiting: The optimized version uses early continue statements to skip nodes missing required keys (data, node, template), reducing nesting and improving branch prediction. This is particularly effective when dealing with malformed nodes.

  4. Simplified field validation: Instead of isinstance(field, dict) and "value" in field, the code first checks isinstance(field, dict) with a continue, then directly gets the value, eliminating the redundant "value" in field check followed by field["value"] access.

Performance Impact:
The optimizations are most effective for scenarios with:

  • Large node counts (test cases with 1000+ nodes show the biggest gains)
  • Nodes with missing or malformed structure (early exits reduce unnecessary processing)
  • Complex template hierarchies (reduced dictionary lookups compound savings)

The line profiler shows the optimized version processes the same workload with fewer operations per line, particularly in the hot paths where field validation and value extraction occur thousands of times. While the total runtime appears slightly higher in the profiler due to additional conditional checks, the actual measured runtime is 20% faster, indicating more efficient execution paths and reduced memory allocation overhead.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 52 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

# imports
import pytest
from langflow.api.utils.core import is_file_used

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_basic_file_found_as_exact_string():
    # File name appears as exact string in template value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "myfile.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_basic_file_not_found():
    # File name does not appear in any node
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "otherfile.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_basic_file_found_in_list():
    # File name appears in a list of values
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": ["a.txt", "myfile.txt", "b.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_basic_file_found_as_substring():
    # File name appears as substring in a value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "folder/myfile.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_basic_multiple_nodes_one_match():
    # File name appears in only one of several nodes
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "notit.txt"}
                        }
                    }
                }
            },
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "myfile.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

# -------------------- EDGE TEST CASES --------------------

def test_edge_empty_flow_data():
    # flow_data is None
    codeflash_output = is_file_used(None, "myfile.txt")
    # flow_data is empty dict
    codeflash_output = is_file_used({}, "myfile.txt")

def test_edge_no_nodes_key():
    # flow_data missing 'nodes' key
    flow_data = {"something_else": []}
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_nodes_empty():
    # 'nodes' is empty list
    flow_data = {"nodes": []}
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_node_missing_data_node_template():
    # Node missing 'data', 'node', or 'template'
    flow_data = {
        "nodes": [
            {},  # no data
            {"data": {}},  # no node
            {"data": {"node": {}}},  # no template
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_template_field_not_dict():
    # Template value is not a dict
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": "myfile.txt"
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_field_dict_without_value():
    # Template field is dict but has no 'value' key
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"not_value": "myfile.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_value_is_list_with_non_string_items():
    # List contains non-string items
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": ["a.txt", 123, None, {"x": 1}, "myfile.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_value_is_empty_string_or_list():
    # Value is empty string
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": ""}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

    # Value is empty list
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": []}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_edge_file_name_is_empty_string():
    # file_name is empty string, should match any non-empty string value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "something"}
                        }
                    }
                }
            }
        ]
    }
    # '' in 'something' is always True
    codeflash_output = is_file_used(flow_data, "")

def test_edge_file_name_not_in_any_value():
    # file_name is not a substring of any value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "file": {"value": "abc.txt"},
                            "files": {"value": ["def.txt", "ghi.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "xyz.txt")

def test_edge_value_is_list_with_substring_matches():
    # file_name is a substring of one of the list items
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": ["folder/myfile.txt", "other.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

# -------------------- LARGE SCALE TEST CASES --------------------

def test_large_scale_many_nodes_file_at_end():
    # Many nodes, file name only at the last node
    nodes = [
        {
            "data": {
                "node": {
                    "template": {
                        "file": {"value": f"file_{i}.txt"}
                    }
                }
            }
        }
        for i in range(999)
    ]
    nodes.append({
        "data": {
            "node": {
                "template": {
                    "file": {"value": "myfile.txt"}
                }
            }
        }
    })
    flow_data = {"nodes": nodes}
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_large_scale_many_nodes_no_match():
    # Many nodes, no file name matches
    nodes = [
        {
            "data": {
                "node": {
                    "template": {
                        "file": {"value": f"file_{i}.txt"}
                    }
                }
            }
        }
        for i in range(1000)
    ]
    flow_data = {"nodes": nodes}
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_large_scale_node_with_large_list():
    # One node with a large list, file name in the middle
    files = [f"file_{i}.txt" for i in range(500)]
    files.insert(250, "myfile.txt")
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": files}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_large_scale_multiple_possible_fields():
    # Many nodes, each with multiple template fields, file name in one field
    nodes = []
    for i in range(500):
        nodes.append({
            "data": {
                "node": {
                    "template": {
                        "field1": {"value": f"file_{i}.txt"},
                        "field2": {"value": f"other_{i}.txt"},
                        "field3": {"value": [f"list_{i}.txt", "myfile.txt" if i == 123 else f"not_{i}.txt"]}
                    }
                }
            }
        })
    flow_data = {"nodes": nodes}
    codeflash_output = is_file_used(flow_data, "myfile.txt")

def test_large_scale_file_name_appears_multiple_times():
    # File name appears in multiple nodes and fields
    nodes = []
    for i in range(10):
        nodes.append({
            "data": {
                "node": {
                    "template": {
                        "field": {"value": "myfile.txt" if i % 3 == 0 else f"file_{i}.txt"}
                    }
                }
            }
        })
    flow_data = {"nodes": nodes}
    codeflash_output = is_file_used(flow_data, "myfile.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from langflow.api.utils.core import is_file_used

# unit tests

# -------------------
# Basic Test Cases
# -------------------

def test_basic_file_found_in_single_node():
    # File name is directly in a string value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input_file": {"value": "my_file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_basic_file_not_found():
    # File name is not present
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input_file": {"value": "other_file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_basic_file_in_list():
    # File name is present in a list of strings
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": ["a.txt", "my_file.txt", "b.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_basic_file_not_in_list():
    # File name is not present in the list
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "files": {"value": ["a.txt", "b.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_basic_file_substring_match():
    # File name is a substring of the value
    flow_data = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input_file": {"value": "prefix_my_file.txt_suffix"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

# -------------------
# Edge Test Cases
# -------------------

def test_edge_flow_data_none():
    # flow_data is None
    codeflash_output = is_file_used(None, "my_file.txt")

def test_edge_flow_data_empty_dict():
    # flow_data is empty dict
    codeflash_output = is_file_used({}, "my_file.txt")

def test_edge_nodes_missing():
    # flow_data missing 'nodes' key
    flow_data = {"not_nodes": []}
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_nodes_empty_list():
    # flow_data with empty nodes list
    flow_data = {"nodes": []}
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_node_missing_data():
    # node missing 'data' key
    flow_data = {
        "nodes": [
            {}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_node_missing_node_key():
    # node['data'] missing 'node' key
    flow_data = {
        "nodes": [
            {"data": {}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_node_missing_template_key():
    # node['data']['node'] missing 'template' key
    flow_data = {
        "nodes": [
            {"data": {"node": {}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_template_empty():
    # template is empty dict
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_field_not_dict():
    # template field is not a dict
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": "my_file.txt"}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_field_dict_missing_value():
    # template field dict missing 'value' key
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_value_not_str_or_list():
    # 'value' is an int, not a str or list
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": 123}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_value_list_with_non_str_items():
    # 'value' is a list with non-str items
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"files": {"value": ["a.txt", 123, None]}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_file_name_empty_string():
    # file_name is empty string, should match any string value containing ''
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": "something"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "")

def test_edge_file_name_special_characters():
    # file_name contains special characters
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": "file@#$.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "file@#$.txt")

def test_edge_multiple_nodes_file_in_second():
    # file_name present in second node only
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": "other.txt"}}}}},
            {"data": {"node": {"template": {"input_file": {"value": "my_file.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_multiple_fields_file_in_second_field():
    # file_name present in second field only
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {
                "field1": {"value": "other.txt"},
                "field2": {"value": "my_file.txt"}
            }}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_value_list_file_substring():
    # file_name is a substring of a list item
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"files": {"value": ["prefix_my_file.txt_suffix"]}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_edge_file_name_not_in_any_node():
    # file_name not present in any node
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": "other.txt"}}}}},
            {"data": {"node": {"template": {"input_file": {"value": "another.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

# -------------------
# Large Scale Test Cases
# -------------------

def test_large_scale_many_nodes_file_in_last():
    # Large number of nodes, file_name present in the last node
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": f"file_{i}.txt"}}}}}
            for i in range(999)
        ] + [
            {"data": {"node": {"template": {"input_file": {"value": "my_file.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_large_scale_many_nodes_file_not_present():
    # Large number of nodes, file_name not present
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"input_file": {"value": f"file_{i}.txt"}}}}}
            for i in range(1000)
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_large_scale_many_fields_per_node_file_in_middle_field():
    # Each node has many fields, file_name present in a middle field of one node
    fields = {
        f"field_{i}": {"value": f"file_{i}.txt"} for i in range(500)
    }
    fields["field_250"] = {"value": "my_file.txt"}
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": fields}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_large_scale_value_list_file_in_middle():
    # 'value' is a large list, file_name present in the middle
    values = [f"file_{i}.txt" for i in range(500)]
    values[250] = "my_file.txt"
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {"files": {"value": values}}}}}
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_large_scale_multiple_nodes_and_fields_file_not_present():
    # Many nodes and many fields, file_name not present
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {
                f"field_{j}": {"value": f"file_{i}_{j}.txt"}
                for j in range(10)
            }}}}
            for i in range(100)
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")

def test_large_scale_multiple_nodes_and_fields_file_in_first_node_last_field():
    # Many nodes and many fields, file_name present in first node, last field
    flow_data = {
        "nodes": [
            {"data": {"node": {"template": {
                **{f"field_{j}": {"value": f"file_{0}_{j}.txt"} for j in range(9)},
                "field_9": {"value": "my_file.txt"}
            }}}},
            *[
                {"data": {"node": {"template": {
                    f"field_{j}": {"value": f"file_{i}_{j}.txt"}
                    for j in range(10)
                }}}}
                for i in range(1, 100)
            ]
        ]
    }
    codeflash_output = is_file_used(flow_data, "my_file.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr10819-2025-12-01T19.59.30

Click to see suggested changes
Suggested change
for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
for item in value:
if isinstance(item, str) and file_name in item:
return True
nodes = flow_data["nodes"]
for node in nodes:
node_data = node.get("data")
if not node_data:
continue
node_obj = node_data.get("node")
if not node_obj:
continue
template = node_obj.get("template")
if not template:
continue
for field in template.values():
# Fastest path: common-case check, avoid double get
if not isinstance(field, dict):
continue
val = field.get("value")
if isinstance(val, str):
if file_name in val:
return True
elif isinstance(val, list):
for item in val:
if isinstance(item, str) and file_name in item:
return True

Comment on lines +195 to +203
for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️Codeflash found 18% (0.18x) speedup for is_file_used in src/backend/base/langflow/api/v2/files.py

⏱️ Runtime : 1.78 milliseconds 1.52 milliseconds (best of 122 runs)

📝 Explanation and details

The optimized code achieves a 17% speedup through several key micro-optimizations that reduce object allocations and method calls:

Key Optimizations:

  1. Eliminated unnecessary dict allocations: The original code used .get("key", {}) which creates empty dictionaries even when not needed. The optimized version uses .get("key") and explicit None checks, avoiding these allocations entirely.

  2. Reduced chained method calls: Instead of node.get("data", {}).get("node", {}), the optimization breaks this into separate calls with early exit conditions, reducing the number of method invocations per iteration.

  3. Faster type checking: Replaced isinstance(field, dict) and "value" in field with type(field) is dict followed by .get("value"). The type() check is faster than isinstance(), and using .get() instead of membership testing followed by dictionary access is more efficient.

  4. Better control flow structure: Added explicit continue statements for early exit when intermediate objects are None, avoiding unnecessary nested operations on invalid data.

Performance Impact:
The optimizations are most effective for flows with:

  • Many nodes with missing or incomplete data structures (benefits from early exits)
  • Large templates with mixed field types (benefits from faster type checking)
  • Scenarios where the file is found early (benefits from reduced per-iteration overhead)

From the test results, the optimization provides consistent speedups across various workload patterns, from simple single-node cases to large-scale flows with 1000+ nodes. The early-exit optimizations are particularly beneficial when processing malformed or incomplete node data, which appears common in real-world usage based on the test coverage.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 52 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from langflow.api.v2.files import is_file_used

# unit tests

# Basic Test Cases

def test_file_used_simple_string():
    # File name appears in a string value
    flow = {
        "nodes": [
            {"data": {"node": {"template": {"f1": {"value": "myfile.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_substring():
    # File name is a substring within the value
    flow = {
        "nodes": [
            {"data": {"node": {"template": {"f1": {"value": "path/to/myfile.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_in_list():
    # File name appears in a list of strings
    flow = {
        "nodes": [
            {"data": {"node": {"template": {"f1": {"value": ["other.txt", "myfile.txt"]}}}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_not_used():
    # File name does not appear anywhere
    flow = {
        "nodes": [
            {"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_multiple_nodes():
    # File name appears in only one node among several
    flow = {
        "nodes": [
            {"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}},
            {"data": {"node": {"template": {"f2": {"value": "myfile.txt"}}}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

# Edge Test Cases

def test_empty_flow_data():
    # flow_data is None
    codeflash_output = is_file_used(None, "myfile.txt")

def test_flow_data_missing_nodes():
    # flow_data does not have "nodes"
    flow = {"something_else": []}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_empty_nodes_list():
    # "nodes" is an empty list
    flow = {"nodes": []}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_node_missing_data():
    # Node missing "data" key
    flow = {"nodes": [{}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_node_data_missing_node():
    # Node's "data" missing "node"
    flow = {"nodes": [{"data": {}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_node_template_missing():
    # Node's "node" missing "template"
    flow = {"nodes": [{"data": {"node": {}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_field_not_dict():
    # Template field is not a dict
    flow = {"nodes": [{"data": {"node": {"template": {"f1": "notadict"}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_field_dict_no_value():
    # Template field is dict but missing "value"
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"notvalue": "x"}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_value_is_list_of_non_strings():
    # Value is a list, but contains non-strings
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": [1, 2, 3]}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_value_is_empty_string():
    # Value is an empty string
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": ""}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_value_is_empty_list():
    # Value is an empty list
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": []}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_case_sensitive():
    # File name matching is case sensitive
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "MyFile.txt"}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")  # Should be case sensitive

def test_file_name_is_empty():
    # File name is empty string, should match all non-empty strings
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "something"}}}}}]}
    codeflash_output = is_file_used(flow, "")

def test_file_used_in_list_substring():
    # File name is substring in one of the list items
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": ["abc_myfile.txt_def"]}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_multiple_fields():
    # File name appears in multiple template fields
    flow = {
        "nodes": [
            {"data": {"node": {"template": {
                "f1": {"value": "other.txt"},
                "f2": {"value": "myfile.txt"}
            }}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_multiple_list_items():
    # File name appears in several items in a value list
    flow = {
        "nodes": [
            {"data": {"node": {"template": {
                "f1": {"value": ["myfile.txt", "anotherfile.txt", "myfile.txt"]}
            }}}}
        ]
    }
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_file_used_with_special_characters():
    # File name contains special regex characters
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": "my[file].txt"}}}}}]}
    codeflash_output = is_file_used(flow, "my[file].txt")

# Large Scale Test Cases

def test_large_number_of_nodes_file_present():
    # Large flow with file present in one of many nodes
    nodes = [{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}} for _ in range(999)]
    nodes.append({"data": {"node": {"template": {"f1": {"value": "myfile.txt"}}}}})
    flow = {"nodes": nodes}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_large_number_of_nodes_file_absent():
    # Large flow with file absent
    nodes = [{"data": {"node": {"template": {"f1": {"value": "other.txt"}}}}} for _ in range(1000)]
    flow = {"nodes": nodes}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_large_number_of_fields_in_template():
    # Large number of fields in a single template, file present in one
    template = {f"f{i}": {"value": "other.txt"} for i in range(999)}
    template["target"] = {"value": "myfile.txt"}
    flow = {"nodes": [{"data": {"node": {"template": template}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_large_list_in_value():
    # Value is a large list, file present in one item
    value_list = ["other.txt"] * 999 + ["myfile.txt"]
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": value_list}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")

def test_large_list_in_value_absent():
    # Value is a large list, file not present
    value_list = ["other.txt"] * 1000
    flow = {"nodes": [{"data": {"node": {"template": {"f1": {"value": value_list}}}}}]}
    codeflash_output = is_file_used(flow, "myfile.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from langflow.api.v2.files import is_file_used

# unit tests

# ----------------------------- Basic Test Cases -----------------------------

def test_none_flow_data_returns_false():
    # flow_data is None
    codeflash_output = is_file_used(None, "file.txt")

def test_missing_nodes_key_returns_false():
    # flow_data missing 'nodes' key
    codeflash_output = is_file_used({}, "file.txt")

def test_empty_nodes_list_returns_false():
    # flow_data with empty nodes list
    codeflash_output = is_file_used({"nodes": []}, "file.txt")

def test_file_used_in_single_node_string_value():
    # file_name present as substring in a string value
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "some/path/file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_not_used_in_single_node_string_value():
    # file_name not present in any value
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "some/path/other.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_used_in_list_of_strings():
    # file_name present in a list of strings
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": ["a.txt", "b.txt", "file.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_not_used_in_list_of_strings():
    # file_name not present in any string in the list
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": ["a.txt", "b.txt", "c.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_used_in_multiple_nodes():
    # file_name present in the second node
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "a.txt"}
                        }
                    }
                }
            },
            {
                "data": {
                    "node": {
                        "template": {
                            "input2": {"value": "file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

# ----------------------------- Edge Test Cases -----------------------------

def test_file_name_is_empty_string():
    # file_name is empty string, should match any string (since "" in x is always True)
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "anything"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "")

def test_node_without_data_key():
    # node missing 'data' key
    flow = {
        "nodes": [
            {}
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_node_data_without_node_key():
    # node['data'] missing 'node' key
    flow = {
        "nodes": [
            {"data": {}}
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_node_data_node_without_template_key():
    # node['data']['node'] missing 'template' key
    flow = {
        "nodes": [
            {"data": {"node": {}}}
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_template_field_not_dict():
    # template field is not a dict
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": "not_a_dict"
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_template_field_dict_without_value_key():
    # template field dict missing 'value'
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"not_value": "file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_value_is_list_with_non_string_items():
    # value is a list with non-string items
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": ["a.txt", 123, None, {"x": 1}]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_value_is_non_string_non_list():
    # value is an int, not a string or list
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": 123}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_name_is_substring_of_value():
    # file_name is a substring of a longer string
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "somefile.txt.backup"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_name_is_only_part_of_value():
    # file_name is only part of a string in a list
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": ["xxxfile.txtyyy", "zzz"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_multiple_fields_in_template():
    # file_name present in one of multiple fields
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "a.txt"},
                            "input2": {"value": "file.txt"},
                            "input3": {"value": ["b.txt", "c.txt"]}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_file_name_in_multiple_nodes_and_fields():
    # file_name present in multiple places, should short-circuit on first found
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input1": {"value": "not_it"}
                        }
                    }
                }
            },
            {
                "data": {
                    "node": {
                        "template": {
                            "input2": {"value": ["nope", "file.txt", "another"]}
                        }
                    }
                }
            },
            {
                "data": {
                    "node": {
                        "template": {
                            "input3": {"value": "file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

# ----------------------------- Large Scale Test Cases -----------------------------

def test_large_number_of_nodes_no_match():
    # Large flow, file_name not present
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input": {"value": f"file_{i}.txt"}
                        }
                    }
                }
            } for i in range(1000)
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_large_number_of_nodes_with_match_at_end():
    # Large flow, file_name present in last node
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input": {"value": f"file_{i}.txt"}
                        }
                    }
                }
            } for i in range(999)
        ] + [
            {
                "data": {
                    "node": {
                        "template": {
                            "input": {"value": "file.txt"}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_large_number_of_nodes_with_match_in_middle():
    # Large flow, file_name present in the middle
    nodes = [
        {
            "data": {
                "node": {
                    "template": {
                        "input": {"value": f"file_{i}.txt"}
                    }
                }
            }
        } for i in range(500)
    ]
    nodes.append(
        {
            "data": {
                "node": {
                    "template": {
                        "input": {"value": "file.txt"}
                    }
                }
            }
        }
    )
    nodes += [
        {
            "data": {
                "node": {
                    "template": {
                        "input": {"value": f"file_{i}.txt"}
                    }
                }
            }
        } for i in range(501, 1000)
    ]
    flow = {"nodes": nodes}
    codeflash_output = is_file_used(flow, "file.txt")

def test_large_number_of_fields_per_node():
    # Each node has many fields, only one has the file_name
    template = {f"input{i}": {"value": f"file_{i}.txt"} for i in range(50)}
    template["special"] = {"value": "file.txt"}
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": template
                    }
                }
            }
        ] * 20
    }
    codeflash_output = is_file_used(flow, "file.txt")

def test_large_list_of_values():
    # value is a large list, file_name present near the end
    value_list = [f"file_{i}.txt" for i in range(999)] + ["file.txt"]
    flow = {
        "nodes": [
            {
                "data": {
                    "node": {
                        "template": {
                            "input": {"value": value_list}
                        }
                    }
                }
            }
        ]
    }
    codeflash_output = is_file_used(flow, "file.txt")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr10819-2025-12-01T20.38.21

Click to see suggested changes
Suggested change
for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
nodes = flow_data["nodes"]
for node in nodes:
data = node.get("data")
if not data:
continue
node_data = data.get("node")
if not node_data:
continue
template = node_data.get("template")
if not template:
continue
# Extract values once to local variable, avoids .values() call per loop iteration
for field in template.values():
# Fast path: skip non-dict fields up front
if type(field) is dict:
value = field.get("value")
if isinstance(value, str):
if file_name in value:
return True
elif isinstance(value, list):
# For small lists (which is common), fast in-place iteration is enough
# List comprehension is not helpful for early return, so keep as loop

Copy link
Copy Markdown
Collaborator

@Adam-Aghili Adam-Aghili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally make sense, Alittle confused how to test this manually. Rabbit brought up some good points

"get_top_level_vertices",
# Functions
"has_api_terms",
"is_file_used",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado this seems relevent

Comment on lines +218 to +235
def is_file_used(flow_data: dict | None, file_name: str) -> bool:
"""Check if a file is used in the flow."""
if not flow_data or "nodes" not in flow_data:
return False

for node in flow_data["nodes"]:
node_data = node.get("data", {}).get("node", {})
template = node_data.get("template", {})
for field in template.values():
if isinstance(field, dict) and "value" in field:
value = field["value"]
if isinstance(value, str) and file_name in value:
return True
if isinstance(value, list):
for item in value:
if isinstance(item, str) and file_name in item:
return True
return False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up on this

Comment on lines +33 to +35
MAX_FILENAME_LENGTH = 255
# Maximum reasonable extension length
MAX_EXTENSION_LENGTH = 20
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these the max lengths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Pull Request from an external contributor enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants