Skip to content

feat: add s3 file storage implementation#10526

Merged
jordanrfrazier merged 106 commits into
mainfrom
s3-file-store
Nov 25, 2025
Merged

feat: add s3 file storage implementation#10526
jordanrfrazier merged 106 commits into
mainfrom
s3-file-store

Conversation

@jordanrfrazier
Copy link
Copy Markdown
Collaborator

@jordanrfrazier jordanrfrazier commented Nov 6, 2025

Adds s3 as a possible backing file storage service. Includes fixes to usage of database session scope.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added S3 storage backend support with async file operations and streaming capabilities for enterprise deployments.
    • Introduced database migration advisory locks to safely support multiple concurrent instances sharing a database.
  • Improvements

    • Enhanced file operation error handling with improved error messages and recovery.
    • Optimized database transaction handling for improved reliability and concurrency.
    • Streamlined profile picture management with filesystem-based access.
  • Documentation

    • Added configuration documentation for PostgreSQL advisory lock namespace in multi-instance setups.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 6, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces comprehensive changes to database transaction management, storage abstraction, and session lifecycle handling. It replaces explicit commit patterns with flush-based operations and centralizes session management through session_scope. Additionally, it adds S3 storage support alongside local storage, updates file operations with enhanced error handling and cleanup logic, and introduces PostgreSQL migration locking via advisory locks. Dependencies are updated for AWS support.

Changes

Cohort / File(s) Summary
Documentation
docs/docs/Develop/memory.mdx
Added documentation for LANGFLOW_MIGRATION_LOCK_NAMESPACE environment variable describing optional PostgreSQL advisory lock namespace for migrations.
Dependencies
pyproject.toml
Updated langchain-aws from exact version (==0.2.33) to range (>=0.2.33,<1.0.0); added new aioboto3 dependency (>=15.2.0,<16.0.0).
CLI & Migration Management
src/backend/base/langflow/__main__.py, src/backend/base/langflow/alembic/env.py
Modified API key CLI flow to return result directly; enhanced PostgreSQL migrations with advisory lock mechanism (namespace-derived or default), lock timeout configuration, and conditional prepared statement disabling.
Session Management Refactoring
src/backend/base/langflow/api/utils/core.py, src/backend/base/langflow/services/deps.py, src/backend/base/langflow/services/auth/utils.py
Replaced get_session dependency with session_scope for auto-commit; introduced DbSessionReadOnly using session_scope_readonly; added deprecation path for get_session with NotImplementedError.
Database Service Architecture
src/backend/base/langflow/services/database/service.py, src/backend/base/langflow/services/database/models/.../*
Renamed with_session to _with_session; introduced async_session_maker; replaced commit patterns with flush in multiple CRUD operations (api_key, message, user, folder utilities).
API Endpoints – Transaction Handling
src/backend/base/langflow/api/v1/chat.py, src/backend/base/langflow/api/v1/flows.py, src/backend/base/langflow/api/v1/projects.py, src/backend/base/langflow/api/v1/users.py, src/backend/base/langflow/api/v1/mcp_projects.py, src/backend/base/langflow/api/v1/monitor.py
Systematically replaced session.commit() with session.flush(); converted ORM objects to read schemas (FlowRead, FolderRead) within active sessions to prevent detached instance errors; adjusted return types to expose read models.
Storage Services – Local Implementation
src/backend/base/langflow/services/storage/local.py, src/lfx/src/lfx/services/storage/local.py
Refactored LocalStorageService to delegate to lfx backend; added resolve_component_path, get_file_stream, get_file_size methods; switched to async file I/O operations.
Storage Services – S3 Implementation
src/backend/base/langflow/services/storage/s3.py, src/lfx/src/lfx/services/storage/s3.py
Replaced synchronous boto3 with async aioboto3; refactored method signatures to use flow_id/file_name; added error mapping for FileNotFoundError/PermissionError; implemented get_file_stream, get_file_size, resolve_component_path; added configuration validation and tagging support.
Storage Services – Base Architecture
src/backend/base/langflow/services/storage/service.py, src/backend/base/langflow/services/storage/__init__.py, src/lfx/src/lfx/services/storage/service.py
Updated StorageService to inherit from both Service and LfxStorageService; changed constructor to require session_service and settings_service; expanded abstract interface with build_full_path, resolve_component_path, get_file_stream, get_file_size, teardown; exported storage implementations.
File Upload & Download – Enhanced Error Handling
src/backend/base/langflow/api/v1/files.py, src/backend/base/langflow/api/v2/files.py
Replaced storage-backed profile pictures with filesystem references; added FileNotFoundError (404) and PermissionError (403) handling; introduced file size retrieval post-save; added transactional cleanup on DB insert failure; enhanced delete paths with storage failure logging.
Storage-Aware Data Components
src/lfx/src/lfx/base/data/base_file.py, src/lfx/src/lfx/base/data/utils.py, src/lfx/src/lfx/base/data/storage_utils.py, src/lfx/src/lfx/components/data/{csv_to_data, file, json_to_data}.py
Added storage_utils.py with parse_storage_path, read_file_bytes, read_file_text, get_file_size, file_exists; introduced async parsing (parse_text_file_to_data_async, read_docx_file_async, parse_pdf_to_text_async); updated data components for S3-aware file reading with lazy validation.
LangChain Utilities – Storage-Aware Processing
src/lfx/src/lfx/components/langchain_utilities/{csv_agent, json_agent}.py, src/lfx/src/lfx/components/twelvelabs/{split_video, video_file}.py, src/lfx/src/lfx/components/vectorstores/local_db.py, src/lfx/src/lfx/graph/vertex/param_handler.py
Added local path resolution for S3 files with temporary download and cleanup; added S3 guards raising ValueError for incompatible components (video processing, local vector stores); replaced path construction with resolve_component_path.
Memory & Task Management
src/backend/base/langflow/memory.py, src/backend/base/langflow/services/task/temp_flow_cleanup.py, src/backend/base/langflow/services/flow/flow_runner.py, src/backend/base/langflow/services/variable/service.py
Removed batch commits; introduced aadd_messagetables with retry logic for CancelledError; replaced commits with flushes in variable/flow operations; removed explicit rollbacks.
Setup & Initialization
src/backend/base/langflow/initial_setup/setup.py, src/backend/base/langflow/services/utils.py, src/backend/base/langflow/main.py
Replaced commits with flushes in folder/flow/project creation; adjusted folder assignment logic; moved tempfile/FileLock imports to module scope; wrapped cleanup tasks in session_scope context.
Settings Configuration
src/lfx/src/lfx/services/settings/base.py
Added migration_lock_namespace, object_storage_bucket_name, object_storage_prefix, object_storage_tags fields; extended sqlite_pragmas with busy_timeout.
lfx Session Management
src/lfx/src/lfx/services/deps.py
Implemented proper session_scope with commit/rollback semantics; added session_scope_readonly for read-only operations; deprecated get_session with NotImplementedError; added InvalidRequestError handling.
Backend Tests – Session Refactoring
src/backend/tests/conftest.py, src/backend/tests/unit/test_database.py, src/backend/tests/unit/api/v1/{test_files, test_mcp_projects}.py, src/backend/tests/integration/components/mcp/test_mcp_superuser_flow.py
Replaced per-session db_manager usage with session_scope; removed explicit commit calls; updated session_scope import source from langflow to lfx where applicable.
Backend Tests – S3 Integration
src/backend/tests/unit/api/test_s3_endpoints.py, src/backend/tests/unit/api/v2/test_files.py, src/backend/tests/unit/components/data/{test_s3_components, test_s3_uploader_component}.py, src/backend/tests/unit/services/storage/{test_local_storage_service, test_s3_storage_service}.py
Added comprehensive test suites for S3 storage operations including streaming downloads, uploads, deletions, error handling, and metadata retrieval; added fixtures for AWS credential validation; updated S3 uploader test decorator.
Frontend File Handling
src/frontend/src/components/core/parameterRenderComponent/components/inputFileComponent/index.tsx, src/frontend/src/controllers/API/queries/file-management/use-post-upload-file.ts, src/frontend/src/hooks/files/use-upload-file.ts
Added null-safety guards for file arrays; added array validation in cache updates; enhanced error message extraction from response data and fallback handling.
lfx Component & Utility Tests
src/lfx/tests/unit/base/data/test_storage_utils.py, src/lfx/tests/unit/components/langchain_utilities/{test_csv_agent, test_json_agent}.py, src/backend/tests/unit/components/processing/test_save_file_component.py
Added comprehensive test coverage for storage utilities, CSV/JSON agents with local/S3 paths and temp file cleanup, and SaveFileComponent refactoring with async mocks.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant SessionScope as session_scope()
    participant DBService
    participant DB as Database

    Client->>API: Request (write operation)
    API->>SessionScope: Enter context
    SessionScope->>DBService: _with_session()
    DBService->>DB: Begin transaction
    DB-->>SessionScope: AsyncSession
    SessionScope-->>API: Yield session

    API->>API: Perform operation
    API->>SessionScope: flush() instead of commit()
    SessionScope->>DB: Send pending changes (no commit)
    DB-->>SessionScope: Changes staged

    API->>API: Additional operations in same transaction
    
    alt Success
        API->>SessionScope: Exit context normally
        SessionScope->>DB: COMMIT
        DB-->>SessionScope: Transaction committed
    else Exception
        API->>SessionScope: Exit context (exception)
        SessionScope->>DB: ROLLBACK
        DB-->>SessionScope: Transaction rolled back
    end
    
    SessionScope-->>Client: Response
Loading
sequenceDiagram
    participant Component as File Component
    participant Settings as Settings Service
    participant Storage as Storage Service
    participant LocalFS as Local Filesystem
    participant S3 as S3 Storage

    Component->>Settings: get_settings_service()
    Settings-->>Component: storage_type (s3 or local)

    alt storage_type == "s3"
        Component->>Storage: resolve_component_path(path)
        Storage->>S3: Parse S3 key
        S3-->>Storage: flow_id/file_name
        Storage-->>Component: S3 key
        
        Component->>Storage: read_file_bytes(s3_path)
        Storage->>S3: GetObject request
        S3-->>Storage: File bytes
        Storage-->>Component: File bytes
    else storage_type == "local"
        Component->>Storage: resolve_component_path(path)
        Storage->>LocalFS: Resolve path
        LocalFS-->>Storage: Local path
        Storage-->>Component: Local path
        
        Component->>Storage: read_file_bytes(local_path)
        Storage->>LocalFS: Read file
        LocalFS-->>Storage: File bytes
        Storage-->>Component: File bytes
    end

    Component->>Component: Process file content
Loading
sequenceDiagram
    participant Migration as Alembic Migration
    participant Env as env.py
    participant Settings as LANGFLOW Settings
    participant Lock as PostgreSQL Advisory Lock
    participant DB as PostgreSQL DB

    Migration->>Env: run_migrations(PostgreSQL)
    Env->>Settings: Check LANGFLOW_MIGRATION_LOCK_NAMESPACE
    
    alt LANGFLOW_MIGRATION_LOCK_NAMESPACE set
        Settings-->>Env: namespace value
        Env->>Env: Compute lock_key = hash(namespace)
    else LANGFLOW_MIGRATION_LOCK_NAMESPACE not set
        Settings-->>Env: None
        Env->>Env: Use default lock_key
    end
    
    Env->>DB: SET lock_timeout to 180s
    DB-->>Env: Configured
    
    Env->>Lock: SELECT pg_advisory_xact_lock(lock_key)
    Lock->>DB: Acquire lock
    DB-->>Lock: Lock acquired
    Lock-->>Env: Lock held
    
    Env->>Migration: Proceed with migration
    Migration->>DB: Run SQL changes
    DB-->>Migration: Changes applied
    
    Migration-->>Env: Complete
    Env->>Lock: Release lock (transaction end)
    Lock->>DB: Lock released
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas requiring extra attention during review:

  • Session & Transaction Management Refactoring (src/backend/base/langflow/services/database/service.py, src/lfx/src/lfx/services/deps.py): Critical architectural change from per-session management to centralized session_scope. Requires verification that all transaction boundaries are preserved, especially around flush vs. commit semantics and rollback error handling (InvalidRequestError).

  • Storage Service Abstraction (src/backend/base/langflow/services/storage/s3.py, src/lfx/src/lfx/services/storage/service.py): Complete redesign of storage layer with S3 support. Verify async aioboto3 integration, error mapping correctness, resource cleanup (client context managers), and file size/streaming behavior.

  • Database API Endpoint Changes (src/backend/base/langflow/api/v1/flows.py, src/backend/base/langflow/api/v1/projects.py): Multiple endpoints return different types (ORM models converted to read schemas). Verify in-session conversion prevents detached instance errors, flush points are correct, and all transaction paths maintain consistency.

  • File Operations & Cleanup (src/backend/base/langflow/api/v2/files.py): Enhanced error handling with new exception types and transactional cleanup on failure. Verify orphaned file cleanup logic, error propagation, and that all paths handle both success and failure cleanup correctly.

  • Storage-Aware Components (src/lfx/src/lfx/components/data/{base_file, utils, storage_utils}.py, src/lfx/src/lfx/components/langchain_utilities/*): New conditional logic based on storage_type with lazy validation for S3 and eager validation for local. Verify path resolution, temporary file cleanup on S3, and guard conditions preventing incompatible operations.

  • PostgreSQL Migration Locking (src/backend/base/langflow/alembic/env.py): New advisory lock mechanism with namespace hashing. Verify lock_key computation, timeout configuration, and that locking doesn't introduce deadlock scenarios in concurrent migration scenarios.

  • Dependency Version Changes (pyproject.toml): Updated langchain-aws to range constraint and added aioboto3. Verify compatibility across versions and that no breaking changes in boto3/aioboto3 API affect code paths.

Possibly related PRs

Suggested labels

database, storage, session-management, s3-integration, transactions, lfx-integration

Suggested reviewers

  • erichare
  • ogabrielluiz

Pre-merge checks and finishing touches

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 4 warnings)
Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error PR adds 7 new test files for S3/storage implementations, but tests contain critical bugs preventing execution and LFX S3StorageService lacks dedicated test coverage. Fix test implementation bugs: correct module patches in test_storage_utils.py, fix exception types in test_s3_endpoints.py, use AsyncMock for async functions, remove incorrect awaits, replace asyncio.run() with run_until_complete(), add dedicated tests for lfx/services/storage/s3.py.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.36% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Quality And Coverage ⚠️ Warning Test suite contains critical issues: AsyncMock not used for async functions, incorrect patch targets (langflow vs lfx), wrong error type expectations (FileNotFoundError vs HTTPException), and await on synchronous methods. Replace MagicMock with AsyncMock for async functions, fix patch targets to lfx.base.data.storage_utils, remove incorrect get_file_size assertions, expect HTTPException 404 instead of FileNotFoundError, remove await from sync methods.
Test File Naming And Structure ⚠️ Warning Test files contain multiple structural violations: incorrect mock patch module paths (langflow.base instead of lfx.base), missing AsyncMock imports for async patches, improper await calls on sync methods, and incorrect exception type expectations in assertions. Correct all patch decorators to target proper module paths (lfx.base.data.storage_utils), add AsyncMock imports and use new_callable=AsyncMock for async patches, remove incorrect await calls on sync methods, and update test expectations to match actual FastAPI exception behavior.
Excessive Mock Usage Warning ⚠️ Warning Test files exhibit excessive mock usage that obscures actual behavior verification. Unit tests mock core logic and dependencies rather than testing real interactions, while using incorrect patch targets and inconsistent async mock patterns. Replace mocks of core logic with real objects (temp directories, actual components). Fix patch targets to correct module paths (lfx.base not langflow.base). Use AsyncMock for async functions consistently. Remove await calls on synchronous APIs. Reserve mocks for external dependencies only.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add s3 file storage implementation' clearly and concisely summarizes the main change: adding S3 as a file storage backend.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 6, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 6, 2025

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 15%
15.3% (4188/27372) 8.5% (1778/20915) 9.6% (579/6029)

Unit Test Results

Tests Skipped Failures Errors Time
1638 0 💤 0 ❌ 0 🔥 20.961s ⏱️

@github-actions

This comment has been minimized.

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 47.97297% with 385 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.38%. Comparing base (348b1b8) to head (02f0ee1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/backend/base/langflow/services/storage/s3.py 11.27% 118 Missing ⚠️
src/lfx/src/lfx/base/data/utils.py 17.64% 56 Missing ⚠️
src/backend/base/langflow/api/v2/files.py 69.11% 42 Missing ⚠️
src/lfx/src/lfx/base/data/base_file.py 32.65% 28 Missing and 5 partials ⚠️
src/lfx/src/lfx/services/storage/local.py 21.05% 30 Missing ⚠️
src/lfx/src/lfx/services/deps.py 48.71% 19 Missing and 1 partial ⚠️
...rComponent/components/inputFileComponent/index.tsx 0.00% 12 Missing ⚠️
src/lfx/src/lfx/base/data/storage_utils.py 85.29% 5 Missing and 5 partials ⚠️
...rc/backend/base/langflow/services/storage/local.py 78.04% 9 Missing ⚠️
...backend/base/langflow/services/variable/service.py 58.82% 7 Missing ⚠️
... and 18 more

❌ Your project status has failed because the head coverage (40.17%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10526      +/-   ##
==========================================
+ Coverage   32.10%   32.38%   +0.28%     
==========================================
  Files        1364     1366       +2     
  Lines       62528    62943     +415     
  Branches     9266     9304      +38     
==========================================
+ Hits        20077    20387     +310     
- Misses      41437    41531      +94     
- Partials     1014     1025      +11     
Flag Coverage Δ
backend 51.08% <49.75%> (+0.51%) ⬆️
frontend 14.14% <10.52%> (-0.01%) ⬇️
lfx 40.17% <47.86%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/backend/base/langflow/api/utils/core.py 62.44% <100.00%> (+0.17%) ⬆️
src/backend/base/langflow/api/v1/chat.py 39.58% <ø> (+0.41%) ⬆️
src/backend/base/langflow/api/v1/files.py 66.14% <ø> (ø)
src/backend/base/langflow/api/v1/users.py 66.66% <100.00%> (+1.60%) ⬆️
src/backend/base/langflow/helpers/user.py 65.00% <100.00%> (ø)
src/backend/base/langflow/main.py 65.99% <100.00%> (+8.06%) ⬆️
src/backend/base/langflow/services/auth/utils.py 57.14% <100.00%> (-0.82%) ⬇️
...ase/langflow/services/database/models/user/crud.py 82.60% <100.00%> (+1.75%) ⬆️
.../backend/base/langflow/services/storage/service.py 78.78% <100.00%> (ø)
...d/base/langflow/services/task/temp_flow_cleanup.py 61.53% <ø> (+1.53%) ⬆️
... and 31 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 6, 2025
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 6, 2025
@github-actions

This comment has been minimized.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 6, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
src/backend/base/langflow/api/v1/files.py (1)

128-151: Critical: Path traversal vulnerability allows unauthorized file access.

The function constructs file paths by directly concatenating user-provided folder_name and file_name without validation (lines 138-139). An attacker can use path traversal sequences like ../ to access files outside the intended profile_pictures directory.

Apply this diff to validate that the resolved path stays within the intended directory:

     try:
         # Profile pictures are in the package installation directory
         package_dir = Path(__file__).parent.parent.parent / "initial_setup" / "profile_pictures"
         file_path = package_dir / folder_name / file_name
+        
+        # Prevent path traversal by ensuring resolved path is within package_dir
+        if not file_path.resolve().is_relative_to(package_dir.resolve()):
+            raise HTTPException(status_code=400, detail="Invalid file path")
 
         if not file_path.exists():
             raise HTTPException(status_code=404, detail="Profile picture not found")

Additional issue: Redundant exception handling.

The HTTPException raised at line 142 is caught and re-wrapped at lines 149-150. Consider letting HTTPExceptions propagate naturally.

Apply this diff to fix the redundant error handling:

-    except Exception as e:
+    except HTTPException:
+        raise
+    except Exception as e:
         raise HTTPException(status_code=500, detail=str(e)) from e
src/backend/base/langflow/services/database/models/folder/utils.py (1)

26-34: Fix folder reassignment filter

Flow.folder_id is None is evaluated immediately by Python, returning False, so the UPDATE never matches any rows. As a result, flows with a null folder_id are never migrated into the default folder, defeating the purpose of this helper. Switch to SQLAlchemy's .is_(None) (or equivalent) so the predicate is rendered in SQL. Suggested fix:

-        await session.exec(
-            update(Flow)
-            .where(
-                and_(
-                    Flow.folder_id is None,
-                    Flow.user_id == user_id,
-                )
-            )
-            .values(folder_id=folder.id)
-        )
+        await session.exec(
+            update(Flow)
+            .where(
+                and_(
+                    Flow.folder_id.is_(None),
+                    Flow.user_id == user_id,
+                )
+            )
+            .values(folder_id=folder.id)
+        )
src/backend/tests/unit/api/v1/test_files.py (1)

51-72: Add flush before refresh to ensure object persistence.

At line 63, session.refresh(user) is called immediately after session.add(user) without an intervening flush. The refresh() operation requires the object to be persistent in the database, but without a flush, the INSERT may not have been sent yet.

Apply this diff:

         else:
             session.add(user)
+            await session.flush()
             await session.refresh(user)

The same issue exists in the files_flow fixture at lines 84-85. Apply a similar fix:

     async with session_scope() as session:
         session.add(flow)
+        await session.flush()
         await session.refresh(flow)
src/backend/base/langflow/services/deps.py (1)

152-170: Missing import for asynccontextmanager decorator.

Line 152 uses the @asynccontextmanager decorator, but asynccontextmanager is not imported. This will cause a NameError at runtime.

Add the missing import at the top of the file:

 from __future__ import annotations
 
+from contextlib import asynccontextmanager
 from typing import TYPE_CHECKING
src/backend/tests/conftest.py (1)

515-535: Give the superuser fixture its own username.

Both active_user and active_super_user now create the username "activeuser". If a test requests both fixtures, active_super_user will pick up the user created by active_user, leaving is_superuser=False and breaking the test. Please use a distinct username (or otherwise ensure is_superuser is flipped before yielding) so the superuser fixture always returns a real superuser.

src/backend/base/langflow/services/auth/utils.py (2)

145-150: Fix dependency injection to return AsyncSession.

Depends(session_scope) is pulling in lfx.services.deps.session_scope, which is an @asynccontextmanager. FastAPI will inject the context manager object itself, not an AsyncSession, so every call to get_current_user will pass a _AsyncGeneratorContextManager into downstream CRUD helpers and crash at runtime. Import the backend wrapper that yields the session (e.g., from langflow.services.deps import session_scope) or expose a generator-style function here.

-from lfx.services.deps import session_scope
+from langflow.services.deps import session_scope

586-591: Same dependency bug for MCP path.

The MCP handler still injects the async context manager instead of an AsyncSession, so MCP auth will fail the moment it hits the database. Align this import with the backend wrapper (see comment above) so the dependency actually yields a session.

src/backend/base/langflow/api/v2/files.py (1)

467-501: Fix download streaming flow and HTTP error mapping.

Calling await storage_service.get_file(...) before the streaming branch means we always pull the entire payload into memory—even when the backend supports chunked streaming—so large S3 downloads still read the whole file twice. A missing file now bubbles up as an unhandled FileNotFoundError, which our outer except Exception converts into a 500 instead of the expected 404. The fallback path also does await byte_stream_generator(...); that function is an async generator, so the await raises TypeError whenever we hit a storage backend without true streaming support.

Please restructure this block so we only invoke get_file when it’s actually needed (content return or non-streaming fallback), convert FileNotFoundError/PermissionError into 404/403 immediately, and drop the extra await on byte_stream_generator. For example:

-        # Get file stream
-        file_stream = await storage_service.get_file(flow_id=str(current_user.id), file_name=file_name)
-
-        if file_stream is None:
-            raise HTTPException(status_code=404, detail="File stream not available")
-
-        # If return_content is True, read the file content and return it
-        if return_content:
-            # For content return, get the full file
-            file_content = await storage_service.get_file(flow_id=str(current_user.id), file_name=file_name)
-            if file_content is None:
-                raise HTTPException(status_code=404, detail="File not found")
-            return await read_file_content(file_content, decode=True)
-
-        # For streaming, use the appropriate method based on storage type
-        if hasattr(storage_service, "get_file_stream"):
-            # S3 storage - use streaming method
-            file_stream = storage_service.get_file_stream(flow_id=str(current_user.id), file_name=file_name)
-            byte_stream = file_stream
-        else:
-            # Local storage - get file and convert to stream
-            file_content = await storage_service.get_file(flow_id=str(current_user.id), file_name=file_name)
-            if file_content is None:
-                raise HTTPException(status_code=404, detail="File not found")
-            byte_stream = await byte_stream_generator(file_content)
+        try:
+            if return_content:
+                file_content = await storage_service.get_file(flow_id=str(current_user.id), file_name=file_name)
+                return await read_file_content(file_content, decode=True)
+
+            if callable(getattr(storage_service, "get_file_stream", None)):
+                byte_stream = storage_service.get_file_stream(flow_id=str(current_user.id), file_name=file_name)
+            else:
+                file_content = await storage_service.get_file(flow_id=str(current_user.id), file_name=file_name)
+                byte_stream = byte_stream_generator(file_content)
+        except FileNotFoundError as exc:
+            raise HTTPException(status_code=404, detail=str(exc)) from exc
+        except PermissionError as exc:
+            raise HTTPException(status_code=403, detail=str(exc)) from exc

This keeps streaming efficient, preserves memory, and returns the correct status codes for missing or forbidden files. After this change, the tests asserting a 404 can pass without flakiness.

src/backend/tests/unit/api/v2/test_files.py (1)

1-10: Import the modules you use.

json.dumps and uuid.uuid4 are referenced later in this file, but json and uuid are never imported. As soon as the S3 fixtures run, pytest will raise NameError. Please add the missing imports near the top:

-import asyncio
-import os
+import asyncio
+import json
+import os
 import tempfile
 from contextlib import suppress
 from pathlib import Path
+import uuid
src/backend/base/langflow/api/v1/flows.py (1)

270-275: Paginated branch still returns ORM models

When get_all is False we still return the raw Page of Flow ORM instances. That bypasses the new FlowRead.model_validate(..., from_attributes=True) conversion, so the paginated response reintroduces the same detached-instance/serialization problems and no longer matches the declared Page[FlowRead] response model. Please convert the paginated items to FlowRead before returning.

-            return await apaginate(session, stmt, params=params)
+            page = await apaginate(session, stmt, params=params)
+            flow_reads = [FlowRead.model_validate(flow, from_attributes=True) for flow in page.items]
+            page_dict = page.model_dump()
+            page_dict["items"] = flow_reads
+            return Page(**page_dict)
🧹 Nitpick comments (6)
src/backend/base/langflow/api/v1/files.py (1)

153-174: Consider defensive path validation.

While the folder names are currently hardcoded ("People", "Space"), applying the same path validation pattern as recommended for download_profile_picture would provide defense-in-depth against future modifications or directory structure issues.

Additionally, the same redundant exception handling issue exists here. Consider letting HTTPExceptions propagate:

     try:
         # Profile pictures are in the package installation directory
         package_dir = Path(__file__).parent.parent.parent / "initial_setup" / "profile_pictures"
+        
+        # Validate package_dir exists within expected bounds
+        if not package_dir.exists():
+            raise HTTPException(status_code=500, detail="Profile pictures directory not found")
 
         people_path = package_dir / "People"
         space_path = package_dir / "Space"
 
         # List files from package directory - these are bundled with the container
         people = [f.name for f in people_path.iterdir() if f.is_file()] if people_path.exists() else []
         space = [f.name for f in space_path.iterdir() if f.is_file()] if space_path.exists() else []
-    except Exception as e:
+    except HTTPException:
+        raise
+    except Exception as e:
         raise HTTPException(status_code=500, detail=str(e)) from e
src/frontend/src/hooks/files/use-upload-file.ts (1)

60-65: Preserve the original error for better debugging.

The error message normalization pattern is good and provides a consistent user-facing message. However, re-throwing a new Error discards the original stack trace and error properties, which can complicate debugging.

Apply this diff to preserve the original error context:

-    } catch (e: any) {
+    } catch (e: unknown) {
       const errorMessage =
-        e?.response?.data?.detail ||
-        e?.message ||
+        (e as any)?.response?.data?.detail ||
+        (e as Error)?.message ||
         "An error occurred while uploading the file";
-      throw new Error(errorMessage);
+      throw new Error(errorMessage, { cause: e });
     }

This change:

  • Uses unknown for better type safety (explicit casting required)
  • Preserves the original error as cause, maintaining stack traces and debugging context
  • Keeps the normalized message for user-facing error handling
src/frontend/src/controllers/API/queries/file-management/use-post-upload-file.ts (1)

45-53: Make type annotation consistent.

Line 45 uses any for the old parameter, while lines 33 and 60 use FileType[]. For consistency and better type safety, consider using FileType[] here as well, or use unknown if you need to handle potentially non-array values before the guard clause.

Apply this diff to make the type annotation consistent:

-                queryClient.setQueryData(["useGetFilesV2"], (old: any) => {
+                queryClient.setQueryData(["useGetFilesV2"], (old: FileType[]) => {
src/backend/base/langflow/services/database/models/user/crud.py (1)

30-32: Consider removing commented-out code.

The commented-out username uniqueness check appears to be dead code. If the validation is no longer needed (perhaps enforced at the database level or elsewhere), consider removing these lines to improve code clarity.

src/lfx/src/lfx/components/twelvelabs/video_file.py (1)

146-149: Consider extracting the duplicated error message.

The same error message appears in both process_files (line 103) and load_files (line 148). Consider defining it as a class constant to ensure consistency and easier maintenance.

+    S3_NOT_SUPPORTED_MSG = "Video processing is not supported in S3 mode. Use local storage mode to enable this component."
+
     def process_files(self, file_list: list[BaseFileComponent.BaseFile]) -> list[BaseFileComponent.BaseFile]:
         """Process video files."""
         settings = get_settings_service().settings
         if settings.storage_type == "s3":
-            msg = "Video processing is not supported in S3 mode. Use local storage mode to enable this component."
-            raise ValueError(msg)
+            raise ValueError(self.S3_NOT_SUPPORTED_MSG)
src/backend/base/langflow/api/utils/core.py (1)

13-13: Well-designed session dependency separation.

The introduction of DbSessionReadOnly alongside DbSession is a good architectural improvement. The read-only variant avoids unnecessary locking and commit overhead for read operations.

Consider using DbSessionReadOnly in read-heavy endpoints (like GET operations) to improve performance and reduce database contention.

Also applies to: 38-41

Comment thread src/backend/base/langflow/api/v1/users.py
Comment thread src/backend/base/langflow/services/storage/local.py
Comment thread src/backend/tests/unit/api/test_s3_endpoints.py Outdated
Comment thread src/backend/tests/unit/api/test_s3_endpoints.py
Comment thread src/backend/tests/unit/components/data/test_s3_components.py Outdated
Comment thread src/lfx/src/lfx/components/files_and_knowledge/file.py Outdated
Comment thread src/lfx/src/lfx/components/data_source/json_to_data.py Outdated
Comment thread src/lfx/src/lfx/services/storage/local.py Outdated
Comment thread src/lfx/tests/unit/base/data/test_storage_utils.py
Comment thread src/lfx/tests/unit/components/langchain_utilities/test_csv_agent.py
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 9, 2025
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 24, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@jordanrfrazier
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review PR for issues

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 25, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for potential session.flush() without proper transaction context
rg -n 'await session\.flush\(\)' --type=py -B 3 -A 3 src/backend/base/langflow/ | head -150

Length of output: 12504


🏁 Script executed:

#!/bin/bash
# Look for S3 client usage patterns and potential resource leaks
rg -n '_get_client|get_client' --type=py -A 10 src/backend/base/langflow/services/storage/s3.py

Length of output: 4067


🏁 Script executed:

#!/bin/bash
# Check for missing error handling in file operations
rg -n 'async def (save_file|get_file|delete_file)' --type=py -A 20 src/backend/base/langflow/services/storage/s3.py

Length of output: 4315


🏁 Script executed:

#!/bin/bash
# Look for potential issues with database session scope usage in endpoints
rg -n 'DbSession|session_scope' --type=py -C 5 src/backend/base/langflow/api/v1/flows.py | head -100

Length of output: 3004


🏁 Script executed:

#!/bin/bash
# Check if there are any missing await keywords
rg -n 'session\.(commit|flush|rollback|refresh)\(\)' --type=py src/backend/base/langflow/ | grep -v await | head -50

Length of output: 119


🏁 Script executed:

#!/bin/bash
# Check S3 initialization and bucket validation
rg -n 'def __init__|bucket_name|aioboto3' --type=py -C 5 src/backend/base/langflow/services/storage/s3.py

Length of output: 7368


🏁 Script executed:

#!/bin/bash
# Look for potential race conditions in session management
rg -n 'session_scope|_with_session' --type=py -C 3 src/backend/base/langflow/services/database/service.py

Length of output: 1729


🏁 Script executed:

#!/bin/bash
# Check for proper cleanup in async generators
ast-grep --pattern 'async def $FUNC($$$) -> AsyncIterator[$$$]:
  $$$'

Length of output: 23996


@jordanrfrazier I've completed a comprehensive review of the PR. Here are my findings:

🔴 CRITICAL BUG: S3 Streaming Connection Lifecycle Issue

Location: src/backend/base/langflow/services/storage/s3.py:185-223 - get_file_stream() method

Problem: The async generator has a fundamental context manager lifecycle bug. When the generator yields its first chunk, control returns to the caller, but the async with self._get_client() context manager exits immediately, closing the S3 client before the generator finishes yielding all chunks:

async def get_file_stream(self, flow_id: str, file_name: str, chunk_size: int = 8192) -> AsyncIterator[bytes]:
    key = self.build_full_path(flow_id, file_name)
    
    try:
        async with self._get_client() as s3_client:  # ❌ Context exits after first yield!
            response = await s3_client.get_object(Bucket=self.bucket_name, Key=key)
            body = response["Body"]
            
            try:
                async for chunk in body.iter_chunks(chunk_size):
                    yield chunk  # Function returns here, context manager exits
            finally:
                if hasattr(body, "close"):
                    with contextlib.suppress(Exception):
                        await body.close()

When FastAPI's StreamingResponse consumes this generator, the S3 connection will already be closed, causing incomplete downloads or connection errors.

Fix: Restructure to keep the aioboto3 session alive during the entire generator lifecycle:

@asynccontextmanager
async def _get_streaming_client(self):
    """Get S3 client that stays alive for streaming operations."""
    async with self.session.client("s3") as client:
        yield client

async def get_file_stream(self, flow_id: str, file_name: str, chunk_size: int = 8192) -> AsyncIterator[bytes]:
    key = self.build_full_path(flow_id, file_name)
    
    try:
        # Keep client alive for entire generator lifecycle
        async with self._get_streaming_client() as s3_client:
            response = await s3_client.get_object(Bucket=self.bucket_name, Key=key)
            body = response["Body"]
            
            try:
                # Now the context stays open during iteration
                async for chunk in body.iter_chunks(chunk_size):
                    yield chunk
            finally:
                if hasattr(body, "close"):
                    with contextlib.suppress(Exception):
                        await body.close()
                        
        logger.debug(f"File {file_name} streamed successfully from S3: s3://{self.bucket_name}/{key}")
        
    except Exception as e:
        if hasattr(e, "response") and e.response.get("Error", {}).get("Code") == "NoSuchKey":
            await logger.awarning(f"File {file_name} not found in S3 flow {flow_id}")
            msg = f"File not found: {file_name}"
            raise FileNotFoundError(msg) from e
        
        logger.exception(f"Error streaming file {file_name} from S3 in flow {flow_id}")
        raise

This ensures the aioboto3 session and S3 client remain active throughout the entire streaming operation.


🟡 Medium Priority Issues

1. Migration Lock Collision Risk

Location: src/backend/base/langflow/alembic/env.py:49-50

Using only 16 hex characters from SHA-256 increases collision probability:

lock_key = int(hashlib.sha256(namespace.encode()).hexdigest()[:16], 16) % (2**63 - 1)

Recommendation: Use full 8 bytes for better distribution:

hash_bytes = hashlib.sha256(namespace.encode()).digest()[:8]
lock_key = int.from_bytes(hash_bytes, 'big') % (2**63 - 1)

2. Missing Exception Handling in Read-Only Sessions

Location: src/lfx/src/lfx/services/deps.py:59-64 - session_scope_readonly()

No rollback on exceptions, which could leave locks held:

Recommendation:

@asynccontextmanager
async def session_scope_readonly() -> AsyncGenerator[AsyncSession, None]:
    db_service = get_db_service()
    async with db_service._with_session() as session:
        try:
            yield session
        except Exception:
            # Rollback even for read-only to release locks
            if session.is_active:
                try:
                    await session.rollback()
                except InvalidRequestError:
                    pass
            raise

🟢 Low Priority Observations

Temporary File Cleanup in Components

Location: src/lfx/src/lfx/components/langchain_utilities/csv_agent.py, json_agent.py

Components create temp files for S3 with try/finally cleanup, but process crashes will orphan files. Consider using tempfile.TemporaryDirectory or a background cleanup task.


Verified Working Correctly

  • ✅ Session scope commit/flush patterns - properly wrapped in session_scope() dependency
  • ✅ Database transaction management - flush within contexts auto-commits on exit
  • ✅ S3 error handling for standard operations - proper exception mapping
  • ✅ File upload cleanup on DB failure - deletes uploaded file if DB insert fails
  • ✅ No missing await keywords detected
  • ✅ S3 initialization and configuration validation

Summary

The S3 streaming bug is critical and will cause production issues with file downloads from S3. This must be fixed before merging. The session management refactoring looks solid - the commit → flush pattern correctly relies on session_scope() to handle final commits.

The medium-priority issues are improvements but not blockers. Overall, the PR implements a significant architectural change well, aside from the streaming lifecycle bug.


🧠 Learnings used
Learnt from: edwinjosechittilappilly
Repo: langflow-ai/langflow PR: 8504
File: src/backend/base/langflow/initial_setup/starter_projects/Image Sentiment Analysis.json:391-393
Timestamp: 2025-06-12T15:25:01.072Z
Learning: The repository owner prefers CodeRabbit not to review or comment on JSON files because they are autogenerated.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 25, 2025

Build successful! ✅
Deploying docs draft.
Deploy successful! View draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants