Skip to content

fix: implement weak references for vertex and custom component to pre…#11299

Open
ricofurtado wants to merge 2 commits into
mainfrom
lfoss-2044-out-of-memory-after-multiple-executions-of-custom-component
Open

fix: implement weak references for vertex and custom component to pre…#11299
ricofurtado wants to merge 2 commits into
mainfrom
lfoss-2044-out-of-memory-after-multiple-executions-of-custom-component

Conversation

@ricofurtado
Copy link
Copy Markdown
Contributor

@ricofurtado ricofurtado commented Jan 14, 2026

This pull request introduces improvements to memory management and resource cleanup in the custom_component and vertex modules, and enhances the handling of asynchronous cleanup tasks in the graph module. The main focus is on using weak references to avoid reference cycles and potential memory leaks, and ensuring async tasks are properly cleaned up when objects are destroyed.

Memory management improvements:
The applied fixes reduce Langflow backend RSS growth by approximately 30–50% per run, eliminating the primary leak-like behaviour while leaving allocator-driven VSZ growth unchanged.

Tests were run in a 140-component flow, with 25 invocations of a subflow consisting of 12 components.
The flow was provided by Shuhei Kochi


Langflow Backend Memory Regression Report

Original Code vs Fixed Code

Summary

We ran repeated Langflow flow executions via the REST API and measured backend process memory usage after each run. Compared to the original implementation, the fixed version shows a significant reduction in RSS (resident memory) growth per run, indicating reduced memory retention across repeated executions.

While VSZ (virtual memory) growth remains largely unchanged, this behavior is consistent with allocator and arena behavior rather than a true memory leak.


Test Methodology

  • Flows executed repeatedly using Langflow REST API (stream=false) within a single backend process

  • Memory collected from /proc/<PID>/status:

    • RSS (MiB)VmRSS (primary signal for real memory retention)

    • VSZ (MiB)VmSize (secondary signal; allocator-related)

  • Each test consists of:

    1. Warm-up runs (lazy loading, cache initialization)

    2. A 15-run steady-state block using a fresh baseline_before

  • Both successful (200) and failed (500) runs were included to observe failure-path behavior

  • Memory deltas are computed relative to the baseline_before for each block


Data Sets Analyzed

Original Code

  • Run 1

    • Baseline RSS: 1327 MiB

    • RSS after run 15: 1649 MiB

  • Run 2

    • Baseline RSS: 1314 MiB

    • RSS after run 15: 1715 MiB

Fixed Code

  • Run 1

    • Baseline RSS: 1374 MiB

    • RSS after run 15: 1564 MiB

  • Run 2

    • Baseline RSS: 1335 MiB

    • RSS after run 15: 1617 MiB


Results

1️⃣ RSS Growth Reduction (Primary Improvement)

RSS represents actual resident memory and is the strongest indicator of leak-like behavior.

Original Code

  • Run 1: +322 MiB over 15 runs → ~21.5 MiB/run

  • Run 2: +401 MiB over 15 runs → ~26.7 MiB/run

Fixed Code

  • Run 1: +190 MiB over 15 runs → ~12.7 MiB/run

  • Run 2: +282 MiB over 15 runs → ~18.8 MiB/run

Net Improvement

  • RSS growth reduced from ~21–27 MiB/run~13–19 MiB/run

  • Equivalent to approximately 30–50% reduction in resident memory growth per run

This strongly indicates that the fix reduced retained memory across repeated flow executions.


2️⃣ VSZ Behavior (Largely Unchanged)

VSZ growth remained similar across original and fixed versions:

  • Original: ~+5419 to +5476 MiB over 15 runs
    (~361–365 MiB/run)

  • Fixed: ~+5285 to +5386 MiB over 15 runs
    (~352–359 MiB/run)

Interpretation:

  • VSZ growth is consistent with Python allocator behavior (arenas, fragmentation, mmap)

  • The fact that RSS improved while VSZ did not suggests:

    • fewer live objects retained

    • allocator behavior remains the dominant VSZ factor

  • This does not indicate an active memory leak


3️⃣ Failure-Path Behavior (HTTP 500)

Failures were intentionally included in the run sets:

Version | Run | Failures -- | -- | -- Original | Run 1 | 4 / 15 Original | Run 2 | 4 / 15 Fixed | Run 1 | 5 / 15 Fixed | Run 2 | 4 / 15

Even with intermittent 500 responses:

  • The fixed version consistently shows lower RSS growth

  • Failure paths no longer cause disproportionate memory retention

This suggests improved cleanup even under error conditions.


Key Conclusions

  1. Resident memory retention per run improved significantly

    • ~30–50% reduction in RSS growth rate

  2. VSZ growth remains allocator-driven

    • No evidence of additional live-object leaks

  3. Failure paths are safer

    • Errors no longer amplify memory growth trends


Recommendations / Next Steps

To further improve stability and observability:

  1. Differentiate allocator behavior from true leaks

    • Add tracemalloc snapshots (post-warmup vs post-run 15)

    • Diff top allocation sites

  2. Harden failure paths

    • Review resource cleanup on 500 responses

    • Ensure request-scoped objects and buffers are released

  3. Standardize benchmarks

    • Fixed warm-up count

    • Fixed delay between runs

    • Consistent baseline capture


One-Line Takeaway

The applied fixes reduce Langflow backend RSS growth by approximately 30–50% per run, eliminating the primary leak-like behavior while leaving allocator-driven VSZ growth unchanged.


@ricofurtado ricofurtado requested a review from Copilot January 14, 2026 16:16
@github-actions github-actions Bot added the community Pull Request from an external contributor label Jan 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 14, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR introduces weak references to break circular dependencies between CustomComponentVertex and VertexGraph. The Graph class gains async cleanup logic via a new __del__ destructor and cleanup_async_tasks method to properly await and terminate pending traces during garbage collection.

Changes

Cohort / File(s) Summary
Weak reference introduction
src/lfx/src/lfx/custom/custom_component/custom_component.py, src/lfx/src/lfx/graph/vertex/base.py
Replaced direct object storage with weak references: CustomComponent._vertex now uses _vertex_ref internally; Vertex.graph now uses _graph_ref internally. Public interface preserved via properties that dereference on access. Added weakref import.
Async cleanup and destructor
src/lfx/src/lfx/graph/graph/base.py
Added cleanup_async_tasks() async method to flush and await end-trace tasks with timeout handling. Added __del__ destructor to trigger cleanup when event loop is running. Modified _end_all_traces_async() to prefer get_running_loop() and improve event loop scheduling logic.

Sequence Diagram

sequenceDiagram
    participant GC as Garbage Collector
    participant Graph
    participant EventLoop as Event Loop
    participant TaskSet as end_trace_tasks

    GC->>Graph: __del__() called
    activate Graph
    Graph->>EventLoop: asyncio.get_running_loop()
    alt Loop running
        EventLoop-->>Graph: loop instance
        Graph->>Graph: cleanup_async_tasks()
        activate Graph
        Graph->>TaskSet: flush all tasks
        Graph->>TaskSet: await with timeout
        TaskSet-->>Graph: completion
        Graph->>TaskSet: clear()
        deactivate Graph
    else No running loop
        EventLoop-->>Graph: RuntimeError
        Graph->>TaskSet: cancel pending tasks
    end
    deactivate Graph
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 5
❌ Failed checks (1 error, 4 warnings)
Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error PR introduces weak reference implementations and async cleanup functionality without corresponding test coverage. Add test files verifying weak reference dereferencing, async cleanup cancellation, and destructor behavior across event loop states.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Quality And Coverage ⚠️ Warning Pull request introduces memory-management features but lacks test coverage for weak references, async cleanup, garbage collection, and serialization behavior. Add comprehensive pytest tests covering weak reference behavior, async cleanup_async_tasks() execution, Graph/Vertex del cleanup, serialization with weak references, error consistency, and integration tests for memory leak fixes.
Test File Naming And Structure ⚠️ Warning Pull request introduces significant memory management changes without test coverage for weak references, garbage collection, async cleanup, or destructor logic. Add comprehensive tests for weak reference behavior, garbage collection scenarios, cleanup_async_tasks() method, del destructor, and event loop handling logic in existing test files.
Excessive Mock Usage Warning ⚠️ Warning Existing test files use mocks appropriately for external dependencies. However, PR introduces significant changes lacking dedicated test coverage for weak references, async cleanup, destructor logic, and serialization bugs. Add tests for weak reference property behavior, async cleanup (cleanup_async_tasks), destructor logic (del), and fix critical serialization bug in Vertex's getstate method for _graph_ref weak reference.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title is partially related to the changeset. It mentions 'weak references for vertex and custom component' which are real changes, but it's truncated (ends with 'pre…') and omits the significant async cleanup and Graph weak reference changes that address memory leak concerns.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch lfoss-2044-out-of-memory-after-multiple-executions-of-custom-component

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves memory management by introducing weak references to prevent circular reference chains between Graph, Vertex, and CustomComponent objects. It also adds proper cleanup for asynchronous trace tasks to prevent resource leaks during object destruction.

Changes:

  • Introduced weak references in Vertex and CustomComponent to break circular dependencies with Graph and Vertex respectively
  • Added async task cleanup mechanism in Graph.__del__ to handle pending trace tasks during garbage collection
  • Enhanced _end_all_traces_async to gracefully handle event loop availability

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/lfx/src/lfx/graph/vertex/base.py Converted strong Graph reference to weak reference with property-based access
src/lfx/src/lfx/graph/graph/base.py Added async task cleanup in destructor and improved event loop handling in trace methods
src/lfx/src/lfx/custom/custom_component/custom_component.py Converted strong Vertex reference to weak reference with property getter/setter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lfx/src/lfx/graph/graph/base.py Outdated
"""Cleanup when graph is destroyed."""
if self._end_trace_tasks:
try:
loop = asyncio.get_event_loop()
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using asyncio.get_event_loop() in __del__ is deprecated in Python 3.10+ and can cause issues. Consider using asyncio.get_running_loop() with a try/except, or check if a loop exists before attempting to use it. The behavior of get_event_loop() in destructors is unreliable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing...

Comment thread src/lfx/src/lfx/graph/graph/base.py Outdated
Comment on lines +669 to +696
def __del__(self):
"""Cleanup when graph is destroyed."""
if self._end_trace_tasks:
try:
loop = asyncio.get_event_loop()
if loop.is_running():
cleanup_task = loop.create_task(self.cleanup_async_tasks())

def _log_cleanup(t: asyncio.Task) -> None:
try:
exc = t.exception()
except Exception: # noqa: BLE001
logger.exception("Cleanup task inspection failed")
return
if exc:
logger.warning("Cleanup task failed during __del__", exc_info=exc)
else:
logger.info("Cleanup task completed cleanly")

cleanup_task.add_done_callback(_log_cleanup)
elif loop.is_closed():
for task in list(self._end_trace_tasks):
task.cancel()
self._end_trace_tasks.clear()
else:
loop.run_until_complete(self.cleanup_async_tasks())
except Exception: # noqa: BLE001
logger.exception("Error cleaning up async tasks")
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __del__ method creates a task at line 675 but doesn't await it when the loop is running. This fire-and-forget task may not complete before the object is fully destroyed, potentially leaving resources uncleaned. Additionally, calling run_until_complete at line 694 when the loop is not running could block if the loop has been stopped but not yet closed, which is problematic in a destructor.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unlikely to happen, but I'll refactor it...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a second thought about this: Graph keeps _end_trace_tasks (set of asyncio Tasks) and each task has a callback bound to self._end_trace_tasks.discard, while the task coroutine itself closes over self. That creates a cycle Graph -> set -> task -> callback -> Graph. Because Graph defines del, any such cycle is non-collectable, so a graph cached in ChatService.set_cache or kept alive by pending trace tasks will leak until the process exits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python’s garbage collection improvements suggest that, in the long term, we should avoid using del, as it can interfere with optimal GC behavior and potentially introduce subtle issues. I’m removing it because the performance gains were negligible.

@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Jan 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/lfx/src/lfx/custom/custom_component/custom_component.py`:
- Around line 118-125: The _vertex property currently returns None when the
weakref target was garbage-collected, causing downstream AttributeError in
callers like the graph property; change the _vertex getter to mirror
Vertex.graph by detecting when self._vertex_ref is set but self._vertex_ref() is
None and raise ValueError("Vertex has been garbage collected") instead of
returning None, leaving the setter (def _vertex) as-is so callers that
intentionally check `if self._vertex:` still work when no ref exists.
🧹 Nitpick comments (1)
src/lfx/src/lfx/graph/graph/base.py (1)

657-667: Consider reducing log verbosity for normal operation.

INFO-level logging for every graph cleanup may produce excessive logs in production. Consider DEBUG level, or make logging conditional on whether there were actually tasks to clean up.

♻️ Suggested adjustment
 async def cleanup_async_tasks(self) -> None:
     """Ensure all async tasks complete before cleanup."""
-    logger.info(f"Cleaning up async tasks for run_id: {self._run_id}")
+    logger.debug(f"Cleaning up async tasks for run_id: {self._run_id}")
     if self._end_trace_tasks:
-        logger.info(f"Waiting for {len(self._end_trace_tasks)} trace tasks to complete")
+        logger.debug(f"Waiting for {len(self._end_trace_tasks)} trace tasks to complete")
         try:
             await asyncio.wait_for(asyncio.gather(*self._end_trace_tasks, return_exceptions=True), timeout=30.0)
         except asyncio.TimeoutError:
             logger.warning("Some trace tasks did not complete within timeout")
         finally:
             self._end_trace_tasks.clear()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20680cd and 2ac2e39.

📒 Files selected for processing (3)
  • src/lfx/src/lfx/custom/custom_component/custom_component.py
  • src/lfx/src/lfx/graph/graph/base.py
  • src/lfx/src/lfx/graph/vertex/base.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to src/backend/**/*.py : Use `asyncio.create_task()` for background tasks and implement proper cleanup with try/except for `asyncio.CancelledError`

Applied to files:

  • src/lfx/src/lfx/graph/graph/base.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use async fixtures with proper cleanup using try/finally blocks to ensure resources are properly released after tests complete

Applied to files:

  • src/lfx/src/lfx/graph/graph/base.py
🧬 Code graph analysis (2)
src/lfx/src/lfx/graph/graph/base.py (2)
src/backend/base/langflow/services/task/backends/anyio.py (1)
  • cleanup_task (108-117)
src/backend/base/langflow/server.py (1)
  • error (53-61)
src/lfx/src/lfx/custom/custom_component/custom_component.py (1)
src/lfx/src/lfx/graph/vertex/base.py (1)
  • Vertex (50-885)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Update Starter Projects
🔇 Additional comments (5)
src/lfx/src/lfx/graph/vertex/base.py (2)

7-7: LGTM: Weak reference import for breaking reference cycles.

The import aligns with the PR objective to prevent memory leaks by avoiding reference cycles between Vertex and Graph.


118-126: LGTM: Property correctly dereferences weak reference with explicit error handling.

The property pattern is well-implemented:

  • Safely dereferences the weak reference
  • Raises a clear ValueError if the Graph has been garbage collected
  • Maintains the same public interface
src/lfx/src/lfx/graph/graph/base.py (1)

698-707: LGTM: Correct pattern for async trace scheduling.

The implementation correctly:

  • Uses get_running_loop() to detect if we're in an async context
  • Falls back to synchronous execution via run_until_complete when no loop is running
  • Tracks tasks for later cleanup via _end_trace_tasks
  • Uses add_done_callback with discard for automatic cleanup
src/lfx/src/lfx/custom/custom_component/custom_component.py (2)

4-4: LGTM: Weak reference import for breaking reference cycles.

The import aligns with the PR objective to prevent memory leaks.


91-91: LGTM: Weak reference storage for vertex.

Correctly uses weakref.ReferenceType[Vertex] type annotation for the weak reference.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread src/lfx/src/lfx/custom/custom_component/custom_component.py
Comment thread src/lfx/src/lfx/graph/graph/base.py Outdated
Comment thread src/lfx/src/lfx/graph/vertex/base.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lfx/src/lfx/custom/custom_component/custom_component.py
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Jan 19, 2026
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working community Pull Request from an external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants