fix: serialize concurrent MCP session access to prevent race conditions by erichare · Pull Request #12761 · langflow-ai/langflow

erichare · 2026-04-17T19:01:49Z

Summary

Fixes intermittent failures (~40–50%) when two MCPTools components in the same flow point at the same SSE URL and multiple flow executions run concurrently, as reported in #9860 and re-confirmed in v1.9.0 / v1.10.0.

Root Cause

Three concurrent-access bugs in MCPSessionManager (src/lfx/src/lfx/base/mcp/util.py):

Unserialized get_session() — concurrent callers for the same server both iterated sessions_by_server[server_key]["sessions"], both passed the health check, and both invoked _cleanup_session_by_id() on the same session.
Unsafe cleanup delete — _cleanup_session_by_id() used del sessions[session_id] in a finally block. Two callers that both passed the if session_id not in sessions guard raced on the delete; the loser raised KeyError: 'streamable_http_<hash>_0' — matching the exact reported error.
Session-ID collision — session_id = f"{server_key}_{len(sessions)}" could recycle IDs after cleanup, silently overwriting an existing session entry.

Fix

Added a per-server asyncio.Lock (created under a module-level guard) to serialize the reuse / create / cleanup sections of get_session().
_cleanup_session_by_id() now pop()s the session entry up front — only the winning caller tears the session down; the rest no-op.
Session IDs now come from a monotonic per-server counter (_session_id_counters[server_key]), so recycled slots don't produce colliding IDs.

Test plan

3 new regression tests in test_mcp_util.py::TestMCPSessionManager:
- test_concurrent_get_session_same_server_reuses_one_session — 10 concurrent get_session() calls share exactly one underlying session / one create call.
- test_concurrent_cleanup_same_session_is_idempotent — 10 concurrent _cleanup_session_by_id() calls on the same session do not raise KeyError; task.cancel() runs exactly once.
- test_session_ids_are_monotonic_not_len_based — the counter keeps advancing after cleanup; {server_key}_0 is not recycled for new sessions.
Existing test_session_caching, test_session_cleanup, test_server_switch_detection still pass.
ruff check + ruff format pass.
Manual verification: run the flow topology from Inconsistent behaviors / Flow failures when using two MCP tool connections with same tool but different agents #9860 (2 MCPTools → same SSE URL → 2 different Agents) with 10+ concurrent executions. No 'streamable_http_..._0' KeyError, no Timeout updating tool list.

Fixes #9860

Two MCPTools components pointing at the same SSE URL share a single MCPSessionManager via the component cache. Under concurrent flow execution (10+ runs against the same server), ~40-50% of runs failed with one of: Error updating tool list: 'streamable_http_<hash>_0' Timeout updating tool list: ... Error updating tool list: <connection error> Three races caused this: 1. get_session() was not serialized per-server. Concurrent callers iterating self.sessions_by_server[server_key]["sessions"] could both pass the health check, then both invoke _cleanup_session_by_id(). 2. _cleanup_session_by_id() used del sessions[session_id] in a finally block. Two callers that both passed the `if session_id not in sessions` guard would race on the delete — the loser raised KeyError: 'streamable_http_<hash>_0', matching the reported symptom. 3. Session ids were generated from len(sessions), so removing and re-adding sessions could silently produce colliding ids. Fixes: - Per-server asyncio.Lock (guarded by a module-level lock for creation) serializes session reuse/creation/cleanup. - _cleanup_session_by_id() now pops the session entry up front; only the winning caller runs teardown, the rest no-op. - Session ids come from a monotonic per-server counter. Regression tests cover all three races: 10 concurrent get_session calls must share a single created session; 10 concurrent cleanups must not raise; session ids must not recycle "_0" after cleanup. Fixes #9860

coderabbitai · 2026-04-17T19:01:57Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31276d28-cff2-4af8-ab39-12a976761413

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/mcp-session-concurrent-access

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-17T19:04:49Z

Codecov Report

❌ Patch coverage is 8.77193% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.60%. Comparing base (16432b6) to head (69d8d34).
⚠️ Report is 9 commits behind head on release-1.10.0.

Files with missing lines	Patch %	Lines
src/lfx/src/lfx/base/mcp/util.py	8.77%	104 Missing ⚠️

❌ Your patch status has failed because the patch coverage (8.77%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (49.77%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@                Coverage Diff                 @@
##           release-1.10.0   #12761      +/-   ##
==================================================
- Coverage           52.90%   52.60%   -0.30%     
==================================================
  Files                2025     2025              
  Lines              184223   183549     -674     
  Branches            28873    26179    -2694     
==================================================
- Hits                97460    96559     -901     
- Misses              85669    85895     +226     
- Partials             1094     1095       +1

Flag	Coverage Δ
backend	`55.87% <ø> (ø)`
frontend	`52.53% <ø> (-0.43%)`	⬇️
lfx	`49.77% <8.77%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/lfx/src/lfx/base/mcp/util.py	`10.88% <8.77%> (+0.34%)`	⬆️

... and 220 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-04-17T19:10:15Z

Frontend Unit Test Coverage Report

Coverage Summary

Lines	Statements	Branches	Functions
	34.74% (39851/114703)	67.39% (5443/8076)	35.73% (932/2608)

Unit Test Results

Tests	Skipped	Failures	Errors	Time
3818	0 💤	0 ❌	0 🔥	7m 48s ⏱️

Addresses review findings on the previous commit: HIGH: cleanup paths bypassed the per-server lock. Both `_cleanup_session(context_id)` (invoked from client disconnect) and `_cleanup_idle_sessions()` (invoked from the periodic background task) still mutated `sessions_by_server` without holding the lock. A concurrent `get_session()` could finish validating a session while idle-cleanup popped and cancelled its task; `get_session()` then returned a dead session and registered a dangling refcount entry. MEDIUM: per-server maps (`_server_locks`, `_session_id_counters`) grew unboundedly. The previous patch only reclaimed server entries from `sessions_by_server`; rotating auth/session headers (which change `server_key` via `_get_server_key`) leaked entries in the other two maps forever in long-lived processes. Changes: - Replace `_get_server_lock()` with `_server_lock()`, an async context manager that pin-counts the entry so reclamation can't race a task that holds or is about to acquire the lock. - Wrap the mutating regions of `_cleanup_session()` and `_cleanup_idle_sessions()` in `_server_lock(server_key)`. - On lock release, reclaim both `_server_locks[server_key]` and `_session_id_counters[server_key]` once pins drop to 0, the lock is unheld, and the server has no remaining sessions. - `cleanup_all()` also clears the two maps under `_locks_guard`. New regression tests: - `test_cleanup_idle_vs_get_session_are_serialized` — an in-flight `get_session` blocks a concurrent idle-cleanup pass; the session is returned live, not torn down underneath the caller. - `test_concurrent_cleanup_session_and_get_session_safe` — refcount transitions stay consistent across concurrent connect/disconnect. - `test_server_lock_and_counter_reclaimed_when_unused` — after all sessions for a `server_key` are removed, both `_server_locks` and `_session_id_counters` entries go away.

erichare · 2026-04-17T19:19:44Z

Thanks for the sharp review — both findings confirmed and fixed in cea0156.

HIGH — cleanup paths now honor the per-server lock

Both _cleanup_session(context_id) (client disconnect) and _cleanup_idle_sessions() (periodic task) now wrap their mutating regions in async with self._server_lock(server_key):. The race you described — cleanup cancelling a session while get_session() is mid-validation, then get_session returning the dead session and creating a dangling refcount — is gone.

New regression tests exercise both angles:

test_cleanup_idle_vs_get_session_are_serialized — starts a get_session() that blocks inside _validate_session_connectivity while holding the lock, fires _cleanup_idle_sessions() concurrently, and asserts the session isn't torn down until the getter finishes. Because get_session bumps last_used on reuse, the cleaner correctly leaves the just-used session alone afterward.
test_concurrent_cleanup_session_and_get_session_safe — races a disconnect (_cleanup_session("ctx_a")) against a second get_session("ctx_b"); refcount transitions stay consistent, session isn't cancelled.

MEDIUM — `_server_locks` and `_session_id_counters` are now reclaimed

Converted _get_server_lock() into an async context manager _server_lock(server_key) that pin-counts its entry. On release, when pins == 0, the lock isn't held, and server_key not in sessions_by_server, both maps drop the entry. The combined condition avoids two failure modes:

Aggressive reclamation (pins==0 only) would churn locks for live servers — a lock entry bouncing create/destroy on every get_session call.
Leaking (previous behavior) meant rotating auth headers grew the maps forever, since server_key includes the hashed headers.

cleanup_all() also clears the two maps under _locks_guard.

New test:

test_server_lock_and_counter_reclaimed_when_unused — after the only context disconnects and the idle cleanup removes the empty server entry, both _server_locks[server_key] and _session_id_counters[server_key] are gone.

Happy to rename the pin-counting helper if you'd prefer something different — _server_lock felt natural alongside _locks_guard.

Addresses the review finding on cross-server reuse of `context_id`: `_cleanup_session(context_id)` runs under the per-server lock of the server the mapping pointed at when cleanup started. That lock does NOT serialize a concurrent `get_session(context_id, different_server)` which runs under a *different* per-server lock and atomically re-points `_context_to_session[context_id]` at the new session. The old cleanup path then ran `self._context_to_session.pop(context_id, None)` unconditionally, wiping out the fresh mapping. The new session on the other server was left with refcount 1 and no context mapping, so the next disconnect was a no-op and the session leaked indefinitely. Fix: CAS the pop. After the refcount decrement / teardown, only drop the mapping if `_context_to_session[context_id]` still points at `(server_key, session_id)` — the pair we just cleaned up. The `get()` and `pop()` run synchronously (no `await` between them) so asyncio cannot interleave another coroutine between the check and the mutation. New regression test `test_cleanup_does_not_wipe_cross_server_handoff` slows down `_cleanup_session_by_id` for server A while a concurrent `get_session(ctx, serverB)` re-points the context at server B. Without the CAS, the assertion that `_context_to_session[ctx] == (server_B, ...)` fails. With the CAS, the fresh B mapping survives and its refcount is intact.

erichare · 2026-04-17T19:29:10Z

Good catch — fixed in 0d4239c.

The per-server lock does serialize same-server cleanup/reuse, but you're right that _cleanup_session(context_id) acquires only the lock for the server the mapping currently points at. A concurrent get_session(context_id, different_server) runs under a different lock and atomically re-points _context_to_session[context_id] at the new session. The old cleanup path then ran _context_to_session.pop(context_id, None) unconditionally, destroying the fresh mapping and stranding the new session with refcount 1 and no mapping — so the next disconnect was a no-op and the session leaked forever.

Fix: CAS the pop. After the refcount decrement and teardown for the session we were cleaning, only drop the mapping if _context_to_session.get(context_id) == (server_key, session_id) still — the pair we actually closed. .get() and .pop() run synchronously with no await between them, so asyncio cannot interleave another coroutine to re-point the mapping after our check but before our pop.

New regression test test_cleanup_does_not_wipe_cross_server_handoff patches _cleanup_session_by_id to block inside server A's lock while a concurrent get_session(ctx, serverB) re-points the context at B. Verified locally: the test fails (AssertionError on the fresh-B-mapping assertion) without the CAS, passes with it. Server A's session is cleaned up, server B's session + mapping + refcount are all intact.

Cristhianzl

🧭 TL;DR Verdict

Dimension	Result
🔴 CRITICAL blockers	None
🟠 IMPORTANT (must-fix before merge)	2
🟡 RECOMMENDED	5
🟢 NICE TO HAVE	3
Overall	Approve with minor changes — the bug fix itself is surgically correct and well-tested; findings below are quality refinements, not design defects

This is a high-quality concurrency fix. The root-cause analysis is sharp, the fix is minimal and targeted, the three commits form a clean progression (fix → extend lock coverage → close cross-server CAS), and the regression tests actually reproduce the races they’re guarding against. The PR exemplifies the security-mindset principles in DEVELOPMENT_RULE.md §Security Mindset: each commit explicitly names an assumption that was wrong and demonstrates the blast radius before fixing it.

🔴 CRITICAL — Security & PII Blockers

None found.

Verifications performed:

No print(...), console.log, or PII in any new logger.* call. The only values logged are session_id and server_key — both internal, non-reversible identifiers (server_key = f"streamable_http_{hash(url|headers)}", see util.py:877, 885).
No secrets or tokens appear in any log statement, test fixture, or error message.
No eval, exec, os.system, subprocess injection, SQL string concat, or innerHTML-equivalents.
Test fixtures use http://example.test/sse — an IANA reserved test domain. No real endpoints.
The fix does not alter any authentication/authorization boundary; it only serializes already-authenticated session reuse.

🟠 IMPORTANT — Must Fix Before Merge

1. `dict[str, Any]` typing on `_server_locks`

File: src/lfx/src/lfx/base/mcp/util.py:758

self._server_locks: dict[str, dict[str, Any]] = {}

The comment above documents the shape precisely — {"lock": asyncio.Lock(), "pins": int} — but the type system sees Any, defeating static checks on every callsite that touches entry["lock"] / entry["pins"].

Why it violates the rules: DEVELOPMENT_RULE.md §Code Quality – General Rules: "Use strong typing; avoid loose types like any, object, or dynamic." And REVIEWER_RULE.md §Clean Code — Any is the Python equivalent of TS any.
Fix: Introduce a TypedDict (no runtime cost) next to the attribute:
```
from typing import TypedDict

class _ServerLockEntry(TypedDict):
    lock: asyncio.Lock
    pins: int

self._server_locks: dict[str, _ServerLockEntry] = {}
```
Or a slotted dataclass if mutation ergonomics matter. Either option preserves the pin-count invariant encoded in the comment.

2. Legacy file-size violation is being extended

File: src/lfx/src/lfx/base/mcp/util.py — 2,127 lines after this PR (was 2,112).
File: src/backend/tests/unit/base/mcp/test_mcp_util.py — 3,666 lines after this PR (was 3,465).

Both files massively exceed the hard 500-line limit documented in REVIEWER_RULE.md §File Structure Limits / DEVELOPMENT_RULE.md §1.

This PR did not create the problem and the fix legitimately must live on MCPSessionManager, so I am not blocking merge on this alone. However, REVIEWER_RULE.md §Legacy Code Awareness is explicit:

Do NOT prolong bad patterns — even if surrounding code is bad, write good code. Flag legacy patterns you encounter for future cleanup.

Why it matters here: this PR adds four new methods (_server_lock, _release_server_lock_if_idle, _next_session_id, plus a modified _cleanup_session) to an already-2k-line file. Each addition makes the eventual split harder.
Recommended action (not a blocker): open a follow-up ticket to extract MCPSessionManager into its own module (mcp/session_manager.py + mcp/session_manager_test.py), grouping the new concurrency primitives in a helpers/concurrency.py submodule. Reference this PR as the trigger.
For this PR specifically: add a # TODO(#<ticket>): marker immediately above the class MCPSessionManager: declaration so the debt is visible to the next reader.

🟡 RECOMMENDED — Should Fix

3. Time-based sleeps make the regression tests flaky-prone

File: src/backend/tests/unit/base/mcp/test_mcp_util.py

test_concurrent_get_session_same_server_reuses_one_session → await asyncio.sleep(0.05) inside fake_create (line 90 of diff)
test_cleanup_idle_vs_get_session_are_serialized → await asyncio.sleep(0.02) (line 528 of diff) to "give it a chance to run if it were going to race"
test_cleanup_does_not_wipe_cross_server_handoff → await asyncio.sleep(0) (line 936 of diff)
Why: REVIEWER_RULE.md §Test Quality — "Deterministic tests (no flaky tests)". On slow CI runners or under load, 20 ms can expire before the task reaches the awaited point.

Fix: replace the sleep-based waits with asyncio.Event / asyncio.Barrier synchronization. Example:

cleaner_entered = asyncio.Event()
# monkeypatch _cleanup_idle_sessions to set cleaner_entered.set() before blocking
cleaner = asyncio.create_task(session_manager._cleanup_idle_sessions())
await cleaner_entered.wait()   # deterministic rendezvous

The first test (fake_create with 0.05 s) can just use an asyncio.Event() per caller and gather — the point is to overlap, not to measure a latency.

4. Law-of-Demeter leaks of internal dict shape

File: src/lfx/src/lfx/base/mcp/util.py

New code repeatedly digs into shared dicts:

self.sessions_by_server[server_key]["sessions"]
self._server_locks[server_key]["lock"]
self._server_locks[server_key]["pins"]

Also in the regression tests (test_mcp_util.py:104, 171, 587, 947 etc.).

Why: DEVELOPMENT_RULE.md §Law of Demeter and REVIEWER_RULE.md §KISS. The caller knows the full shape, which is exactly what the dict[str, Any] hides. Tests becoming coupled to internal representation is a well-known smell.

Fix: lightweight accessors on MCPSessionManager:

def _sessions_for(self, server_key: str) -> dict[str, dict]:
    entry = self.sessions_by_server.get(server_key)
    return entry.get("sessions", {}) if entry else {}

Combined with finding #1's TypedDict, this also removes ["lock"] / ["pins"] string-key access from the concurrency path.

5. Pre-existing `except Exception: # noqa: BLE001`

File: src/lfx/src/lfx/base/mcp/util.py — inside _cleanup_session_by_id (the generic-catch block the PR preserves).

This is not introduced by this PR — it already existed — but REVIEWER_RULE.md §Error Handling asks: "No generic exceptions (Exception, Error, object)".

Recommended action: since you're already touching this function (the whole purpose of the PR), narrow the except to the actual exceptions teardown can raise (asyncio.CancelledError, ConnectionError, OSError, the MCP-specific ones you already enumerate in _validate_session_connectivity), plus Exception as a last-resort log-and-continue. At minimum, replace the bare # noqa: BLE001 with a short comment saying why the broad catch is load-bearing.
If out of scope for this PR: fine — add a follow-up ticket.

6. `_release_server_lock_if_idle` reclamation has a subtle ordering contract not asserted

File: src/lfx/src/lfx/base/mcp/util.py:801-808

The reclamation uses three conditions:

if entry["pins"] <= 0 and not entry["lock"].locked() and server_key not in self.sessions_by_server:

Why <= not == 0? If pins ever goes negative, the branch still fires — but a negative pin count would indicate a missing acquire / double-release bug that should fail loudly.

Fix: Keep the <= 0 for safety but add an assert entry["pins"] >= 0 with a log.warn path when it trips, so a regression shows up in telemetry instead of being silently swept.

7. `_next_session_id` is not thread/task-safe on its own

File: src/lfx/src/lfx/base/mcp/util.py:810-814

def _next_session_id(self, server_key: str) -> str:
    current = self._session_id_counters.get(server_key, 0)
    self._session_id_counters[server_key] = current + 1
    return f"{server_key}_{current}"

This is correct only because every caller of _next_session_id is already inside async with self._server_lock(server_key) (verified: the single call site is get_session at util.py:979 after the async with self._server_lock(...) block). That invariant is load-bearing and invisible.

Fix: add a one-line docstring note: "Caller must hold self._server_lock(server_key). Otherwise this increment races." This protects the next maintainer.

🟢 NICE TO HAVE — Polish

8. Type the async context manager return

File: src/lfx/src/lfx/base/mcp/util.py:766-767

@contextlib.asynccontextmanager
async def _server_lock(self, server_key: str):

Annotate the yield for clarity and IDE support:

from collections.abc import AsyncIterator

@contextlib.asynccontextmanager
async def _server_lock(self, server_key: str) -> AsyncIterator[None]:

9. Test naming convention

Tests follow test_<what>_<expected>. Most are descriptive, but test_concurrent_cleanup_same_session_is_idempotent could be test_cleanup_session_by_id_is_idempotent_when_called_concurrently to match should_[expected]_when_[condition] style from REVIEWER_RULE.md §Test Quality. Low priority — the intent is unambiguous either way.

10. Consider a centralized constants symbol

"streamable_http", "sse", "stdio" are string literals peppered across both files. A small class TransportType(StrEnum): ... would remove magic strings (DEVELOPMENT_RULE.md §General Rules – avoid magic strings). Probably out of scope for a bug-fix PR, but worth a follow-up.

✅ Compliant Aspects Worth Highlighting

Root-cause discipline — each of the three commits opens with a clear statement of what the old code assumed, why that assumption was wrong, and what the blast radius was. Matches DEVELOPMENT_RULE.md §Security Mindset verbatim.
Defensive by default — sessions.pop(session_id, None) replaces the racy del sessions[session_id]; _context_to_session.get() == ... then .pop() CAS replaces the unconditional pop.
Reclamation is conservative — pin counting was added in response to review feedback to close a subtle reclaim-vs-acquire race. The pins == 0 AND not locked AND server empty gate is the right conjunction.
Tests reproduce the races — not just assertion-of-happy-path. test_cleanup_idle_vs_get_session_are_serialized uses an asyncio.Event rendezvous to prove the lock is load-bearing, not incidental.
Idempotent cleanup — test_concurrent_cleanup_same_session_is_idempotent fires 10 concurrent cleanups and asserts cancel.assert_called_once(). Exactly the right shape.
Cross-server CAS — the third commit catches a race (ctx_move reconnecting to a different server while the old disconnect is in flight) that most reviewers would miss. The reasoning in the commit message is first-rate.
Zero PII / zero secrets / zero eval.
Comments explain WHY — every new block has a paragraph in the docstring describing the incident it prevents, not the mechanics of the code.

🧪 Security Mindset — The Five Questions (per `REVIEWER_RULE.md` and `DEVELOPMENT_RULE.md §Security Mindset`)

#	Question	Answer for this PR
1	What is this code trusting without verifying?	The monotonic counter is only safe while held under `_server_lock` — invariant is documented in the docstring but not enforced; see finding #7.
2	Authoritative source for behavior?	Python `asyncio` primitives (`Lock`, `Event`, `gather`) — idiomatic and correct. ✅
3	What happens in every failure path?	`_cleanup_session_by_id` preserves the pre-existing broad `except Exception` for teardown safety (finding #5). Lock-acquire failure does not hide underlying exceptions. ✅
4	Who controls each value, and can they lie?	All inputs (`server_key`, `session_id`, `context_id`) are server-generated. No external control surface. ✅
5	Blast radius if wrong?	Worst case: session leak or DoS via connection exhaustion on a long-lived process. The PR specifically closes the "leaks forever" case via pin-counted reclamation. Reduced, not amplified. ✅

📋 Final Pre-Merge Checklist

🔴 CRITICAL                                   PASS
🟠 IMPORTANT                                  2 items — resolve before merge
   [ ] 1. Replace dict[str, Any] with TypedDict for _ServerLockEntry
   [ ] 2. Add TODO(#<follow-up-ticket>) marker flagging util.py file-size debt
🟡 RECOMMENDED                                5 items — strongly encouraged
   [ ] 3. Replace asyncio.sleep-based test rendezvous with asyncio.Event
   [ ] 4. Encapsulate internal dict-shape access (Law of Demeter)
   [ ] 5. Narrow (or justify) the broad `except Exception` in _cleanup_session_by_id
   [ ] 6. Assert non-negative pin count in _release_server_lock_if_idle
   [ ] 7. Document "caller must hold server lock" on _next_session_id
🟢 NICE TO HAVE                               3 items — follow-up ticket OK

Testing
   [x] Regression tests exist for all three race conditions (reuse, cleanup, id collision)
   [x] Extra regression tests for follow-on findings (idle-vs-get, cross-server CAS, reclamation)
   [x] Tests are adversarial, not happy-path only
   [ ] Replace timing-based sleeps with Event rendezvous (finding #3)
   [ ] Show pytest coverage output for the touched functions (≥ 75%, target 80%)
   [x] `ruff check` / `ruff format` confirmed passing in PR description

IMPORTANT - Replace `dict[str, Any]` with `_ServerLockEntry` TypedDict so the lock/pins shape is visible to static analysis (finding #1). - Add TODO marker above `MCPSessionManager` flagging the module's file-size debt as follow-up work (finding #2). RECOMMENDED - Replace timing-based `asyncio.sleep(0.02)` / `asyncio.sleep(0.05)` in concurrency regression tests with deterministic Event-based rendezvous and pin-count polling (finding #3). - Introduce `_sessions_for(server_key)` helper and use it in `_cleanup_idle_sessions`, `_cleanup_session_by_id`, and `cleanup_all` to stop reaching through `sessions_by_server[server_key]["sessions"]` at every call site (finding #4). - Document why `_cleanup_session_by_id` keeps a broad `except Exception` (transport-layer teardown raises many different exception hierarchies — leak-on-cleanup is worse than a swallowed error) (finding #5). - Log a warning when pin count goes negative in `_release_server_lock_if_idle` so a missing acquire / double release surfaces in telemetry instead of being silently swept (finding #6). - Document the "caller must hold `_server_lock(server_key)`" invariant on `_next_session_id` (finding #7). NICE TO HAVE - Annotate the `_server_lock` async context manager yield type as `AsyncIterator[None]` for IDE support (finding #8). All 175 tests in `test_mcp_util.py` still pass; ruff clean.

# Conflicts: # src/lfx/src/lfx/base/mcp/util.py

Cristhianzl

lgtm

…ns (#12761) * fix: serialize concurrent MCP session access to prevent race conditions Two MCPTools components pointing at the same SSE URL share a single MCPSessionManager via the component cache. Under concurrent flow execution (10+ runs against the same server), ~40-50% of runs failed with one of: Error updating tool list: 'streamable_http_<hash>_0' Timeout updating tool list: ... Error updating tool list: <connection error> Three races caused this: 1. get_session() was not serialized per-server. Concurrent callers iterating self.sessions_by_server[server_key]["sessions"] could both pass the health check, then both invoke _cleanup_session_by_id(). 2. _cleanup_session_by_id() used del sessions[session_id] in a finally block. Two callers that both passed the `if session_id not in sessions` guard would race on the delete — the loser raised KeyError: 'streamable_http_<hash>_0', matching the reported symptom. 3. Session ids were generated from len(sessions), so removing and re-adding sessions could silently produce colliding ids. Fixes: - Per-server asyncio.Lock (guarded by a module-level lock for creation) serializes session reuse/creation/cleanup. - _cleanup_session_by_id() now pops the session entry up front; only the winning caller runs teardown, the rest no-op. - Session ids come from a monotonic per-server counter. Regression tests cover all three races: 10 concurrent get_session calls must share a single created session; 10 concurrent cleanups must not raise; session ids must not recycle "_0" after cleanup. Fixes #9860 * fix: extend per-server lock to cleanup paths and reclaim per-key maps Addresses review findings on the previous commit: HIGH: cleanup paths bypassed the per-server lock. Both `_cleanup_session(context_id)` (invoked from client disconnect) and `_cleanup_idle_sessions()` (invoked from the periodic background task) still mutated `sessions_by_server` without holding the lock. A concurrent `get_session()` could finish validating a session while idle-cleanup popped and cancelled its task; `get_session()` then returned a dead session and registered a dangling refcount entry. MEDIUM: per-server maps (`_server_locks`, `_session_id_counters`) grew unboundedly. The previous patch only reclaimed server entries from `sessions_by_server`; rotating auth/session headers (which change `server_key` via `_get_server_key`) leaked entries in the other two maps forever in long-lived processes. Changes: - Replace `_get_server_lock()` with `_server_lock()`, an async context manager that pin-counts the entry so reclamation can't race a task that holds or is about to acquire the lock. - Wrap the mutating regions of `_cleanup_session()` and `_cleanup_idle_sessions()` in `_server_lock(server_key)`. - On lock release, reclaim both `_server_locks[server_key]` and `_session_id_counters[server_key]` once pins drop to 0, the lock is unheld, and the server has no remaining sessions. - `cleanup_all()` also clears the two maps under `_locks_guard`. New regression tests: - `test_cleanup_idle_vs_get_session_are_serialized` — an in-flight `get_session` blocks a concurrent idle-cleanup pass; the session is returned live, not torn down underneath the caller. - `test_concurrent_cleanup_session_and_get_session_safe` — refcount transitions stay consistent across concurrent connect/disconnect. - `test_server_lock_and_counter_reclaimed_when_unused` — after all sessions for a `server_key` are removed, both `_server_locks` and `_session_id_counters` entries go away. * fix: CAS context mapping pop to preserve cross-server handoffs Addresses the review finding on cross-server reuse of `context_id`: `_cleanup_session(context_id)` runs under the per-server lock of the server the mapping pointed at when cleanup started. That lock does NOT serialize a concurrent `get_session(context_id, different_server)` which runs under a *different* per-server lock and atomically re-points `_context_to_session[context_id]` at the new session. The old cleanup path then ran `self._context_to_session.pop(context_id, None)` unconditionally, wiping out the fresh mapping. The new session on the other server was left with refcount 1 and no context mapping, so the next disconnect was a no-op and the session leaked indefinitely. Fix: CAS the pop. After the refcount decrement / teardown, only drop the mapping if `_context_to_session[context_id]` still points at `(server_key, session_id)` — the pair we just cleaned up. The `get()` and `pop()` run synchronously (no `await` between them) so asyncio cannot interleave another coroutine between the check and the mutation. New regression test `test_cleanup_does_not_wipe_cross_server_handoff` slows down `_cleanup_session_by_id` for server A while a concurrent `get_session(ctx, serverB)` re-points the context at server B. Without the CAS, the assertion that `_context_to_session[ctx] == (server_B, ...)` fails. With the CAS, the fresh B mapping survives and its refcount is intact. * refactor(mcp): address review findings on concurrent-access fix IMPORTANT - Replace `dict[str, Any]` with `_ServerLockEntry` TypedDict so the lock/pins shape is visible to static analysis (finding #1). - Add TODO marker above `MCPSessionManager` flagging the module's file-size debt as follow-up work (finding #2). RECOMMENDED - Replace timing-based `asyncio.sleep(0.02)` / `asyncio.sleep(0.05)` in concurrency regression tests with deterministic Event-based rendezvous and pin-count polling (finding #3). - Introduce `_sessions_for(server_key)` helper and use it in `_cleanup_idle_sessions`, `_cleanup_session_by_id`, and `cleanup_all` to stop reaching through `sessions_by_server[server_key]["sessions"]` at every call site (finding #4). - Document why `_cleanup_session_by_id` keeps a broad `except Exception` (transport-layer teardown raises many different exception hierarchies — leak-on-cleanup is worse than a swallowed error) (finding #5). - Log a warning when pin count goes negative in `_release_server_lock_if_idle` so a missing acquire / double release surfaces in telemetry instead of being silently swept (finding #6). - Document the "caller must hold `_server_lock(server_key)`" invariant on `_next_session_id` (finding #7). NICE TO HAVE - Annotate the `_server_lock` async context manager yield type as `AsyncIterator[None]` for IDE support (finding #8). All 175 tests in `test_mcp_util.py` still pass; ruff clean.

github-actions Bot added the bug Something isn't working label Apr 17, 2026

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Apr 17, 2026

Cristhianzl requested changes Apr 20, 2026

View reviewed changes

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Apr 20, 2026

erichare requested a review from Cristhianzl April 20, 2026 17:04

Merge branch 'release-1.10.0' into fix/mcp-session-concurrent-access

69d8d34

# Conflicts: # src/lfx/src/lfx/base/mcp/util.py

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Apr 20, 2026

Cristhianzl approved these changes Apr 23, 2026

View reviewed changes

github-actions Bot added the lgtm This PR has been approved by a maintainer label Apr 23, 2026

erichare added this pull request to the merge queue Apr 23, 2026

Merged via the queue into release-1.10.0 with commit 0217f70 Apr 23, 2026
190 of 194 checks passed

erichare deleted the fix/mcp-session-concurrent-access branch April 23, 2026 18:48

lice-reis mentioned this pull request May 5, 2026

Inconsistent behaviors / Flow failures when using two MCP tool connections with same tool but different agents #9860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: serialize concurrent MCP session access to prevent race conditions#12761

fix: serialize concurrent MCP session access to prevent race conditions#12761
erichare merged 5 commits into
release-1.10.0from
fix/mcp-session-concurrent-access

erichare commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

erichare commented Apr 17, 2026

Uh oh!

erichare commented Apr 17, 2026

Uh oh!

Cristhianzl left a comment

Uh oh!

Cristhianzl left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erichare commented Apr 17, 2026

Summary

Root Cause

Fix

Test plan

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Frontend Unit Test Coverage Report

Coverage Summary

Unit Test Results

Uh oh!

erichare commented Apr 17, 2026

HIGH — cleanup paths now honor the per-server lock

MEDIUM — _server_locks and _session_id_counters are now reclaimed

Uh oh!

erichare commented Apr 17, 2026

Uh oh!

Cristhianzl left a comment

Choose a reason for hiding this comment

🧭 TL;DR Verdict

🔴 CRITICAL — Security & PII Blockers

🟠 IMPORTANT — Must Fix Before Merge

1. dict[str, Any] typing on _server_locks

2. Legacy file-size violation is being extended

🟡 RECOMMENDED — Should Fix

3. Time-based sleeps make the regression tests flaky-prone

4. Law-of-Demeter leaks of internal dict shape

5. Pre-existing except Exception: # noqa: BLE001

6. _release_server_lock_if_idle reclamation has a subtle ordering contract not asserted

7. _next_session_id is not thread/task-safe on its own

🟢 NICE TO HAVE — Polish

8. Type the async context manager return

9. Test naming convention

10. Consider a centralized constants symbol

✅ Compliant Aspects Worth Highlighting

🧪 Security Mindset — The Five Questions (per REVIEWER_RULE.md and DEVELOPMENT_RULE.md §Security Mindset)

📋 Final Pre-Merge Checklist

Uh oh!

Cristhianzl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading

github-actions Bot commented Apr 17, 2026 •

edited

Loading

MEDIUM — `_server_locks` and `_session_id_counters` are now reclaimed

1. `dict[str, Any]` typing on `_server_locks`

5. Pre-existing `except Exception: # noqa: BLE001`

6. `_release_server_lock_if_idle` reclamation has a subtle ordering contract not asserted

7. `_next_session_id` is not thread/task-safe on its own

🧪 Security Mindset — The Five Questions (per `REVIEWER_RULE.md` and `DEVELOPMENT_RULE.md §Security Mindset`)