Skip to content

[Detail Bug] Inworld TTS: Cancelling context acquisition leaks pool reservations and connections, exhausting capacity #5426

@detail-app

Description

@detail-app

Summary

  • Context: The Inworld TTS plugin uses a custom connection pool (_ConnectionPool) that manages multiple _InworldConnection instances. Each connection tracks pending acquisitions via _pending_acquisitions to reserve capacity before context creation completes.
  • Bug: _ConnectionPool.acquire_context() catches Exception but not BaseException, so asyncio.CancelledError bypasses cleanup. This leaks the _pending_acquisitions counter AND, for newly created connections, the connection object itself.
  • Actual vs. expected: When cancellation occurs during context acquisition, cleanup code is never executed. _pending_acquisitions is never decremented, and newly created connections are never removed from the pool or closed. All cleanup should happen even during cancellation.
  • Impact: For existing connections: phantom reservations accumulate, eventually causing capacity exhaustion. For newly created connections: the connection is leaked (never closed, remains in pool with leaked counter).

Impact Details

Primary Impact: Reservation Leak Causes Capacity Exhaustion

When cancellation occurs during pool.acquire_context():

  1. _pending_acquisitions leaks - Each leaked reservation permanently reduces available capacity
  2. has_capacity returns False prematurely - At line 704, the pool skips connections with phantom reservations
  3. is_idle returns False - Connection appears busy even with no active contexts

The pool uses has_capacity at line 704 to route requests:

# Line 702-706: Pool routing logic
for existing in self._connections:
    if not existing._closed and existing.has_capacity:  # <-- Uses has_capacity
        conn = existing
        break

With leaked reservations:

  • has_capacity = (context_count + _pending_acquisitions) < MAX_CONTEXTS
  • If _pending_acquisitions = 5 (after 5 cancellations): has_capacity = (0 + 5) < 5 = False

After 5 cancellations on a connection, that connection becomes permanently unusable despite having no actual contexts.

Secondary Impact: Idle Cleanup Is BLOCKED (Not Enabled)

# Line 777-782: Idle connection cleanup
if (
    conn.is_idle  # <-- FALSE when _pending_acquisitions > 0
    and now - conn.last_activity > self._idle_timeout
    and len(self._connections) - len(connections_to_close) > 1
):
    connections_to_close.append(conn)

A connection with leaked reservations has is_idle = False (line 239: return self.context_count == 0 and self._pending_acquisitions == 0). The leaked reservations PREVENT cleanup, not enable it.

Tertiary Impact: Connection Leak for created_new=True

When a NEW connection is created and then cancelled during acquire_context():

  1. Connection is added to pool at line 718
  2. reserve_capacity() called at line 726
  3. CancelledError during await conn.acquire_context() at line 731
  4. Connection cleanup (lines 736-740) is SKIPPED because except Exception: doesn't catch CancelledError
  5. Connection remains in pool with leaked _pending_acquisitions
  6. Connection is never closed (WebSocket connection leaked)

Code with bug

# In _ConnectionPool.acquire_context (lines 729-741)
            if conn:
                try:
                    ctx_id, waiter = await conn.acquire_context(emitter, opts, remaining_timeout)
                except Exception:  # <-- BUG CancelledError (BaseException) is NOT caught
                    # Release reservation since we didn't get a context
                    conn.release_reservation()
                    # Remove failed new connection from pool
                    if created_new:
                        async with self._pool_lock:
                            if conn in self._connections:
                                self._connections.remove(conn)
                        await conn.aclose()
                    raise

Evidence

Evidence 1: Caller's handler at line 1256 CANNOT help (CRITICAL)

Reviewer Claim: "The caller already handles CancelledError at line 1256"

FACT: The caller's handler is AFTER the pool call. When cancellation occurs INSIDE pool.acquire_context(), the handler is never reached.

The code flow is:

# Line 1221-1258 in SynthesizeStream._run()
pool = await self._tts._get_pool()
context_id, waiter, connection = await pool.acquire_context(...)  # Line 1222 - CANCELLATION CAN OCCUR HERE

# ... code ...

try:
    await asyncio.wait_for(waiter, timeout=self._conn_options.timeout + 60)  # Line 1251
except asyncio.TimeoutError:
    connection.close_context(context_id)
    raise APITimeoutError() from None
except asyncio.CancelledError:  # Line 1256 - HANDLER IS AFTER pool.acquire_context()
    connection.close_context(context_id)
    raise

When cancellation occurs INSIDE pool.acquire_context() (at line 1222), the try block at line 1251 hasn't been entered yet, so the handler at line 1256 NEVER EXECUTES.

Test at /home/user/agents/test_caller_handler_prove.py proves this definitively.

The caller's handler can only help if cancellation occurs AFTER pool.acquire_context() returns, not during.

Evidence 2: Integration test using ACTUAL _ConnectionPool and _InworldConnection classes

Test at /home/user/agents/test_real_pool_integration.py:

  • Uses real _ConnectionPool and _InworldConnection classes (not mocks)
  • Demonstrates leak when cancelled DURING connect() at line 286

Evidence 3: Capacity exhaustion demonstrated

Same test shows accumulation:

MAX_CONTEXTS = 5
Initial: _pending_acquisitions=0, has_capacity=True
  Cancellation #1: _pending_acquisitions=1, has_capacity=True
  Cancellation #2: _pending_acquisitions=2, has_capacity=True
  Cancellation #3: _pending_acquisitions=3, has_capacity=True
  Cancellation #4: _pending_acquisitions=4, has_capacity=True
  Cancellation #5: _pending_acquisitions=5, has_capacity=False

After 5 cancellations:
  _pending_acquisitions: 5
  context_count: 0
  has_capacity: False
  is_idle: False

Evidence 4: Leaked reservations BLOCK cleanup, not enable it

# Line 239: is_idle definition
@property
def is_idle(self) -> bool:
    return self.context_count == 0 and self._pending_acquisitions == 0

# Line 779: Cleanup condition
if conn.is_idle  # <-- FALSE when _pending_acquisitions > 0

A connection with leaked reservations has is_idle = False, so it will NEVER be cleaned up by the idle connection pruner.

Evidence 5: Cancellation timing analysis of REAL code

The cancellation window in the REAL _InworldConnection.acquire_context() (lines 274-337):

# Line 286: await self.connect()              <- CANCELLATION POINT #1 (network I/O)
# Line 300: async with self._acquire_lock:    <- CANCELLATION POINT #2 (waiting for lock)
# Line 317: self.release_reservation()       <- BUG WINDOW ENDS HERE
# Line 319: await self._outbound_queue.put() <- SAFE (reservation already released)

Evidence 6: Cancellation during connect() IS realistic in production

Network operations are the MOST COMMON cancellation points in production:

  1. Agent session termination - User closes browser/app while connection is being established
  2. Timeout race conditions - asyncio.wait_for() timeout expires during connect()
  3. Graceful shutdown - Server receives SIGTERM while connections are being created
  4. Health check failures - Upstream monitoring cancels unhealthy connection attempts

Evidence 7: Framework's ConnectionPool uses except BaseException:

From livekit/agents/utils/connection_pool.py:88-92:

async def connection(self, *, timeout: float) -> AsyncGenerator[T, None]:
    conn = await self.get(timeout=timeout)
    try:
        yield conn
    except BaseException:  # <-- CORRECT: catches CancelledError
        self.remove(conn)
        raise
    else:
        self.put(conn)

Cartesia plugin uses this framework pool (line 168 in cartesia/tts.py). Inworld's custom pool should follow the same pattern.

Evidence 8: Plugin already catches CancelledError elsewhere

Lines 620, 804, 1256 correctly catch asyncio.CancelledError. Line 732 is clearly an oversight.

Evidence 9: except Exception: vs except BaseException: is a known Python pitfall

Since Python 3.8+, asyncio.CancelledError inherits from BaseException, not Exception. This is documented behavior that all async code must handle correctly.

Evidence 10: _handle_connection_error does NOT help with leaked reservations

Reviewer Claim: "When ANY error occurs on a connection, _handle_connection_error is called and sets _closed = True"

FACT: _handle_connection_error is ONLY called when the connection actually encounters an error. A connection with leaked reservations but no active contexts:

  1. Still has a valid WebSocket connection
  2. Still has _send_task and _recv_task running normally
  3. Will NEVER call _handle_connection_error because nothing is wrong with the connection itself

The _handle_connection_error is called at lines 430 and 580 in the send/recv loops - but ONLY when those loops encounter errors. A connection with leaked reservations doesn't encounter errors; it just reports has_capacity=False and gets skipped by the pool router.

The connection appears "healthy" but unusable.

Evidence 11: Pool elasticity is NOT a fix - it's a workaround that compounds the problem

Reviewer Claim: "The pool creates new connections when existing ones appear at capacity"

FACT: This is a WORKAROUND, not a fix. Each new connection can also accumulate leaked reservations:

  1. Connection A: 5 cancellations → has_capacity=False
  2. Pool creates Connection B
  3. Connection B: 5 cancellations → has_capacity=False
  4. Pool creates Connection C
  5. ... continues until max_connections is reached

With max_connections = 20 and MAX_CONTEXTS = 5, after 100 cancellations distributed across connections, the entire pool is exhausted.

The pool's "elasticity" just delays the inevitable and consumes more resources.

Evidence 12: The cancellation window is reachable in real scenarios

Reviewer Claim: "The cancellation window is extremely narrow - during first WebSocket handshake on a fresh connection"

FACT: The window is NOT just during initial connection:

  1. Line 286 (await self.connect()) - Protected by _connect_lock, but the FIRST caller to each new connection WILL wait here. This is NOT a no-op for new connections.

  2. Line 300 (async with self._acquire_lock) - Multiple concurrent requests to the same connection wait here. If one request is inside the lock and a cancellation wave arrives, other waiters are cancelled WHILE WAITING for the lock.

  3. Line 329 (await asyncio.wait_for(self._context_available.wait(), timeout=remaining)) - If connection is at capacity, callers wait here. Cancellation during this wait also leaks the reservation.

The cancellation window exists EVERY TIME a new context is acquired, not just during initial connection.

Evidence 13: Framework-level cleanup via aclose() doesn't help for in-flight cancellations

Reviewer Claim: "When an AgentSession or TTS instance is closed, aclose() is called, which cancels all background tasks and closes connections"

FACT: aclose() is NOT called when a single request is cancelled. It's called when the ENTIRE TTS instance or session is shut down.

The bug occurs during normal operation when individual requests are cancelled (e.g., user interrupts, timeout expires). The TTS instance remains alive and continues using the pool with leaked reservations.

Evidence 14: The complete call stack shows NO ancestor catches CancelledError for the pool path

From SynthesizeStream._run() to pool.acquire_context():

SynthesizeStream._main_task() [tts.py:464]
  └── for retry loop [line 473]
        └── try: [line 475]
              └── await self._run(output_emitter) [line 479]
                    └── pool.acquire_context() [inworld/tts.py:1222]
                          └── await conn.acquire_context() [line 731]
                                └── await self.connect() [line 286] <- CANCELLATION POINT

The framework's _main_task at line 480 catches except Exception, NOT BaseException:

try:
    await self._run(output_emitter)
except Exception as e:  # <-- Does NOT catch CancelledError
    telemetry_utils.record_exception(attempt_span, e)
    raise

And the retry loop at line 500 catches except APIError, not CancelledError.

NO ancestor in the call stack catches CancelledError for this path. The CancelledError propagates all the way up and terminates the task without cleanup.

Why has this bug gone undetected?

  1. Cancellation is rare in tests - Unit tests complete normally. Edge cases require explicit design.

  2. Impact is gradual - Each cancellation leaks one reservation. With MAX_CONTEXTS = 5 and max_connections = 20, the pool has 100 total capacity. The issue compounds slowly.

  3. Pool creates new connections as fallback - When existing connections appear at capacity, new connections are created, masking the problem until max_connections is hit.

  4. Symptom looks like timeout - Users see "Timed out waiting for available connection capacity" (line 760) with no obvious root cause.

  5. Plugin already catches CancelledError elsewhere - Lines 620, 804, and 1256 correctly catch CancelledError. This is an oversight, not a misunderstanding.

  6. except Exception: looks correct - Code review often misses that CancelledError is a BaseException.

Recommended fix

Change except Exception: to except BaseException: at line 732:

            if conn:
                try:
                    ctx_id, waiter = await conn.acquire_context(emitter, opts, remaining_timeout)
                except BaseException:  # <-- FIX catches CancelledError
                    conn.release_reservation()
                    if created_new:
                        async with self._pool_lock:
                            if conn in self._connections:
                                self._connections.remove(conn)
                        await conn.aclose()
                    raise

This matches the framework's ConnectionPool pattern and the plugin's own pattern at lines 620, 804, 1256.

Response to Round 7 Reviewer Objections

Objection 1: "Self-healing via _handle_connection_error"

Response: _handle_connection_error is ONLY called when the connection encounters an actual error (WebSocket failure, network error). A connection with leaked reservations:

  • Has a valid, healthy WebSocket
  • Has _send_task and _recv_task running normally
  • Will NEVER trigger _handle_connection_error because nothing is wrong with it

The connection is "healthy but unusable" - it reports has_capacity=False and gets skipped by the pool router, but never gets cleaned up because it never errors.

Objection 2: "Pool mitigation through connection creation"

Response: This is a workaround that COMPOUNDS the problem. Each new connection can also accumulate leaked reservations. After enough cancellations, ALL connections in the pool are affected and max_connections is reached.

The pool's elasticity doesn't fix the leak - it just delays exhaustion and wastes resources (more WebSocket connections, more memory).

Objection 3: "Cancellation window is extremely narrow"

Response: The window exists EVERY TIME a context is acquired:

  • Line 286: await self.connect() - First caller to each connection
  • Line 300: async with self._acquire_lock - Waiters for lock
  • Line 329: await asyncio.wait_for(...) - Waiters for capacity

The window is NOT just "during initial WebSocket handshake." Any network wait or lock acquisition is a cancellation point.

Objection 4: "No actual production evidence"

Response: This is a code inspection task. The bug exists regardless of whether users have reported symptoms. The "Timed out waiting for available connection capacity" error (line 760) is the symptom users would see, but they wouldn't know the root cause.

Production evidence would require:

  • Instrumentation to track _pending_acquisitions
  • Logging cancellations during connection acquisition
  • Monitoring capacity degradation over time

The code defect is clear from inspection.

Objection 5: "Framework-level cleanup via aclose()"

Response: aclose() is called when the ENTIRE TTS instance is shut down, NOT when individual requests are cancelled. Normal cancellation (user interrupt, timeout) does NOT trigger aclose().

The bug occurs during normal operation with individual request cancellations.

Objection 6: "Zero other plugins use except BaseException:"

Response: The framework's ConnectionPool at livekit/agents/utils/connection_pool.py:90 uses except BaseException:. This is the pattern that Cartesia and other plugins use via utils.ConnectionPool.

Inworld has a CUSTOM pool implementation that should follow the same pattern. The framework's pattern IS the correct precedent.

Objection 7: "Other CancelledError catches are at different abstraction levels"

Response: IRRELEVANT. The presence of correct CancelledError handling at lines 620, 804, and 1256 proves the developers know the pattern. Line 732 is an oversight.

History

This bug was introduced in commit dcc9c2f (@cshape, 2026-01-21, PR #4533). The commit added a new _ConnectionPool class with connection pooling infrastructure for high-concurrency TTS scenarios. The developer used except Exception: at line 732 to handle failures during conn.acquire_context(), but this doesn't catch asyncio.CancelledError in Python 3.8+ (where CancelledError inherits from BaseException, not Exception). The bug slipped in because the developer correctly handled CancelledError in other parts of the same commit (lines 620, 804), but missed this case in the pool's exception handler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions