[Detail Bug] Inworld TTS: Cancelling context acquisition leaks pool reservations and connections, exhausting capacity

# Summary
- **Context**: The Inworld TTS plugin uses a custom connection pool (`_ConnectionPool`) that manages multiple `_InworldConnection` instances. Each connection tracks pending acquisitions via `_pending_acquisitions` to reserve capacity before context creation completes.
- **Bug**: `_ConnectionPool.acquire_context()` catches `Exception` but not `BaseException`, so `asyncio.CancelledError` bypasses cleanup. This leaks the `_pending_acquisitions` counter AND, for newly created connections, the connection object itself.
- **Actual vs. expected**: When cancellation occurs during context acquisition, cleanup code is never executed. `_pending_acquisitions` is never decremented, and newly created connections are never removed from the pool or closed. All cleanup should happen even during cancellation.
- **Impact**: For existing connections: phantom reservations accumulate, eventually causing capacity exhaustion. For newly created connections: the connection is leaked (never closed, remains in pool with leaked counter).

# Impact Details

## Primary Impact: Reservation Leak Causes Capacity Exhaustion

When cancellation occurs during `pool.acquire_context()`:

1. **`_pending_acquisitions` leaks** - Each leaked reservation permanently reduces available capacity
2. **`has_capacity` returns False prematurely** - At line 704, the pool skips connections with phantom reservations
3. **`is_idle` returns False** - Connection appears busy even with no active contexts

The pool uses `has_capacity` at line 704 to route requests:

```python
# Line 702-706: Pool routing logic
for existing in self._connections:
    if not existing._closed and existing.has_capacity:  # <-- Uses has_capacity
        conn = existing
        break
```

With leaked reservations:
- `has_capacity = (context_count + _pending_acquisitions) < MAX_CONTEXTS`
- If `_pending_acquisitions = 5` (after 5 cancellations): `has_capacity = (0 + 5) < 5 = False`

After 5 cancellations on a connection, that connection becomes permanently unusable despite having no actual contexts.

## Secondary Impact: Idle Cleanup Is BLOCKED (Not Enabled)

```python
# Line 777-782: Idle connection cleanup
if (
    conn.is_idle  # <-- FALSE when _pending_acquisitions > 0
    and now - conn.last_activity > self._idle_timeout
    and len(self._connections) - len(connections_to_close) > 1
):
    connections_to_close.append(conn)
```

A connection with leaked reservations has `is_idle = False` (line 239: `return self.context_count == 0 and self._pending_acquisitions == 0`). The leaked reservations **PREVENT** cleanup, not enable it.

## Tertiary Impact: Connection Leak for `created_new=True`

When a NEW connection is created and then cancelled during `acquire_context()`:

1. Connection is added to pool at line 718
2. `reserve_capacity()` called at line 726
3. `CancelledError` during `await conn.acquire_context()` at line 731
4. Connection cleanup (lines 736-740) is SKIPPED because `except Exception:` doesn't catch `CancelledError`
5. Connection remains in pool with leaked `_pending_acquisitions`
6. Connection is never closed (WebSocket connection leaked)

# Code with bug
```python
# In _ConnectionPool.acquire_context (lines 729-741)
            if conn:
                try:
                    ctx_id, waiter = await conn.acquire_context(emitter, opts, remaining_timeout)
                except Exception:  # <-- BUG CancelledError (BaseException) is NOT caught
                    # Release reservation since we didn't get a context
                    conn.release_reservation()
                    # Remove failed new connection from pool
                    if created_new:
                        async with self._pool_lock:
                            if conn in self._connections:
                                self._connections.remove(conn)
                        await conn.aclose()
                    raise
```

# Evidence

## Evidence 1: Caller's handler at line 1256 CANNOT help (CRITICAL)

**Reviewer Claim**: "The caller already handles CancelledError at line 1256"

**FACT**: The caller's handler is AFTER the pool call. When cancellation occurs INSIDE `pool.acquire_context()`, the handler is never reached.

The code flow is:
```python
# Line 1221-1258 in SynthesizeStream._run()
pool = await self._tts._get_pool()
context_id, waiter, connection = await pool.acquire_context(...)  # Line 1222 - CANCELLATION CAN OCCUR HERE

# ... code ...

try:
    await asyncio.wait_for(waiter, timeout=self._conn_options.timeout + 60)  # Line 1251
except asyncio.TimeoutError:
    connection.close_context(context_id)
    raise APITimeoutError() from None
except asyncio.CancelledError:  # Line 1256 - HANDLER IS AFTER pool.acquire_context()
    connection.close_context(context_id)
    raise
```

When cancellation occurs INSIDE `pool.acquire_context()` (at line 1222), the `try` block at line 1251 hasn't been entered yet, so the handler at line 1256 NEVER EXECUTES.

Test at `/home/user/agents/test_caller_handler_prove.py` proves this definitively.

**The caller's handler can only help if cancellation occurs AFTER `pool.acquire_context()` returns, not during.**

## Evidence 2: Integration test using ACTUAL _ConnectionPool and _InworldConnection classes

Test at `/home/user/agents/test_real_pool_integration.py`:
- Uses real `_ConnectionPool` and `_InworldConnection` classes (not mocks)
- Demonstrates leak when cancelled DURING `connect()` at line 286

## Evidence 3: Capacity exhaustion demonstrated

Same test shows accumulation:
```
MAX_CONTEXTS = 5
Initial: _pending_acquisitions=0, has_capacity=True
  Cancellation #1: _pending_acquisitions=1, has_capacity=True
  Cancellation #2: _pending_acquisitions=2, has_capacity=True
  Cancellation #3: _pending_acquisitions=3, has_capacity=True
  Cancellation #4: _pending_acquisitions=4, has_capacity=True
  Cancellation #5: _pending_acquisitions=5, has_capacity=False

After 5 cancellations:
  _pending_acquisitions: 5
  context_count: 0
  has_capacity: False
  is_idle: False
```

## Evidence 4: Leaked reservations BLOCK cleanup, not enable it

```python
# Line 239: is_idle definition
@property
def is_idle(self) -> bool:
    return self.context_count == 0 and self._pending_acquisitions == 0

# Line 779: Cleanup condition
if conn.is_idle  # <-- FALSE when _pending_acquisitions > 0
```

A connection with leaked reservations has `is_idle = False`, so it will NEVER be cleaned up by the idle connection pruner.

## Evidence 5: Cancellation timing analysis of REAL code

The cancellation window in the REAL `_InworldConnection.acquire_context()` (lines 274-337):

```python
# Line 286: await self.connect()              <- CANCELLATION POINT #1 (network I/O)
# Line 300: async with self._acquire_lock:    <- CANCELLATION POINT #2 (waiting for lock)
# Line 317: self.release_reservation()       <- BUG WINDOW ENDS HERE
# Line 319: await self._outbound_queue.put() <- SAFE (reservation already released)
```

## Evidence 6: Cancellation during connect() IS realistic in production

Network operations are the MOST COMMON cancellation points in production:

1. **Agent session termination** - User closes browser/app while connection is being established
2. **Timeout race conditions** - `asyncio.wait_for()` timeout expires during `connect()`
3. **Graceful shutdown** - Server receives SIGTERM while connections are being created
4. **Health check failures** - Upstream monitoring cancels unhealthy connection attempts

## Evidence 7: Framework's ConnectionPool uses `except BaseException:`

From `livekit/agents/utils/connection_pool.py:88-92`:
```python
async def connection(self, *, timeout: float) -> AsyncGenerator[T, None]:
    conn = await self.get(timeout=timeout)
    try:
        yield conn
    except BaseException:  # <-- CORRECT: catches CancelledError
        self.remove(conn)
        raise
    else:
        self.put(conn)
```

Cartesia plugin uses this framework pool (line 168 in cartesia/tts.py). Inworld's custom pool should follow the same pattern.

## Evidence 8: Plugin already catches CancelledError elsewhere

Lines 620, 804, 1256 correctly catch `asyncio.CancelledError`. Line 732 is clearly an oversight.

## Evidence 9: `except Exception:` vs `except BaseException:` is a known Python pitfall

Since Python 3.8+, `asyncio.CancelledError` inherits from `BaseException`, not `Exception`. This is documented behavior that all async code must handle correctly.

## Evidence 10: `_handle_connection_error` does NOT help with leaked reservations

**Reviewer Claim**: "When ANY error occurs on a connection, `_handle_connection_error` is called and sets `_closed = True`"

**FACT**: `_handle_connection_error` is ONLY called when the connection actually encounters an error. A connection with leaked reservations but no active contexts:

1. Still has a valid WebSocket connection
2. Still has `_send_task` and `_recv_task` running normally
3. Will NEVER call `_handle_connection_error` because nothing is wrong with the connection itself

The `_handle_connection_error` is called at lines 430 and 580 in the send/recv loops - but ONLY when those loops encounter errors. A connection with leaked reservations doesn't encounter errors; it just reports `has_capacity=False` and gets skipped by the pool router.

The connection appears "healthy" but unusable.

## Evidence 11: Pool elasticity is NOT a fix - it's a workaround that compounds the problem

**Reviewer Claim**: "The pool creates new connections when existing ones appear at capacity"

**FACT**: This is a WORKAROUND, not a fix. Each new connection can also accumulate leaked reservations:

1. Connection A: 5 cancellations → `has_capacity=False`
2. Pool creates Connection B
3. Connection B: 5 cancellations → `has_capacity=False`
4. Pool creates Connection C
5. ... continues until `max_connections` is reached

With `max_connections = 20` and `MAX_CONTEXTS = 5`, after 100 cancellations distributed across connections, the entire pool is exhausted.

The pool's "elasticity" just delays the inevitable and consumes more resources.

## Evidence 12: The cancellation window is reachable in real scenarios

**Reviewer Claim**: "The cancellation window is extremely narrow - during first WebSocket handshake on a fresh connection"

**FACT**: The window is NOT just during initial connection:

1. **Line 286** (`await self.connect()`) - Protected by `_connect_lock`, but the FIRST caller to each new connection WILL wait here. This is NOT a no-op for new connections.

2. **Line 300** (`async with self._acquire_lock`) - Multiple concurrent requests to the same connection wait here. If one request is inside the lock and a cancellation wave arrives, other waiters are cancelled WHILE WAITING for the lock.

3. **Line 329** (`await asyncio.wait_for(self._context_available.wait(), timeout=remaining)`) - If connection is at capacity, callers wait here. Cancellation during this wait also leaks the reservation.

The cancellation window exists EVERY TIME a new context is acquired, not just during initial connection.

## Evidence 13: Framework-level cleanup via `aclose()` doesn't help for in-flight cancellations

**Reviewer Claim**: "When an AgentSession or TTS instance is closed, `aclose()` is called, which cancels all background tasks and closes connections"

**FACT**: `aclose()` is NOT called when a single request is cancelled. It's called when the ENTIRE TTS instance or session is shut down.

The bug occurs during normal operation when individual requests are cancelled (e.g., user interrupts, timeout expires). The TTS instance remains alive and continues using the pool with leaked reservations.

## Evidence 14: The complete call stack shows NO ancestor catches CancelledError for the pool path

From `SynthesizeStream._run()` to `pool.acquire_context()`:

```
SynthesizeStream._main_task() [tts.py:464]
  └── for retry loop [line 473]
        └── try: [line 475]
              └── await self._run(output_emitter) [line 479]
                    └── pool.acquire_context() [inworld/tts.py:1222]
                          └── await conn.acquire_context() [line 731]
                                └── await self.connect() [line 286] <- CANCELLATION POINT
```

The framework's `_main_task` at line 480 catches `except Exception`, NOT `BaseException`:
```python
try:
    await self._run(output_emitter)
except Exception as e:  # <-- Does NOT catch CancelledError
    telemetry_utils.record_exception(attempt_span, e)
    raise
```

And the retry loop at line 500 catches `except APIError`, not `CancelledError`.

**NO ancestor in the call stack catches `CancelledError` for this path.** The `CancelledError` propagates all the way up and terminates the task without cleanup.

# Why has this bug gone undetected?

1. **Cancellation is rare in tests** - Unit tests complete normally. Edge cases require explicit design.

2. **Impact is gradual** - Each cancellation leaks one reservation. With `MAX_CONTEXTS = 5` and `max_connections = 20`, the pool has 100 total capacity. The issue compounds slowly.

3. **Pool creates new connections as fallback** - When existing connections appear at capacity, new connections are created, masking the problem until `max_connections` is hit.

4. **Symptom looks like timeout** - Users see "Timed out waiting for available connection capacity" (line 760) with no obvious root cause.

5. **Plugin already catches CancelledError elsewhere** - Lines 620, 804, and 1256 correctly catch `CancelledError`. This is an oversight, not a misunderstanding.

6. **`except Exception:` looks correct** - Code review often misses that `CancelledError` is a `BaseException`.

# Recommended fix

Change `except Exception:` to `except BaseException:` at line 732:

```python
            if conn:
                try:
                    ctx_id, waiter = await conn.acquire_context(emitter, opts, remaining_timeout)
                except BaseException:  # <-- FIX catches CancelledError
                    conn.release_reservation()
                    if created_new:
                        async with self._pool_lock:
                            if conn in self._connections:
                                self._connections.remove(conn)
                        await conn.aclose()
                    raise
```

This matches the framework's `ConnectionPool` pattern and the plugin's own pattern at lines 620, 804, 1256.

# Response to Round 7 Reviewer Objections

## Objection 1: "Self-healing via `_handle_connection_error`"

**Response**: `_handle_connection_error` is ONLY called when the connection encounters an actual error (WebSocket failure, network error). A connection with leaked reservations:
- Has a valid, healthy WebSocket
- Has `_send_task` and `_recv_task` running normally
- Will NEVER trigger `_handle_connection_error` because nothing is wrong with it

The connection is "healthy but unusable" - it reports `has_capacity=False` and gets skipped by the pool router, but never gets cleaned up because it never errors.

## Objection 2: "Pool mitigation through connection creation"

**Response**: This is a workaround that COMPOUNDS the problem. Each new connection can also accumulate leaked reservations. After enough cancellations, ALL connections in the pool are affected and `max_connections` is reached.

The pool's elasticity doesn't fix the leak - it just delays exhaustion and wastes resources (more WebSocket connections, more memory).

## Objection 3: "Cancellation window is extremely narrow"

**Response**: The window exists EVERY TIME a context is acquired:
- Line 286: `await self.connect()` - First caller to each connection
- Line 300: `async with self._acquire_lock` - Waiters for lock
- Line 329: `await asyncio.wait_for(...)` - Waiters for capacity

The window is NOT just "during initial WebSocket handshake." Any network wait or lock acquisition is a cancellation point.

## Objection 4: "No actual production evidence"

**Response**: This is a code inspection task. The bug exists regardless of whether users have reported symptoms. The "Timed out waiting for available connection capacity" error (line 760) is the symptom users would see, but they wouldn't know the root cause.

Production evidence would require:
- Instrumentation to track `_pending_acquisitions`
- Logging cancellations during connection acquisition
- Monitoring capacity degradation over time

The code defect is clear from inspection.

## Objection 5: "Framework-level cleanup via `aclose()`"

**Response**: `aclose()` is called when the ENTIRE TTS instance is shut down, NOT when individual requests are cancelled. Normal cancellation (user interrupt, timeout) does NOT trigger `aclose()`.

The bug occurs during normal operation with individual request cancellations.

## Objection 6: "Zero other plugins use `except BaseException:`"

**Response**: The framework's `ConnectionPool` at `livekit/agents/utils/connection_pool.py:90` uses `except BaseException:`. This is the pattern that Cartesia and other plugins use via `utils.ConnectionPool`.

Inworld has a CUSTOM pool implementation that should follow the same pattern. The framework's pattern IS the correct precedent.

## Objection 7: "Other `CancelledError` catches are at different abstraction levels"

**Response**: IRRELEVANT. The presence of correct `CancelledError` handling at lines 620, 804, and 1256 proves the developers know the pattern. Line 732 is an oversight.

# History

This bug was introduced in commit dcc9c2f (@cshape, 2026-01-21, PR #4533). The commit added a new `_ConnectionPool` class with connection pooling infrastructure for high-concurrency TTS scenarios. The developer used `except Exception:` at line 732 to handle failures during `conn.acquire_context()`, but this doesn't catch `asyncio.CancelledError` in Python 3.8+ (where `CancelledError` inherits from `BaseException`, not `Exception`). The bug slipped in because the developer correctly handled `CancelledError` in other parts of the same commit (lines 620, 804), but missed this case in the pool's exception handler.


[Detail Bug] Inworld TTS: Cancelling context acquisition leaks pool reservations and connections, exhausting capacity #5426

Description

Summary

Impact Details

Primary Impact: Reservation Leak Causes Capacity Exhaustion

Secondary Impact: Idle Cleanup Is BLOCKED (Not Enabled)

Tertiary Impact: Connection Leak for created_new=True

Code with bug

Evidence

Evidence 1: Caller's handler at line 1256 CANNOT help (CRITICAL)

Evidence 2: Integration test using ACTUAL _ConnectionPool and _InworldConnection classes

Evidence 3: Capacity exhaustion demonstrated

Evidence 4: Leaked reservations BLOCK cleanup, not enable it

Evidence 5: Cancellation timing analysis of REAL code

Evidence 6: Cancellation during connect() IS realistic in production

Evidence 7: Framework's ConnectionPool uses except BaseException:

Evidence 8: Plugin already catches CancelledError elsewhere

Evidence 9: except Exception: vs except BaseException: is a known Python pitfall

Evidence 10: _handle_connection_error does NOT help with leaked reservations

Evidence 11: Pool elasticity is NOT a fix - it's a workaround that compounds the problem

Evidence 12: The cancellation window is reachable in real scenarios

Evidence 13: Framework-level cleanup via aclose() doesn't help for in-flight cancellations

Evidence 14: The complete call stack shows NO ancestor catches CancelledError for the pool path

Why has this bug gone undetected?

Recommended fix

Response to Round 7 Reviewer Objections

Objection 1: "Self-healing via _handle_connection_error"

Objection 2: "Pool mitigation through connection creation"

Objection 3: "Cancellation window is extremely narrow"

Objection 4: "No actual production evidence"

Objection 5: "Framework-level cleanup via aclose()"

Objection 6: "Zero other plugins use except BaseException:"

Objection 7: "Other CancelledError catches are at different abstraction levels"

History

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Tertiary Impact: Connection Leak for `created_new=True`

Evidence 7: Framework's ConnectionPool uses `except BaseException:`

Evidence 9: `except Exception:` vs `except BaseException:` is a known Python pitfall

Evidence 10: `_handle_connection_error` does NOT help with leaked reservations

Evidence 13: Framework-level cleanup via `aclose()` doesn't help for in-flight cancellations

Objection 1: "Self-healing via `_handle_connection_error`"

Objection 5: "Framework-level cleanup via `aclose()`"

Objection 6: "Zero other plugins use `except BaseException:`"

Objection 7: "Other `CancelledError` catches are at different abstraction levels"