Skip to content

Fix terminal stability — WebSocket reconnection, resize debounce, control frame filtering #1195

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

Problem

The xterm.js terminal in the Gastown dashboard has three recurring rendering/stability issues that degrade the experience significantly:

  1. UI becomes completely unresponsive — blinking cursor, no keyboard input accepted, requires a full page refresh
  2. "Blacked out" artifacts — certain row/col coordinates render as blank/corrupt cells, making text unreadable
  3. Raw JSON overlay in chat box{"cursor":<number>} text appears overlaid in the terminal, interfering with the TUI's own rendering

Issues 2 and 3 can be temporarily fixed by running a command that refreshes the TUI (e.g., /status). Issue 1 requires a page refresh.

Root Causes

Issue 1: Unresponsive terminal — WebSocket disconnects with no reconnection

Files: useXtermPty.ts:187-191, TerminalBar.tsx:779-783

When the PTY WebSocket closes (container restart, network blip, container sleep after 30min idle), the handler sets connected=false but never attempts to reconnect. The xterm instance stays mounted with cursorBlink: true (hence the blinking cursor), but term.onData silently drops keystrokes because ws.readyState !== WebSocket.OPEN.

The alarm status WebSocket (useAlarmStatusWs at TerminalBar.tsx:340) does have 3-second reconnection logic. The PTY WebSockets do not.

Contributing factors:

  • TownContainerDO.sleepAfter = '30m' — container sleeping kills all WebSocket connections
  • If the Mayor agent restarts (new session), the PTY session ID becomes stale but the terminal component doesn't detect this unless the mayorAgentId changes
  • No visual indicator that the WebSocket is disconnected — the user just sees a dead terminal

Issue 2: Blacked out artifacts — PTY/xterm dimension mismatch during resize

Files: useXtermPty.ts:117-126, TerminalBar.tsx:796

ResizeObserver calls fitAddon.fit() with no debounce. During CSS transitions (sidebar expand/collapse, terminal bar resize), this fires many rapid resize events. Each triggers:

  1. fitAddon.fit() — synchronously resizes xterm's viewport (instant)
  2. resizePtySession tRPC mutation — async: browser → tRPC → Worker → TownContainerDO → Container → SDK PUT /pty/:idprocess.resize(cols, rows) (network latency)

Between steps 1 and 2, the TUI renders to the old PTY dimensions while xterm expects the new dimensions. Cells that the TUI didn't repaint appear as "blacked out" — the xterm viewport grew but the TUI hasn't redrawn those cells yet.

The setTimeout(() => fit(), 50) in TerminalBar.tsx:822-823 partially mitigates this for sidebar changes but not for ResizeObserver events.

Issue 3: {"cursor":N} JSON overlay — missing control frame filter

Files: useXtermPty.ts:170-184, TerminalBar.tsx:763-776

The Kilo SDK's PTY module sends cursor metadata as a binary WebSocket frame with a 0x00 prefix byte (pty/index.ts:27-34):

// SDK sends: [0x00, ...JSON.stringify({cursor: N})]

The native Kilo desktop app correctly filters these (packages/app/src/components/terminal.tsx:472-486):

if (bytes[0] !== 0) return; // Not a control frame — skip
const json = decoder.decode(bytes.subarray(1)); // Parse metadata

The Gastown browser code does not check for the 0x00 prefix. It writes all binary data directly to xterm:

if (e.data instanceof ArrayBuffer) {
  term.write(new Uint8Array(e.data)); // ← No 0x00 check!
}

The NUL byte is ignored by xterm.js, but {"cursor":123} renders as visible text in the terminal viewport, overlapping with the TUI's own output.

A secondary path exists for when the proxy chain converts the binary frame to a string. The filter at line 175 (e.data.startsWith('{') + JSON.parse) catches this case but is fragile — leading whitespace or a preserved NUL byte would bypass it.

Fixes

Fix 1: WebSocket reconnection with exponential backoff

Add reconnection logic to PTY WebSockets, matching the pattern already used by useAlarmStatusWs:

  • On ws.onclose: if not intentionally closed, attempt reconnection after 1s → 2s → 4s → 8s (capped)
  • Before reconnecting, check if the PTY session still exists (GET /agents/:id/pty/:ptyId/status)
  • If the PTY session is gone (container restarted), create a new PTY session and reconnect
  • Show a visual "Reconnecting..." indicator in the terminal bar
  • If reconnection fails after N attempts, show "Connection lost — click to reconnect" with a manual retry button

Apply to both useXtermPty.ts and the duplicated logic in TerminalBar.tsx (or deduplicate — see Fix 5).

Fix 2: Debounce resize events

Wrap the ResizeObserver callback and fitAddon.fit() calls with a debounce (e.g., 150ms). Only send the resizePtySession tRPC mutation after the debounce settles. This prevents resize storms during CSS transitions and ensures the PTY gets a single, final resize rather than dozens of intermediate ones.

Additionally, after sending the resize mutation, wait for it to complete before allowing the next resize. This ensures the PTY dimensions and xterm dimensions stay in sync.

Fix 3: Filter 0x00 control frames in the WebSocket message handler

Before writing binary data to xterm, check for the control frame prefix:

if (e.data instanceof ArrayBuffer) {
  const bytes = new Uint8Array(e.data);
  if (bytes.length > 0 && bytes[0] === 0) {
    // Control frame — parse metadata, don't write to terminal
    return;
  }
  term.write(bytes);
}

This matches the native Kilo app's implementation at packages/app/src/components/terminal.tsx:472-486.

Fix 4: Visual connection status indicator

Add a small status badge to the terminal bar showing WebSocket state:

  • Connected (green dot — default, unobtrusive)
  • Reconnecting (yellow dot + "Reconnecting...")
  • Disconnected (red dot + "Connection lost — click to reconnect")

This gives users immediate feedback instead of a mysteriously dead terminal.

Fix 5: Deduplicate terminal setup code

MayorTerminalPane in TerminalBar.tsx:614-836 duplicates the entire xterm/WebSocket/resize setup from useXtermPty.ts. This means every fix needs to be applied in two places. Extract the shared logic into useXtermPty (or a new shared hook) and use it from both MayorTerminalPane and AgentTerminalPane. This is not just cleanup — the duplicated code paths can diverge, and any fix applied to one but not the other creates a new inconsistency.

Acceptance Criteria

  • PTY WebSockets reconnect automatically on disconnect with exponential backoff
  • New PTY session created on reconnect if the previous one is gone (container restart)
  • Resize events debounced — no resize storms during CSS transitions
  • 0x00 control frame prefix checked on all binary WebSocket messages — {"cursor":N} never written to xterm
  • Visual connection status indicator in terminal bar (connected/reconnecting/disconnected)
  • Terminal setup code deduplicated between MayorTerminalPane, AgentTerminalPane, and useXtermPty
  • Container sleep/wake doesn't permanently break the terminal (reconnection recovers)

Notes

  • The {"cursor":N} fix (Fix 3) is a one-liner with a clear reference implementation in the native Kilo app
  • The WebSocket reconnection (Fix 1) is the highest-impact fix — it addresses the most annoying symptom (permanent freeze requiring page refresh)
  • The resize debounce (Fix 2) explains why running /status fixes the rendering — the TUI modal redraws the full viewport, overwriting the stale cells
  • The code deduplication (Fix 5) should be done first to avoid applying fixes in two places

Metadata

Metadata

Assignees

No one assigned

    Labels

    kilo-duplicateAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions