Skip to content

bug: isPaneRunning swallows all errors — can cascade-fail entire plan #12

@nigel-dev

Description

@nigel-dev

Description

isPaneRunning() in tmux.ts catches all errors and returns false, making it impossible to distinguish between "pane is not running" and "tmux command failed." A transient tmux issue (e.g., server restart, socket error) will cause every job to be marked as dead simultaneously.

Steps to Reproduce

  1. Start a plan with multiple running jobs.
  2. Cause a transient tmux failure (e.g., temporarily kill the tmux server, network issue on remote tmux).
  3. The monitor polls isPaneRunning() for each job — all return false.
  4. All jobs are marked failed simultaneously.

Expected Behavior

  • If tmux itself is unreachable, the error should be surfaced (or retried) rather than interpreted as "all panes are dead."
  • Only return false when tmux successfully confirms the pane doesn't exist.

Actual Behavior

// src/lib/tmux.ts:314-316
} catch {
  return false;
}

Any error (including tmux server not found, permission errors, socket errors) silently returns false.

Cascade Impact

In resumePlan() at src/lib/orchestrator.ts:851-863:

for (const runningJob of runningJobs) {
  const paneAlive = await isPaneRunning(runningJob.tmuxTarget);
  if (!paneAlive) {
    // Marks job as failed, then fails entire plan
  }
}

A single tmux hiccup → all jobs failed → entire plan failed.

Proposed Fix

  1. Distinguish error types in the catch block:
    • tmux returns non-zero + "no pane" message → return false (pane genuinely dead)
    • tmux returns non-zero + other error → throw or return an error result
    • tmux command fails to execute → throw
  2. Consider adding retry logic (1-2 retries with backoff) before declaring a pane dead.
  3. Add a isTmuxHealthy() pre-check before bulk-checking panes.

Files Involved

  • src/lib/tmux.ts:314-316 — the catch-all error handler
  • src/lib/orchestrator.ts:851-863resumePlan() marks all dead panes as failed
  • src/lib/monitor.ts — polling loop that calls isPaneRunning()

Additional Context

Identified in the master audit report (Section 7: Robustness). The audit flags this as: "isPaneRunning returns false for ALL errors (could mark all jobs dead)."

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1: highImportant fix or feature — next up after criticalbugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions