-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Labels
P1: highImportant fix or feature — next up after criticalImportant fix or feature — next up after criticalbugSomething isn't workingSomething isn't working
Description
Description
isPaneRunning() in tmux.ts catches all errors and returns false, making it impossible to distinguish between "pane is not running" and "tmux command failed." A transient tmux issue (e.g., server restart, socket error) will cause every job to be marked as dead simultaneously.
Steps to Reproduce
- Start a plan with multiple running jobs.
- Cause a transient tmux failure (e.g., temporarily kill the tmux server, network issue on remote tmux).
- The monitor polls
isPaneRunning()for each job — all returnfalse. - All jobs are marked
failedsimultaneously.
Expected Behavior
- If tmux itself is unreachable, the error should be surfaced (or retried) rather than interpreted as "all panes are dead."
- Only return
falsewhen tmux successfully confirms the pane doesn't exist.
Actual Behavior
// src/lib/tmux.ts:314-316
} catch {
return false;
}Any error (including tmux server not found, permission errors, socket errors) silently returns false.
Cascade Impact
In resumePlan() at src/lib/orchestrator.ts:851-863:
for (const runningJob of runningJobs) {
const paneAlive = await isPaneRunning(runningJob.tmuxTarget);
if (!paneAlive) {
// Marks job as failed, then fails entire plan
}
}A single tmux hiccup → all jobs failed → entire plan failed.
Proposed Fix
- Distinguish error types in the catch block:
- tmux returns non-zero + "no pane" message → return
false(pane genuinely dead) - tmux returns non-zero + other error → throw or return an error result
- tmux command fails to execute → throw
- tmux returns non-zero + "no pane" message → return
- Consider adding retry logic (1-2 retries with backoff) before declaring a pane dead.
- Add a
isTmuxHealthy()pre-check before bulk-checking panes.
Files Involved
src/lib/tmux.ts:314-316— the catch-all error handlersrc/lib/orchestrator.ts:851-863—resumePlan()marks all dead panes as failedsrc/lib/monitor.ts— polling loop that callsisPaneRunning()
Additional Context
Identified in the master audit report (Section 7: Robustness). The audit flags this as: "isPaneRunning returns false for ALL errors (could mark all jobs dead)."
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1: highImportant fix or feature — next up after criticalImportant fix or feature — next up after criticalbugSomething isn't workingSomething isn't working