fix: session stability hardening by PureWeen · Pull Request #373 · PureWeen/PolyPilot

PureWeen · 2026-03-13T21:35:30Z

Problem

Sessions die instantly after reconnect due to compounding bugs:

Corrupted session files kill sessions permanently - When events.jsonl contains tool output with words like "ephemeral", the CLI JSON parser chokes on session.resume. No catch existed, so the session dies.
Orphaned event handlers race with new state - Stale SessionIdleEvent callbacks from the old session clear IsProcessing on the shared Info object, killing the new session. Each reconnect stacks another handler.
Watchdog crashes leave sessions permanently stuck - If watchdog cleanup throws, IsProcessing was never cleared.
posix_spawn failures not detected - Shell failures did not trigger auto-recovery.
relaunch.sh kills copilot server - Grep pattern matched the bundled copilot binary too.

Fixes

Corrupted session file recovery: Catch "corrupted" in resume, create fresh session
IsOrphaned flag: HandleSessionEvent and CompleteResponse skip orphaned states
Generation invalidation: Old state ProcessingGeneration set to long.MaxValue
Watchdog crash safety: Outer catch clears IsProcessing (INV-1 compliant)
Shell failure detection: posix_spawn treated like permission denial for recovery
relaunch.sh: End-of-line anchor to only match app binary

Testing

2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel - unrelated)

Extends the existing permission-denial recovery mechanism to also detect shell environment failures (posix_spawn failed). When 3 out of 5 tool calls fail with posix_spawn errors, the session is automatically disposed and resumed — the same sliding-window + TryRecoverPermissionAsync flow. This fixes sessions that become permanently broken when the CLI's internal process spawning fails, which previously required manual session recreation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two paths could leave IsProcessing=true forever with no recovery: 1. Watchdog loop exception (catch block): The catch(Exception) block logged the error but did NOT clear IsProcessing or companion state. Any unexpected exception left the session stuck at "Sending..." forever. 2. Timeout callback failure: The Case C timeout callback called FlushCurrentResponse() BEFORE setting IsProcessing=false. If the flush threw, the exception was silently caught by InvokeOnUI's try-catch, and the watchdog had already exited (break) — no further cleanup. Fix: - Wrap FlushCurrentResponse in the timeout callback with try-catch so IsProcessing is always cleared even if the flush fails - Add INV-1 compliant crash recovery to the catch(Exception) block: clears IsProcessing + all 9 companion fields, completes TCS, fires OnSessionComplete/OnError/OnStateChanged Added 4 regression tests verifying the crash recovery path and structural invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…d session files Three fixes for instant session death after reconnect: 1. Handle 'Session file is corrupted' in resume catch chain — when events.jsonl is corrupted (e.g., tool output text parsed as JSON literal), create a fresh session instead of letting the reconnect fail entirely. 2. Add IsOrphaned flag to SessionState — set true on the old state before creating the replacement. HandleSessionEvent and CompleteResponse skip all processing for orphaned states, preventing stale callbacks from clearing IsProcessing on the shared Info object. 3. Invalidate old state's ProcessingGeneration (set to long.MaxValue) before creating the new state. Any queued Invoke callbacks from the old state fail the generation check in CompleteResponse, providing defense-in-depth alongside IsOrphaned. Root cause: on reconnect, old and new SessionState share the same Info object. Stale SessionIdleEvent callbacks from the orphaned old CopilotSession would pass the generation check (old gen matched old state) and clear Info.IsProcessing, killing the new state's processing. Each reconnect also registered a new event handler without unregistering the old one, causing N× event multiplication. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use end-of-line anchor (grep 'PolyPilot$') so the PID capture only matches the app binary (comm ends with 'PolyPilot'), not the copilot headless server bundled inside PolyPilot.app/Contents/MonoBundle/copilot whose full path also contains 'PolyPilot'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…resume - Add IsOrphaned checks to watchdog loop, tool health timer, TurnEnd fallback, TriggerToolHealthRecovery, and crash recovery callbacks - Add missing INV-1 fields (ToolHealthStaleChecks, EventCountThisTurn, TurnEndReceivedAtTicks) to watchdog timeout and crash recovery paths - Fix sibling re-resume: create new SessionState + orphan old one to prevent duplicate event handler dispatch (critical race condition) - Register handler BEFORE publishing to dictionary (no-handler window) - Guard corrupted/expired catch blocks so cleanup runs if CreateSessionAsync throws (cancel watchers + orphan state) - Reorder IsOrphaned=true before Cancel* calls (catch queued callbacks) - Complete TCS with TrySetCanceled() in orphaned CompleteResponse guard so orchestrator workers don't hang forever Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- SystemTextFontScalingTests: --type-subhead was renamed to --type-callout - FontSizingEnforcementTests: allowlist SessionListItem worker-child 0.85em - DevTunnelServiceTests: handle devtunnel CLI being installed (skip gracefully) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- N1: Snapshot Organization.Sessions/Groups to local lists before Task.Run to avoid InvalidOperationException from concurrent access - N2: Set IsOrphaned=true in failed sibling catch so dead sessions don't become zombies with stale event handlers still firing - N4: Broaden corrupted-session catch to also match 'session file' errors (matches Persistence.cs recovery pattern) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- SessionErrorEvent: add 3 missing INV-1 fields + IsOrphaned guard - RecoveryFailure: add 3 missing INV-1 fields - PermissionRecover cleanup: add 3 missing INV-1 fields - TriggerToolHealthRecovery: add ProcessingGeneration guard - Sibling re-resume: cancel old TCS, reset all tool counters on new state to match primary reconnect path - Inner createEx catches: reorder IsOrphaned=true before Cancel* calls Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add TrySetCanceled() to failed sibling catch to unblock orchestrator workers - Use TryUpdate instead of index assignment to prevent TOCTOU on rapid reconnects - Skip actively-processing siblings to avoid mid-turn abort Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-13T23:04:31Z

PR Review: fix/session-stability-hardening

Multi-model review (4/5 models completed; consensus = flagged by 2+ models)

CI Status

⚠️ No CI checks reported on this branch. Author reports 2549 tests pass locally with 5 pre-existing failures (font scaling + DevTunnel — unrelated). Tests should be verified to pass on a clean CI run.

Findings (Consensus, Ranked by Severity)

🔴 CRITICAL — Sibling Re-Resume TOCTOU Race

File: PolyPilot/Services/CopilotService.cs ~line 2646
Flagged by: 4/4 models

if (otherState.Info.IsProcessing) continue;  // check on background thread
// ... await ResumeSessionAsync(...)         // async gap — sibling can start processing here
otherState.ResponseCompletion?.TrySetCanceled();  // may cancel a live TCS

IsProcessing is a plain bool (not volatile) on AgentSessionInfo, written on the UI thread. The check runs on a Task.Run background thread. If a concurrent SendPromptAsync starts between the check and TrySetCanceled(), the active ResponseCompletion TCS is forcibly canceled — throwing TaskCanceledException to any orchestrator worker awaiting it and silently dropping the mid-turn response. Multi-agent orchestration makes this race more likely (workers dispatch sends concurrently).

Suggested fix: After ResumeSessionAsync returns, re-check otherState.Info.IsProcessing before proceeding; if it's now true, skip orphaning and mark siblingState.IsOrphaned = true immediately:

var resumed = await newClient.ResumeSessionAsync(...);
if (otherState.Info.IsProcessing)  // re-check after await
{
    siblingState.IsOrphaned = true;  // discard the just-resumed session
    continue;
}

🟡 MODERATE — Sibling ResumeSessionConfig Missing MCP Servers

File: PolyPilot/Services/CopilotService.cs ~line 2655
Flagged by: 3/4 models

The sibling re-resume only registers ShowImageTool:

var cfg = new ResumeSessionConfig
{
    Tools = new List<AIFunction> { ShowImageTool.CreateFunction() },
    OnPermissionRequest = AutoApprovePermissions,
};

The primary reconnect and corrupted-session path both use BuildFreshSessionConfig which includes MCP servers, skill directories, and system messages. Any sibling session that had MCP tools configured will silently lose them after re-resume. Tool calls will fail with confusing errors until the user manually reconnects. Should use BuildFreshSessionConfig(otherState) (or equivalent) for sibling configs.

🟡 MODERATE — Watchdog Crash Recovery Loses Un-Flushed Content

File: PolyPilot/Services/CopilotService.Events.cs ~line 2194
Flagged by: 3/4 models

var crashResponse = state.FlushedResponse.ToString();  // only flushed content
state.FlushedResponse.Clear();
state.CurrentResponse.Clear();  // ← streaming content silently discarded
state.ResponseCompletion?.TrySetResult(crashResponse);  // missing CurrentResponse

CurrentResponse holds streaming deltas received since the last FlushCurrentResponse call. The crash path clears it without concatenating it into crashResponse. This is inconsistent with TriggerToolHealthRecovery which correctly combines both. The fix is:

var crashResponse = state.FlushedResponse.ToString() + state.CurrentResponse.ToString();

🟡 MODERATE — Overly Broad Corrupted-Session Exception Filter

File: PolyPilot/Services/CopilotService.cs ~line 2772
Flagged by: 3/4 models

catch (Exception resumeEx) when (
    resumeEx.Message.Contains("corrupted", StringComparison.OrdinalIgnoreCase) ||
    resumeEx.Message.Contains("session file", StringComparison.OrdinalIgnoreCase))

"session file" is a very broad substring. Transient errors containing this phrase (e.g., "Cannot read session file: permission denied", "session file locked") will be misidentified as corruption and trigger destructive fresh-session creation — losing the user's entire session history when a retry would have worked. A more specific pattern (e.g., "session file is corrupted", "invalid json", "line N") would reduce false positives.

🟢 MINOR — Indentation Anomaly in TriggerToolHealthRecovery

File: PolyPilot/Services/CopilotService.Events.cs ~line 1679
Flagged by: 3/4 models

The InvokeOnUI(() => call has ~24 extra leading spaces compared to surrounding code. C# semantics are unaffected (indentation-insensitive), but it's real source corruption (not a diff artifact) that breaks code navigation and will produce confusing merge conflicts.

Test Coverage Assessment

New code with tests: ✅ IsShellEnvironmentFailure (PermissionDenialDetectionTests.cs), ✅ watchdog crash recovery source-inspection (ProcessingWatchdogTests.cs)

Missing test coverage for new code paths:

❌ No test for sibling re-resume race condition — the TOCTOU scenario (sibling starts processing while re-resume is in flight) has no test case
❌ No test that sibling retains MCP tool registrations after re-resume — would catch the missing BuildFreshSessionConfig issue
❌ No test for corrupted session recovery end-to-end — the when filter behavior is source-inspected but not exercised with a real exception message containing "session file" without "corrupted"
❌ No test for CurrentResponse content in watchdog crash response — the existing WatchdogCrashRecovery_ClearsCompanionFields test doesn't verify the response payload includes streaming content

Recommended Action

⚠️ Request Changes

The sibling re-resume logic (the key new feature) has a real TOCTOU race that can cancel in-flight orchestrator workers. The missing MCP servers will silently break sessions that rely on plugins after any reconnect. These are both straightforward to fix. The remaining items are lower risk but also straightforward.

Specific asks:

Re-check IsProcessing after ResumeSessionAsync returns to close the TOCTOU window
Use BuildFreshSessionConfig (or equivalent) for sibling ResumeSessionConfig
Append CurrentResponse to crashResponse in watchdog crash handler
Tighten the "session file" exception filter to a more specific pattern

…r tightening - Re-check IsProcessing after ResumeSessionAsync to close TOCTOU window where a concurrent SendPromptAsync could start between check and orphan - Include MCP servers and skill directories in sibling ResumeSessionConfig so plugin tools survive reconnect - Append CurrentResponse to crashResponse in watchdog crash recovery (was silently discarding streaming content) - Tighten corrupted-session filter: 'session file is' + 'Invalid literal value' instead of overly broad 'session file' substring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-15T03:18:14Z

Re-Review: fix/session-stability-hardening

Multi-model re-review (5/5 models) — updated after 5 additional commits

Previous Findings Status

#	Finding	Status
1	🔴 TOCTOU race in sibling re-resume	✅ FIXED (post-await re-check added)
2	🟡 Missing MCP servers in sibling config	✅ FIXED (`LoadMcpServers` + `LoadSkillDirectories` added)
3	🟡 Watchdog crash loses `CurrentResponse`	✅ FIXED (`crashResponse = FlushedResponse + CurrentResponse`)
4	🟡 Broad `"session file"` exception filter	✅ FIXED (tightened to `"session file is"` + `"Invalid literal value"`)
5	🟢 Indentation corruption in `TriggerToolHealthRecovery`	❌ STILL PRESENT

New Finding (Consensus: 3/5 models)

🟡 MODERATE — Leaked in sibling re-resume abandon paths

File: (sibling loop, ~lines 2672 and 2703)

Two paths in the sibling re-resume loop abandon a live without disposing it:

Path 1 — post-await TOCTOU guard:

var resumed = await newClient.ResumeSessionAsync(otherState.Info.SessionId, cfg, CancellationToken.None);
if (otherState.Info.IsProcessing)
{
    Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
    continue;  // ← 'resumed' is never disposed
}

Path 2 — failure (rapid back-to-back reconnects):

{
    siblingState.IsOrphaned = true;
    continue;  // ← 'resumed' is never disposed
}

In both cases, has already created a server-side session with an open transport (stdio pipe or TCP connection to the copilot CLI process). Without , these handles accumulate on rapid reconnects or in multi-session workspaces. Under flapping network conditions with multiple sessions, this will exhaust file descriptors or produce stale server-side sessions.

Fix:

// Path 1
if (otherState.Info.IsProcessing)
{
    Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
    try { await resumed.DisposeAsync(); } catch { }
    continue;
}

// Path 2
{
    siblingState.IsOrphaned = true;
    try { await resumed.DisposeAsync(); } catch { }
    continue;
}

Remaining Minor Issue

** indentation** () — the call still has 24 extra leading spaces in the updated diff. Functionally harmless in C# but creates misleading visual structure.

Recommended Action

⚠️ Request changes (one remaining item)

All four major findings from the initial review are fixed. One new 🟡 MODERATE issue: the two CopilotSession leak paths need . This is a straightforward two-line fix. Once addressed the PR is ready to merge.

- DisposeAsync resumed CopilotSession in TOCTOU guard (sibling started processing during re-resume) to prevent leaked server-side handles - DisposeAsync resumed CopilotSession in TryUpdate failure (rapid back-to-back reconnects) to prevent file descriptor exhaustion - Fix 24-space indentation anomaly in TriggerToolHealthRecovery Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The debug log claimed the old session was disposed but DisposeAsync was never called. Removing the claim rather than adding disposal, because CopilotSession.DisposeAsync may send a close command to the server that would break the resumed session (same session ID). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-15T03:58:02Z

Round 7 Review — Parallel 3-Agent Deep Dive

Agents: Agent-8 (sibling re-resume), Agent-9 (Events.cs INV-1), Agent-10 (reconnect paths)

Triage Summary

#	Agent	Finding	Verdict	Reasoning
1	Agent-9	CompleteResponse missing FlushCurrentResponse	False positive	CompleteResponse does the flush work inline (lines 968-1002): reads CurrentResponse→adds to history→builds fullResponse→clears. It IS the flush.
2	Agent-9	TriggerToolHealthRecovery missing FlushCurrentResponse	False positive	Also reads both buffers inline and clears them (lines 1699-1708). Documented as "by design" in processing-state-safety skill.
3	Agent-8	Missing ResponseCompletion init on sibling state	False positive	Siblings are idle (IsProcessing=true siblings are skipped). TCS is created by SendPromptAsync when next prompt arrives. Not needed for idle sessions.
4	Agent-8	Missing ProcessingGeneration increment on sibling state	False positive	Old state has long.MaxValue (blocks stale callbacks). New state at 0 is correct — identical to any fresh session. Primary path increments because it's mid-turn.
5	Agent-8	Missing CurrentResponse/FlushedResponse/PendingReasoningMessages clear	False positive	New `SessionState` object has fresh `StringBuilder` instances — already empty by construction.
6	Agent-10	Missing DisposeAsync of old session in primary reconnect	Won't fix (risky)	`CopilotSession.DisposeAsync` may send a close command to the server, breaking the resumed session (same session ID).
7	Agent-10	Log says "old session disposed" but no disposal happens	✅ Fixed	Removed misleading text from log message (commit `f2c337e`).

Fix Applied

f2c337e — Corrected misleading "old session disposed" debug log message. The old CopilotSession is intentionally NOT disposed because DisposeAsync could send a server-side close that breaks the resumed session sharing the same session ID.

Status

2554/2554 tests passing
12 commits on the PR (all incremental, preserving full review history)
7 rounds of review (4 external multi-model consensus + 3 from this session)
All critical and moderate findings across all rounds addressed

The PR is ready for merge. 🚀

…bling loop Primary reconnect path (SendPromptAsync connection error recovery) was missing McpServers and SkillDirectories in the ResumeSessionConfig, causing MCP tool access to be lost after a connection drop. The sibling path was already fixed in Round 5 — this aligns the primary path. Also threads the outer cancellationToken into the sibling re-resume Task.Run loop for clean shutdown behavior instead of fire-and-forget with CancellationToken.None. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- multi-agent-orchestration SKILL.md: 4→5 phase lifecycle, new IDLE-DEFER section, fix INV-O3 ordering, new INV-O15, mark premature idle bug as FIXED - processing-state-safety SKILL.md: 10→16 paths table with tags, new INV-18 for BackgroundTasks, IDLE-DEFER in stuck session table, note EVT-REARM is now secondary defense, add PRs #373/#375/#399 to regression history - copilot-instructions.md: update SDK Event Flow step 9 for BackgroundTasks check, add [IDLE-DEFER] diagnostic tag, fix stale path count (8→15+), add BackgroundTasksIdleTests to test list Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## Problem Sessions die instantly after reconnect due to compounding bugs: 1. **Corrupted session files kill sessions permanently** - When events.jsonl contains tool output with words like "ephemeral", the CLI JSON parser chokes on session.resume. No catch existed, so the session dies. 2. **Orphaned event handlers race with new state** - Stale SessionIdleEvent callbacks from the old session clear IsProcessing on the shared Info object, killing the new session. Each reconnect stacks another handler. 3. **Watchdog crashes leave sessions permanently stuck** - If watchdog cleanup throws, IsProcessing was never cleared. 4. **posix_spawn failures not detected** - Shell failures did not trigger auto-recovery. 5. **relaunch.sh kills copilot server** - Grep pattern matched the bundled copilot binary too. ## Fixes - **Corrupted session file recovery**: Catch "corrupted" in resume, create fresh session - **IsOrphaned flag**: HandleSessionEvent and CompleteResponse skip orphaned states - **Generation invalidation**: Old state ProcessingGeneration set to long.MaxValue - **Watchdog crash safety**: Outer catch clears IsProcessing (INV-1 compliant) - **Shell failure detection**: posix_spawn treated like permission denial for recovery - **relaunch.sh**: End-of-line anchor to only match app binary ## Testing 2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel - unrelated) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 9 commits March 13, 2026 16:30

PureWeen and others added 2 commits March 14, 2026 22:46

PureWeen merged commit 37abe13 into main Mar 15, 2026

PureWeen deleted the fix/session-stability-hardening branch March 15, 2026 14:28

This was referenced Mar 15, 2026

fix: prevent sessions from being permanently stuck when watchdog crashes #372

Closed

Bug: IsProcessing siblings skipped during client recreation #376

Closed

Copilot AI mentioned this pull request Mar 16, 2026

Fix: force-complete IsProcessing siblings during client recreation instead of skipping #384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: session stability hardening#373

fix: session stability hardening#373
PureWeen merged 13 commits intomainfrom
fix/session-stability-hardening

PureWeen commented Mar 13, 2026

Uh oh!

PureWeen commented Mar 13, 2026

Uh oh!

PureWeen commented Mar 15, 2026

Uh oh!

PureWeen commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 13, 2026

Problem

Fixes

Testing

Uh oh!

PureWeen commented Mar 13, 2026

PR Review: fix/session-stability-hardening

CI Status

Findings (Consensus, Ranked by Severity)

🔴 CRITICAL — Sibling Re-Resume TOCTOU Race

🟡 MODERATE — Sibling ResumeSessionConfig Missing MCP Servers

🟡 MODERATE — Watchdog Crash Recovery Loses Un-Flushed Content

🟡 MODERATE — Overly Broad Corrupted-Session Exception Filter

🟢 MINOR — Indentation Anomaly in TriggerToolHealthRecovery

Test Coverage Assessment

Recommended Action

Uh oh!

PureWeen commented Mar 15, 2026

Re-Review: fix/session-stability-hardening

Previous Findings Status

New Finding (Consensus: 3/5 models)

🟡 MODERATE — Leaked in sibling re-resume abandon paths

Remaining Minor Issue

Recommended Action

Uh oh!

PureWeen commented Mar 15, 2026

Round 7 Review — Parallel 3-Agent Deep Dive

Triage Summary

Fix Applied

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant