Conversation
Extends the existing permission-denial recovery mechanism to also detect shell environment failures (posix_spawn failed). When 3 out of 5 tool calls fail with posix_spawn errors, the session is automatically disposed and resumed — the same sliding-window + TryRecoverPermissionAsync flow. This fixes sessions that become permanently broken when the CLI's internal process spawning fails, which previously required manual session recreation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two paths could leave IsProcessing=true forever with no recovery: 1. Watchdog loop exception (catch block): The catch(Exception) block logged the error but did NOT clear IsProcessing or companion state. Any unexpected exception left the session stuck at "Sending..." forever. 2. Timeout callback failure: The Case C timeout callback called FlushCurrentResponse() BEFORE setting IsProcessing=false. If the flush threw, the exception was silently caught by InvokeOnUI's try-catch, and the watchdog had already exited (break) — no further cleanup. Fix: - Wrap FlushCurrentResponse in the timeout callback with try-catch so IsProcessing is always cleared even if the flush fails - Add INV-1 compliant crash recovery to the catch(Exception) block: clears IsProcessing + all 9 companion fields, completes TCS, fires OnSessionComplete/OnError/OnStateChanged Added 4 regression tests verifying the crash recovery path and structural invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d session files Three fixes for instant session death after reconnect: 1. Handle 'Session file is corrupted' in resume catch chain — when events.jsonl is corrupted (e.g., tool output text parsed as JSON literal), create a fresh session instead of letting the reconnect fail entirely. 2. Add IsOrphaned flag to SessionState — set true on the old state before creating the replacement. HandleSessionEvent and CompleteResponse skip all processing for orphaned states, preventing stale callbacks from clearing IsProcessing on the shared Info object. 3. Invalidate old state's ProcessingGeneration (set to long.MaxValue) before creating the new state. Any queued Invoke callbacks from the old state fail the generation check in CompleteResponse, providing defense-in-depth alongside IsOrphaned. Root cause: on reconnect, old and new SessionState share the same Info object. Stale SessionIdleEvent callbacks from the orphaned old CopilotSession would pass the generation check (old gen matched old state) and clear Info.IsProcessing, killing the new state's processing. Each reconnect also registered a new event handler without unregistering the old one, causing N× event multiplication. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use end-of-line anchor (grep 'PolyPilot$') so the PID capture only matches the app binary (comm ends with 'PolyPilot'), not the copilot headless server bundled inside PolyPilot.app/Contents/MonoBundle/copilot whose full path also contains 'PolyPilot'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…resume - Add IsOrphaned checks to watchdog loop, tool health timer, TurnEnd fallback, TriggerToolHealthRecovery, and crash recovery callbacks - Add missing INV-1 fields (ToolHealthStaleChecks, EventCountThisTurn, TurnEndReceivedAtTicks) to watchdog timeout and crash recovery paths - Fix sibling re-resume: create new SessionState + orphan old one to prevent duplicate event handler dispatch (critical race condition) - Register handler BEFORE publishing to dictionary (no-handler window) - Guard corrupted/expired catch blocks so cleanup runs if CreateSessionAsync throws (cancel watchers + orphan state) - Reorder IsOrphaned=true before Cancel* calls (catch queued callbacks) - Complete TCS with TrySetCanceled() in orphaned CompleteResponse guard so orchestrator workers don't hang forever Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SystemTextFontScalingTests: --type-subhead was renamed to --type-callout - FontSizingEnforcementTests: allowlist SessionListItem worker-child 0.85em - DevTunnelServiceTests: handle devtunnel CLI being installed (skip gracefully) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- N1: Snapshot Organization.Sessions/Groups to local lists before Task.Run to avoid InvalidOperationException from concurrent access - N2: Set IsOrphaned=true in failed sibling catch so dead sessions don't become zombies with stale event handlers still firing - N4: Broaden corrupted-session catch to also match 'session file' errors (matches Persistence.cs recovery pattern) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SessionErrorEvent: add 3 missing INV-1 fields + IsOrphaned guard - RecoveryFailure: add 3 missing INV-1 fields - PermissionRecover cleanup: add 3 missing INV-1 fields - TriggerToolHealthRecovery: add ProcessingGeneration guard - Sibling re-resume: cancel old TCS, reset all tool counters on new state to match primary reconnect path - Inner createEx catches: reorder IsOrphaned=true before Cancel* calls Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add TrySetCanceled() to failed sibling catch to unblock orchestrator workers - Use TryUpdate instead of index assignment to prevent TOCTOU on rapid reconnects - Skip actively-processing siblings to avoid mid-turn abort Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review: fix/session-stability-hardening
CI StatusFindings (Consensus, Ranked by Severity)🔴 CRITICAL — Sibling Re-Resume TOCTOU RaceFile: if (otherState.Info.IsProcessing) continue; // check on background thread
// ... await ResumeSessionAsync(...) // async gap — sibling can start processing here
otherState.ResponseCompletion?.TrySetCanceled(); // may cancel a live TCS
Suggested fix: After var resumed = await newClient.ResumeSessionAsync(...);
if (otherState.Info.IsProcessing) // re-check after await
{
siblingState.IsOrphaned = true; // discard the just-resumed session
continue;
}🟡 MODERATE — Sibling ResumeSessionConfig Missing MCP ServersFile: The sibling re-resume only registers var cfg = new ResumeSessionConfig
{
Tools = new List<AIFunction> { ShowImageTool.CreateFunction() },
OnPermissionRequest = AutoApprovePermissions,
};The primary reconnect and corrupted-session path both use 🟡 MODERATE — Watchdog Crash Recovery Loses Un-Flushed ContentFile: var crashResponse = state.FlushedResponse.ToString(); // only flushed content
state.FlushedResponse.Clear();
state.CurrentResponse.Clear(); // ← streaming content silently discarded
state.ResponseCompletion?.TrySetResult(crashResponse); // missing CurrentResponse
var crashResponse = state.FlushedResponse.ToString() + state.CurrentResponse.ToString();🟡 MODERATE — Overly Broad Corrupted-Session Exception FilterFile: catch (Exception resumeEx) when (
resumeEx.Message.Contains("corrupted", StringComparison.OrdinalIgnoreCase) ||
resumeEx.Message.Contains("session file", StringComparison.OrdinalIgnoreCase))
🟢 MINOR — Indentation Anomaly in TriggerToolHealthRecoveryFile: The Test Coverage AssessmentNew code with tests: ✅ Missing test coverage for new code paths:
Recommended ActionThe sibling re-resume logic (the key new feature) has a real TOCTOU race that can cancel in-flight orchestrator workers. The missing MCP servers will silently break sessions that rely on plugins after any reconnect. These are both straightforward to fix. The remaining items are lower risk but also straightforward. Specific asks:
|
…r tightening - Re-check IsProcessing after ResumeSessionAsync to close TOCTOU window where a concurrent SendPromptAsync could start between check and orphan - Include MCP servers and skill directories in sibling ResumeSessionConfig so plugin tools survive reconnect - Append CurrentResponse to crashResponse in watchdog crash recovery (was silently discarding streaming content) - Tighten corrupted-session filter: 'session file is' + 'Invalid literal value' instead of overly broad 'session file' substring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re-Review: fix/session-stability-hardening
Previous Findings Status
New Finding (Consensus: 3/5 models)🟡 MODERATE — Leaked in sibling re-resume abandon pathsFile: (sibling loop, ~lines 2672 and 2703) Two paths in the sibling re-resume loop abandon a live without disposing it: Path 1 — post-await TOCTOU guard: var resumed = await newClient.ResumeSessionAsync(otherState.Info.SessionId, cfg, CancellationToken.None);
if (otherState.Info.IsProcessing)
{
Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
continue; // ← 'resumed' is never disposed
}Path 2 — failure (rapid back-to-back reconnects): {
siblingState.IsOrphaned = true;
continue; // ← 'resumed' is never disposed
}In both cases, has already created a server-side session with an open transport (stdio pipe or TCP connection to the copilot CLI process). Without , these handles accumulate on rapid reconnects or in multi-session workspaces. Under flapping network conditions with multiple sessions, this will exhaust file descriptors or produce stale server-side sessions. Fix: // Path 1
if (otherState.Info.IsProcessing)
{
Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
try { await resumed.DisposeAsync(); } catch { }
continue;
}
// Path 2
{
siblingState.IsOrphaned = true;
try { await resumed.DisposeAsync(); } catch { }
continue;
}Remaining Minor Issue** indentation** () — the call still has 24 extra leading spaces in the updated diff. Functionally harmless in C# but creates misleading visual structure. Recommended ActionAll four major findings from the initial review are fixed. One new 🟡 MODERATE issue: the two CopilotSession leak paths need . This is a straightforward two-line fix. Once addressed the PR is ready to merge. |
- DisposeAsync resumed CopilotSession in TOCTOU guard (sibling started processing during re-resume) to prevent leaked server-side handles - DisposeAsync resumed CopilotSession in TryUpdate failure (rapid back-to-back reconnects) to prevent file descriptor exhaustion - Fix 24-space indentation anomaly in TriggerToolHealthRecovery Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The debug log claimed the old session was disposed but DisposeAsync was never called. Removing the claim rather than adding disposal, because CopilotSession.DisposeAsync may send a close command to the server that would break the resumed session (same session ID). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 7 Review — Parallel 3-Agent Deep DiveAgents: Agent-8 (sibling re-resume), Agent-9 (Events.cs INV-1), Agent-10 (reconnect paths) Triage Summary
Fix Applied
Status
The PR is ready for merge. 🚀 |
…bling loop Primary reconnect path (SendPromptAsync connection error recovery) was missing McpServers and SkillDirectories in the ResumeSessionConfig, causing MCP tool access to be lost after a connection drop. The sibling path was already fixed in Round 5 — this aligns the primary path. Also threads the outer cancellationToken into the sibling re-resume Task.Run loop for clean shutdown behavior instead of fire-and-forget with CancellationToken.None. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- multi-agent-orchestration SKILL.md: 4→5 phase lifecycle, new IDLE-DEFER section, fix INV-O3 ordering, new INV-O15, mark premature idle bug as FIXED - processing-state-safety SKILL.md: 10→16 paths table with tags, new INV-18 for BackgroundTasks, IDLE-DEFER in stuck session table, note EVT-REARM is now secondary defense, add PRs #373/#375/#399 to regression history - copilot-instructions.md: update SDK Event Flow step 9 for BackgroundTasks check, add [IDLE-DEFER] diagnostic tag, fix stale path count (8→15+), add BackgroundTasksIdleTests to test list Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem Sessions die instantly after reconnect due to compounding bugs: 1. **Corrupted session files kill sessions permanently** - When events.jsonl contains tool output with words like "ephemeral", the CLI JSON parser chokes on session.resume. No catch existed, so the session dies. 2. **Orphaned event handlers race with new state** - Stale SessionIdleEvent callbacks from the old session clear IsProcessing on the shared Info object, killing the new session. Each reconnect stacks another handler. 3. **Watchdog crashes leave sessions permanently stuck** - If watchdog cleanup throws, IsProcessing was never cleared. 4. **posix_spawn failures not detected** - Shell failures did not trigger auto-recovery. 5. **relaunch.sh kills copilot server** - Grep pattern matched the bundled copilot binary too. ## Fixes - **Corrupted session file recovery**: Catch "corrupted" in resume, create fresh session - **IsOrphaned flag**: HandleSessionEvent and CompleteResponse skip orphaned states - **Generation invalidation**: Old state ProcessingGeneration set to long.MaxValue - **Watchdog crash safety**: Outer catch clears IsProcessing (INV-1 compliant) - **Shell failure detection**: posix_spawn treated like permission denial for recovery - **relaunch.sh**: End-of-line anchor to only match app binary ## Testing 2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel - unrelated) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
Sessions die instantly after reconnect due to compounding bugs:
Fixes
Testing
2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel - unrelated)