Skip to content

fix: session stability hardening#373

Merged
PureWeen merged 13 commits intomainfrom
fix/session-stability-hardening
Mar 15, 2026
Merged

fix: session stability hardening#373
PureWeen merged 13 commits intomainfrom
fix/session-stability-hardening

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Sessions die instantly after reconnect due to compounding bugs:

  1. Corrupted session files kill sessions permanently - When events.jsonl contains tool output with words like "ephemeral", the CLI JSON parser chokes on session.resume. No catch existed, so the session dies.
  2. Orphaned event handlers race with new state - Stale SessionIdleEvent callbacks from the old session clear IsProcessing on the shared Info object, killing the new session. Each reconnect stacks another handler.
  3. Watchdog crashes leave sessions permanently stuck - If watchdog cleanup throws, IsProcessing was never cleared.
  4. posix_spawn failures not detected - Shell failures did not trigger auto-recovery.
  5. relaunch.sh kills copilot server - Grep pattern matched the bundled copilot binary too.

Fixes

  • Corrupted session file recovery: Catch "corrupted" in resume, create fresh session
  • IsOrphaned flag: HandleSessionEvent and CompleteResponse skip orphaned states
  • Generation invalidation: Old state ProcessingGeneration set to long.MaxValue
  • Watchdog crash safety: Outer catch clears IsProcessing (INV-1 compliant)
  • Shell failure detection: posix_spawn treated like permission denial for recovery
  • relaunch.sh: End-of-line anchor to only match app binary

Testing

2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel - unrelated)

PureWeen and others added 9 commits March 13, 2026 16:30
Extends the existing permission-denial recovery mechanism to also detect
shell environment failures (posix_spawn failed). When 3 out of 5 tool
calls fail with posix_spawn errors, the session is automatically disposed
and resumed — the same sliding-window + TryRecoverPermissionAsync flow.

This fixes sessions that become permanently broken when the CLI's
internal process spawning fails, which previously required manual
session recreation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two paths could leave IsProcessing=true forever with no recovery:

1. Watchdog loop exception (catch block): The catch(Exception) block logged
   the error but did NOT clear IsProcessing or companion state. Any
   unexpected exception left the session stuck at "Sending..." forever.

2. Timeout callback failure: The Case C timeout callback called
   FlushCurrentResponse() BEFORE setting IsProcessing=false. If the flush
   threw, the exception was silently caught by InvokeOnUI's try-catch,
   and the watchdog had already exited (break) — no further cleanup.

Fix:
- Wrap FlushCurrentResponse in the timeout callback with try-catch so
  IsProcessing is always cleared even if the flush fails
- Add INV-1 compliant crash recovery to the catch(Exception) block:
  clears IsProcessing + all 9 companion fields, completes TCS,
  fires OnSessionComplete/OnError/OnStateChanged

Added 4 regression tests verifying the crash recovery path and
structural invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d session files

Three fixes for instant session death after reconnect:

1. Handle 'Session file is corrupted' in resume catch chain — when events.jsonl
   is corrupted (e.g., tool output text parsed as JSON literal), create a fresh
   session instead of letting the reconnect fail entirely.

2. Add IsOrphaned flag to SessionState — set true on the old state before creating
   the replacement. HandleSessionEvent and CompleteResponse skip all processing for
   orphaned states, preventing stale callbacks from clearing IsProcessing on the
   shared Info object.

3. Invalidate old state's ProcessingGeneration (set to long.MaxValue) before creating
   the new state. Any queued Invoke callbacks from the old state fail the generation
   check in CompleteResponse, providing defense-in-depth alongside IsOrphaned.

Root cause: on reconnect, old and new SessionState share the same Info object.
Stale SessionIdleEvent callbacks from the orphaned old CopilotSession would pass
the generation check (old gen matched old state) and clear Info.IsProcessing,
killing the new state's processing. Each reconnect also registered a new event
handler without unregistering the old one, causing N× event multiplication.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use end-of-line anchor (grep 'PolyPilot$') so the PID capture only matches
the app binary (comm ends with 'PolyPilot'), not the copilot headless server
bundled inside PolyPilot.app/Contents/MonoBundle/copilot whose full path
also contains 'PolyPilot'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…resume

- Add IsOrphaned checks to watchdog loop, tool health timer, TurnEnd
  fallback, TriggerToolHealthRecovery, and crash recovery callbacks
- Add missing INV-1 fields (ToolHealthStaleChecks, EventCountThisTurn,
  TurnEndReceivedAtTicks) to watchdog timeout and crash recovery paths
- Fix sibling re-resume: create new SessionState + orphan old one to
  prevent duplicate event handler dispatch (critical race condition)
- Register handler BEFORE publishing to dictionary (no-handler window)
- Guard corrupted/expired catch blocks so cleanup runs if
  CreateSessionAsync throws (cancel watchers + orphan state)
- Reorder IsOrphaned=true before Cancel* calls (catch queued callbacks)
- Complete TCS with TrySetCanceled() in orphaned CompleteResponse guard
  so orchestrator workers don't hang forever

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SystemTextFontScalingTests: --type-subhead was renamed to --type-callout
- FontSizingEnforcementTests: allowlist SessionListItem worker-child 0.85em
- DevTunnelServiceTests: handle devtunnel CLI being installed (skip gracefully)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- N1: Snapshot Organization.Sessions/Groups to local lists before
  Task.Run to avoid InvalidOperationException from concurrent access
- N2: Set IsOrphaned=true in failed sibling catch so dead sessions
  don't become zombies with stale event handlers still firing
- N4: Broaden corrupted-session catch to also match 'session file'
  errors (matches Persistence.cs recovery pattern)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SessionErrorEvent: add 3 missing INV-1 fields + IsOrphaned guard
- RecoveryFailure: add 3 missing INV-1 fields
- PermissionRecover cleanup: add 3 missing INV-1 fields
- TriggerToolHealthRecovery: add ProcessingGeneration guard
- Sibling re-resume: cancel old TCS, reset all tool counters on new
  state to match primary reconnect path
- Inner createEx catches: reorder IsOrphaned=true before Cancel* calls

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add TrySetCanceled() to failed sibling catch to unblock orchestrator workers
- Use TryUpdate instead of index assignment to prevent TOCTOU on rapid reconnects
- Skip actively-processing siblings to avoid mid-turn abort

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR Review: fix/session-stability-hardening

Multi-model review (4/5 models completed; consensus = flagged by 2+ models)

CI Status

⚠️ No CI checks reported on this branch. Author reports 2549 tests pass locally with 5 pre-existing failures (font scaling + DevTunnel — unrelated). Tests should be verified to pass on a clean CI run.


Findings (Consensus, Ranked by Severity)

🔴 CRITICAL — Sibling Re-Resume TOCTOU Race

File: PolyPilot/Services/CopilotService.cs ~line 2646
Flagged by: 4/4 models

if (otherState.Info.IsProcessing) continue;  // check on background thread
// ... await ResumeSessionAsync(...)         // async gap — sibling can start processing here
otherState.ResponseCompletion?.TrySetCanceled();  // may cancel a live TCS

IsProcessing is a plain bool (not volatile) on AgentSessionInfo, written on the UI thread. The check runs on a Task.Run background thread. If a concurrent SendPromptAsync starts between the check and TrySetCanceled(), the active ResponseCompletion TCS is forcibly canceled — throwing TaskCanceledException to any orchestrator worker awaiting it and silently dropping the mid-turn response. Multi-agent orchestration makes this race more likely (workers dispatch sends concurrently).

Suggested fix: After ResumeSessionAsync returns, re-check otherState.Info.IsProcessing before proceeding; if it's now true, skip orphaning and mark siblingState.IsOrphaned = true immediately:

var resumed = await newClient.ResumeSessionAsync(...);
if (otherState.Info.IsProcessing)  // re-check after await
{
    siblingState.IsOrphaned = true;  // discard the just-resumed session
    continue;
}

🟡 MODERATE — Sibling ResumeSessionConfig Missing MCP Servers

File: PolyPilot/Services/CopilotService.cs ~line 2655
Flagged by: 3/4 models

The sibling re-resume only registers ShowImageTool:

var cfg = new ResumeSessionConfig
{
    Tools = new List<AIFunction> { ShowImageTool.CreateFunction() },
    OnPermissionRequest = AutoApprovePermissions,
};

The primary reconnect and corrupted-session path both use BuildFreshSessionConfig which includes MCP servers, skill directories, and system messages. Any sibling session that had MCP tools configured will silently lose them after re-resume. Tool calls will fail with confusing errors until the user manually reconnects. Should use BuildFreshSessionConfig(otherState) (or equivalent) for sibling configs.


🟡 MODERATE — Watchdog Crash Recovery Loses Un-Flushed Content

File: PolyPilot/Services/CopilotService.Events.cs ~line 2194
Flagged by: 3/4 models

var crashResponse = state.FlushedResponse.ToString();  // only flushed content
state.FlushedResponse.Clear();
state.CurrentResponse.Clear();  // ← streaming content silently discarded
state.ResponseCompletion?.TrySetResult(crashResponse);  // missing CurrentResponse

CurrentResponse holds streaming deltas received since the last FlushCurrentResponse call. The crash path clears it without concatenating it into crashResponse. This is inconsistent with TriggerToolHealthRecovery which correctly combines both. The fix is:

var crashResponse = state.FlushedResponse.ToString() + state.CurrentResponse.ToString();

🟡 MODERATE — Overly Broad Corrupted-Session Exception Filter

File: PolyPilot/Services/CopilotService.cs ~line 2772
Flagged by: 3/4 models

catch (Exception resumeEx) when (
    resumeEx.Message.Contains("corrupted", StringComparison.OrdinalIgnoreCase) ||
    resumeEx.Message.Contains("session file", StringComparison.OrdinalIgnoreCase))

"session file" is a very broad substring. Transient errors containing this phrase (e.g., "Cannot read session file: permission denied", "session file locked") will be misidentified as corruption and trigger destructive fresh-session creation — losing the user's entire session history when a retry would have worked. A more specific pattern (e.g., "session file is corrupted", "invalid json", "line N") would reduce false positives.


🟢 MINOR — Indentation Anomaly in TriggerToolHealthRecovery

File: PolyPilot/Services/CopilotService.Events.cs ~line 1679
Flagged by: 3/4 models

The InvokeOnUI(() => call has ~24 extra leading spaces compared to surrounding code. C# semantics are unaffected (indentation-insensitive), but it's real source corruption (not a diff artifact) that breaks code navigation and will produce confusing merge conflicts.


Test Coverage Assessment

New code with tests:IsShellEnvironmentFailure (PermissionDenialDetectionTests.cs), ✅ watchdog crash recovery source-inspection (ProcessingWatchdogTests.cs)

Missing test coverage for new code paths:

  • No test for sibling re-resume race condition — the TOCTOU scenario (sibling starts processing while re-resume is in flight) has no test case
  • No test that sibling retains MCP tool registrations after re-resume — would catch the missing BuildFreshSessionConfig issue
  • No test for corrupted session recovery end-to-end — the when filter behavior is source-inspected but not exercised with a real exception message containing "session file" without "corrupted"
  • No test for CurrentResponse content in watchdog crash response — the existing WatchdogCrashRecovery_ClearsCompanionFields test doesn't verify the response payload includes streaming content

Recommended Action

⚠️ Request Changes

The sibling re-resume logic (the key new feature) has a real TOCTOU race that can cancel in-flight orchestrator workers. The missing MCP servers will silently break sessions that rely on plugins after any reconnect. These are both straightforward to fix. The remaining items are lower risk but also straightforward.

Specific asks:

  1. Re-check IsProcessing after ResumeSessionAsync returns to close the TOCTOU window
  2. Use BuildFreshSessionConfig (or equivalent) for sibling ResumeSessionConfig
  3. Append CurrentResponse to crashResponse in watchdog crash handler
  4. Tighten the "session file" exception filter to a more specific pattern

…r tightening

- Re-check IsProcessing after ResumeSessionAsync to close TOCTOU window
  where a concurrent SendPromptAsync could start between check and orphan
- Include MCP servers and skill directories in sibling ResumeSessionConfig
  so plugin tools survive reconnect
- Append CurrentResponse to crashResponse in watchdog crash recovery
  (was silently discarding streaming content)
- Tighten corrupted-session filter: 'session file is' + 'Invalid literal value'
  instead of overly broad 'session file' substring

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Re-Review: fix/session-stability-hardening

Multi-model re-review (5/5 models) — updated after 5 additional commits

Previous Findings Status

# Finding Status
1 🔴 TOCTOU race in sibling re-resume ✅ FIXED (post-await re-check added)
2 🟡 Missing MCP servers in sibling config ✅ FIXED (LoadMcpServers + LoadSkillDirectories added)
3 🟡 Watchdog crash loses CurrentResponse ✅ FIXED (crashResponse = FlushedResponse + CurrentResponse)
4 🟡 Broad "session file" exception filter ✅ FIXED (tightened to "session file is" + "Invalid literal value")
5 🟢 Indentation corruption in TriggerToolHealthRecovery ❌ STILL PRESENT

New Finding (Consensus: 3/5 models)

🟡 MODERATE — Leaked in sibling re-resume abandon paths

File: (sibling loop, ~lines 2672 and 2703)

Two paths in the sibling re-resume loop abandon a live without disposing it:

Path 1 — post-await TOCTOU guard:

var resumed = await newClient.ResumeSessionAsync(otherState.Info.SessionId, cfg, CancellationToken.None);
if (otherState.Info.IsProcessing)
{
    Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
    continue;  // ← 'resumed' is never disposed
}

Path 2 — failure (rapid back-to-back reconnects):

{
    siblingState.IsOrphaned = true;
    continue;  // ← 'resumed' is never disposed
}

In both cases, has already created a server-side session with an open transport (stdio pipe or TCP connection to the copilot CLI process). Without , these handles accumulate on rapid reconnects or in multi-session workspaces. Under flapping network conditions with multiple sessions, this will exhaust file descriptors or produce stale server-side sessions.

Fix:

// Path 1
if (otherState.Info.IsProcessing)
{
    Debug($"[RECONNECT] Sibling started processing during re-resume — skipping");
    try { await resumed.DisposeAsync(); } catch { }
    continue;
}

// Path 2
{
    siblingState.IsOrphaned = true;
    try { await resumed.DisposeAsync(); } catch { }
    continue;
}

Remaining Minor Issue

** indentation** () — the call still has 24 extra leading spaces in the updated diff. Functionally harmless in C# but creates misleading visual structure.


Recommended Action

⚠️ Request changes (one remaining item)

All four major findings from the initial review are fixed. One new 🟡 MODERATE issue: the two CopilotSession leak paths need . This is a straightforward two-line fix. Once addressed the PR is ready to merge.

PureWeen and others added 2 commits March 14, 2026 22:46
- DisposeAsync resumed CopilotSession in TOCTOU guard (sibling started
  processing during re-resume) to prevent leaked server-side handles
- DisposeAsync resumed CopilotSession in TryUpdate failure (rapid
  back-to-back reconnects) to prevent file descriptor exhaustion
- Fix 24-space indentation anomaly in TriggerToolHealthRecovery

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The debug log claimed the old session was disposed but DisposeAsync
was never called. Removing the claim rather than adding disposal,
because CopilotSession.DisposeAsync may send a close command to the
server that would break the resumed session (same session ID).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Round 7 Review — Parallel 3-Agent Deep Dive

Agents: Agent-8 (sibling re-resume), Agent-9 (Events.cs INV-1), Agent-10 (reconnect paths)

Triage Summary

# Agent Finding Verdict Reasoning
1 Agent-9 CompleteResponse missing FlushCurrentResponse False positive CompleteResponse does the flush work inline (lines 968-1002): reads CurrentResponse→adds to history→builds fullResponse→clears. It IS the flush.
2 Agent-9 TriggerToolHealthRecovery missing FlushCurrentResponse False positive Also reads both buffers inline and clears them (lines 1699-1708). Documented as "by design" in processing-state-safety skill.
3 Agent-8 Missing ResponseCompletion init on sibling state False positive Siblings are idle (IsProcessing=true siblings are skipped). TCS is created by SendPromptAsync when next prompt arrives. Not needed for idle sessions.
4 Agent-8 Missing ProcessingGeneration increment on sibling state False positive Old state has long.MaxValue (blocks stale callbacks). New state at 0 is correct — identical to any fresh session. Primary path increments because it's mid-turn.
5 Agent-8 Missing CurrentResponse/FlushedResponse/PendingReasoningMessages clear False positive New SessionState object has fresh StringBuilder instances — already empty by construction.
6 Agent-10 Missing DisposeAsync of old session in primary reconnect Won't fix (risky) CopilotSession.DisposeAsync may send a close command to the server, breaking the resumed session (same session ID).
7 Agent-10 Log says "old session disposed" but no disposal happens Fixed Removed misleading text from log message (commit f2c337e).

Fix Applied

  • f2c337e — Corrected misleading "old session disposed" debug log message. The old CopilotSession is intentionally NOT disposed because DisposeAsync could send a server-side close that breaks the resumed session sharing the same session ID.

Status

  • 2554/2554 tests passing
  • 12 commits on the PR (all incremental, preserving full review history)
  • 7 rounds of review (4 external multi-model consensus + 3 from this session)
  • All critical and moderate findings across all rounds addressed

The PR is ready for merge. 🚀

…bling loop

Primary reconnect path (SendPromptAsync connection error recovery) was
missing McpServers and SkillDirectories in the ResumeSessionConfig,
causing MCP tool access to be lost after a connection drop. The sibling
path was already fixed in Round 5 — this aligns the primary path.

Also threads the outer cancellationToken into the sibling re-resume
Task.Run loop for clean shutdown behavior instead of fire-and-forget
with CancellationToken.None.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 37abe13 into main Mar 15, 2026
@PureWeen PureWeen deleted the fix/session-stability-hardening branch March 15, 2026 14:28
PureWeen added a commit that referenced this pull request Mar 17, 2026
- multi-agent-orchestration SKILL.md: 4→5 phase lifecycle, new
  IDLE-DEFER section, fix INV-O3 ordering, new INV-O15, mark
  premature idle bug as FIXED
- processing-state-safety SKILL.md: 10→16 paths table with tags,
  new INV-18 for BackgroundTasks, IDLE-DEFER in stuck session table,
  note EVT-REARM is now secondary defense, add PRs #373/#375/#399
  to regression history
- copilot-instructions.md: update SDK Event Flow step 9 for
  BackgroundTasks check, add [IDLE-DEFER] diagnostic tag, fix
  stale path count (8→15+), add BackgroundTasksIdleTests to test list

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
## Problem

Sessions die instantly after reconnect due to compounding bugs:

1. **Corrupted session files kill sessions permanently** - When
events.jsonl contains tool output with words like "ephemeral", the CLI
JSON parser chokes on session.resume. No catch existed, so the session
dies.
2. **Orphaned event handlers race with new state** - Stale
SessionIdleEvent callbacks from the old session clear IsProcessing on
the shared Info object, killing the new session. Each reconnect stacks
another handler.
3. **Watchdog crashes leave sessions permanently stuck** - If watchdog
cleanup throws, IsProcessing was never cleared.
4. **posix_spawn failures not detected** - Shell failures did not
trigger auto-recovery.
5. **relaunch.sh kills copilot server** - Grep pattern matched the
bundled copilot binary too.

## Fixes

- **Corrupted session file recovery**: Catch "corrupted" in resume,
create fresh session
- **IsOrphaned flag**: HandleSessionEvent and CompleteResponse skip
orphaned states
- **Generation invalidation**: Old state ProcessingGeneration set to
long.MaxValue
- **Watchdog crash safety**: Outer catch clears IsProcessing (INV-1
compliant)
- **Shell failure detection**: posix_spawn treated like permission
denial for recovery
- **relaunch.sh**: End-of-line anchor to only match app binary

## Testing
2549 tests pass, 5 pre-existing failures (font scaling + DevTunnel -
unrelated)

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant