Fix multi-agent worker failures, session persistence, server health, and history recovery by PureWeen · Pull Request #391 · PureWeen/PolyPilot

PureWeen · 2026-03-16T16:40:44Z

Summary

Multiple bug fixes discovered after PR #375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss.

Changes

1. Never push to main rule

Added as first Git Workflow rule in .github/copilot-instructions.md

2. Permission recovery killing multi-agent workers

TryRecoverPermissionAsync calls TrySetCanceled() on ResponseCompletion TCS, propagating as TaskCanceledException to orchestrator workers
Fix: Retry loop in SendPromptAndWaitAsync detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries)

3. Session ID not persisted after reconnect

When SDK returns different session ID on resume, state.Info.SessionId was updated in memory but FlushSaveActiveSessionsToDisk() never called
Fix: Added flush after every SessionId update in 4 reconnect sites

4. Server health notice for posix_spawn failures

Bundled CLI native modules can be deleted by unknown processes, causing posix_spawn ENOENT
Fix: ServerHealthNotice banner on Dashboard with Restart Server button and full server restart cycle

5. Session history loss from dead event streams

After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl
Fix: LoadBestHistoryAsync() compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent

6. PR review fixes

CRITICAL: RestartServerAsync wrapped in _clientReconnectLock (race condition fix)
HIGH: DisposePrematureIdleSignal added in restart disposal loop (MRE leak)
HIGH: History recency threshold reduced from 1 minute to 5 seconds
MINOR: Dashboard restores ServerHealthNotice on restart failure

Related Issues

Upstream: Copilot CLI headless server breaks when global CLI cleans shared native module directory #392 — posix_spawn upstream bug
Spinner vanishes briefly during premature idle recovery cycles #395 — Spinner gap during premature idle recovery

Testing

All 2669 tests pass.

PureWeen · 2026-03-16T19:18:58Z

PR #391 Review — 5-Model Consensus

CI: No checks configured ⚠️ (no pre-existing CI on this repo)
Prior reviews: None

🔴 CRITICAL — `CopilotService.cs:984` — `RestartServerAsync` has no concurrency guard (3/5 models)

ReconnectAsync and TryRecoverPermissionAsync both acquire _clientReconnectLock before modifying _client, _sessions, and the server process. RestartServerAsync does none of this.

If TryRecoverPermissionAsync is mid-flight (holding the lock, disposing _client, recreating sessions via TryUpdate) while the user clicks Restart:

RestartServerAsync calls _sessions.Clear() while recovery is doing TryUpdate on those same sessions
Both paths call StartServerAsync — recovery inside the lock (line ~2772), restart outside — resulting in two concurrent server starts
_client is disposed and recreated twice concurrently, leaving a corrupted reference

Fix: Wrap RestartServerAsync body in await _clientReconnectLock.WaitAsync() / finally { _clientReconnectLock.Release() }, matching ReconnectAsync.

🟡 HIGH — `CopilotService.cs:998-1004` — `RestartServerAsync` leaks `ManualResetEventSlim` per session (4/5 models)

The disposal loop cancels watchdog, fallback, and tool health-check timers but never calls DisposePrematureIdleSignal(state). PrematureIdleSignal is a ManualResetEventSlim wrapping unmanaged kernel event objects. Every server restart leaks one per active session.

Compare: TryRecoverPermissionAsync at line ~2374 calls it. All 19 individual session removal sites call it. Only the bulk loops (ReconnectAsync:881, DisposeAsync:3899, and now RestartServerAsync) are missing it.

Fix: Add DisposePrematureIdleSignal(state); after CancelToolHealthCheck(state); in the disposal loop.

🟡 HIGH — `CopilotService.Utilities.cs:539-547` — `LoadBestHistoryAsync` 1-minute threshold causes destructive data loss (3/5 models)

BulkInsertAsync is not additive — it executes DELETE FROM ChatMessageEntity WHERE SessionId = ? then InsertAll in a single transaction. When a crash occurs during an active conversation (the primary scenario this feature exists to handle), the DB typically has messages from seconds-to-minutes before the crash.

With the current 1-minute threshold:

DB has message 45 seconds newer than events.jsonl → events.jsonl wins → BulkInsertAsync permanently deletes the newer DB message and replaces with stale data
The more common recovery scenario (crash mid-active-conversation) is precisely the sub-minute range

Fix: Reduce threshold from > 1 minute to > 5 seconds (or > 0). The 1-minute guard makes the DB path nearly unreachable in the cases it is meant to protect.

�� MINOR — `Dashboard.razor:595-602` — Restart failures are invisible to the user (2/5 models)

ServerHealthNotice is cleared at line 995 before any fallible work begins. If an unexpected exception escapes (bad settings, cancellation during port scan, etc.), the catch block only does Console.WriteLine. The health banner is gone and no replacement is set.

Fix: In the Dashboard catch block, restore a notice:

CopilotService.ServerHealthNotice = "Server restart failed — please try again.";

✅ Confirmed NOT bugs (4/5 models agree)

SendPromptAndWaitAsync retry loop CTS semantics (Organization.cs:1731): Safe. Permission recovery calls TrySetCanceled() on the ResponseCompletion TCS directly — it does NOT cancel cts. The catch filter !cts.IsCancellationRequested correctly distinguishes recovery-caused cancellations from user/timeout cancellations. On retry, cts.Token is guaranteed valid.
ServerHealthNotice thread safety (Events.cs:449-451): Safe — assigned inside Invoke() (UI-thread marshal) or directly from a Blazor UI event handler.

Test coverage gaps

New code paths lacking tests:

RestartServerAsync concurrent with ReconnectAsync — race condition test (calls both concurrently, verifies no crash/corruption)
LoadBestHistoryAsync with DB message 30s newer — should prefer DB; currently would silently lose it
LoadBestHistoryAsync with DB message 90s newer — should prefer DB (currently works correctly)
RestartServerAsync exception path — verify ServerHealthNotice or FallbackNotice is set on failure

Verdict: ⚠️ Request changes

Two real bugs: the missing lock is a concurrency hazard that can corrupt _client and _sessions; the 1-minute threshold in LoadBestHistoryAsync causes destructive overwrites of newer chat history via BulkInsertAsync. Both are straightforward fixes. The MRE leak and UI feedback gap are lower severity but worth patching in the same pass.

PureWeen · 2026-03-16T19:39:30Z

All 4 findings addressed in commit dc53a74:

🔴 CRITICAL — Concurrency guard ✅

Wrapped RestartServerAsync body in _clientReconnectLock.WaitAsync() / finally { Release() }.

🟡 HIGH — MRE leak ✅

Added DisposePrematureIdleSignal(state) in the disposal loop after CancelToolHealthCheck.

🟡 HIGH — 1-minute threshold ✅

Reduced from TotalMinutes > 1 to TotalSeconds > 5. Sub-minute crash recovery now correctly prefers DB.

🟢 MINOR — Restart error feedback ✅

Dashboard catch block now sets ServerHealthNotice = "Server restart failed — please try again.".

Note: The review also flagged ReconnectAsync and DisposeAsync as missing DisposePrematureIdleSignal in their bulk loops — good catch, but those are pre-existing and out of scope for this PR. Will track separately.

PureWeen · 2026-03-16T21:03:25Z

🤖 Multi-Model Consensus Review — Round 2

CI: ⚠️ No checks configured
Tests: 2669/2669 pass (clean)

Previous Findings Status

#	Finding	Status
F1	`RestartServerAsync` missing `_clientReconnectLock`	✅ FIXED — lock acquired at line 1003, released in finally
F2	MRE leak (`DisposePrematureIdleSignal` skipped)	⚠️ PARTIALLY FIXED — fixed in `RestartServerAsync:1015` but still missing in `ReconnectAsync:890` and `DisposeAsync:3931`
F3	`BulkInsertAsync` unguarded in `RestorePreviousSessionsAsync`	✅
F4	Dashboard `RestartServer()` catch swallows errors	✅ FIXED — now calls `SetServerHealthNotice()` to show banner
F5	`ConnectionSettings.Load()` testability	✅ FIXED — uses `_currentSettings ?? ConnectionSettings.Load()`

Remaining Issue

[🟡 MODERATE] CopilotService.cs:890, :3931 — DisposePrematureIdleSignal still missing in 2 of 3 disposal loops

RestartServerAsync (line 1015) correctly calls DisposePrematureIdleSignal(state) before disposing sessions, but ReconnectAsync (line 890) and DisposeAsync (line 3931) still skip it. This leaks ManualResetEventSlim resources during mode switching and app shutdown.

Fix: Add DisposePrematureIdleSignal(state); to both loops, matching the pattern at line 1015.

Verdict: ⚠️ Request changes

4 of 5 original findings fixed. One remaining MODERATE (MRE leak in 2 disposal loops). The new watchdog Case B session.shutdown detection (commit ccee665) looks correct.

Review by PR Review Squad (2/5 models reached consensus due to rate limiting)

PureWeen · 2026-03-16T21:04:57Z

Round 2 Re-Review — PR #391

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: 2+ models to flag an issue.

CI: No checks configured ⚠️ (pre-existing, not PR-specific)
Tests: ✅ 2669/2669 pass (verified on pr-391-r2 with fresh build)

Previous Findings Status

Finding	Status
🔴 F1 — `RestartServerAsync` not protected by `_clientReconnectLock`	✅ FIXED (5/5)
🟡 F2 — MRE leak: disposal loops skip `DisposePrematureIdleSignal`	⚠️ PARTIALLY FIXED (5/5)
🟡 F3 — `BulkInsertAsync` destructive at 3rd call site (no `!historyFromDb` guard)	✅ FIXED (5/5)
🟢 F4 — Dashboard `RestartServer()` catch swallows errors silently	✅ FIXED (5/5)
🟡 F5 — `ConnectionSettings.Load()` called directly — not testable	✅ FIXED (5/5)

Fix Verification

F1 (CopilotService.cs:1003): Persistent path wrapped in _clientReconnectLock.WaitAsync() / finally Release(). Non-Persistent path returns early and delegates to ReconnectAsync (which manages its own lock). No deadlock risk. ✓

F3 (Persistence.cs:639): if (recreatedState.Info.SessionId != null && !oldFromDb) guard correctly prevents DB-origin history from being re-written. oldFromDb is correctly reassigned when origHistory wins the count comparison. ✓

F4 (Dashboard.razor + CopilotService.cs): Start-failure sets FallbackNotice; connect-failure sets FallbackNotice; Dashboard catch sets ServerHealthNotice("Server restart failed — please try again."). User-visible on all error paths. ✓

F5 (CopilotService.cs:998,1033): _currentSettings ?? ConnectionSettings.Load() — _currentSettings is set during InitializeAsync and ReconnectAsync(settings), so tests that initialize normally never reach the file read. ✓

F2 — Still Partially Present ⚠️

DisposePrematureIdleSignal was added to RestartServerAsync's loop (line 1015) but two other bulk disposal loops are still missing it — and these are the more-commonly-exercised paths:

ReconnectAsync (line 890–896) — called on every Settings → Save & Reconnect:

foreach (var state in _sessions.Values)
{
    CancelProcessingWatchdog(state);
    CancelTurnEndFallback(state);
    CancelToolHealthCheck(state);
    // ← DisposePrematureIdleSignal(state) MISSING
    try { if (state.Session != null) await state.Session.DisposeAsync(); } catch { }
}

DisposeAsync (line 3931–3938) — called on app shutdown:

foreach (var state in _sessions.Values)
{
    CancelProcessingWatchdog(state);
    CancelTurnEndFallback(state);
    CancelToolHealthCheck(state);
    // ← DisposePrematureIdleSignal(state) MISSING
    ...
}

Each leaked ManualResetEventSlim holds an OS kernel handle. Both loops should add DisposePrematureIdleSignal(state) to match RestartServerAsync.

New Finding (2+ model consensus)

🟢 MINOR CopilotService.Utilities.cs:145-157 — GetLastEventType reads the entire events.jsonl forward to find the last line (O(N) full-file scan). It is called every 15s per processing session in watchdog Case B. For sessions with thousands of events, this is unnecessary I/O. HasInterruptedToolExecution in the same file uses a 30-line ring buffer to avoid this — the same pattern could be applied here (read only the last line via a small tail buffer). Flagged by 3/5 models.

Verdict: ⚠️ Request Changes

Specific ask: Add DisposePrematureIdleSignal(state) to the disposal loops in ReconnectAsync (line 893) and DisposeAsync (line 3934), matching the pattern already in RestartServerAsync (line 1015). Two-line fix.

Everything else is clean. Once F2 is fully resolved, this is ready to merge.

Adds the first Git Workflow rule: never commit or push directly to main, always create a feature branch and open a PR. This prevents bypassing code review for any change, no matter how small. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Permission recovery (TryRecoverPermissionAsync) calls TrySetCanceled on the ResponseCompletion TCS to unblock SendPromptAsync. For multi-agent workers, this TaskCanceledException propagates to ExecuteWorkerAsync's generic catch, which logs FAILED and discards all collected content. Fix: SendPromptAndWaitAsync now detects permission-recovery cancellation (TCS cancelled but dispatch token still alive) and retries by awaiting the new state's ResponseCompletion TCS. If recovery skipped the resend (tools already completed), it collects partial response from history instead. Observed: worker-1 hit 3 shell errors in 13s, triggering permission recovery. Recovery cancelled the TCS, ExecuteWorkerAsync caught TaskCanceledException, entire orchestration failed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lling workers Two fixes: 1. Session ID persistence: When the SDK returns a different session ID on resume (e.g., reverting to an older session), active-sessions.json was never updated. On next restart, the app would resume the old session with stale conversation history. Added FlushSaveActiveSessionsToDisk() after every state.Info.SessionId update in reconnect paths (4 sites). Root cause: CI-Investigate had ID 7e9c8108 with recent conversations, but active-sessions.json still had old ID 6ff9eb34. After relaunch, it resumed 6ff9eb34 and the user lost all recent work. 2. Permission recovery worker survival: TryRecoverPermissionAsync calls TrySetCanceled on ResponseCompletion to unblock SendPromptAsync. For multi-agent workers, this TaskCanceledException propagated to ExecuteWorkerAsync's generic catch, killing the worker. Fixed by adding a retry loop in SendPromptAndWaitAsync that detects permission-recovery cancellation (TCS cancelled but dispatch token alive) and re-awaits the new state's ResponseCompletion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When the headless server's native modules become stale (e.g., another CLI installation cleans up ~/.copilot/pkg/darwin-arm64/), posix_spawn fails for all shell operations. This adds: - ServerHealthNotice property on CopilotService (like FallbackNotice) - Dashboard banner with 'Restart Server' button when shell failures detected - RestartServerAsync() method: stops server, waits for port, starts fresh server, recreates client, restores all sessions from disk - System message in affected session explaining the error - Shell failure detection in Events.cs sets the banner in Persistent mode The restart follows the version-mismatch pattern: stop → wait → start → recreate client → restore sessions. No session data is lost. Fixes #392 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t streams When the headless server cleans up an idle session and PolyPilot re-resumes it, the SDK's event file writer breaks ('dead event stream'). Events flow in-memory but are never persisted to events.jsonl. On restart, LoadHistoryFromDisk reads stale data and all post-reconnect messages are lost. Fix: LoadBestHistoryAsync compares events.jsonl and chat_history.db (written fire-and-forget on every message), preferring whichever has more substantive messages. This also prevents BulkInsertAsync from destructively overwriting richer DB data with stale events.jsonl content. Updated at 4 call sites: ResumeSessionAsync, lazy restore, codespace placeholder, and fallback recovery. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The count-based comparison was wrong — events.jsonl can have more OLD messages while the DB has NEWER ones from dead event stream sessions. Merging is risky (duplicates, ordering). Instead, compare the latest user message timestamp from each source and pick the winner outright. No merging, no risk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. RestartServerAsync: wrap in _clientReconnectLock to prevent races with TryRecoverPermissionAsync and ReconnectAsync (CRITICAL) 2. RestartServerAsync disposal loop: add DisposePrematureIdleSignal(state) to prevent ManualResetEventSlim leak per session (HIGH) 3. LoadBestHistoryAsync: reduce recency threshold from 1 minute to 5 seconds so crash-during-active-conversation recovers from DB (HIGH) 4. Dashboard RestartServer catch: restore ServerHealthNotice on failure so user sees feedback instead of silent failure (MINOR) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The fallback recovery path in RestorePreviousSessionsAsync was the only when history came from ChatDatabase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When the headless server kills a session (idle timeout, stuck tools), it writes session.shutdown to events.jsonl. But if the client's event stream is dead, we never receive this event. The watchdog's Case B freshness check sees the file as recently modified and keeps deferring, leaving the session stuck for 10+ minutes. Fix: Before deferring in Case B, read the last event type from events.jsonl. If it's session.shutdown, skip the deferral and complete immediately. Uses the same parse pattern as HasInterruptedToolExecution. Also fix pre-existing build error: Dashboard.razor was assigning to ServerHealthNotice's private setter. Added SetServerHealthNotice() method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… GetLastEventType 1. Add DisposePrematureIdleSignal(state) to ReconnectAsync and DisposeAsync disposal loops, matching the pattern already in RestartServerAsync. This prevents ManualResetEventSlim leaks during mode switching and app shutdown. 2. Optimize GetLastEventType to use a 4KB tail-read instead of O(N) full-file scan. Called every 15s per processing session in watchdog Case B — the previous forward scan was unnecessary I/O for large events.jsonl files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…names - Add GetAllMessagesAsync to StubChatDatabase (interface compliance) - Update structural tests to search for LoadBestHistoryAsync (renamed) - Increase RenderThrottleTests search window from 1000 to 1200 chars Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-16T21:27:58Z

All findings addressed in commit dbe57d9 (now rebased as part of 536e1da):

F2 - MRE leak (FULLY FIXED): Added DisposePrematureIdleSignal(state) to both ReconnectAsync (line 895) and DisposeAsync (line 3936) disposal loops, matching the pattern in RestartServerAsync (line 1016). All 3 disposal loops are now consistent.

New finding - GetLastEventType O(N) scan (FIXED): Replaced forward full-file scan with a 4KB tail-read (FileStream.Seek to Length - 4096). Called every 15s per processing session, so this eliminates unnecessary I/O for large events.jsonl files.

Rebase: Rebased on latest origin/main (c2fdb8c). Resolved merge conflicts in RestartServerAsync (kept both TryRecoverPersistentServerAsync from main and our RestartServerAsync). All 2703 tests pass.

PureWeen · 2026-03-16T21:49:55Z

Round 3 Re-Review — ✅ Approve

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex) | 2703/2703 tests pass

Previous Findings Status

Finding	Status
F1 🔴 `RestartServerAsync` not locked	✅ FIXED — `_clientReconnectLock.WaitAsync` at line 1098
F3 🟡
F4 🟢 Dashboard `catch` swallows errors	✅ FIXED — error surfaced via `SetServerHealthNotice()`
F5 🟡 `ConnectionSettings.Load()` not testable	✅ FIXED
R2-N1 🟡 `DisposePrematureIdleSignal` missing in `ReconnectAsync` + `DisposeAsync`	✅ FIXED — added to both loops (lines 905, 4028); all 3 disposal loops now consistent

New Findings (consensus filter: 2+ models required)

No new issues passed the consensus filter. One model noted a potential flag-swap in the session-ID mismatch branch (actualFromDb vs historyFromDb) but this was not flagged by any other model and the effect is mitigated by CopyEventsToNewSession running before the affected branch.

GetLastEventType optimization

The 4KB tail-read is correct:

Partial first line from mid-file seek is safely overwritten by subsequent full JSON lines
FileShare.ReadWrite is consistent with LoadHistoryFromDiskAsync in the same file
JsonDocument.Parse failure on degenerate cases returns null (correct watchdog behavior)

Test note

The --no-build run showed 1 failure in RenderThrottleTests due to stale binary (built before the 536e1daf window bump). A fresh build clears it: 2703/2703 pass.

PureWeen · 2026-03-16T21:50:02Z

🤖 Multi-Model Consensus Review — Round 3

CI: ⚠️ No checks configured
Tests: 2703/2703 pass ✅

Previous Findings Status

#	Finding	Status
F1	`RestartServerAsync` missing `_clientReconnectLock`	✅ FIXED — lock at line 1098
F3	`BulkInsertAsync` unguarded 3rd call site	✅
F4	Dashboard catch swallows errors	✅ FIXED — `SetServerHealthNotice()` surfaces error
F5	`ConnectionSettings.Load()` testability	✅ FIXED — tests bypass via `ReconnectAsync(settings)` with in-memory objects
R2-N1	`DisposePrematureIdleSignal` missing in 2 disposal loops	✅ FIXED — added to `ReconnectAsync` (line 908) and `DisposeAsync` (line 4031)

New Findings (Round 3)

None. All 3/5 sub-agents that completed (others hit rate limits) unanimously confirm all findings fixed with no new issues.

Key verification points:

All 3 disposal loops (ReconnectAsync, RestartServerAsync, DisposeAsync) now consistently call DisposePrematureIdleSignal(state)
GetLastEventType tail-read uses FileShare.ReadWrite, 4KB tail buffer, proper partial-line handling
Test changes are mechanical (stub updates, search window adjustments)

Verdict: ✅ Approve

All 5 original findings resolved across 3 rounds of review. No new issues.

Review by PR Review Squad (3/5 models reached — rate limit on remaining 2)

PureWeen · 2026-03-16T22:25:32Z

Multi-Model Consensus Review — Round 2 (5 models)

Latest commits: 2b9d517b (concurrency guard + MRE + UX), 06d019b7 (BulkInsertAsync guard), ccee6654 (watchdog session.shutdown detection)

CI: ⚠️ No CI configured

Previous Findings Status

#	Sev	Finding	Status
1	🔴	`RestartServerAsync` not protected by `_clientReconnectLock`	✅ FIXED — `await _clientReconnectLock.WaitAsync()` added at line 1003 with `finally { Release() }` at line 1083
2	🟡	MRE leak — `DisposePrematureIdleSignal` skipped in disposal loop	✅ FIXED — added to `RestartServerAsync` disposal loop (line 1016)
3	🟡
4	🟢	Dashboard `RestartServer()` catch silently drops errors	✅ FIXED — `ServerHealthNotice` restored on failure
5	🟡	`RestartServerAsync` calls `ConnectionSettings.Load()` directly	⚠️ STILL PRESENT — `_currentSettings ?? ConnectionSettings.Load()` at line 1036 and 1000

New Findings (2+ model consensus)

Sev	File:Line	Models	Description
🟡	`CopilotService.cs:993-1001`	3/5	Non-Persistent early-return path skips `_clientReconnectLock`. When `CurrentMode != ConnectionMode.Persistent`, `RestartServerAsync` calls `await ReconnectAsync(settings, cancellationToken)` without first acquiring `_clientReconnectLock`. Two concurrent calls to `RestartServerAsync` in Non-Persistent mode (e.g., two server health notices clicked) can both fall through to `ReconnectAsync` simultaneously. `ReconnectAsync` acquires the lock internally, so they serialize there — but the mode-check→call sequence is not atomic. In Persistent mode the fix is complete.

New Feature: Watchdog Case B Shutdown Detection (`ccee6654`)

GetLastEventType() is correctly implemented — handles missing file (return null), empty file, corrupt JSON (try/catch), all return null. Called only in the Case B deferral path (line 2092). The logic is: before deferring because events.jsonl looks fresh, peek at the last event; if it's session.shutdown, complete immediately instead of deferring. This closes a real gap where headless server kills were invisible to the client.

Remaining Item

Finding #5 (ConnectionSettings.Load()) is a testability concern, not a runtime bug. It uses the pattern _currentSettings ?? ConnectionSettings.Load() which correctly prefers the in-memory settings when available. Non-blocking for merge.

Verdict: ✅ Approve (with note)

All 4 blocking findings are fixed. The Non-Persistent early-return note (3/5 consensus) is a low-risk edge case since Non-Persistent mode is not the default deployment target. The ConnectionSettings.Load() testability concern is pre-existing pattern. The watchdog shutdown detection is a clean, well-implemented improvement.

…and history recovery (PureWeen#391) ## Summary Multiple bug fixes discovered after PR PureWeen#375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss. ## Changes ### 1. Never push to main rule - Added as first Git Workflow rule in `.github/copilot-instructions.md` ### 2. Permission recovery killing multi-agent workers - `TryRecoverPermissionAsync` calls `TrySetCanceled()` on `ResponseCompletion` TCS, propagating as `TaskCanceledException` to orchestrator workers - **Fix**: Retry loop in `SendPromptAndWaitAsync` detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries) ### 3. Session ID not persisted after reconnect - When SDK returns different session ID on resume, `state.Info.SessionId` was updated in memory but `FlushSaveActiveSessionsToDisk()` never called - **Fix**: Added flush after every SessionId update in 4 reconnect sites ### 4. Server health notice for posix_spawn failures - Bundled CLI native modules can be deleted by unknown processes, causing `posix_spawn ENOENT` - **Fix**: `ServerHealthNotice` banner on Dashboard with Restart Server button and full server restart cycle ### 5. Session history loss from dead event streams - After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl - **Fix**: `LoadBestHistoryAsync()` compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent ### 6. PR review fixes - **CRITICAL**: `RestartServerAsync` wrapped in `_clientReconnectLock` (race condition fix) - **HIGH**: `DisposePrematureIdleSignal` added in restart disposal loop (MRE leak) - **HIGH**: History recency threshold reduced from 1 minute to 5 seconds - **MINOR**: Dashboard restores `ServerHealthNotice` on restart failure ## Related Issues - PureWeen#392 — posix_spawn upstream bug - PureWeen#395 — Spinner gap during premature idle recovery ## Testing All 2669 tests pass. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen force-pushed the fix/never-push-to-main branch from dc53a74 to 2b9d517 Compare March 16, 2026 20:01

PureWeen changed the title ~~Add 'never push to main' rule to copilot instructions~~ Fix multi-agent worker failures, session persistence, server health, and history recovery Mar 16, 2026

This was referenced Mar 16, 2026

Send keep-alive pings to prevent server idle timeout killing sessions #396

Closed

Pre-check for session.shutdown before sending prompts to avoid dead session sends #397

Open

Trivial bash tools hang indefinitely when sub-agent is actively running #398

Open

PureWeen and others added 11 commits March 16, 2026 16:09

Guard 3rd BulkInsertAsync site against destructive DB overwrite

f399d1f

The fallback recovery path in RestorePreviousSessionsAsync was the only when history came from ChatDatabase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen force-pushed the fix/never-push-to-main branch from ccee665 to 536e1da Compare March 16, 2026 21:27

PureWeen merged commit 23a3083 into main Mar 16, 2026

PureWeen deleted the fix/never-push-to-main branch March 16, 2026 21:58

PureWeen mentioned this pull request Mar 18, 2026

Assistant responses lost on restart: dead event stream + DB write failure #406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-agent worker failures, session persistence, server health, and history recovery#391

Fix multi-agent worker failures, session persistence, server health, and history recovery#391
PureWeen merged 11 commits intomainfrom
fix/never-push-to-main

PureWeen commented Mar 16, 2026 •

edited

Loading

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Never push to main rule

2. Permission recovery killing multi-agent workers

3. Session ID not persisted after reconnect

4. Server health notice for posix_spawn failures

5. Session history loss from dead event streams

6. PR review fixes

Related Issues

Testing

Uh oh!

PureWeen commented Mar 16, 2026

PR #391 Review — 5-Model Consensus

🔴 CRITICAL — CopilotService.cs:984 — RestartServerAsync has no concurrency guard (3/5 models)

🟡 HIGH — CopilotService.cs:998-1004 — RestartServerAsync leaks ManualResetEventSlim per session (4/5 models)

🟡 HIGH — CopilotService.Utilities.cs:539-547 — LoadBestHistoryAsync 1-minute threshold causes destructive data loss (3/5 models)

�� MINOR — Dashboard.razor:595-602 — Restart failures are invisible to the user (2/5 models)

✅ Confirmed NOT bugs (4/5 models agree)

Test coverage gaps

Verdict: ⚠️ Request changes

Uh oh!

PureWeen commented Mar 16, 2026

🔴 CRITICAL — Concurrency guard ✅

🟡 HIGH — MRE leak ✅

🟡 HIGH — 1-minute threshold ✅

🟢 MINOR — Restart error feedback ✅

Uh oh!

PureWeen commented Mar 16, 2026

🤖 Multi-Model Consensus Review — Round 2

Previous Findings Status

Remaining Issue

Verdict: ⚠️ Request changes

Uh oh!

PureWeen commented Mar 16, 2026

Round 2 Re-Review — PR #391

Previous Findings Status

Fix Verification

F2 — Still Partially Present ⚠️

New Finding (2+ model consensus)

Verdict: ⚠️ Request Changes

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 16, 2026

Round 3 Re-Review — ✅ Approve

Previous Findings Status

New Findings (consensus filter: 2+ models required)

GetLastEventType optimization

Test note

Uh oh!

PureWeen commented Mar 16, 2026

🤖 Multi-Model Consensus Review — Round 3

Previous Findings Status

New Findings (Round 3)

Verdict: ✅ Approve

Uh oh!

PureWeen commented Mar 16, 2026

Multi-Model Consensus Review — Round 2 (5 models)

CI: ⚠️ No CI configured

Previous Findings Status

New Findings (2+ model consensus)

New Feature: Watchdog Case B Shutdown Detection (ccee6654)

Remaining Item

Verdict: ✅ Approve (with note)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PureWeen commented Mar 16, 2026 •

edited

Loading

🔴 CRITICAL — `CopilotService.cs:984` — `RestartServerAsync` has no concurrency guard (3/5 models)

🟡 HIGH — `CopilotService.cs:998-1004` — `RestartServerAsync` leaks `ManualResetEventSlim` per session (4/5 models)

🟡 HIGH — `CopilotService.Utilities.cs:539-547` — `LoadBestHistoryAsync` 1-minute threshold causes destructive data loss (3/5 models)

�� MINOR — `Dashboard.razor:595-602` — Restart failures are invisible to the user (2/5 models)

New Feature: Watchdog Case B Shutdown Detection (`ccee6654`)