Fix multi-agent worker failures, session persistence, server health, and history recovery#391
Fix multi-agent worker failures, session persistence, server health, and history recovery#391
Conversation
PR #391 Review — 5-Model ConsensusCI: No checks configured 🔴 CRITICAL —
|
|
All 4 findings addressed in commit 🔴 CRITICAL — Concurrency guard ✅Wrapped 🟡 HIGH — MRE leak ✅Added 🟡 HIGH — 1-minute threshold ✅Reduced from 🟢 MINOR — Restart error feedback ✅Dashboard catch block now sets Note: The review also flagged |
dc53a74 to
2b9d517
Compare
🤖 Multi-Model Consensus Review — Round 2CI: Previous Findings Status
Remaining Issue
Fix: Add Verdict:
|
Round 2 Re-Review — PR #3915-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: 2+ models to flag an issue. CI: No checks configured Previous Findings Status
Fix VerificationF1 ( F3 ( F4 ( F5 ( F2 — Still Partially Present
|
Adds the first Git Workflow rule: never commit or push directly to main, always create a feature branch and open a PR. This prevents bypassing code review for any change, no matter how small. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Permission recovery (TryRecoverPermissionAsync) calls TrySetCanceled on the ResponseCompletion TCS to unblock SendPromptAsync. For multi-agent workers, this TaskCanceledException propagates to ExecuteWorkerAsync's generic catch, which logs FAILED and discards all collected content. Fix: SendPromptAndWaitAsync now detects permission-recovery cancellation (TCS cancelled but dispatch token still alive) and retries by awaiting the new state's ResponseCompletion TCS. If recovery skipped the resend (tools already completed), it collects partial response from history instead. Observed: worker-1 hit 3 shell errors in 13s, triggering permission recovery. Recovery cancelled the TCS, ExecuteWorkerAsync caught TaskCanceledException, entire orchestration failed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lling workers Two fixes: 1. Session ID persistence: When the SDK returns a different session ID on resume (e.g., reverting to an older session), active-sessions.json was never updated. On next restart, the app would resume the old session with stale conversation history. Added FlushSaveActiveSessionsToDisk() after every state.Info.SessionId update in reconnect paths (4 sites). Root cause: CI-Investigate had ID 7e9c8108 with recent conversations, but active-sessions.json still had old ID 6ff9eb34. After relaunch, it resumed 6ff9eb34 and the user lost all recent work. 2. Permission recovery worker survival: TryRecoverPermissionAsync calls TrySetCanceled on ResponseCompletion to unblock SendPromptAsync. For multi-agent workers, this TaskCanceledException propagated to ExecuteWorkerAsync's generic catch, killing the worker. Fixed by adding a retry loop in SendPromptAndWaitAsync that detects permission-recovery cancellation (TCS cancelled but dispatch token alive) and re-awaits the new state's ResponseCompletion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the headless server's native modules become stale (e.g., another CLI installation cleans up ~/.copilot/pkg/darwin-arm64/), posix_spawn fails for all shell operations. This adds: - ServerHealthNotice property on CopilotService (like FallbackNotice) - Dashboard banner with 'Restart Server' button when shell failures detected - RestartServerAsync() method: stops server, waits for port, starts fresh server, recreates client, restores all sessions from disk - System message in affected session explaining the error - Shell failure detection in Events.cs sets the banner in Persistent mode The restart follows the version-mismatch pattern: stop → wait → start → recreate client → restore sessions. No session data is lost. Fixes #392 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t streams
When the headless server cleans up an idle session and PolyPilot re-resumes it,
the SDK's event file writer breaks ('dead event stream'). Events flow in-memory
but are never persisted to events.jsonl. On restart, LoadHistoryFromDisk reads
stale data and all post-reconnect messages are lost.
Fix: LoadBestHistoryAsync compares events.jsonl and chat_history.db (written
fire-and-forget on every message), preferring whichever has more substantive
messages. This also prevents BulkInsertAsync from destructively overwriting
richer DB data with stale events.jsonl content.
Updated at 4 call sites: ResumeSessionAsync, lazy restore, codespace
placeholder, and fallback recovery.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The count-based comparison was wrong — events.jsonl can have more OLD messages while the DB has NEWER ones from dead event stream sessions. Merging is risky (duplicates, ordering). Instead, compare the latest user message timestamp from each source and pick the winner outright. No merging, no risk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. RestartServerAsync: wrap in _clientReconnectLock to prevent races with TryRecoverPermissionAsync and ReconnectAsync (CRITICAL) 2. RestartServerAsync disposal loop: add DisposePrematureIdleSignal(state) to prevent ManualResetEventSlim leak per session (HIGH) 3. LoadBestHistoryAsync: reduce recency threshold from 1 minute to 5 seconds so crash-during-active-conversation recovers from DB (HIGH) 4. Dashboard RestartServer catch: restore ServerHealthNotice on failure so user sees feedback instead of silent failure (MINOR) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fallback recovery path in RestorePreviousSessionsAsync was the only when history came from ChatDatabase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the headless server kills a session (idle timeout, stuck tools), it writes session.shutdown to events.jsonl. But if the client's event stream is dead, we never receive this event. The watchdog's Case B freshness check sees the file as recently modified and keeps deferring, leaving the session stuck for 10+ minutes. Fix: Before deferring in Case B, read the last event type from events.jsonl. If it's session.shutdown, skip the deferral and complete immediately. Uses the same parse pattern as HasInterruptedToolExecution. Also fix pre-existing build error: Dashboard.razor was assigning to ServerHealthNotice's private setter. Added SetServerHealthNotice() method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… GetLastEventType 1. Add DisposePrematureIdleSignal(state) to ReconnectAsync and DisposeAsync disposal loops, matching the pattern already in RestartServerAsync. This prevents ManualResetEventSlim leaks during mode switching and app shutdown. 2. Optimize GetLastEventType to use a 4KB tail-read instead of O(N) full-file scan. Called every 15s per processing session in watchdog Case B — the previous forward scan was unnecessary I/O for large events.jsonl files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names - Add GetAllMessagesAsync to StubChatDatabase (interface compliance) - Update structural tests to search for LoadBestHistoryAsync (renamed) - Increase RenderThrottleTests search window from 1000 to 1200 chars Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ccee665 to
536e1da
Compare
|
All findings addressed in commit dbe57d9 (now rebased as part of 536e1da): F2 - MRE leak (FULLY FIXED): Added New finding - GetLastEventType O(N) scan (FIXED): Replaced forward full-file scan with a 4KB tail-read ( Rebase: Rebased on latest origin/main (c2fdb8c). Resolved merge conflicts in |
Round 3 Re-Review — ✅ Approve5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex) | 2703/2703 tests pass Previous Findings Status
New Findings (consensus filter: 2+ models required)No new issues passed the consensus filter. One model noted a potential flag-swap in the session-ID mismatch branch ( GetLastEventType optimizationThe 4KB tail-read is correct:
Test noteThe |
🤖 Multi-Model Consensus Review — Round 3CI: Previous Findings Status
New Findings (Round 3)None. All 3/5 sub-agents that completed (others hit rate limits) unanimously confirm all findings fixed with no new issues. Key verification points:
Verdict: ✅ ApproveAll 5 original findings resolved across 3 rounds of review. No new issues. Review by PR Review Squad (3/5 models reached — rate limit on remaining 2) |
Multi-Model Consensus Review — Round 2 (5 models)Latest commits: CI:
|
| # | Sev | Finding | Status |
|---|---|---|---|
| 1 | 🔴 | RestartServerAsync not protected by _clientReconnectLock |
✅ FIXED — await _clientReconnectLock.WaitAsync() added at line 1003 with finally { Release() } at line 1083 |
| 2 | 🟡 | MRE leak — DisposePrematureIdleSignal skipped in disposal loop |
✅ FIXED — added to RestartServerAsync disposal loop (line 1016) |
| 3 | 🟡 | ||
| 4 | 🟢 | Dashboard RestartServer() catch silently drops errors |
✅ FIXED — ServerHealthNotice restored on failure |
| 5 | 🟡 | RestartServerAsync calls ConnectionSettings.Load() directly |
_currentSettings ?? ConnectionSettings.Load() at line 1036 and 1000 |
New Findings (2+ model consensus)
| Sev | File:Line | Models | Description |
|---|---|---|---|
| 🟡 | CopilotService.cs:993-1001 |
3/5 | Non-Persistent early-return path skips _clientReconnectLock. When CurrentMode != ConnectionMode.Persistent, RestartServerAsync calls await ReconnectAsync(settings, cancellationToken) without first acquiring _clientReconnectLock. Two concurrent calls to RestartServerAsync in Non-Persistent mode (e.g., two server health notices clicked) can both fall through to ReconnectAsync simultaneously. ReconnectAsync acquires the lock internally, so they serialize there — but the mode-check→call sequence is not atomic. In Persistent mode the fix is complete. |
New Feature: Watchdog Case B Shutdown Detection (ccee6654)
GetLastEventType() is correctly implemented — handles missing file (return null), empty file, corrupt JSON (try/catch), all return null. Called only in the Case B deferral path (line 2092). The logic is: before deferring because events.jsonl looks fresh, peek at the last event; if it's session.shutdown, complete immediately instead of deferring. This closes a real gap where headless server kills were invisible to the client.
Remaining Item
Finding #5 (ConnectionSettings.Load()) is a testability concern, not a runtime bug. It uses the pattern _currentSettings ?? ConnectionSettings.Load() which correctly prefers the in-memory settings when available. Non-blocking for merge.
Verdict: ✅ Approve (with note)
All 4 blocking findings are fixed. The Non-Persistent early-return note (3/5 consensus) is a low-risk edge case since Non-Persistent mode is not the default deployment target. The ConnectionSettings.Load() testability concern is pre-existing pattern. The watchdog shutdown detection is a clean, well-implemented improvement.
…and history recovery (PureWeen#391) ## Summary Multiple bug fixes discovered after PR PureWeen#375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss. ## Changes ### 1. Never push to main rule - Added as first Git Workflow rule in `.github/copilot-instructions.md` ### 2. Permission recovery killing multi-agent workers - `TryRecoverPermissionAsync` calls `TrySetCanceled()` on `ResponseCompletion` TCS, propagating as `TaskCanceledException` to orchestrator workers - **Fix**: Retry loop in `SendPromptAndWaitAsync` detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries) ### 3. Session ID not persisted after reconnect - When SDK returns different session ID on resume, `state.Info.SessionId` was updated in memory but `FlushSaveActiveSessionsToDisk()` never called - **Fix**: Added flush after every SessionId update in 4 reconnect sites ### 4. Server health notice for posix_spawn failures - Bundled CLI native modules can be deleted by unknown processes, causing `posix_spawn ENOENT` - **Fix**: `ServerHealthNotice` banner on Dashboard with Restart Server button and full server restart cycle ### 5. Session history loss from dead event streams - After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl - **Fix**: `LoadBestHistoryAsync()` compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent ### 6. PR review fixes - **CRITICAL**: `RestartServerAsync` wrapped in `_clientReconnectLock` (race condition fix) - **HIGH**: `DisposePrematureIdleSignal` added in restart disposal loop (MRE leak) - **HIGH**: History recency threshold reduced from 1 minute to 5 seconds - **MINOR**: Dashboard restores `ServerHealthNotice` on restart failure ## Related Issues - PureWeen#392 — posix_spawn upstream bug - PureWeen#395 — Spinner gap during premature idle recovery ## Testing All 2669 tests pass. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Multiple bug fixes discovered after PR #375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss.
Changes
1. Never push to main rule
.github/copilot-instructions.md2. Permission recovery killing multi-agent workers
TryRecoverPermissionAsynccallsTrySetCanceled()onResponseCompletionTCS, propagating asTaskCanceledExceptionto orchestrator workersSendPromptAndWaitAsyncdetects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries)3. Session ID not persisted after reconnect
state.Info.SessionIdwas updated in memory butFlushSaveActiveSessionsToDisk()never called4. Server health notice for posix_spawn failures
posix_spawn ENOENTServerHealthNoticebanner on Dashboard with Restart Server button and full server restart cycle5. Session history loss from dead event streams
LoadBestHistoryAsync()compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent6. PR review fixes
RestartServerAsyncwrapped in_clientReconnectLock(race condition fix)DisposePrematureIdleSignaladded in restart disposal loop (MRE leak)ServerHealthNoticeon restart failureRelated Issues
Testing
All 2669 tests pass.