Skip to content

Fix multi-agent worker failures, session persistence, server health, and history recovery#391

Merged
PureWeen merged 11 commits intomainfrom
fix/never-push-to-main
Mar 16, 2026
Merged

Fix multi-agent worker failures, session persistence, server health, and history recovery#391
PureWeen merged 11 commits intomainfrom
fix/never-push-to-main

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 16, 2026

Summary

Multiple bug fixes discovered after PR #375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss.

Changes

1. Never push to main rule

  • Added as first Git Workflow rule in .github/copilot-instructions.md

2. Permission recovery killing multi-agent workers

  • TryRecoverPermissionAsync calls TrySetCanceled() on ResponseCompletion TCS, propagating as TaskCanceledException to orchestrator workers
  • Fix: Retry loop in SendPromptAndWaitAsync detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries)

3. Session ID not persisted after reconnect

  • When SDK returns different session ID on resume, state.Info.SessionId was updated in memory but FlushSaveActiveSessionsToDisk() never called
  • Fix: Added flush after every SessionId update in 4 reconnect sites

4. Server health notice for posix_spawn failures

  • Bundled CLI native modules can be deleted by unknown processes, causing posix_spawn ENOENT
  • Fix: ServerHealthNotice banner on Dashboard with Restart Server button and full server restart cycle

5. Session history loss from dead event streams

  • After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl
  • Fix: LoadBestHistoryAsync() compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent

6. PR review fixes

  • CRITICAL: RestartServerAsync wrapped in _clientReconnectLock (race condition fix)
  • HIGH: DisposePrematureIdleSignal added in restart disposal loop (MRE leak)
  • HIGH: History recency threshold reduced from 1 minute to 5 seconds
  • MINOR: Dashboard restores ServerHealthNotice on restart failure

Related Issues

Testing

All 2669 tests pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

PR #391 Review — 5-Model Consensus

CI: No checks configured ⚠️ (no pre-existing CI on this repo)
Prior reviews: None


🔴 CRITICAL — CopilotService.cs:984RestartServerAsync has no concurrency guard (3/5 models)

ReconnectAsync and TryRecoverPermissionAsync both acquire _clientReconnectLock before modifying _client, _sessions, and the server process. RestartServerAsync does none of this.

If TryRecoverPermissionAsync is mid-flight (holding the lock, disposing _client, recreating sessions via TryUpdate) while the user clicks Restart:

  • RestartServerAsync calls _sessions.Clear() while recovery is doing TryUpdate on those same sessions
  • Both paths call StartServerAsync — recovery inside the lock (line ~2772), restart outside — resulting in two concurrent server starts
  • _client is disposed and recreated twice concurrently, leaving a corrupted reference

Fix: Wrap RestartServerAsync body in await _clientReconnectLock.WaitAsync() / finally { _clientReconnectLock.Release() }, matching ReconnectAsync.


🟡 HIGH — CopilotService.cs:998-1004RestartServerAsync leaks ManualResetEventSlim per session (4/5 models)

The disposal loop cancels watchdog, fallback, and tool health-check timers but never calls DisposePrematureIdleSignal(state). PrematureIdleSignal is a ManualResetEventSlim wrapping unmanaged kernel event objects. Every server restart leaks one per active session.

Compare: TryRecoverPermissionAsync at line ~2374 calls it. All 19 individual session removal sites call it. Only the bulk loops (ReconnectAsync:881, DisposeAsync:3899, and now RestartServerAsync) are missing it.

Fix: Add DisposePrematureIdleSignal(state); after CancelToolHealthCheck(state); in the disposal loop.


🟡 HIGH — CopilotService.Utilities.cs:539-547LoadBestHistoryAsync 1-minute threshold causes destructive data loss (3/5 models)

BulkInsertAsync is not additive — it executes DELETE FROM ChatMessageEntity WHERE SessionId = ? then InsertAll in a single transaction. When a crash occurs during an active conversation (the primary scenario this feature exists to handle), the DB typically has messages from seconds-to-minutes before the crash.

With the current 1-minute threshold:

  • DB has message 45 seconds newer than events.jsonl → events.jsonl wins → BulkInsertAsync permanently deletes the newer DB message and replaces with stale data
  • The more common recovery scenario (crash mid-active-conversation) is precisely the sub-minute range

Fix: Reduce threshold from > 1 minute to > 5 seconds (or > 0). The 1-minute guard makes the DB path nearly unreachable in the cases it is meant to protect.


�� MINOR — Dashboard.razor:595-602 — Restart failures are invisible to the user (2/5 models)

ServerHealthNotice is cleared at line 995 before any fallible work begins. If an unexpected exception escapes (bad settings, cancellation during port scan, etc.), the catch block only does Console.WriteLine. The health banner is gone and no replacement is set.

Fix: In the Dashboard catch block, restore a notice:

CopilotService.ServerHealthNotice = "Server restart failed — please try again.";

✅ Confirmed NOT bugs (4/5 models agree)

  • SendPromptAndWaitAsync retry loop CTS semantics (Organization.cs:1731): Safe. Permission recovery calls TrySetCanceled() on the ResponseCompletion TCS directly — it does NOT cancel cts. The catch filter !cts.IsCancellationRequested correctly distinguishes recovery-caused cancellations from user/timeout cancellations. On retry, cts.Token is guaranteed valid.
  • ServerHealthNotice thread safety (Events.cs:449-451): Safe — assigned inside Invoke() (UI-thread marshal) or directly from a Blazor UI event handler.

Test coverage gaps

New code paths lacking tests:

  1. RestartServerAsync concurrent with ReconnectAsync — race condition test (calls both concurrently, verifies no crash/corruption)
  2. LoadBestHistoryAsync with DB message 30s newer — should prefer DB; currently would silently lose it
  3. LoadBestHistoryAsync with DB message 90s newer — should prefer DB (currently works correctly)
  4. RestartServerAsync exception path — verify ServerHealthNotice or FallbackNotice is set on failure

Verdict: ⚠️ Request changes

Two real bugs: the missing lock is a concurrency hazard that can corrupt _client and _sessions; the 1-minute threshold in LoadBestHistoryAsync causes destructive overwrites of newer chat history via BulkInsertAsync. Both are straightforward fixes. The MRE leak and UI feedback gap are lower severity but worth patching in the same pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

All 4 findings addressed in commit dc53a74:

🔴 CRITICAL — Concurrency guard ✅

Wrapped RestartServerAsync body in _clientReconnectLock.WaitAsync() / finally { Release() }.

🟡 HIGH — MRE leak ✅

Added DisposePrematureIdleSignal(state) in the disposal loop after CancelToolHealthCheck.

🟡 HIGH — 1-minute threshold ✅

Reduced from TotalMinutes > 1 to TotalSeconds > 5. Sub-minute crash recovery now correctly prefers DB.

🟢 MINOR — Restart error feedback ✅

Dashboard catch block now sets ServerHealthNotice = "Server restart failed — please try again.".

Note: The review also flagged ReconnectAsync and DisposeAsync as missing DisposePrematureIdleSignal in their bulk loops — good catch, but those are pre-existing and out of scope for this PR. Will track separately.

@PureWeen PureWeen force-pushed the fix/never-push-to-main branch from dc53a74 to 2b9d517 Compare March 16, 2026 20:01
@PureWeen PureWeen changed the title Add 'never push to main' rule to copilot instructions Fix multi-agent worker failures, session persistence, server health, and history recovery Mar 16, 2026
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Consensus Review — Round 2

CI: ⚠️ No checks configured
Tests: 2669/2669 pass (clean)

Previous Findings Status

# Finding Status
F1 RestartServerAsync missing _clientReconnectLock FIXED — lock acquired at line 1003, released in finally
F2 MRE leak (DisposePrematureIdleSignal skipped) ⚠️ PARTIALLY FIXED — fixed in RestartServerAsync:1015 but still missing in ReconnectAsync:890 and DisposeAsync:3931
F3 BulkInsertAsync unguarded in RestorePreviousSessionsAsync
F4 Dashboard RestartServer() catch swallows errors FIXED — now calls SetServerHealthNotice() to show banner
F5 ConnectionSettings.Load() testability FIXED — uses _currentSettings ?? ConnectionSettings.Load()

Remaining Issue

[🟡 MODERATE] CopilotService.cs:890, :3931 — DisposePrematureIdleSignal still missing in 2 of 3 disposal loops

RestartServerAsync (line 1015) correctly calls DisposePrematureIdleSignal(state) before disposing sessions, but ReconnectAsync (line 890) and DisposeAsync (line 3931) still skip it. This leaks ManualResetEventSlim resources during mode switching and app shutdown.

Fix: Add DisposePrematureIdleSignal(state); to both loops, matching the pattern at line 1015.

Verdict: ⚠️ Request changes

4 of 5 original findings fixed. One remaining MODERATE (MRE leak in 2 disposal loops). The new watchdog Case B session.shutdown detection (commit ccee665) looks correct.

Review by PR Review Squad (2/5 models reached consensus due to rate limiting)

@PureWeen
Copy link
Copy Markdown
Owner Author

Round 2 Re-Review — PR #391

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: 2+ models to flag an issue.

CI: No checks configured ⚠️ (pre-existing, not PR-specific)
Tests: ✅ 2669/2669 pass (verified on pr-391-r2 with fresh build)


Previous Findings Status

Finding Status
🔴 F1 — RestartServerAsync not protected by _clientReconnectLock FIXED (5/5)
🟡 F2 — MRE leak: disposal loops skip DisposePrematureIdleSignal ⚠️ PARTIALLY FIXED (5/5)
🟡 F3 — BulkInsertAsync destructive at 3rd call site (no !historyFromDb guard) FIXED (5/5)
🟢 F4 — Dashboard RestartServer() catch swallows errors silently FIXED (5/5)
🟡 F5 — ConnectionSettings.Load() called directly — not testable FIXED (5/5)

Fix Verification

F1 (CopilotService.cs:1003): Persistent path wrapped in _clientReconnectLock.WaitAsync() / finally Release(). Non-Persistent path returns early and delegates to ReconnectAsync (which manages its own lock). No deadlock risk. ✓

F3 (Persistence.cs:639): if (recreatedState.Info.SessionId != null && !oldFromDb) guard correctly prevents DB-origin history from being re-written. oldFromDb is correctly reassigned when origHistory wins the count comparison. ✓

F4 (Dashboard.razor + CopilotService.cs): Start-failure sets FallbackNotice; connect-failure sets FallbackNotice; Dashboard catch sets ServerHealthNotice("Server restart failed — please try again."). User-visible on all error paths. ✓

F5 (CopilotService.cs:998,1033): _currentSettings ?? ConnectionSettings.Load()_currentSettings is set during InitializeAsync and ReconnectAsync(settings), so tests that initialize normally never reach the file read. ✓


F2 — Still Partially Present ⚠️

DisposePrematureIdleSignal was added to RestartServerAsync's loop (line 1015) but two other bulk disposal loops are still missing it — and these are the more-commonly-exercised paths:

ReconnectAsync (line 890–896) — called on every Settings → Save & Reconnect:

foreach (var state in _sessions.Values)
{
    CancelProcessingWatchdog(state);
    CancelTurnEndFallback(state);
    CancelToolHealthCheck(state);
    // ← DisposePrematureIdleSignal(state) MISSING
    try { if (state.Session != null) await state.Session.DisposeAsync(); } catch { }
}

DisposeAsync (line 3931–3938) — called on app shutdown:

foreach (var state in _sessions.Values)
{
    CancelProcessingWatchdog(state);
    CancelTurnEndFallback(state);
    CancelToolHealthCheck(state);
    // ← DisposePrematureIdleSignal(state) MISSING
    ...
}

Each leaked ManualResetEventSlim holds an OS kernel handle. Both loops should add DisposePrematureIdleSignal(state) to match RestartServerAsync.


New Finding (2+ model consensus)

🟢 MINOR CopilotService.Utilities.cs:145-157GetLastEventType reads the entire events.jsonl forward to find the last line (O(N) full-file scan). It is called every 15s per processing session in watchdog Case B. For sessions with thousands of events, this is unnecessary I/O. HasInterruptedToolExecution in the same file uses a 30-line ring buffer to avoid this — the same pattern could be applied here (read only the last line via a small tail buffer). Flagged by 3/5 models.


Verdict: ⚠️ Request Changes

Specific ask: Add DisposePrematureIdleSignal(state) to the disposal loops in ReconnectAsync (line 893) and DisposeAsync (line 3934), matching the pattern already in RestartServerAsync (line 1015). Two-line fix.

Everything else is clean. Once F2 is fully resolved, this is ready to merge.

PureWeen and others added 11 commits March 16, 2026 16:09
Adds the first Git Workflow rule: never commit or push directly to main,
always create a feature branch and open a PR. This prevents bypassing
code review for any change, no matter how small.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Permission recovery (TryRecoverPermissionAsync) calls TrySetCanceled on the
ResponseCompletion TCS to unblock SendPromptAsync. For multi-agent workers,
this TaskCanceledException propagates to ExecuteWorkerAsync's generic catch,
which logs FAILED and discards all collected content.

Fix: SendPromptAndWaitAsync now detects permission-recovery cancellation
(TCS cancelled but dispatch token still alive) and retries by awaiting the
new state's ResponseCompletion TCS. If recovery skipped the resend (tools
already completed), it collects partial response from history instead.

Observed: worker-1 hit 3 shell errors in 13s, triggering permission recovery.
Recovery cancelled the TCS, ExecuteWorkerAsync caught TaskCanceledException,
entire orchestration failed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lling workers

Two fixes:

1. Session ID persistence: When the SDK returns a different session ID on
   resume (e.g., reverting to an older session), active-sessions.json was
   never updated. On next restart, the app would resume the old session
   with stale conversation history. Added FlushSaveActiveSessionsToDisk()
   after every state.Info.SessionId update in reconnect paths (4 sites).

   Root cause: CI-Investigate had ID 7e9c8108 with recent conversations,
   but active-sessions.json still had old ID 6ff9eb34. After relaunch,
   it resumed 6ff9eb34 and the user lost all recent work.

2. Permission recovery worker survival: TryRecoverPermissionAsync calls
   TrySetCanceled on ResponseCompletion to unblock SendPromptAsync. For
   multi-agent workers, this TaskCanceledException propagated to
   ExecuteWorkerAsync's generic catch, killing the worker. Fixed by adding
   a retry loop in SendPromptAndWaitAsync that detects permission-recovery
   cancellation (TCS cancelled but dispatch token alive) and re-awaits the
   new state's ResponseCompletion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the headless server's native modules become stale (e.g., another CLI
installation cleans up ~/.copilot/pkg/darwin-arm64/), posix_spawn fails
for all shell operations. This adds:

- ServerHealthNotice property on CopilotService (like FallbackNotice)
- Dashboard banner with 'Restart Server' button when shell failures detected
- RestartServerAsync() method: stops server, waits for port, starts fresh
  server, recreates client, restores all sessions from disk
- System message in affected session explaining the error
- Shell failure detection in Events.cs sets the banner in Persistent mode

The restart follows the version-mismatch pattern: stop → wait → start →
recreate client → restore sessions. No session data is lost.

Fixes #392

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t streams

When the headless server cleans up an idle session and PolyPilot re-resumes it,
the SDK's event file writer breaks ('dead event stream'). Events flow in-memory
but are never persisted to events.jsonl. On restart, LoadHistoryFromDisk reads
stale data and all post-reconnect messages are lost.

Fix: LoadBestHistoryAsync compares events.jsonl and chat_history.db (written
fire-and-forget on every message), preferring whichever has more substantive
messages. This also prevents BulkInsertAsync from destructively overwriting
richer DB data with stale events.jsonl content.

Updated at 4 call sites: ResumeSessionAsync, lazy restore, codespace
placeholder, and fallback recovery.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The count-based comparison was wrong — events.jsonl can have more OLD messages
while the DB has NEWER ones from dead event stream sessions. Merging is risky
(duplicates, ordering). Instead, compare the latest user message timestamp from
each source and pick the winner outright. No merging, no risk.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. RestartServerAsync: wrap in _clientReconnectLock to prevent races with
   TryRecoverPermissionAsync and ReconnectAsync (CRITICAL)
2. RestartServerAsync disposal loop: add DisposePrematureIdleSignal(state)
   to prevent ManualResetEventSlim leak per session (HIGH)
3. LoadBestHistoryAsync: reduce recency threshold from 1 minute to 5 seconds
   so crash-during-active-conversation recovers from DB (HIGH)
4. Dashboard RestartServer catch: restore ServerHealthNotice on failure so
   user sees feedback instead of silent failure (MINOR)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fallback recovery path in RestorePreviousSessionsAsync was the only
when history came from ChatDatabase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the headless server kills a session (idle timeout, stuck tools),
it writes session.shutdown to events.jsonl. But if the client's event
stream is dead, we never receive this event. The watchdog's Case B
freshness check sees the file as recently modified and keeps deferring,
leaving the session stuck for 10+ minutes.

Fix: Before deferring in Case B, read the last event type from
events.jsonl. If it's session.shutdown, skip the deferral and complete
immediately. Uses the same parse pattern as HasInterruptedToolExecution.

Also fix pre-existing build error: Dashboard.razor was assigning to
ServerHealthNotice's private setter. Added SetServerHealthNotice() method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… GetLastEventType

1. Add DisposePrematureIdleSignal(state) to ReconnectAsync and DisposeAsync
   disposal loops, matching the pattern already in RestartServerAsync. This
   prevents ManualResetEventSlim leaks during mode switching and app shutdown.

2. Optimize GetLastEventType to use a 4KB tail-read instead of O(N) full-file
   scan. Called every 15s per processing session in watchdog Case B — the
   previous forward scan was unnecessary I/O for large events.jsonl files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names

- Add GetAllMessagesAsync to StubChatDatabase (interface compliance)
- Update structural tests to search for LoadBestHistoryAsync (renamed)
- Increase RenderThrottleTests search window from 1000 to 1200 chars

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/never-push-to-main branch from ccee665 to 536e1da Compare March 16, 2026 21:27
@PureWeen
Copy link
Copy Markdown
Owner Author

All findings addressed in commit dbe57d9 (now rebased as part of 536e1da):

F2 - MRE leak (FULLY FIXED): Added DisposePrematureIdleSignal(state) to both ReconnectAsync (line 895) and DisposeAsync (line 3936) disposal loops, matching the pattern in RestartServerAsync (line 1016). All 3 disposal loops are now consistent.

New finding - GetLastEventType O(N) scan (FIXED): Replaced forward full-file scan with a 4KB tail-read (FileStream.Seek to Length - 4096). Called every 15s per processing session, so this eliminates unnecessary I/O for large events.jsonl files.

Rebase: Rebased on latest origin/main (c2fdb8c). Resolved merge conflicts in RestartServerAsync (kept both TryRecoverPersistentServerAsync from main and our RestartServerAsync). All 2703 tests pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

Round 3 Re-Review — ✅ Approve

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex) | 2703/2703 tests pass


Previous Findings Status

Finding Status
F1 🔴 RestartServerAsync not locked ✅ FIXED — _clientReconnectLock.WaitAsync at line 1098
F3 🟡
F4 🟢 Dashboard catch swallows errors ✅ FIXED — error surfaced via SetServerHealthNotice()
F5 🟡 ConnectionSettings.Load() not testable ✅ FIXED
R2-N1 🟡 DisposePrematureIdleSignal missing in ReconnectAsync + DisposeAsync ✅ FIXED — added to both loops (lines 905, 4028); all 3 disposal loops now consistent

New Findings (consensus filter: 2+ models required)

No new issues passed the consensus filter. One model noted a potential flag-swap in the session-ID mismatch branch (actualFromDb vs historyFromDb) but this was not flagged by any other model and the effect is mitigated by CopyEventsToNewSession running before the affected branch.

GetLastEventType optimization

The 4KB tail-read is correct:

  • Partial first line from mid-file seek is safely overwritten by subsequent full JSON lines
  • FileShare.ReadWrite is consistent with LoadHistoryFromDiskAsync in the same file
  • JsonDocument.Parse failure on degenerate cases returns null (correct watchdog behavior)

Test note

The --no-build run showed 1 failure in RenderThrottleTests due to stale binary (built before the 536e1daf window bump). A fresh build clears it: 2703/2703 pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Consensus Review — Round 3

CI: ⚠️ No checks configured
Tests: 2703/2703 pass ✅

Previous Findings Status

# Finding Status
F1 RestartServerAsync missing _clientReconnectLock FIXED — lock at line 1098
F3 BulkInsertAsync unguarded 3rd call site
F4 Dashboard catch swallows errors FIXEDSetServerHealthNotice() surfaces error
F5 ConnectionSettings.Load() testability FIXED — tests bypass via ReconnectAsync(settings) with in-memory objects
R2-N1 DisposePrematureIdleSignal missing in 2 disposal loops FIXED — added to ReconnectAsync (line 908) and DisposeAsync (line 4031)

New Findings (Round 3)

None. All 3/5 sub-agents that completed (others hit rate limits) unanimously confirm all findings fixed with no new issues.

Key verification points:

  • All 3 disposal loops (ReconnectAsync, RestartServerAsync, DisposeAsync) now consistently call DisposePrematureIdleSignal(state)
  • GetLastEventType tail-read uses FileShare.ReadWrite, 4KB tail buffer, proper partial-line handling
  • Test changes are mechanical (stub updates, search window adjustments)

Verdict: ✅ Approve

All 5 original findings resolved across 3 rounds of review. No new issues.

Review by PR Review Squad (3/5 models reached — rate limit on remaining 2)

@PureWeen PureWeen merged commit 23a3083 into main Mar 16, 2026
@PureWeen PureWeen deleted the fix/never-push-to-main branch March 16, 2026 21:58
@PureWeen
Copy link
Copy Markdown
Owner Author

Multi-Model Consensus Review — Round 2 (5 models)

Latest commits: 2b9d517b (concurrency guard + MRE + UX), 06d019b7 (BulkInsertAsync guard), ccee6654 (watchdog session.shutdown detection)

CI: ⚠️ No CI configured


Previous Findings Status

# Sev Finding Status
1 🔴 RestartServerAsync not protected by _clientReconnectLock FIXEDawait _clientReconnectLock.WaitAsync() added at line 1003 with finally { Release() } at line 1083
2 🟡 MRE leak — DisposePrematureIdleSignal skipped in disposal loop FIXED — added to RestartServerAsync disposal loop (line 1016)
3 🟡
4 🟢 Dashboard RestartServer() catch silently drops errors FIXEDServerHealthNotice restored on failure
5 🟡 RestartServerAsync calls ConnectionSettings.Load() directly ⚠️ STILL PRESENT_currentSettings ?? ConnectionSettings.Load() at line 1036 and 1000

New Findings (2+ model consensus)

Sev File:Line Models Description
🟡 CopilotService.cs:993-1001 3/5 Non-Persistent early-return path skips _clientReconnectLock. When CurrentMode != ConnectionMode.Persistent, RestartServerAsync calls await ReconnectAsync(settings, cancellationToken) without first acquiring _clientReconnectLock. Two concurrent calls to RestartServerAsync in Non-Persistent mode (e.g., two server health notices clicked) can both fall through to ReconnectAsync simultaneously. ReconnectAsync acquires the lock internally, so they serialize there — but the mode-check→call sequence is not atomic. In Persistent mode the fix is complete.

New Feature: Watchdog Case B Shutdown Detection (ccee6654)

GetLastEventType() is correctly implemented — handles missing file (return null), empty file, corrupt JSON (try/catch), all return null. Called only in the Case B deferral path (line 2092). The logic is: before deferring because events.jsonl looks fresh, peek at the last event; if it's session.shutdown, complete immediately instead of deferring. This closes a real gap where headless server kills were invisible to the client.


Remaining Item

Finding #5 (ConnectionSettings.Load()) is a testability concern, not a runtime bug. It uses the pattern _currentSettings ?? ConnectionSettings.Load() which correctly prefers the in-memory settings when available. Non-blocking for merge.


Verdict: ✅ Approve (with note)

All 4 blocking findings are fixed. The Non-Persistent early-return note (3/5 consensus) is a low-risk edge case since Non-Persistent mode is not the default deployment target. The ConnectionSettings.Load() testability concern is pre-existing pattern. The watchdog shutdown detection is a clean, well-implemented improvement.

arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…and history recovery (PureWeen#391)

## Summary

Multiple bug fixes discovered after PR PureWeen#375 merge, addressing worker
failures, session persistence, server health detection, and conversation
history loss.

## Changes

### 1. Never push to main rule
- Added as first Git Workflow rule in `.github/copilot-instructions.md`

### 2. Permission recovery killing multi-agent workers
- `TryRecoverPermissionAsync` calls `TrySetCanceled()` on
`ResponseCompletion` TCS, propagating as `TaskCanceledException` to
orchestrator workers
- **Fix**: Retry loop in `SendPromptAndWaitAsync` detects
permission-recovery cancellation and re-awaits new state's TCS (up to 3
retries)

### 3. Session ID not persisted after reconnect
- When SDK returns different session ID on resume,
`state.Info.SessionId` was updated in memory but
`FlushSaveActiveSessionsToDisk()` never called
- **Fix**: Added flush after every SessionId update in 4 reconnect sites

### 4. Server health notice for posix_spawn failures
- Bundled CLI native modules can be deleted by unknown processes,
causing `posix_spawn ENOENT`
- **Fix**: `ServerHealthNotice` banner on Dashboard with Restart Server
button and full server restart cycle

### 5. Session history loss from dead event streams
- After server-side idle cleanup + re-resume, SDK event file writer
breaks — events flow in-memory but never persist to events.jsonl
- **Fix**: `LoadBestHistoryAsync()` compares latest user message
timestamps from events.jsonl and chat_history.db, picks whichever is
more recent

### 6. PR review fixes
- **CRITICAL**: `RestartServerAsync` wrapped in `_clientReconnectLock`
(race condition fix)
- **HIGH**: `DisposePrematureIdleSignal` added in restart disposal loop
(MRE leak)
- **HIGH**: History recency threshold reduced from 1 minute to 5 seconds
- **MINOR**: Dashboard restores `ServerHealthNotice` on restart failure

## Related Issues
- PureWeen#392 — posix_spawn upstream bug
- PureWeen#395 — Spinner gap during premature idle recovery

## Testing
All 2669 tests pass.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant