Strengthen multi-agent skills, Fix flow, and stability tests#375
Strengthen multi-agent skills, Fix flow, and stability tests#375
Conversation
e42a357 to
8aae337
Compare
Multi-Model Consensus Review -- Round 2 (5-model × 5-agent)CI Status:
|
- Re-arm IsProcessing on TurnStartEvent after premature session.idle (SDK sends idle then continues 15+ tool rounds — ghost events lost) - Fix INV-15: revival path uses TryUpdate instead of index assignment - Fix thread safety: Organization.Sessions/Groups use snapshot methods - Fix hardcoded UserProfile path → CopilotService.BaseDir - Fix vacuous/tautological test assertions (PR review CRITICALs #1, #2) - Fix test method scoping to use Task<string> SendPromptAsync( signature - Widen handler proximity check to 60 lines, skip CopilotService.cs (retry patterns share single handler after catch blocks) All 2616 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 tests covering the PR #375 bug where steering a busy orchestrator canceled the in-flight orchestration TCS: - EnqueueMessage_QueuesDrainAfterCompletion: verify queue works - EnqueueMessage_MultipleMessages_QueuedInOrder: FIFO ordering - DashboardDispatch_OrchestratorCheckBeforeSteer: structural ordering - DashboardDispatch_OrchestratorUsesEnqueueNotSteer: correct dispatch - DashboardDispatch_OrchestratorQueueHasDiagnosticLog: QUEUED_ORCH_BUSY tag - DashboardDispatch_NonOrchestratorStillSteered: steer path preserved - GetOrchestratorGroupId_WorkerInActiveGroup_ReturnsNull: workers steerable - LongRunningOrchestrator_UserFollowup_MustQueue: 15min scenario Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Re-arm IsProcessing on TurnStartEvent after premature session.idle (SDK sends idle then continues 15+ tool rounds — ghost events lost) - Fix INV-15: revival path uses TryUpdate instead of index assignment - Fix thread safety: Organization.Sessions/Groups use snapshot methods - Fix hardcoded UserProfile path → CopilotService.BaseDir - Fix vacuous/tautological test assertions (PR review CRITICALs #1, #2) - Fix test method scoping to use Task<string> SendPromptAsync( signature - Widen handler proximity check to 60 lines, skip CopilotService.cs (retry patterns share single handler after catch blocks) All 2616 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 tests covering the PR #375 bug where steering a busy orchestrator canceled the in-flight orchestration TCS: - EnqueueMessage_QueuesDrainAfterCompletion: verify queue works - EnqueueMessage_MultipleMessages_QueuedInOrder: FIFO ordering - DashboardDispatch_OrchestratorCheckBeforeSteer: structural ordering - DashboardDispatch_OrchestratorUsesEnqueueNotSteer: correct dispatch - DashboardDispatch_OrchestratorQueueHasDiagnosticLog: QUEUED_ORCH_BUSY tag - DashboardDispatch_NonOrchestratorStillSteered: steer path preserved - GetOrchestratorGroupId_WorkerInActiveGroup_ReturnsNull: workers steerable - LongRunningOrchestrator_UserFollowup_MustQueue: 15min scenario Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
39499b9 to
5293996
Compare
- Re-arm IsProcessing on TurnStartEvent after premature session.idle (SDK sends idle then continues 15+ tool rounds — ghost events lost) - Fix INV-15: revival path uses TryUpdate instead of index assignment - Fix thread safety: Organization.Sessions/Groups use snapshot methods - Fix hardcoded UserProfile path → CopilotService.BaseDir - Fix vacuous/tautological test assertions (PR review CRITICALs #1, #2) - Fix test method scoping to use Task<string> SendPromptAsync( signature - Widen handler proximity check to 60 lines, skip CopilotService.cs (retry patterns share single handler after catch blocks) All 2616 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 tests covering the PR #375 bug where steering a busy orchestrator canceled the in-flight orchestration TCS: - EnqueueMessage_QueuesDrainAfterCompletion: verify queue works - EnqueueMessage_MultipleMessages_QueuedInOrder: FIFO ordering - DashboardDispatch_OrchestratorCheckBeforeSteer: structural ordering - DashboardDispatch_OrchestratorUsesEnqueueNotSteer: correct dispatch - DashboardDispatch_OrchestratorQueueHasDiagnosticLog: QUEUED_ORCH_BUSY tag - DashboardDispatch_NonOrchestratorStillSteered: steer path preserved - GetOrchestratorGroupId_WorkerInActiveGroup_ReturnsNull: workers steerable - LongRunningOrchestrator_UserFollowup_MustQueue: 15min scenario Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5293996 to
cb8caaf
Compare
- Multi-agent orchestration SKILL.md: Added PR 373 reconnect safety invariants (INV-O9 through INV-O13), session death/deadlock bug documentation, Fix with Copilot multi-agent awareness section, OrchestratorReflect monitoring guide, enhanced end-to-end checklist, and comprehensive test coverage gap analysis - Processing-state-safety SKILL.md: Added INV-14 through INV-17 covering IsOrphaned guards, TryUpdate concurrency, handler-before- publish ordering, and MCP server reload on reconnect - SessionSidebar.razor Fix flow: GetBugReportDebugInfo now includes multi-agent context (group name, mode, role, members, event diagnostics, PendingOrchestration state) via AppendMultiAgentDebugInfo. BuildCopilotPrompt adds multi-agent testing requirements when the session is in a multi-agent group - SessionStabilityTests.cs: 20 new tests covering IsOrphaned guards (5 tests), ForceCompleteProcessingAsync INV-1 compliance (3 tests), synthesis prompt mixed success/failure (2 tests), sibling re-resume safety (4 tests), MCP reload (2 tests), diagnostic log completeness, watchdog companion fields, and Fix prompt multi-agent enhancement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ety tests Root cause: ExecuteWorkerAsync revival path created a fresh CopilotSession without registering an event handler (.On(evt => HandleSessionEvent(...))). The CLI processed prompts and wrote 189+ events to events.jsonl, but HandleSessionEvent never fired — leaving the session permanently stuck. Fix (Organization.cs): - Set deadState.IsOrphaned = true before dispose (prevents stale callbacks) - Copy IsMultiAgentSession from dead state to fresh state - Register event handler on fresh session before storing in _sessions Safety tests (25 tests in LongRunningSessionSafetyTests.cs): - Timeout constants accommodate 20+ min workers and 30 min freshness - Revival path verified: event handler, IsOrphaned, IsMultiAgentSession - All session creation paths across 3 files register event handlers - Guard against mtime staleness detection reintroduction - Theory tests for various worker durations (3-45 min) and thinking pauses - Abort clears all processing state for long-running sessions Skill update (SKILL.md): - Long-Running Session Safety section with cardinal rule and safe/unsafe table - INV-O14: Never add mtime staleness detection to watchdog - Bug pattern: Dead event stream after worker revival Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Re-arm IsProcessing on TurnStartEvent after premature session.idle (SDK sends idle then continues 15+ tool rounds — ghost events lost) - Fix INV-15: revival path uses TryUpdate instead of index assignment - Fix thread safety: Organization.Sessions/Groups use snapshot methods - Fix hardcoded UserProfile path → CopilotService.BaseDir - Fix vacuous/tautological test assertions (PR review CRITICALs #1, #2) - Fix test method scoping to use Task<string> SendPromptAsync( signature - Widen handler proximity check to 60 lines, skip CopilotService.cs (retry patterns share single handler after catch blocks) All 2616 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a message is sent to an orchestrator that's already processing, the steer path was canceling the in-flight orchestration TCS, aborting worker dispatch (TaskCanceledException). Now orchestrator sessions queue the message instead, which gets drained after the current orchestration completes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 tests covering the PR #375 bug where steering a busy orchestrator canceled the in-flight orchestration TCS: - EnqueueMessage_QueuesDrainAfterCompletion: verify queue works - EnqueueMessage_MultipleMessages_QueuedInOrder: FIFO ordering - DashboardDispatch_OrchestratorCheckBeforeSteer: structural ordering - DashboardDispatch_OrchestratorUsesEnqueueNotSteer: correct dispatch - DashboardDispatch_OrchestratorQueueHasDiagnosticLog: QUEUED_ORCH_BUSY tag - DashboardDispatch_NonOrchestratorStillSteered: steer path preserved - GetOrchestratorGroupId_WorkerInActiveGroup_ReturnsNull: workers steerable - LongRunningOrchestrator_UserFollowup_MustQueue: 15min scenario Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r feedback
Fixes 3 moderate issues from PR Review Squad consensus:
1. SessionSidebar crash.log uses CopilotService.BaseDir instead of
hardcoded UserProfile path (consistent with other file paths)
2. Replace File.ReadAllLines with ReadLastLines helper that streams
without loading entire file — fixes both crash.log (10 lines)
and event-diagnostics.log (500 lines tail) paths
3. Add user-visible system message when orchestrator queues a message
('📋 Orchestrator is busy...') so users know their message was received
Also updates skill documentation:
- multi-agent-orchestration SKILL.md: steering-orchestrator conflict bug,
premature session.idle truncation bug
- processing-state-safety SKILL.md: path #10 TurnStart re-arm
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SDK bug #299: session.idle fires prematurely mid-turn, causing orchestrator to collect truncated worker content. Fix adds post-collection recovery: - WasPrematurelyIdled volatile flag set in EVT-REARM, cleared in SendPromptAsync - RecoverFromPrematureIdleIfNeededAsync: 5s detection poll → OnSessionComplete subscription → 120s wait → History re-collect → LoadHistoryFromDisk fallback - Only applies to IsMultiAgentSession workers (zero overhead for single sessions) - Fix mutation-before-commit: SessionId assigned after TryUpdate succeeds - Fix silent null guard: Assert.NotNull in LongRunningSessionSafetyTests - Fix File.ReadAllText in debug panel: use streaming FileStream (review finding) - 11 new structural regression tests for premature idle recovery Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ection, fix exception type and dispatchTime filter - Replace volatile bool WasPrematurelyIdled + busy-poll loop with ManualResetEventSlim PrematureIdleSignal: exits immediately when EVT-REARM fires (zero latency for normal completions); times out after 5s only when no premature idle occurs - Fix catch(InvalidOperationException) on List<T>.ToArray() — should be catch(Exception) since List<T>.ToArray() does not throw InvalidOperationException on concurrent modification - Add && m.Timestamp >= dispatchTime filter to disk fallback, matching in-memory History path - Rename LoadHistoryFromDisk to LoadHistoryFromDiskAsync with FileShare.ReadWrite and async ReadLineAsync; keep sync wrapper for backward compat - Update structural tests to reflect ManualResetEventSlim API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 5s WasPrematurelyIdled polling window was fundamentally insufficient — EVT-REARM can take 30-60s to fire after premature session.idle. Worker-3 in live testing was truncated (363 chars) because recovery never triggered. Changes: - Add IsEventsFileActive() helper — checks events.jsonl mtime freshness - Detection now uses two parallel signals: WasPrematurelyIdled flag OR events.jsonl freshness (<15s age = worker still active) - Recovery loops through repeated premature idle cycles (observed: 4x in a row for worker-3) instead of waiting for a single OnSessionComplete - After each completion round, checks events.jsonl staleness to decide if worker is truly done vs hitting premature idle again - Add PrematureIdleEventsFileFreshnessSeconds constant (15s) - 4 new structural regression tests for freshness check and loop pattern - Update search windows from 5000 to 8000 chars for expanded method body All 2649 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sessions are now loaded as lightweight placeholders at startup (history from events.jsonl only, no SDK connection). The actual SDK ResumeSessionAsync call happens lazily when the user sends a message to a session (EnsureSessionConnectedAsync). Previously, RestorePreviousSessionsAsync called SDK ResumeSessionAsync for each of 41 sessions sequentially. Each call connects to the persistent Copilot server. With many sessions, this blocked the UI thread for minutes, showing only the BlazorWebView background (#0F0F22) which appeared as a 'blue screen'. Changes: - RestorePreviousSessionsAsync now creates placeholder SessionState objects with Session=null and history loaded from disk - Background restore wrapper (RestoreSessionsInBackgroundAsync) runs post-restore tasks off the UI thread - EnsureSessionConnectedAsync lazily connects to SDK on first message - SendPromptAsync checks for null Session and lazy-resumes before send - IsCodespaceSession helper distinguishes codespace vs lazy placeholders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RestoreSessionsInBackgroundAsync was started as fire-and-forget on the UI thread. Its async continuations captured the UIKitSynchronizationContext, so they ran on the UI thread. Inside the restore loop, LoadHistoryFromDisk calls .GetAwaiter().GetResult() which blocks the UI thread. The async file I/O inside LoadHistoryFromDiskAsync then needs to post its continuation to the blocked UI thread → classic SyncContext deadlock. Fix: Wrap the fire-and-forget in Task.Run() so the entire restore runs on the ThreadPool where there is no SyncContext to capture. Also add ConfigureAwait(false) to async I/O calls as belt-and-suspenders. Verified: 41 sessions restore in ~9s on ThreadPool thread, UI renders immediately, all 2649 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
61b2d9c to
52e1a3d
Compare
1. Add missing 'if (!completed)' guard in premature idle recovery loop (Organization.cs:1776) — without it, the recovery always broke on first iteration making the multi-round logic dead code. (Consensus: Opus + Sonnet + GPT-5.1) 2. Marshal ReconcileOrganization to UI thread via InvokeOnUI in RestoreSessionsInBackgroundAsync — Organization.Sessions is a plain List<T> that's not thread-safe, and restore now runs on ThreadPool. (Consensus: Opus + Sonnet + Gemini) 3. Use SnapshotSessionMetas() in IsCodespaceSession and EnsureSessionConnectedAsync instead of direct Organization.Sessions access — these can run from background threads. (Consensus: Sonnet + Gemini) 4. Move lazy resume (EnsureSessionConnectedAsync) inside SendingFlag atomic guard to prevent double-resume race on rapid sends. (Opus) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #375 Review — Round 4 (post-fix
|
| Finding | Status |
|---|---|
| N1 🟡 mutation-before-commit (Org.cs:1517) | ✅ FIXED |
| N2 🟡 silent null guard (LongRunningSessionSafetyTests.cs) | ✅ FIXED |
N3 🟡 5s poll tax in RecoverFromPrematureIdleIfNeeded |
✅ FIXED — fast-path short-circuit added: checks PrematureIdleSignal.IsSet then IsEventsFileActive() before entering the 500ms polling loop. Most workers now exit instantly. |
N4 🟡 disk fallback missing dispatchTime filter |
|
Minor 🟢 synchronous ReadToEnd() in SessionSidebar.razor:2459 |
Remaining Issue
N4 🟡 MODERATE: Disk fallback LoadHistoryFromDiskAsync lacks dispatchTime filter (2 sites)
CopilotService.Organization.cs:1627-1629— dead-event-stream fallback picksLastOrDefault(m.Role == "assistant")without checkingm.Timestamp >= dispatchTime, could return stale content from a prior orchestration roundCopilotService.Organization.cs:1832-1835— recovery-path disk fallback has the same issue- The in-memory paths at lines 1581 and 1596 correctly filter by
dispatchTime— the disk paths should match
Fix: Add && m.Timestamp >= dispatchTime (or equivalent) to both LastOrDefault queries, consistent with the in-memory fallback paths.
Verdict: ⚠️ Request Changes
One moderate finding remains (N4). The 5s poll tax (N3) is fully addressed by the fast-path pattern. Recommend adding the dispatchTime filter to the disk fallback paths for consistency with the in-memory paths, then this is ready to merge.
Add m.Timestamp >= dispatchTime to both LoadHistoryFromDiskAsync fallback queries, matching the in-memory paths. Without this filter, stale assistant messages from prior orchestration rounds could be picked up as worker results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both rename tests sent commands immediately after ConnectClientAsync without waiting for the initial session list to arrive. Under load, the rename message could race with the WebSocket handshake/initial state push, causing the server-side rename to fail silently. Fix: wait for client.Sessions to contain the pre-existing session before sending rename commands, matching the pattern used by other bridge integration tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #375 Review — Round 5 (5-model consensus)Tests: ✅ 2667 passed, 0 failed | CI: Previous Findings
New Findings
Detail on F2 (still blocking)
Fix: Wrap the Detail on N5 (new critical)// Line 1482 — WRONG: UTC
var dispatchTime = DateTime.UtcNow;
// ChatMessage.cs lines 20,68-95 — LOCAL
new ChatMessage() : this("assistant", "", DateTime.Now) { }Every Verdict:
|
…ilter N5 (CRITICAL): dispatchTime used DateTime.UtcNow but ChatMessage.Timestamp uses DateTime.Now. On non-UTC systems, all valid post-dispatch messages were filtered out. Changed to DateTime.Now to match ChatMessage convention. F2 (MODERATE): Recovery timeout OCE from recoveryCts propagated past the catch filter (which only catches outer cancellation), discarding accumulated bestResponse. Now catches inner OCE and breaks cleanly from the loop. N6 (MODERATE): Resume disk fallback in ResumeOrchestrationIfPendingAsync was missing the dispatchTimeLocal filter, same class of bug as N4. Added the filter to match the in-memory path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #375 Review — Round 6 (5-model consensus)Tests: Round 5 Findings Status
New Findings (Round 6)
Detail on N7 (new, moderate)
Fix: Detail on N8 (minor)// Line 1756 (inside try block) — bestResponse not in scope at catch below
string? bestResponse = initialResponse;
// Line 1861 — user abort
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
return initialResponse; // ← should be bestResponse ?? initialResponse
}Fix: hoist Verdict:
|
…CE scope N7 (MODERATE): Lazy-resume fallback used _client.CreateSessionAsync instead of GetClientForGroup(groupId), which would route to the wrong server for codespace-backed sessions. Changed to GetClientForGroup(groupId) to match the resume path 9 lines above. N8 (LOW): Outer OCE catch on user abort returned initialResponse while bestResponse (potentially longer) was trapped inside the try scope. Hoisted bestResponse above the try block so the abort path preserves any content collected during recovery rounds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #375 — Round 7: Architectural & SDK Workaround AssessmentTests: ✅ 2667/2667 pass | Round 6 findings: N7 ✅ FIXED, N8 ✅ FIXED SDK Workaround Classification
Summary: 1 proper, 4 workarounds, 2 hacksThe multi-agent orchestration layer is functionally robust — it has survived 7 rounds of multi-model review and handles real failure modes (premature idle, dead event streams, app restarts mid-turn). But it sits on top of two fragile foundations:
Recommended Upstream SDK Issues (Priority Order)
Is Multi-Agent Bullet-Proof?For production use: Yes, with caveats. ✅ The recovery mechanisms work — premature idle, dead streams, app restarts, and user aborts are all handled with multi-tier fallbacks.
Verdict: ✅ ApproveAll blocking findings from Rounds 1-6 are fixed. The code is as robust as it can be given current SDK limitations. The two hacks should be tracked as tech debt with upstream issue references. |
PR #375 Round 7 — Architecture & SDK Workaround AssessmentTests: ✅ 2667 passed, 0 failed | Round 6 fixes: ✅ N7 (wrong client) and ✅ N8 (OCE scope) both verified fixed. SDK Workaround Audit1. Premature Idle Detection (
|
| Classification | Count | Mechanisms |
|---|---|---|
| 🟢 Proper | 1 | Lazy session resume |
| 🟡 Workaround | 4 | Premature idle detection, dead event stream, recovery loop, watchdog |
| 🔴 Fragile | 2 | events.jsonl scraping, DateTime.Now inconsistency |
Prioritized Upstream Issues to File
- [P0]
session.idlereliability — Root cause of premature idle recovery, watchdog, and TurnEnd fallback. Without this, all those workarounds are permanently necessary. - [P1]
GetSessionHistory()API — Eliminates fragileevents.jsonlscraping across 5 callers. - [P1] Event replay after session revival — After
DisposeAsync()+CreateSessionAsync(), buffered events should be replayable. Eliminates dead event stream recovery. - [P2]
IsSessionActive/ heartbeat API — Replacesevents.jsonlmtime polling in watchdog Case A. - [P3] Standardize UTC timestamps — Internal PolyPilot tech debt, not strictly an SDK issue.
Is Multi-Agent Bullet-Proof?
Honest answer: Production-ready but not bullet-proof.
The architecture is defensive with multiple recovery layers for each known failure. What makes it fragile is a single point of dependency on events.jsonl — if that file moves or changes format in a CLI update, the fallback chain silently returns empty worker results. There is no canary test or alarm for this regression path.
Recommended hardening before calling it bullet-proof:
- File P0+P1 upstream SDK issues and get commitments
- Add a canary test that validates
LoadHistoryFromDiskAsyncagainst a realevents.jsonlfixture - Add observability: when
events.jsonlfallback fires, write a warning to crash.log/telemetry
PR verdict: ✅ Approve — All round findings fixed (N5, F2, N6, N7, N8). Remaining items (F4 MRE disposal, F5 sync ReadToEnd) are non-blocking. The workarounds are as good as they can be absent SDK changes.
…terrupted sessions Fix 1: ManualResetEventSlim resource leak - Add DisposePrematureIdleSignal helper called in all session removal/replacement paths: CreateSessionAsync error paths, SyncRemoteSessions, codespace reconnect, connection-error reconnect, worker revival, and provider cleanup. - Guard PrematureIdleSignal reads with ObjectDisposedException catch. Fix 2: Eager resume for mid-turn sessions - Sessions with LastPrompt set (interrupted mid-turn) are now eagerly resumed via EnsureSessionConnectedAsync after all placeholders load, in a background Task.Run. - Add per-session SemaphoreSlim in _sessionConnectLocks to prevent concurrent resume races between eager restore and first user SendPromptAsync. - 2 new structural tests verify the eager resume wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #375 Round 8 Re-ReviewTests: CI: Previous Findings Status
New Findings (5-model consensus)🟢 MINOR —
if (_modelSwitchLocks.TryRemove(name, out var sem)) sem.Dispose();No equivalent exists for Fix: Add 🟢 MINOR — Both Fix: Add Confirmed Non-Issues
Verdict✅ Approve — All round findings except F5 (pre-existing, non-blocking) are resolved. The two new findings are minor hygiene gaps with negligible runtime impact. The Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com |
…and history recovery (#391) ## Summary Multiple bug fixes discovered after PR #375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss. ## Changes ### 1. Never push to main rule - Added as first Git Workflow rule in `.github/copilot-instructions.md` ### 2. Permission recovery killing multi-agent workers - `TryRecoverPermissionAsync` calls `TrySetCanceled()` on `ResponseCompletion` TCS, propagating as `TaskCanceledException` to orchestrator workers - **Fix**: Retry loop in `SendPromptAndWaitAsync` detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries) ### 3. Session ID not persisted after reconnect - When SDK returns different session ID on resume, `state.Info.SessionId` was updated in memory but `FlushSaveActiveSessionsToDisk()` never called - **Fix**: Added flush after every SessionId update in 4 reconnect sites ### 4. Server health notice for posix_spawn failures - Bundled CLI native modules can be deleted by unknown processes, causing `posix_spawn ENOENT` - **Fix**: `ServerHealthNotice` banner on Dashboard with Restart Server button and full server restart cycle ### 5. Session history loss from dead event streams - After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl - **Fix**: `LoadBestHistoryAsync()` compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent ### 6. PR review fixes - **CRITICAL**: `RestartServerAsync` wrapped in `_clientReconnectLock` (race condition fix) - **HIGH**: `DisposePrematureIdleSignal` added in restart disposal loop (MRE leak) - **HIGH**: History recency threshold reduced from 1 minute to 5 seconds - **MINOR**: Dashboard restores `ServerHealthNotice` on restart failure ## Related Issues - #392 — posix_spawn upstream bug - #395 — Spinner gap during premature idle recovery ## Testing All 2669 tests pass. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- multi-agent-orchestration SKILL.md: 4→5 phase lifecycle, new IDLE-DEFER section, fix INV-O3 ordering, new INV-O15, mark premature idle bug as FIXED - processing-state-safety SKILL.md: 10→16 paths table with tags, new INV-18 for BackgroundTasks, IDLE-DEFER in stuck session table, note EVT-REARM is now secondary defense, add PRs #373/#375/#399 to regression history - copilot-instructions.md: update SDK Event Flow step 9 for BackgroundTasks check, add [IDLE-DEFER] diagnostic tag, fix stale path count (8→15+), add BackgroundTasksIdleTests to test list Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n#375) - Multi-agent orchestration SKILL.md: Added PR 373 reconnect safety invariants (INV-O9 through INV-O13), session death/deadlock bug documentation, Fix with Copilot multi-agent awareness section, OrchestratorReflect monitoring guide, enhanced end-to-end checklist, and comprehensive test coverage gap analysis - Processing-state-safety SKILL.md: Added INV-14 through INV-17 covering IsOrphaned guards, TryUpdate concurrency, handler-before- publish ordering, and MCP server reload on reconnect - SessionSidebar.razor Fix flow: GetBugReportDebugInfo now includes multi-agent context (group name, mode, role, members, event diagnostics, PendingOrchestration state) via AppendMultiAgentDebugInfo. BuildCopilotPrompt adds multi-agent testing requirements when the session is in a multi-agent group - SessionStabilityTests.cs: 20 new tests covering IsOrphaned guards (5 tests), ForceCompleteProcessingAsync INV-1 compliance (3 tests), synthesis prompt mixed success/failure (2 tests), sibling re-resume safety (4 tests), MCP reload (2 tests), diagnostic log completeness, watchdog companion fields, and Fix prompt multi-agent enhancement --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…and history recovery (PureWeen#391) ## Summary Multiple bug fixes discovered after PR PureWeen#375 merge, addressing worker failures, session persistence, server health detection, and conversation history loss. ## Changes ### 1. Never push to main rule - Added as first Git Workflow rule in `.github/copilot-instructions.md` ### 2. Permission recovery killing multi-agent workers - `TryRecoverPermissionAsync` calls `TrySetCanceled()` on `ResponseCompletion` TCS, propagating as `TaskCanceledException` to orchestrator workers - **Fix**: Retry loop in `SendPromptAndWaitAsync` detects permission-recovery cancellation and re-awaits new state's TCS (up to 3 retries) ### 3. Session ID not persisted after reconnect - When SDK returns different session ID on resume, `state.Info.SessionId` was updated in memory but `FlushSaveActiveSessionsToDisk()` never called - **Fix**: Added flush after every SessionId update in 4 reconnect sites ### 4. Server health notice for posix_spawn failures - Bundled CLI native modules can be deleted by unknown processes, causing `posix_spawn ENOENT` - **Fix**: `ServerHealthNotice` banner on Dashboard with Restart Server button and full server restart cycle ### 5. Session history loss from dead event streams - After server-side idle cleanup + re-resume, SDK event file writer breaks — events flow in-memory but never persist to events.jsonl - **Fix**: `LoadBestHistoryAsync()` compares latest user message timestamps from events.jsonl and chat_history.db, picks whichever is more recent ### 6. PR review fixes - **CRITICAL**: `RestartServerAsync` wrapped in `_clientReconnectLock` (race condition fix) - **HIGH**: `DisposePrematureIdleSignal` added in restart disposal loop (MRE leak) - **HIGH**: History recency threshold reduced from 1 minute to 5 seconds - **MINOR**: Dashboard restores `ServerHealthNotice` on restart failure ## Related Issues - PureWeen#392 — posix_spawn upstream bug - PureWeen#395 — Spinner gap during premature idle recovery ## Testing All 2669 tests pass. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Multi-agent orchestration SKILL.md: Added PR 373 reconnect safety invariants (INV-O9 through INV-O13), session death/deadlock bug documentation, Fix with Copilot multi-agent awareness section, OrchestratorReflect monitoring guide, enhanced end-to-end checklist, and comprehensive test coverage gap analysis
Processing-state-safety SKILL.md: Added INV-14 through INV-17 covering IsOrphaned guards, TryUpdate concurrency, handler-before- publish ordering, and MCP server reload on reconnect
SessionSidebar.razor Fix flow: GetBugReportDebugInfo now includes multi-agent context (group name, mode, role, members, event diagnostics, PendingOrchestration state) via AppendMultiAgentDebugInfo. BuildCopilotPrompt adds multi-agent testing requirements when the session is in a multi-agent group
SessionStabilityTests.cs: 20 new tests covering IsOrphaned guards (5 tests), ForceCompleteProcessingAsync INV-1 compliance (3 tests), synthesis prompt mixed success/failure (2 tests), sibling re-resume safety (4 tests), MCP reload (2 tests), diagnostic log completeness, watchdog companion fields, and Fix prompt multi-agent enhancement