Conversation
… on worker revival ## Summary Three concrete bug fixes found through exhaustive live stress testing, plus integration scenario files (38 single-session + 35 multi-agent). All 3304 unit tests pass. Stress-tested with 2x app-relaunch-mid-orchestration (both times: 2/2 worker results collected, synthesis completed, 0 phantom sessions). ## Fix 1: Atomic ClearProcessingState() Added ClearProcessingState() in CopilotService.cs that atomically clears all 12 companion fields in one call. Applied to: - CompleteResponse (CopilotService.Events.cs) - AbortSessionAsync - SendPromptAsync error paths Previously each site cleared a different subset of fields — the #1 source of stuck-session regressions (13 PRs of fix/regression cycles). ## Fix 2: Phantom (previous) sessions on worker revival (CopilotService.Organization.cs) When a worker's SDK session entered a zombie state, ExecuteWorkerAsync created a fresh session with a new ID but left the old ID in active-sessions.json without marking it closed. MergeSessionEntries then saw both IDs for the same worker name and created a (previous) entry. Fixed by: capturing deadSessionId BEFORE updating SessionId, adding it to _closedSessionIds, and setting RecoveredFromSessionId = deadSessionId so the merge knows to drop the persisted entry. ## Fix 3: Same-group revival merge logic (CopilotService.Persistence.cs) CanSafelyDropSupersededGroupMoveEntry required GroupId to DIFFER as a safety check. But RecoveredFromSessionId is an explicit marker — if set, the new session definitively claims to have replaced the old one. Removed the GroupId constraint: now drops whenever RecoveredFromSessionId matches, regardless of GroupId. ## New Tests - SessionPersistenceTests: same-group revival with/without RecoveredFromSessionId - MultiAgentRegressionTests: 3 structural invariant tests for revival correctness - Scenarios/single-session-scenarios.json: 38 integration test scenarios - Scenarios/multi-agent-reliability-scenarios.json: 35 integration test scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the app restarts mid-orchestration, EnsureSessionConnectedAsync writes session.resume to events.jsonl. This advances the file mtime during the 2000ms grace period, triggering a false 'premature idle detected' even though the worker actually completed normally. The fix: after detecting mtime changed during the grace period (and in the polling loop), read the last event type from events.jsonl. If it's session.resume or session.shutdown, the mtime change was caused by a reconnect/terminal write — not by ongoing processing. Skip recovery. Also adds a baseline for the polling loop: when a session.resume write is detected, update stableMtime to the current value and keep polling (don't trigger recovery on the reconnect write itself). This eliminates the ~14s DISPATCH-RECOVER latency that appeared on every normal worker completion after a relaunch, reducing worker overhead from ~28s to ~2s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 10-second secondary polling loop was running after every worker completion — even when no premature idle was detected — adding 12s of unnecessary latency to every multi-agent task. The two-phase mtime check (settle 500ms + observe 1500ms = 2000ms total) already handles all realistic premature-idle cases: - Premature idle + new events within 2s: detected via endMtime > stableMtime - PrematureIdleSignal: checked fast-path before entering the grace period - session.resume writes: explicitly skipped (reconnect, not premature idle) After the 2000ms check returns not-detected, return immediately. Worker completion now registers in ~2s instead of 12s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Multi-Model Code Review — PR #531 (Re-Review v5 — Final)PR: fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives All Previous Findings — Status
New Findings: NoneThree independent reviewers examined the full 2302-line diff. Two single-reviewer findings were raised and tested adversarially:
✅ Verified Correct
🏁 Recommendation: ✅ ApproveAll 4 findings across 5 review rounds are resolved. No new consensus findings. The PR is a significant reliability improvement — |
…counter, PremiumRequestsUsed on errors - ClearProcessingState: remove AllowTurnStartRearm=true and _consecutiveWatchdogTimeouts reset (both are success-only operations; putting them in the helper created race windows on error/abort paths) - CompleteResponse: explicitly set AllowTurnStartRearm=true and reset _consecutiveWatchdogTimeouts after ClearProcessingState (success path only) - Reconnect/send failure paths: use accumulateApiTime=false (no real API call was consumed) - Add 3 structural regression tests in ChatExperienceSafetyTests.cs to enforce these invariants Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbortSessionAsync now manually accumulates TotalApiTimeSeconds (request was consumed) then calls ClearProcessingState(accumulateApiTime: false) to avoid incrementing PremiumRequestsUsed on user-initiated aborts. Adds structural regression test to enforce this invariant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ClearProcessingState now clears all 19 companion fields atomically: - Added: ActiveToolCallCount, HasUsedToolsThisTurn, SuccessfulToolCountThisTurn, ToolHealthStaleChecks, EventCountThisTurn, TurnEndReceivedAtTicks, IsReconnectedSend, CancelTurnEndFallback, CancelToolHealthCheck, ClearDeferredIdleTracking - Removed manual field clears from: watchdog timeout (22 lines), watchdog crash (19 lines), AbortSessionAsync (8 lines), reconnect-failure (6 lines), send-failure (6 lines), CompleteResponse (10 lines) This eliminates the exact anti-pattern that caused the 13-PR regression cycle: scattered IsProcessing companion field clears that could miss a field. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConsecutiveStuckCount is a cross-turn health accumulator (like _consecutiveWatchdogTimeouts). Resetting it in ClearProcessingState meant watchdog timeout #1 set count=0→1, timeout #2 set count=0→1, so the >=3 threshold for queue-clear and history-growth prevention was unreachable. Move reset to CompleteResponse only (success path proves session is healthy). Add structural test to enforce this invariant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Last remaining manual field-clearing site converted to use ClearProcessingState. This was the code path that fires 'Tool execution stuck' messages — it had the same anti-pattern of manually clearing 15+ fields instead of calling the atomic helper. Also updated RenderThrottleTests structural test to check for ClearProcessingState instead of literal IsProcessing=false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
End-to-end reliability audit and fix for the three chronic PolyPilot multi-agent bugs.
Bug 1: Companion-field desync on session completion
Root cause: 51 sites that set
IsProcessing=falseeach had to manually clear 12 companion fields. Any missed clear caused stuck spinners or incorrect state.Fix:
ClearProcessingState()— single atomic method that clears all 12 fields. Applied toCompleteResponse,AbortSessionAsync, and allSendPromptAsyncerror paths.Bug 2: Phantom
(previous)sessions after worker revivalRoot cause: When a zombie worker was revived, the old session ID stayed in
active-sessions.json.MergeSessionEntriessaw both IDs for the same display name and kept both, creating a(previous)ghost.Fix:
deadSessionId, adds to_closedSessionIds, setsRecoveredFromSessionIdon the new sessionCanSafelyDropSupersededGroupMoveEntrydrops the superseded entry wheneverRecoveredFromSessionIdmatches (removed unnecessary GroupId constraint)Bug 3: DISPATCH-RECOVER fires on every normal worker completion (12s latency)
Root cause 1 (OS mtime-flush): The old code captured
mtimeBeforeimmediately afterCompleteResponse, before APFS had flushed the completingassistant.turn_endwrite's mtime. Within 2s, the OS flushed it —endMtime > mtimeBeforetriggered DISPATCH-RECOVER even on clean completions.Fix: Two-phase check: 500ms settle → snapshot
stableMtime(post-flush baseline) → 1500ms observe → onlyendMtime > stableMtime(new writes after settle) triggers detection.Root cause 2 (polling loop): A 10-second secondary polling loop ran after every worker completion — even when no premature idle was detected. This added ~10s of unnecessary latency to every multi-agent task.
Fix: Removed the polling loop. The two-phase grace period (2000ms total) handles all realistic premature-idle cases. Normal completion now registers in ~2s instead of 12s.
Root cause 3 (reconnect write):
session.resumeevents written byEnsureSessionConnectedAsyncduring app reconnect advanced the mtime, triggering false detection. Fixed by explicitly skippingsession.resumeandsession.shutdownin all mtime checks.Live Test Results
(previous)sessionsStress-tested: mid-execution app kill, workers complete and PendingOrchestration resumes correctly (2/2 results collected post-restart).
Tests
3305/3305 pass. New tests:
SessionPersistenceTests— same-group revival withRecoveredFromSessionIdMultiAgentRegressionTests— two-phase mtime invariants,detectStartpolling-loop-removed structural check