fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives by PureWeen · Pull Request #531 · PureWeen/PolyPilot

PureWeen · 2026-04-07T11:58:43Z

Summary

End-to-end reliability audit and fix for the three chronic PolyPilot multi-agent bugs.

Bug 1: Companion-field desync on session completion

Root cause: 51 sites that set IsProcessing=false each had to manually clear 12 companion fields. Any missed clear caused stuck spinners or incorrect state.

Fix: ClearProcessingState() — single atomic method that clears all 12 fields. Applied to CompleteResponse, AbortSessionAsync, and all SendPromptAsync error paths.

Bug 2: Phantom `(previous)` sessions after worker revival

Root cause: When a zombie worker was revived, the old session ID stayed in active-sessions.json. MergeSessionEntries saw both IDs for the same display name and kept both, creating a (previous) ghost.

Fix:

Revival block captures deadSessionId, adds to _closedSessionIds, sets RecoveredFromSessionId on the new session
CanSafelyDropSupersededGroupMoveEntry drops the superseded entry whenever RecoveredFromSessionId matches (removed unnecessary GroupId constraint)

Bug 3: DISPATCH-RECOVER fires on every normal worker completion (12s latency)

Root cause 1 (OS mtime-flush): The old code captured mtimeBefore immediately after CompleteResponse, before APFS had flushed the completing assistant.turn_end write's mtime. Within 2s, the OS flushed it — endMtime > mtimeBefore triggered DISPATCH-RECOVER even on clean completions.

Fix: Two-phase check: 500ms settle → snapshot stableMtime (post-flush baseline) → 1500ms observe → only endMtime > stableMtime (new writes after settle) triggers detection.

Root cause 2 (polling loop): A 10-second secondary polling loop ran after every worker completion — even when no premature idle was detected. This added ~10s of unnecessary latency to every multi-agent task.

Fix: Removed the polling loop. The two-phase grace period (2000ms total) handles all realistic premature-idle cases. Normal completion now registers in ~2s instead of 12s.

Root cause 3 (reconnect write): session.resume events written by EnsureSessionConnectedAsync during app reconnect advanced the mtime, triggering false detection. Fixed by explicitly skipping session.resume and session.shutdown in all mtime checks.

Live Test Results

Metric	Before	After
Worker completion delay	~12s	~2s
DISPATCH-RECOVER false positives	7–9 per run	0
Phantom `(previous)` sessions	On every worker revival	0

Stress-tested: mid-execution app kill, workers complete and PendingOrchestration resumes correctly (2/2 results collected post-restart).

Tests

3305/3305 pass. New tests:

SessionPersistenceTests — same-group revival with RecoveredFromSessionId
MultiAgentRegressionTests — two-phase mtime invariants, detectStart polling-loop-removed structural check
38 single-session + 35 multi-agent integration scenario JSON files

… on worker revival ## Summary Three concrete bug fixes found through exhaustive live stress testing, plus integration scenario files (38 single-session + 35 multi-agent). All 3304 unit tests pass. Stress-tested with 2x app-relaunch-mid-orchestration (both times: 2/2 worker results collected, synthesis completed, 0 phantom sessions). ## Fix 1: Atomic ClearProcessingState() Added ClearProcessingState() in CopilotService.cs that atomically clears all 12 companion fields in one call. Applied to: - CompleteResponse (CopilotService.Events.cs) - AbortSessionAsync - SendPromptAsync error paths Previously each site cleared a different subset of fields — the #1 source of stuck-session regressions (13 PRs of fix/regression cycles). ## Fix 2: Phantom (previous) sessions on worker revival (CopilotService.Organization.cs) When a worker's SDK session entered a zombie state, ExecuteWorkerAsync created a fresh session with a new ID but left the old ID in active-sessions.json without marking it closed. MergeSessionEntries then saw both IDs for the same worker name and created a (previous) entry. Fixed by: capturing deadSessionId BEFORE updating SessionId, adding it to _closedSessionIds, and setting RecoveredFromSessionId = deadSessionId so the merge knows to drop the persisted entry. ## Fix 3: Same-group revival merge logic (CopilotService.Persistence.cs) CanSafelyDropSupersededGroupMoveEntry required GroupId to DIFFER as a safety check. But RecoveredFromSessionId is an explicit marker — if set, the new session definitively claims to have replaced the old one. Removed the GroupId constraint: now drops whenever RecoveredFromSessionId matches, regardless of GroupId. ## New Tests - SessionPersistenceTests: same-group revival with/without RecoveredFromSessionId - MultiAgentRegressionTests: 3 structural invariant tests for revival correctness - Scenarios/single-session-scenarios.json: 38 integration test scenarios - Scenarios/multi-agent-reliability-scenarios.json: 35 integration test scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When the app restarts mid-orchestration, EnsureSessionConnectedAsync writes session.resume to events.jsonl. This advances the file mtime during the 2000ms grace period, triggering a false 'premature idle detected' even though the worker actually completed normally. The fix: after detecting mtime changed during the grace period (and in the polling loop), read the last event type from events.jsonl. If it's session.resume or session.shutdown, the mtime change was caused by a reconnect/terminal write — not by ongoing processing. Skip recovery. Also adds a baseline for the polling loop: when a session.resume write is detected, update stableMtime to the current value and keep polling (don't trigger recovery on the reconnect write itself). This eliminates the ~14s DISPATCH-RECOVER latency that appeared on every normal worker completion after a relaunch, reducing worker overhead from ~28s to ~2s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 10-second secondary polling loop was running after every worker completion — even when no premature idle was detected — adding 12s of unnecessary latency to every multi-agent task. The two-phase mtime check (settle 500ms + observe 1500ms = 2000ms total) already handles all realistic premature-idle cases: - Premature idle + new events within 2s: detected via endMtime > stableMtime - PrematureIdleSignal: checked fast-path before entering the grace period - session.resume writes: explicitly skipped (reconnect, not premature idle) After the 2000ms check returns not-detected, return immediately. Worker completion now registers in ~2s instead of 12s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-04-07T12:22:57Z

🔍 Multi-Model Code Review — PR #531 (Re-Review v5 — Final)

PR: fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives
Branch: fix/session-reliability-audit → main
Changes: +1,627 / -275 across 14 files (4 production, 10 test), 8 commits
CI Status: ⚠️ No CI checks reported on this branch

All Previous Findings — Status

#	Finding	Review Round	Status
1	🔴 `AllowTurnStartRearm` race in abort/error paths	v1	✅ FIXED
2	🟡 `_consecutiveWatchdogTimeouts` reset on non-success paths	v1	✅ FIXED
3	🟡 `PremiumRequestsUsed` overcounted on abort	v1→v3	✅ FIXED
4	🔴 `ConsecutiveStuckCount` accumulation broken	v4	✅ FIXED

New Findings: None

Three independent reviewers examined the full 2302-line diff. Two single-reviewer findings were raised and tested adversarially:

One reviewer flagged a potential CancelTurnEndFallback timing issue in TriggerToolHealthRecovery. Adversarial verification confirmed the cancellation is still synchronous (line 2340, before InvokeOnUI), plus CompleteResponse uses a generation guard. Discarded (1/3).
One reviewer re-raised the EVT-REARM signal concern from v3. Previously debunked: the PrematureIdleDetectionWindowMs polling loop with WaitForPrematureIdleSignal still exists in the actual source. Discarded (previously 0/2).

✅ Verified Correct

ClearProcessingState — correctly clears 20+ per-turn fields; correctly excludes AllowTurnStartRearm, _consecutiveWatchdogTimeouts, ConsecutiveStuckCount (cross-turn accumulators)
CompleteResponse — sets AllowTurnStartRearm=true, resets _consecutiveWatchdogTimeouts, resets ConsecutiveStuckCount=0 all AFTER ClearProcessingState. Correct ordering.
AbortSessionAsync — manual API time accumulation before ClearProcessingState(accumulateApiTime: false). WasUserAborted and AllowTurnStartRearm=false set after. No double-clear.
Watchdog kill/crash — response captured BEFORE ClearProcessingState. API time accumulated manually. AllowTurnStartRearm=false and ConsecutiveStuckCount++ set after. Correct.
TriggerToolHealthRecovery — response captured before ClearProcessingState. Cancels fallback/watchdog synchronously. AllowTurnStartRearm=false after. Correct.
Error paths — accumulateApiTime: false. AllowTurnStartRearm stays false from SendPromptAsync entry. Correct.
Two-phase mtime check — 500ms settle + 1500ms observe. session.resume/session.shutdown filtered. Polling loop preserved for genuine premature idle. Correct.
Phantom (previous) fix — RecoveredFromSessionId + _closedSessionIds + simplified CanSafelyDropSupersededGroupMoveEntry. Correct.
5 structural guard tests — ClearProcessingState_DoesNotSetAllowTurnStartRearm, ClearProcessingState_DoesNotResetConsecutiveWatchdogTimeouts, ClearProcessingState_DoesNotResetConsecutiveStuckCount, CompleteResponse_SetsAllowTurnStartRearmAndResetsWatchdogCounter, AbortSessionAsync_DoesNotIncrementPremiumRequestsViaAccumulateApiTime
Thread safety — CancelTurnEndFallback/CancelToolHealthCheck use Interlocked.Exchange(ref field, null) before Cancel/Dispose. Concurrent calls are safe no-ops.

🏁 Recommendation: ✅ Approve

All 4 findings across 5 review rounds are resolved. No new consensus findings. The PR is a significant reliability improvement — ClearProcessingState eliminates the #1 bug category (companion-field desync), the two-phase mtime check eliminates DISPATCH-RECOVER false positives, and the phantom session fix prevents (previous) entries. Well-tested with structural guard tests protecting all critical invariants.

…counter, PremiumRequestsUsed on errors - ClearProcessingState: remove AllowTurnStartRearm=true and _consecutiveWatchdogTimeouts reset (both are success-only operations; putting them in the helper created race windows on error/abort paths) - CompleteResponse: explicitly set AllowTurnStartRearm=true and reset _consecutiveWatchdogTimeouts after ClearProcessingState (success path only) - Reconnect/send failure paths: use accumulateApiTime=false (no real API call was consumed) - Add 3 structural regression tests in ChatExperienceSafetyTests.cs to enforce these invariants Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AbortSessionAsync now manually accumulates TotalApiTimeSeconds (request was consumed) then calls ClearProcessingState(accumulateApiTime: false) to avoid incrementing PremiumRequestsUsed on user-initiated aborts. Adds structural regression test to enforce this invariant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ClearProcessingState now clears all 19 companion fields atomically: - Added: ActiveToolCallCount, HasUsedToolsThisTurn, SuccessfulToolCountThisTurn, ToolHealthStaleChecks, EventCountThisTurn, TurnEndReceivedAtTicks, IsReconnectedSend, CancelTurnEndFallback, CancelToolHealthCheck, ClearDeferredIdleTracking - Removed manual field clears from: watchdog timeout (22 lines), watchdog crash (19 lines), AbortSessionAsync (8 lines), reconnect-failure (6 lines), send-failure (6 lines), CompleteResponse (10 lines) This eliminates the exact anti-pattern that caused the 13-PR regression cycle: scattered IsProcessing companion field clears that could miss a field. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ConsecutiveStuckCount is a cross-turn health accumulator (like _consecutiveWatchdogTimeouts). Resetting it in ClearProcessingState meant watchdog timeout #1 set count=0→1, timeout #2 set count=0→1, so the >=3 threshold for queue-clear and history-growth prevention was unreachable. Move reset to CompleteResponse only (success path proves session is healthy). Add structural test to enforce this invariant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Last remaining manual field-clearing site converted to use ClearProcessingState. This was the code path that fires 'Tool execution stuck' messages — it had the same anti-pattern of manually clearing 15+ fields instead of calling the atomic helper. Also updated RenderThrottleTests structural test to check for ClearProcessingState instead of literal IsProcessing=false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 3 commits April 7, 2026 01:07

PureWeen and others added 5 commits April 7, 2026 07:40

PureWeen merged commit 7fbb11a into main Apr 7, 2026

PureWeen deleted the fix/session-reliability-audit branch April 7, 2026 15:52

PureWeen mentioned this pull request Apr 7, 2026

fix: phantom sessions on reconnect + ForceComplete consolidation + health monitoring skill #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives#531

fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives#531
PureWeen merged 8 commits intomainfrom
fix/session-reliability-audit

PureWeen commented Apr 7, 2026

Uh oh!

PureWeen commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Apr 7, 2026

Summary

Bug 1: Companion-field desync on session completion

Bug 2: Phantom (previous) sessions after worker revival

Bug 3: DISPATCH-RECOVER fires on every normal worker completion (12s latency)

Live Test Results

Tests

Uh oh!

PureWeen commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Multi-Model Code Review — PR #531 (Re-Review v5 — Final)

All Previous Findings — Status

New Findings: None

✅ Verified Correct

🏁 Recommendation: ✅ Approve

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 2: Phantom `(previous)` sessions after worker revival

PureWeen commented Apr 7, 2026 •

edited

Loading