Skip to content

fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives#531

Merged
PureWeen merged 8 commits intomainfrom
fix/session-reliability-audit
Apr 7, 2026
Merged

fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives#531
PureWeen merged 8 commits intomainfrom
fix/session-reliability-audit

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Apr 7, 2026

Summary

End-to-end reliability audit and fix for the three chronic PolyPilot multi-agent bugs.

Bug 1: Companion-field desync on session completion

Root cause: 51 sites that set IsProcessing=false each had to manually clear 12 companion fields. Any missed clear caused stuck spinners or incorrect state.

Fix: ClearProcessingState() — single atomic method that clears all 12 fields. Applied to CompleteResponse, AbortSessionAsync, and all SendPromptAsync error paths.

Bug 2: Phantom (previous) sessions after worker revival

Root cause: When a zombie worker was revived, the old session ID stayed in active-sessions.json. MergeSessionEntries saw both IDs for the same display name and kept both, creating a (previous) ghost.

Fix:

  • Revival block captures deadSessionId, adds to _closedSessionIds, sets RecoveredFromSessionId on the new session
  • CanSafelyDropSupersededGroupMoveEntry drops the superseded entry whenever RecoveredFromSessionId matches (removed unnecessary GroupId constraint)

Bug 3: DISPATCH-RECOVER fires on every normal worker completion (12s latency)

Root cause 1 (OS mtime-flush): The old code captured mtimeBefore immediately after CompleteResponse, before APFS had flushed the completing assistant.turn_end write's mtime. Within 2s, the OS flushed it — endMtime > mtimeBefore triggered DISPATCH-RECOVER even on clean completions.

Fix: Two-phase check: 500ms settle → snapshot stableMtime (post-flush baseline) → 1500ms observe → only endMtime > stableMtime (new writes after settle) triggers detection.

Root cause 2 (polling loop): A 10-second secondary polling loop ran after every worker completion — even when no premature idle was detected. This added ~10s of unnecessary latency to every multi-agent task.

Fix: Removed the polling loop. The two-phase grace period (2000ms total) handles all realistic premature-idle cases. Normal completion now registers in ~2s instead of 12s.

Root cause 3 (reconnect write): session.resume events written by EnsureSessionConnectedAsync during app reconnect advanced the mtime, triggering false detection. Fixed by explicitly skipping session.resume and session.shutdown in all mtime checks.

Live Test Results

Metric Before After
Worker completion delay ~12s ~2s
DISPATCH-RECOVER false positives 7–9 per run 0
Phantom (previous) sessions On every worker revival 0

Stress-tested: mid-execution app kill, workers complete and PendingOrchestration resumes correctly (2/2 results collected post-restart).

Tests

3305/3305 pass. New tests:

  • SessionPersistenceTests — same-group revival with RecoveredFromSessionId
  • MultiAgentRegressionTests — two-phase mtime invariants, detectStart polling-loop-removed structural check
  • 38 single-session + 35 multi-agent integration scenario JSON files

PureWeen and others added 3 commits April 7, 2026 01:07
… on worker revival

## Summary

Three concrete bug fixes found through exhaustive live stress testing,
plus integration scenario files (38 single-session + 35 multi-agent).

All 3304 unit tests pass. Stress-tested with 2x app-relaunch-mid-orchestration
(both times: 2/2 worker results collected, synthesis completed, 0 phantom sessions).

## Fix 1: Atomic ClearProcessingState()

Added ClearProcessingState() in CopilotService.cs that atomically clears all
12 companion fields in one call. Applied to:
- CompleteResponse (CopilotService.Events.cs)
- AbortSessionAsync
- SendPromptAsync error paths

Previously each site cleared a different subset of fields — the #1 source of
stuck-session regressions (13 PRs of fix/regression cycles).

## Fix 2: Phantom (previous) sessions on worker revival (CopilotService.Organization.cs)

When a worker's SDK session entered a zombie state, ExecuteWorkerAsync created
a fresh session with a new ID but left the old ID in active-sessions.json without
marking it closed. MergeSessionEntries then saw both IDs for the same worker name
and created a (previous) entry.

Fixed by: capturing deadSessionId BEFORE updating SessionId, adding it to
_closedSessionIds, and setting RecoveredFromSessionId = deadSessionId so the
merge knows to drop the persisted entry.

## Fix 3: Same-group revival merge logic (CopilotService.Persistence.cs)

CanSafelyDropSupersededGroupMoveEntry required GroupId to DIFFER as a safety
check. But RecoveredFromSessionId is an explicit marker — if set, the new session
definitively claims to have replaced the old one. Removed the GroupId constraint:
now drops whenever RecoveredFromSessionId matches, regardless of GroupId.

## New Tests

- SessionPersistenceTests: same-group revival with/without RecoveredFromSessionId
- MultiAgentRegressionTests: 3 structural invariant tests for revival correctness
- Scenarios/single-session-scenarios.json: 38 integration test scenarios
- Scenarios/multi-agent-reliability-scenarios.json: 35 integration test scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the app restarts mid-orchestration, EnsureSessionConnectedAsync writes
session.resume to events.jsonl. This advances the file mtime during the
2000ms grace period, triggering a false 'premature idle detected' even
though the worker actually completed normally.

The fix: after detecting mtime changed during the grace period (and in
the polling loop), read the last event type from events.jsonl. If it's
session.resume or session.shutdown, the mtime change was caused by a
reconnect/terminal write — not by ongoing processing. Skip recovery.

Also adds a baseline for the polling loop: when a session.resume write
is detected, update stableMtime to the current value and keep polling
(don't trigger recovery on the reconnect write itself).

This eliminates the ~14s DISPATCH-RECOVER latency that appeared on every
normal worker completion after a relaunch, reducing worker overhead from
~28s to ~2s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 10-second secondary polling loop was running after every worker
completion — even when no premature idle was detected — adding 12s of
unnecessary latency to every multi-agent task.

The two-phase mtime check (settle 500ms + observe 1500ms = 2000ms total)
already handles all realistic premature-idle cases:
- Premature idle + new events within 2s: detected via endMtime > stableMtime
- PrematureIdleSignal: checked fast-path before entering the grace period
- session.resume writes: explicitly skipped (reconnect, not premature idle)

After the 2000ms check returns not-detected, return immediately.
Worker completion now registers in ~2s instead of 12s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 7, 2026

🔍 Multi-Model Code Review — PR #531 (Re-Review v5 — Final)

PR: fix: session reliability audit — eliminate stuck sessions, phantom (previous) entries, and DISPATCH-RECOVER false positives
Branch: fix/session-reliability-auditmain
Changes: +1,627 / -275 across 14 files (4 production, 10 test), 8 commits
CI Status: ⚠️ No CI checks reported on this branch


All Previous Findings — Status

# Finding Review Round Status
1 🔴 AllowTurnStartRearm race in abort/error paths v1 FIXED
2 🟡 _consecutiveWatchdogTimeouts reset on non-success paths v1 FIXED
3 🟡 PremiumRequestsUsed overcounted on abort v1→v3 FIXED
4 🔴 ConsecutiveStuckCount accumulation broken v4 FIXED

New Findings: None

Three independent reviewers examined the full 2302-line diff. Two single-reviewer findings were raised and tested adversarially:

  • One reviewer flagged a potential CancelTurnEndFallback timing issue in TriggerToolHealthRecovery. Adversarial verification confirmed the cancellation is still synchronous (line 2340, before InvokeOnUI), plus CompleteResponse uses a generation guard. Discarded (1/3).

  • One reviewer re-raised the EVT-REARM signal concern from v3. Previously debunked: the PrematureIdleDetectionWindowMs polling loop with WaitForPrematureIdleSignal still exists in the actual source. Discarded (previously 0/2).


✅ Verified Correct

  • ClearProcessingState — correctly clears 20+ per-turn fields; correctly excludes AllowTurnStartRearm, _consecutiveWatchdogTimeouts, ConsecutiveStuckCount (cross-turn accumulators)
  • CompleteResponse — sets AllowTurnStartRearm=true, resets _consecutiveWatchdogTimeouts, resets ConsecutiveStuckCount=0 all AFTER ClearProcessingState. Correct ordering.
  • AbortSessionAsync — manual API time accumulation before ClearProcessingState(accumulateApiTime: false). WasUserAborted and AllowTurnStartRearm=false set after. No double-clear.
  • Watchdog kill/crash — response captured BEFORE ClearProcessingState. API time accumulated manually. AllowTurnStartRearm=false and ConsecutiveStuckCount++ set after. Correct.
  • TriggerToolHealthRecovery — response captured before ClearProcessingState. Cancels fallback/watchdog synchronously. AllowTurnStartRearm=false after. Correct.
  • Error pathsaccumulateApiTime: false. AllowTurnStartRearm stays false from SendPromptAsync entry. Correct.
  • Two-phase mtime check — 500ms settle + 1500ms observe. session.resume/session.shutdown filtered. Polling loop preserved for genuine premature idle. Correct.
  • Phantom (previous) fixRecoveredFromSessionId + _closedSessionIds + simplified CanSafelyDropSupersededGroupMoveEntry. Correct.
  • 5 structural guard testsClearProcessingState_DoesNotSetAllowTurnStartRearm, ClearProcessingState_DoesNotResetConsecutiveWatchdogTimeouts, ClearProcessingState_DoesNotResetConsecutiveStuckCount, CompleteResponse_SetsAllowTurnStartRearmAndResetsWatchdogCounter, AbortSessionAsync_DoesNotIncrementPremiumRequestsViaAccumulateApiTime
  • Thread safetyCancelTurnEndFallback/CancelToolHealthCheck use Interlocked.Exchange(ref field, null) before Cancel/Dispose. Concurrent calls are safe no-ops.

🏁 Recommendation: ✅ Approve

All 4 findings across 5 review rounds are resolved. No new consensus findings. The PR is a significant reliability improvement — ClearProcessingState eliminates the #1 bug category (companion-field desync), the two-phase mtime check eliminates DISPATCH-RECOVER false positives, and the phantom session fix prevents (previous) entries. Well-tested with structural guard tests protecting all critical invariants.

PureWeen and others added 5 commits April 7, 2026 07:40
…counter, PremiumRequestsUsed on errors

- ClearProcessingState: remove AllowTurnStartRearm=true and _consecutiveWatchdogTimeouts reset
  (both are success-only operations; putting them in the helper created race windows on error/abort paths)
- CompleteResponse: explicitly set AllowTurnStartRearm=true and reset _consecutiveWatchdogTimeouts
  after ClearProcessingState (success path only)
- Reconnect/send failure paths: use accumulateApiTime=false (no real API call was consumed)
- Add 3 structural regression tests in ChatExperienceSafetyTests.cs to enforce these invariants

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbortSessionAsync now manually accumulates TotalApiTimeSeconds (request was
consumed) then calls ClearProcessingState(accumulateApiTime: false) to avoid
incrementing PremiumRequestsUsed on user-initiated aborts.

Adds structural regression test to enforce this invariant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ClearProcessingState now clears all 19 companion fields atomically:
- Added: ActiveToolCallCount, HasUsedToolsThisTurn, SuccessfulToolCountThisTurn,
  ToolHealthStaleChecks, EventCountThisTurn, TurnEndReceivedAtTicks,
  IsReconnectedSend, CancelTurnEndFallback, CancelToolHealthCheck,
  ClearDeferredIdleTracking
- Removed manual field clears from: watchdog timeout (22 lines), watchdog crash
  (19 lines), AbortSessionAsync (8 lines), reconnect-failure (6 lines),
  send-failure (6 lines), CompleteResponse (10 lines)

This eliminates the exact anti-pattern that caused the 13-PR regression cycle:
scattered IsProcessing companion field clears that could miss a field.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConsecutiveStuckCount is a cross-turn health accumulator (like
_consecutiveWatchdogTimeouts). Resetting it in ClearProcessingState meant
watchdog timeout #1 set count=0→1, timeout #2 set count=0→1, so the >=3
threshold for queue-clear and history-growth prevention was unreachable.

Move reset to CompleteResponse only (success path proves session is healthy).
Add structural test to enforce this invariant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Last remaining manual field-clearing site converted to use
ClearProcessingState. This was the code path that fires 'Tool execution
stuck' messages — it had the same anti-pattern of manually clearing 15+
fields instead of calling the atomic helper.

Also updated RenderThrottleTests structural test to check for
ClearProcessingState instead of literal IsProcessing=false.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant