Skip to content

fix: prevent sessions from being permanently stuck when watchdog crashes#372

Closed
PureWeen wants to merge 0 commit intomainfrom
fix/session-copilotimprovements-can-you-tell-20260313-1924
Closed

fix: prevent sessions from being permanently stuck when watchdog crashes#372
PureWeen wants to merge 0 commit intomainfrom
fix/session-copilotimprovements-can-you-tell-20260313-1924

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Sessions ("CopilotImprovements", "Evalutation") are stuck showing "Sending..." forever with IsProcessing=true.
The watchdog detects inactivity and attempts cleanup, but if any exception occurs either in the watchdog loop or in the timeout callback posted to the UI thread, IsProcessing is never cleared.

Root Cause — Two paths that fail to clear IsProcessing

Path 1: Watchdog loop exception

The catch (Exception ex) block logged the error but did NOT clear IsProcessing or companion state. Any unexpected exception (NRE, state corruption, etc.) left the session permanently stuck.

Path 2: Timeout callback exception

The Case C timeout callback called FlushCurrentResponse() BEFORE setting IsProcessing=false. If the flush threw, the exception was silently caught by InvokeOnUI's try-catch, and the watchdog had already exited (break) — no further cleanup possible.

Fix

  • Protect FlushCurrentResponse in the timeout callback with try-catch so IsProcessing is always cleared even if the flush fails
  • Add INV-1 compliant crash recovery to the catch(Exception) block: clears IsProcessing + all 9 companion fields, completes TCS, fires OnSessionComplete/OnError/OnStateChanged

Tests

Added 4 regression tests:

  • WatchdogCatchBlock_ClearsIsProcessing_InSource — verifies crash recovery clears INV-1 fields
  • WatchdogCatchBlock_CompletesResponseCompletion_InSource — verifies TCS/notifications
  • WatchdogKillCallback_ProtectsFlushCurrentResponse_InSource — verifies try-catch wrapping
  • WatchdogCrashRecovery_ClearsCompanionFields — behavioral test for state cleanup

All 2545 tests pass (4 pre-existing font/CSS failures unrelated to this change).

@PureWeen
Copy link
Copy Markdown
Owner Author

Multi-Model Review — PR #372

5 models reviewed (2× claude-opus-4.6, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Findings below pass the consensus filter (2+ models).

CI: ⚠️ No checks reported | Prior reviews: None


🟡 MODERATE — Missing CancelTurnEndFallback(state) in crash recovery

4/5 models | CopilotService.Events.cs ~line 2107

The normal kill path calls CancelTurnEndFallback(state) before clearing IsProcessing. The new crash recovery InvokeOnUI callback omits it. A TurnEnd→Idle fallback timer in-flight when the watchdog crashes can fire after crash recovery clears IsProcessing, invoking CompleteResponse on an already-completed session. Fix: add CancelTurnEndFallback(state); at the top of the InvokeOnUI lambda, matching the normal kill path.


🟡 MODERATE — Missing MessageQueue.Clear() for crash loop prevention

2/5 models | CopilotService.Events.cs ~line 2130

The normal kill path has a ConsecutiveStuckCount >= 3 branch that clears the message queue to break auto-redispatch into a repeatedly-stuck session. The crash recovery increments ConsecutiveStuckCount but skips this guard. A session that crashes the watchdog 3+ times will keep auto-redispatching into the same crash. Fix: add the >= 3 queue-clear guard immediately after ConsecutiveStuckCount++, mirroring the normal kill path.


🟡 MODERATE — Truncated response delivered to callers when FlushCurrentResponse throws

2/5 models | CopilotService.Events.cs ~line 2125

If FlushCurrentResponse throws (swallowed), crashResponse = state.FlushedResponse.ToString() captures only already-flushed content; CurrentResponse.Clear() then discards the unmerged partial text. Orchestrators waiting on ResponseCompletion receive a truncated/empty string indistinguishable from a genuine empty response. Consider using null or a sentinel to let callers distinguish crash recovery from normal completion.


🟢 MINOR — TotalApiTimeSeconds not accumulated before clearing ProcessingStartedAt

3/5 models | CopilotService.Events.cs ~line 2128

The normal kill path does TotalApiTimeSeconds += (UtcNow - ProcessingStartedAt) before nulling. Crash recovery skips this, silently dropping API time metrics for crashed turns.


🟡 MODERATE — Behavioral test misses SessionState INV-1 fields

2/5 models | ProcessingWatchdogTests.cs ~line 2760

WatchdogCrashRecovery_ClearsCompanionFields only exercises AgentSessionInfo properties. It never verifies SendingFlag, ActiveToolCallCount, or SuccessfulToolCountThisTurn (the Interlocked fields on SessionState), which are most critical for stuck-session races. The source-scan tests check string presence but not runtime reset behavior.


Summary

Finding Severity Consensus
Missing CancelTurnEndFallback in crash recovery 🟡 MODERATE 4/5
Missing message queue clear (ConsecutiveStuckCount ≥ 3) 🟡 MODERATE 2/5
Truncated crashResponse to TCS callers 🟡 MODERATE 2/5
TotalApiTimeSeconds not accumulated 🟢 MINOR 3/5
Behavioral test gaps for SessionState INV-1 fields 🟡 MODERATE 2/5

⚠️ Request Changes

The core fix is correct and well-structured. Specific ask: (1) Add CancelTurnEndFallback(state); at the top of the crash recovery InvokeOnUI lambda. (2) Add the ConsecutiveStuckCount >= 3 queue-clear guard after ConsecutiveStuckCount++.

@PureWeen PureWeen closed this Mar 15, 2026
@PureWeen PureWeen force-pushed the fix/session-copilotimprovements-can-you-tell-20260313-1924 branch from fbba29e to 37abe13 Compare March 15, 2026 14:30
@PureWeen
Copy link
Copy Markdown
Owner Author

Superseded by #373 which cherry-picked this watchdog crash safety net commit and then added IsOrphaned guards, complete INV-1 field coverage, CurrentResponse in crash recovery, and 7 additional rounds of hardening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant