fix: prevent sessions from being permanently stuck when watchdog crashes#372
fix: prevent sessions from being permanently stuck when watchdog crashes#372
Conversation
Multi-Model Review — PR #3725 models reviewed (2× claude-opus-4.6, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Findings below pass the consensus filter (2+ models). CI: 🟡 MODERATE — Missing
|
| Finding | Severity | Consensus |
|---|---|---|
Missing CancelTurnEndFallback in crash recovery |
🟡 MODERATE | 4/5 |
| Missing message queue clear (ConsecutiveStuckCount ≥ 3) | 🟡 MODERATE | 2/5 |
| Truncated crashResponse to TCS callers | 🟡 MODERATE | 2/5 |
| TotalApiTimeSeconds not accumulated | 🟢 MINOR | 3/5 |
| Behavioral test gaps for SessionState INV-1 fields | 🟡 MODERATE | 2/5 |
⚠️ Request Changes
The core fix is correct and well-structured. Specific ask: (1) Add CancelTurnEndFallback(state); at the top of the crash recovery InvokeOnUI lambda. (2) Add the ConsecutiveStuckCount >= 3 queue-clear guard after ConsecutiveStuckCount++.
fbba29e to
37abe13
Compare
|
Superseded by #373 which cherry-picked this watchdog crash safety net commit and then added IsOrphaned guards, complete INV-1 field coverage, CurrentResponse in crash recovery, and 7 additional rounds of hardening. |
Problem
Sessions ("CopilotImprovements", "Evalutation") are stuck showing "Sending..." forever with
IsProcessing=true.The watchdog detects inactivity and attempts cleanup, but if any exception occurs either in the watchdog loop or in the timeout callback posted to the UI thread,
IsProcessingis never cleared.Root Cause — Two paths that fail to clear IsProcessing
Path 1: Watchdog loop exception
The
catch (Exception ex)block logged the error but did NOT clearIsProcessingor companion state. Any unexpected exception (NRE, state corruption, etc.) left the session permanently stuck.Path 2: Timeout callback exception
The Case C timeout callback called
FlushCurrentResponse()BEFORE settingIsProcessing=false. If the flush threw, the exception was silently caught byInvokeOnUI's try-catch, and the watchdog had already exited (break) — no further cleanup possible.Fix
IsProcessingis always cleared even if the flush failscatch(Exception)block: clearsIsProcessing+ all 9 companion fields, completes TCS, firesOnSessionComplete/OnError/OnStateChangedTests
Added 4 regression tests:
WatchdogCatchBlock_ClearsIsProcessing_InSource— verifies crash recovery clears INV-1 fieldsWatchdogCatchBlock_CompletesResponseCompletion_InSource— verifies TCS/notificationsWatchdogKillCallback_ProtectsFlushCurrentResponse_InSource— verifies try-catch wrappingWatchdogCrashRecovery_ClearsCompanionFields— behavioral test for state cleanupAll 2545 tests pass (4 pre-existing font/CSS failures unrelated to this change).