fix: rescue stuck chat sessions + 41 behavioral safety tests#332
fix: rescue stuck chat sessions + 41 behavioral safety tests#332
Conversation
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…unning tools from lost SDK events Root causes of two chat reliability bugs: 1. 'Messages stop streaming / user has to prod' — AssistantTurnEndEvent fallback suppressed CompleteResponse permanently when HasUsedToolsThisTurn=true (preventing premature completion between tool rounds). But if SessionIdleEvent was lost (SDK bug #299), nothing would ever complete the session. Fixed: wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs) then fire CompleteResponse anyway if no new TurnStart arrives. Total wait = 4s + 30s = 34s for tool sessions. 2. 'Long-running processes killed at ~10 min' — watchdog uses SDK event inactivity as proxy for stuck, but silent tools (builds, test runs) produce zero events during execution. Fixed: check ActiveToolCallCount at timeout: - ActiveToolCallCount > 0 (tool running): probe server TCP liveness; reset timer if alive, kill if server is dead - ActiveToolCallCount = 0 AND tools were used (SessionIdleEvent lost): complete cleanly via CompleteResponse — no error message - Max time exceeded or demo/remote: kill with error (original behavior) The key insight: ActiveToolCallCount=0 means the tool finished; the session just needs the terminal SessionIdleEvent that was dropped. Completing cleanly is correct here. Tests added: - TurnEndFallbackTests: TurnEndIdleToolFallbackAdditionalMs reasonable, total < watchdog, cancelled-by-TurnStart, eventual fire - ProcessingWatchdogTests: ActiveTool+ServerAlive resets timer, NoActiveTool+ToolsUsed completes cleanly, MaxTime always kills Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses two remaining items from stuck-session analysis: 1. ExternalToolRequestedEvent in SdkEventMatrix (Rec 3): This event was arriving as 'Unhandled', causing log spam and potentially updating LastEventAtTicks on sessions that shouldn't be kept alive. Now explicitly classified as TimelineOnly so it's handled silently. 2. Periodic mid-watchdog flush (Rec 2): If content has accumulated in CurrentResponse during a watchdog check interval, flush it to History immediately. This ensures partial responses are visible even while IsProcessing=true — covering cases where streaming stalled and TurnEnd->Idle fallback hasn't fired yet. Flush occurs every WatchdogCheckIntervalSeconds (15s) as long as there's pending content. Combined with the previous commit (30s extended TurnEnd fallback + server liveness check), all 4 recommendations from the stuck-session analysis are now addressed: Rec 1: TurnEnd fallback works for tool sessions (30s extended delay) ✓ Rec 2: Periodic watchdog flush of CurrentResponse to History ✓ Rec 3: ExternalToolRequestedEvent in EventMatrix (no more log spam) ✓ Rec 4: Case B clean CompleteResponse when ActiveToolCallCount=0+tools used ✓ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aebfe6f to
2785239
Compare
🔍 PR Review Squad — Round 1 Review (5-model consensus)Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex Production Code Analysis1. 🟡 MODERATE — Case A effectively disables the 10-min tool timeout (3/5 consensus)
The session IS still bounded by Suggestion: Check 2. 🟡 MODERATE — Periodic flush fires too eagerly during normal tool execution (3/5 consensus)
Suggestion: Gate on a higher threshold (e.g., 3. 🟡 MODERATE — 34s TurnEnd fallback may be too short for complex tool gaps (3/5 consensus)
Mitigating factor: The fallback token is cancelled by TurnStart events, so this only fires if NO new TurnStart arrives in 34s. Suggestion: Consider 60s for the additional delay. 4. 🟡 MODERATE — Source-code string scanning tests are brittle (4/5 consensus)
Suggestion: Convert to behavioral tests using the existing reflection infrastructure. 5. 🟡 MODERATE — Duplicate
|
…st robustness - [HIGH] Add ProcessingGeneration guard to periodic watchdog flush: capture flushGen before InvokeOnUI and validate inside lambda, matching the same pattern used by Case B and Case C. Prevents stale watchdog tick from flushing new-turn content into old-turn history after abort+resend. - [MEDIUM] Switch TurnEnd fallback Task.Run closures from local Invoke() to class-level InvokeOnUI() for clarity. The local Invoke captures _syncContext via closure correctly, but InvokeOnUI makes the UI-thread dispatch intent explicit when reading cross-thread code. - [MEDIUM] Fix flaky timing in ToolFallback_CancelledByTurnStart_DoesNotFire: replace production 4000ms constant with 50ms accelerated delay, matching all other tests in the file. - [MEDIUM] Make watchdog source-parsing tests use dynamic method bounds: find end of RunProcessingWatchdogAsync by locating the next class member (_permissionRecoveryAttempts) instead of fixed 3500/8000 char substrings. - [MEDIUM] Add break; control-flow assertion to NoActiveTool test: verifies Case B exits via break before reaching Case C error path, catching accidental fallthrough regressions. - [MEDIUM] Add WatchdogPeriodicFlush_HasGenerationGuard test: verifies the generation guard is present in the flush block via source assertion. - [LOW] Add TODO comment near reflection helpers in ChatExperienceSafetyTests suggesting InternalsVisibleTo as a compile-time safety improvement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Bump header: 9 invariants → 13 invariants, 8 PRs → 9 PRs, 8 paths → 9 paths - Add 9th IsProcessing path: watchdog Case B (clean CompleteResponse on lost terminal event, no error message) — distinguishable from the kill path - Add INV-10: TurnEnd fallback must not be permanently suppressed by HasUsedToolsThisTurn; use ActiveToolCallCount guard + 30s extended delay - Add INV-11: Watchdog 3-way branch (Case A: liveness probe, Case B: clean complete, Case C: error kill) to distinguish active tools from lost events - Add INV-12: All background→UI dispatches must capture+validate ProcessingGeneration before/inside InvokeOnUI lambda - Add INV-13: Use class-level InvokeOnUI() not local Invoke() in Task.Run - Update Top 5 Recurring Mistakes: replace 'ActiveToolCallCount as sole signal' with the more impactful 'Suppressing TurnEnd fallback entirely' - Update regression chain: add #332 to the sequence - Add full PR #332 entry to regression-history.md with root causes, 3-way branch details, periodic flush, and session recovery improvement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Bump header: 9 invariants → 13 invariants, 8 PRs → 9 PRs, 8 paths → 9 paths - Add 9th IsProcessing path: watchdog Case B (clean CompleteResponse on lost terminal event, no error message) — distinguishable from the kill path - Add INV-10: TurnEnd fallback must not be permanently suppressed by HasUsedToolsThisTurn; use ActiveToolCallCount guard + 30s extended delay - Add INV-11: Watchdog 3-way branch (Case A: liveness probe, Case B: clean complete, Case C: error kill) to distinguish active tools from lost events - Add INV-12: All background→UI dispatches must capture+validate ProcessingGeneration before/inside InvokeOnUI lambda - Add INV-13: Use class-level InvokeOnUI() not local Invoke() in Task.Run - Update Top 5 Recurring Mistakes: replace 'ActiveToolCallCount as sole signal' with the more impactful 'Suppressing TurnEnd fallback entirely' - Update regression chain: add #332 to the sequence - Add full PR #332 entry to regression-history.md with root causes, 3-way branch details, periodic flush, and session recovery improvement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a91a605 to
978b752
Compare
… (PR #332 followup) Root cause analysis of the stuck orchestrator session (gen=28, 00:37:19-00:39:40): 1. AssistantTurnEndEvent arrived → 4s TurnEnd fallback started 2. AssistantTurnStartEvent arrived within 4s → CancelTurnEndFallback called SILENTLY (TurnStartEvent was not logged in [EVT], making it invisible in diagnostics) 3. SDK dropped terminal events for the follow-on sub-turn 4. isMultiAgentSession=true forced 600s watchdog timeout even with no tools 5. Session stuck until user clicked Stop (~2.5 min later) Fix 1 — Diagnostic: Add AssistantTurnStartEvent to [EVT] log filter The [EVT] filter previously only logged SessionIdleEvent, AssistantTurnEndEvent, and SessionErrorEvent. TurnStart events were invisible, making it impossible to see that a TurnStart cancelled the TurnEnd fallback. Now all critical lifecycle events are logged. Fix 2 — Behavioral: Remove isMultiAgentSession from useToolTimeout isMultiAgentSession alone no longer extends the watchdog to 600s. Workers doing text-heavy PR reviews stream delta events continuously, keeping elapsed small regardless of timeout. The 600s guard was protecting against a scenario that couldn't occur during active streaming. Pure reasoning/text sub-turns now use the 120s inactivity timeout. Fix 3 — Case B expansion: multi-agent + server alive → clean complete Case B (clean complete, no error message) now fires for multi-agent sessions when isMultiAgentSession=true and server is alive, not just when hasUsedTools=true. This prevents the error kill message when a session completed successfully but SDK dropped terminal events. Tests (3 new, 4 updated): - WatchdogTimeoutSelection_MultiAgent_NoTools_UsesInactivityTimeout (renamed+updated) - WatchdogDecision_MultiAgent_NoTools_CaseBIncludesMultiAgent_InSource (new source assert) - MultiAgent_NoTools_UseInactivityTimeout_NotToolTimeout (new formula test) - AssistantTurnStartEvent_LoggedInEvtDiagnostics_InSource (new source assert) - Regression_PR195_MultiAgentWorkers_DeltasKeepElapsedSmall (updated regression test) - ExhaustiveMatrix and NonResumed tests updated to reflect 120s for multi-agent no-tool Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n#332) ## Summary Rescues chat sessions that get stuck for 600s waiting for a SessionIdleEvent that never arrives (SDK bug PureWeen#299 / network hiccup). Adds 41 behavioral safety tests exercising the real CopilotService pipeline via reflection. ## Root-cause bugs fixed 1. **TurnEnd fallback permanently disabled for tool sessions**: When HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned early with a skip log — meaning tool sessions had *zero* fallback if SessionIdleEvent was lost. The session would sit stuck until the 600s watchdog killed it with an error message, discarding the (complete) response. 2. **Watchdog couldn't distinguish active tools from lost SDK events**: The watchdog treated all `elapsed >= effectiveTimeout` the same — kill with error. But there are three distinct situations: - **Tool actively running** (`ActiveToolCallCount > 0`, server alive): should reset timer, not kill - **Tool finished, SessionIdleEvent lost** (`ActiveToolCallCount = 0`, `HasUsedToolsThisTurn = true`): should complete cleanly (response IS done) - **Genuinely stuck / server dead**: should kill with error (existing behavior) ## Four fixes ### 1. TurnEnd 30s fallback for tool sessions Instead of skipping the fallback entirely when tools were used, wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart arrives in that window, SessionIdleEvent was lost → fire CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s. ### 2. Watchdog 3-way branch When `elapsed >= effectiveTimeout` and max time not exceeded: - **Case A** (active tool + server alive): reset LastEventAtTicks, continue loop - **Case B** (no active tool + tools were used): CompleteResponse cleanly (no error message) - **Case C** (max time exceeded / server dead / demo): kill with error (unchanged) ### 3. Periodic watchdog flush Every watchdog check interval, if CurrentResponse has accumulated content, flush it to History. This ensures partial responses are visible in the chat even while IsProcessing=true. ### 4. ExternalToolRequestedEvent in EventMatrix Classified as TimelineOnly to prevent "Unhandled" log spam while still allowing LastEventAtTicks updates. ## The stuck session scenario this resolves Before this fix: User sends a prompt that triggers tool use → tools execute → LLM responds → AssistantTurnEndEvent fires → fallback timer starts but immediately returns ("tools were used, TurnStart likely coming") → SessionIdleEvent is lost (SDK bug PureWeen#299) → session sits stuck showing "Thinking..." for 600s → watchdog kills it with error message, discarding the complete response. After this fix: Same scenario → fallback timer waits 34s total → no TurnStart arrives → CompleteResponse fires → user sees the complete response with no error. If a tool is genuinely still running, the watchdog probes server TCP liveness and resets its timer. ## Tests - **41 behavioral safety tests** (ChatExperienceSafetyTests.cs): Exercise real CompleteResponse, FlushCurrentResponse, AbortSessionAsync via reflection. Cover INV-1 cleanup, generation guards, content preservation, TCS completion, dedup guards, race conditions. - **Watchdog decision tests** (ProcessingWatchdogTests.cs): Verify the 3-way branch logic, ExternalToolRequestedEvent classification, periodic flush presence/ordering. - **TurnEnd fallback tests** (TurnEndFallbackTests.cs): Verify TurnEndIdleToolFallbackAdditionalMs bounds, total delay < watchdog timeout, cancellation-by-TurnStart, eventual-fire. All 2344 tests pass (3 pre-existing failures: 2 symlink tests on Windows, 1 flaky unobserved-task test). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Rescues chat sessions that get stuck for 600s waiting for a SessionIdleEvent that never arrives (SDK bug #299 / network hiccup). Adds 41 behavioral safety tests exercising the real CopilotService pipeline via reflection.
Root-cause bugs fixed
TurnEnd fallback permanently disabled for tool sessions: When HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned early with a skip log — meaning tool sessions had zero fallback if SessionIdleEvent was lost. The session would sit stuck until the 600s watchdog killed it with an error message, discarding the (complete) response.
Watchdog couldn't distinguish active tools from lost SDK events: The watchdog treated all
elapsed >= effectiveTimeoutthe same — kill with error. But there are three distinct situations:ActiveToolCallCount > 0, server alive): should reset timer, not killActiveToolCallCount = 0,HasUsedToolsThisTurn = true): should complete cleanly (response IS done)Four fixes
1. TurnEnd 30s fallback for tool sessions
Instead of skipping the fallback entirely when tools were used, wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart arrives in that window, SessionIdleEvent was lost → fire CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s.
2. Watchdog 3-way branch
When
elapsed >= effectiveTimeoutand max time not exceeded:3. Periodic watchdog flush
Every watchdog check interval, if CurrentResponse has accumulated content, flush it to History. This ensures partial responses are visible in the chat even while IsProcessing=true.
4. ExternalToolRequestedEvent in EventMatrix
Classified as TimelineOnly to prevent "Unhandled" log spam while still allowing LastEventAtTicks updates.
The stuck session scenario this resolves
Before this fix: User sends a prompt that triggers tool use → tools execute → LLM responds → AssistantTurnEndEvent fires → fallback timer starts but immediately returns ("tools were used, TurnStart likely coming") → SessionIdleEvent is lost (SDK bug #299) → session sits stuck showing "Thinking..." for 600s → watchdog kills it with error message, discarding the complete response.
After this fix: Same scenario → fallback timer waits 34s total → no TurnStart arrives → CompleteResponse fires → user sees the complete response with no error. If a tool is genuinely still running, the watchdog probes server TCP liveness and resets its timer.
Tests
All 2344 tests pass (3 pre-existing failures: 2 symlink tests on Windows, 1 flaky unobserved-task test).