fix: rescue stuck chat sessions + 41 behavioral safety tests by PureWeen · Pull Request #332 · PureWeen/PolyPilot

PureWeen · 2026-03-09T20:20:37Z

Summary

Rescues chat sessions that get stuck for 600s waiting for a SessionIdleEvent that never arrives (SDK bug #299 / network hiccup). Adds 41 behavioral safety tests exercising the real CopilotService pipeline via reflection.

Root-cause bugs fixed

TurnEnd fallback permanently disabled for tool sessions: When HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned early with a skip log — meaning tool sessions had zero fallback if SessionIdleEvent was lost. The session would sit stuck until the 600s watchdog killed it with an error message, discarding the (complete) response.
Watchdog couldn't distinguish active tools from lost SDK events: The watchdog treated all elapsed >= effectiveTimeout the same — kill with error. But there are three distinct situations:
- Tool actively running (ActiveToolCallCount > 0, server alive): should reset timer, not kill
- Tool finished, SessionIdleEvent lost (ActiveToolCallCount = 0, HasUsedToolsThisTurn = true): should complete cleanly (response IS done)
- Genuinely stuck / server dead: should kill with error (existing behavior)

Four fixes

1. TurnEnd 30s fallback for tool sessions

Instead of skipping the fallback entirely when tools were used, wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart arrives in that window, SessionIdleEvent was lost → fire CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s.

2. Watchdog 3-way branch

When elapsed >= effectiveTimeout and max time not exceeded:

Case A (active tool + server alive): reset LastEventAtTicks, continue loop
Case B (no active tool + tools were used): CompleteResponse cleanly (no error message)
Case C (max time exceeded / server dead / demo): kill with error (unchanged)

3. Periodic watchdog flush

Every watchdog check interval, if CurrentResponse has accumulated content, flush it to History. This ensures partial responses are visible in the chat even while IsProcessing=true.

4. ExternalToolRequestedEvent in EventMatrix

Classified as TimelineOnly to prevent "Unhandled" log spam while still allowing LastEventAtTicks updates.

The stuck session scenario this resolves

Before this fix: User sends a prompt that triggers tool use → tools execute → LLM responds → AssistantTurnEndEvent fires → fallback timer starts but immediately returns ("tools were used, TurnStart likely coming") → SessionIdleEvent is lost (SDK bug #299) → session sits stuck showing "Thinking..." for 600s → watchdog kills it with error message, discarding the complete response.

After this fix: Same scenario → fallback timer waits 34s total → no TurnStart arrives → CompleteResponse fires → user sees the complete response with no error. If a tool is genuinely still running, the watchdog probes server TCP liveness and resets its timer.

Tests

41 behavioral safety tests (ChatExperienceSafetyTests.cs): Exercise real CompleteResponse, FlushCurrentResponse, AbortSessionAsync via reflection. Cover INV-1 cleanup, generation guards, content preservation, TCS completion, dedup guards, race conditions.
Watchdog decision tests (ProcessingWatchdogTests.cs): Verify the 3-way branch logic, ExternalToolRequestedEvent classification, periodic flush presence/ordering.
TurnEnd fallback tests (TurnEndFallbackTests.cs): Verify TurnEndIdleToolFallbackAdditionalMs bounds, total delay < watchdog timeout, cancellation-by-TurnStart, eventual-fire.

All 2344 tests pass (3 pre-existing failures: 2 symlink tests on Windows, 1 flaky unobserved-task test).

Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…unning tools from lost SDK events Root causes of two chat reliability bugs: 1. 'Messages stop streaming / user has to prod' — AssistantTurnEndEvent fallback suppressed CompleteResponse permanently when HasUsedToolsThisTurn=true (preventing premature completion between tool rounds). But if SessionIdleEvent was lost (SDK bug #299), nothing would ever complete the session. Fixed: wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs) then fire CompleteResponse anyway if no new TurnStart arrives. Total wait = 4s + 30s = 34s for tool sessions. 2. 'Long-running processes killed at ~10 min' — watchdog uses SDK event inactivity as proxy for stuck, but silent tools (builds, test runs) produce zero events during execution. Fixed: check ActiveToolCallCount at timeout: - ActiveToolCallCount > 0 (tool running): probe server TCP liveness; reset timer if alive, kill if server is dead - ActiveToolCallCount = 0 AND tools were used (SessionIdleEvent lost): complete cleanly via CompleteResponse — no error message - Max time exceeded or demo/remote: kill with error (original behavior) The key insight: ActiveToolCallCount=0 means the tool finished; the session just needs the terminal SessionIdleEvent that was dropped. Completing cleanly is correct here. Tests added: - TurnEndFallbackTests: TurnEndIdleToolFallbackAdditionalMs reasonable, total < watchdog, cancelled-by-TurnStart, eventual fire - ProcessingWatchdogTests: ActiveTool+ServerAlive resets timer, NoActiveTool+ToolsUsed completes cleanly, MaxTime always kills Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses two remaining items from stuck-session analysis: 1. ExternalToolRequestedEvent in SdkEventMatrix (Rec 3): This event was arriving as 'Unhandled', causing log spam and potentially updating LastEventAtTicks on sessions that shouldn't be kept alive. Now explicitly classified as TimelineOnly so it's handled silently. 2. Periodic mid-watchdog flush (Rec 2): If content has accumulated in CurrentResponse during a watchdog check interval, flush it to History immediately. This ensures partial responses are visible even while IsProcessing=true — covering cases where streaming stalled and TurnEnd->Idle fallback hasn't fired yet. Flush occurs every WatchdogCheckIntervalSeconds (15s) as long as there's pending content. Combined with the previous commit (30s extended TurnEnd fallback + server liveness check), all 4 recommendations from the stuck-session analysis are now addressed: Rec 1: TurnEnd fallback works for tool sessions (30s extended delay) ✓ Rec 2: Periodic watchdog flush of CurrentResponse to History ✓ Rec 3: ExternalToolRequestedEvent in EventMatrix (no more log spam) ✓ Rec 4: Case B clean CompleteResponse when ActiveToolCallCount=0+tools used ✓ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-09T23:49:00Z

🔍 PR Review Squad — Round 1 Review (5-model consensus)

Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex

Production Code Analysis

1. 🟡 MODERATE — Case A effectively disables the 10-min tool timeout (3/5 consensus)

CopilotService.Events.cs — When hasActiveTool && serverAlive, the watchdog resets LastEventAtTicks and continues. After the reset, elapsed drops to ~15s, far below effectiveTimeout (600s), so the 600s tool timeout block is never re-entered until another 600s of silence. The 10-minute tool execution timeout is effectively bypassed for any session where the server process stays alive.

The session IS still bounded by WatchdogMaxProcessingTimeSeconds (3600s / 60 min) — totalProcessingSeconds is recomputed from ProcessingStartedAt each iteration, so after 60 min total, exceededMaxTime triggers the kill. This is NOT an infinite loop. But the effective timeout jumps from 10 min to 60 min for stuck-tool-with-alive-server scenarios.

Suggestion: Check totalProcessingSeconds >= WatchdogToolExecutionTimeoutSeconds * N (e.g., 3x) inside Case A to cap at ~30 min.

2. 🟡 MODERATE — Periodic flush fires too eagerly during normal tool execution (3/5 consensus)

CopilotService.Events.cs:1512-1522 — The condition elapsed >= WatchdogCheckIntervalSeconds (15s) is true on virtually every watchdog cycle where no events arrived in the last 15s. During normal tool execution (e.g., a 60s build), there are 15s+ gaps between events. The flush will split a single response into multiple History entries, producing fragmented chat messages.

Suggestion: Gate on a higher threshold (e.g., elapsed >= 60) to only flush when content has been sitting for a significant time.

3. 🟡 MODERATE — 34s TurnEnd fallback may be too short for complex tool gaps (3/5 consensus)

CopilotService.Events.cs:489-503 — The TurnEndIdleToolFallbackAdditionalMs (30s) + base 4s = 34s total. In multi-agent or complex reasoning scenarios, the LLM may legitimately take >30s between TurnEnd and next TurnStart.

Mitigating factor: The fallback token is cancelled by TurnStart events, so this only fires if NO new TurnStart arrives in 34s.

Suggestion: Consider 60s for the additional delay.

4. 🟡 MODERATE — Source-code string scanning tests are brittle (4/5 consensus)

ProcessingWatchdogTests.cs:2157-2303 — 7 tests use File.ReadAllText + IndexOf to verify code structure. These break on refactoring even when behavior is preserved.

Suggestion: Convert to behavioral tests using the existing reflection infrastructure.

5. 🟡 MODERATE — Duplicate `ComputeEffectiveTimeout` helpers (2/5 consensus)

ChatExperienceSafetyTests.cs:419 and ProcessingWatchdogTests.cs:895 copy the production timeout formula with different parameter orders. If production formula changes, both silently drift.

Suggestion: Extract to a single internal static method on CopilotService.

Safe / No Issues Found

Double-completion (TurnEnd fallback + watchdog Case B) — safe, generation guards + IsProcessing early-exit (5/5 agree)
IsServerRunning TCP probe on watchdog thread — safe, new TcpClient per call (3/5 agree)
ExternalToolRequestedEvent classification as TimelineOnly — correct

CI Status: ⚠️ No checks configured

Existing Reviews: None

Test Coverage

✅ 34 behavioral tests via reflection (CompleteResponse, AbortSession, FlushCurrentResponse, INV-1)
✅ TurnEnd fallback boundary tests (constant bounds, cancellation, fire)
⚠️ Gap: SessionErrorEvent INV-1 cleanup path has no behavioral test
⚠️ Gap: No behavioral test for watchdog Case B clean completion

Verdict: ⚠️ Request Changes

Actionable asks:

Raise periodic flush threshold from 15s to 60s+ (prevents message fragmentation)
Consider raising TurnEnd tool fallback from 30s to 60s
Extract ComputeEffectiveTimeout to single source of truth

…st robustness - [HIGH] Add ProcessingGeneration guard to periodic watchdog flush: capture flushGen before InvokeOnUI and validate inside lambda, matching the same pattern used by Case B and Case C. Prevents stale watchdog tick from flushing new-turn content into old-turn history after abort+resend. - [MEDIUM] Switch TurnEnd fallback Task.Run closures from local Invoke() to class-level InvokeOnUI() for clarity. The local Invoke captures _syncContext via closure correctly, but InvokeOnUI makes the UI-thread dispatch intent explicit when reading cross-thread code. - [MEDIUM] Fix flaky timing in ToolFallback_CancelledByTurnStart_DoesNotFire: replace production 4000ms constant with 50ms accelerated delay, matching all other tests in the file. - [MEDIUM] Make watchdog source-parsing tests use dynamic method bounds: find end of RunProcessingWatchdogAsync by locating the next class member (_permissionRecoveryAttempts) instead of fixed 3500/8000 char substrings. - [MEDIUM] Add break; control-flow assertion to NoActiveTool test: verifies Case B exits via break before reaching Case C error path, catching accidental fallthrough regressions. - [MEDIUM] Add WatchdogPeriodicFlush_HasGenerationGuard test: verifies the generation guard is present in the flush block via source assertion. - [LOW] Add TODO comment near reflection helpers in ChatExperienceSafetyTests suggesting InternalsVisibleTo as a compile-time safety improvement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Bump header: 9 invariants → 13 invariants, 8 PRs → 9 PRs, 8 paths → 9 paths - Add 9th IsProcessing path: watchdog Case B (clean CompleteResponse on lost terminal event, no error message) — distinguishable from the kill path - Add INV-10: TurnEnd fallback must not be permanently suppressed by HasUsedToolsThisTurn; use ActiveToolCallCount guard + 30s extended delay - Add INV-11: Watchdog 3-way branch (Case A: liveness probe, Case B: clean complete, Case C: error kill) to distinguish active tools from lost events - Add INV-12: All background→UI dispatches must capture+validate ProcessingGeneration before/inside InvokeOnUI lambda - Add INV-13: Use class-level InvokeOnUI() not local Invoke() in Task.Run - Update Top 5 Recurring Mistakes: replace 'ActiveToolCallCount as sole signal' with the more impactful 'Suppressing TurnEnd fallback entirely' - Update regression chain: add #332 to the sequence - Add full PR #332 entry to regression-history.md with root causes, 3-way branch details, periodic flush, and session recovery improvement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… (PR #332 followup) Root cause analysis of the stuck orchestrator session (gen=28, 00:37:19-00:39:40): 1. AssistantTurnEndEvent arrived → 4s TurnEnd fallback started 2. AssistantTurnStartEvent arrived within 4s → CancelTurnEndFallback called SILENTLY (TurnStartEvent was not logged in [EVT], making it invisible in diagnostics) 3. SDK dropped terminal events for the follow-on sub-turn 4. isMultiAgentSession=true forced 600s watchdog timeout even with no tools 5. Session stuck until user clicked Stop (~2.5 min later) Fix 1 — Diagnostic: Add AssistantTurnStartEvent to [EVT] log filter The [EVT] filter previously only logged SessionIdleEvent, AssistantTurnEndEvent, and SessionErrorEvent. TurnStart events were invisible, making it impossible to see that a TurnStart cancelled the TurnEnd fallback. Now all critical lifecycle events are logged. Fix 2 — Behavioral: Remove isMultiAgentSession from useToolTimeout isMultiAgentSession alone no longer extends the watchdog to 600s. Workers doing text-heavy PR reviews stream delta events continuously, keeping elapsed small regardless of timeout. The 600s guard was protecting against a scenario that couldn't occur during active streaming. Pure reasoning/text sub-turns now use the 120s inactivity timeout. Fix 3 — Case B expansion: multi-agent + server alive → clean complete Case B (clean complete, no error message) now fires for multi-agent sessions when isMultiAgentSession=true and server is alive, not just when hasUsedTools=true. This prevents the error kill message when a session completed successfully but SDK dropped terminal events. Tests (3 new, 4 updated): - WatchdogTimeoutSelection_MultiAgent_NoTools_UsesInactivityTimeout (renamed+updated) - WatchdogDecision_MultiAgent_NoTools_CaseBIncludesMultiAgent_InSource (new source assert) - MultiAgent_NoTools_UseInactivityTimeout_NotToolTimeout (new formula test) - AssistantTurnStartEvent_LoggedInEvtDiagnostics_InSource (new source assert) - Regression_PR195_MultiAgentWorkers_DeltasKeepElapsedSmall (updated regression test) - ExhaustiveMatrix and NonResumed tests updated to reflect 120s for multi-agent no-tool Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…n#332) ## Summary Rescues chat sessions that get stuck for 600s waiting for a SessionIdleEvent that never arrives (SDK bug PureWeen#299 / network hiccup). Adds 41 behavioral safety tests exercising the real CopilotService pipeline via reflection. ## Root-cause bugs fixed 1. **TurnEnd fallback permanently disabled for tool sessions**: When HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned early with a skip log — meaning tool sessions had *zero* fallback if SessionIdleEvent was lost. The session would sit stuck until the 600s watchdog killed it with an error message, discarding the (complete) response. 2. **Watchdog couldn't distinguish active tools from lost SDK events**: The watchdog treated all `elapsed >= effectiveTimeout` the same — kill with error. But there are three distinct situations: - **Tool actively running** (`ActiveToolCallCount > 0`, server alive): should reset timer, not kill - **Tool finished, SessionIdleEvent lost** (`ActiveToolCallCount = 0`, `HasUsedToolsThisTurn = true`): should complete cleanly (response IS done) - **Genuinely stuck / server dead**: should kill with error (existing behavior) ## Four fixes ### 1. TurnEnd 30s fallback for tool sessions Instead of skipping the fallback entirely when tools were used, wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart arrives in that window, SessionIdleEvent was lost → fire CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s. ### 2. Watchdog 3-way branch When `elapsed >= effectiveTimeout` and max time not exceeded: - **Case A** (active tool + server alive): reset LastEventAtTicks, continue loop - **Case B** (no active tool + tools were used): CompleteResponse cleanly (no error message) - **Case C** (max time exceeded / server dead / demo): kill with error (unchanged) ### 3. Periodic watchdog flush Every watchdog check interval, if CurrentResponse has accumulated content, flush it to History. This ensures partial responses are visible in the chat even while IsProcessing=true. ### 4. ExternalToolRequestedEvent in EventMatrix Classified as TimelineOnly to prevent "Unhandled" log spam while still allowing LastEventAtTicks updates. ## The stuck session scenario this resolves Before this fix: User sends a prompt that triggers tool use → tools execute → LLM responds → AssistantTurnEndEvent fires → fallback timer starts but immediately returns ("tools were used, TurnStart likely coming") → SessionIdleEvent is lost (SDK bug PureWeen#299) → session sits stuck showing "Thinking..." for 600s → watchdog kills it with error message, discarding the complete response. After this fix: Same scenario → fallback timer waits 34s total → no TurnStart arrives → CompleteResponse fires → user sees the complete response with no error. If a tool is genuinely still running, the watchdog probes server TCP liveness and resets its timer. ## Tests - **41 behavioral safety tests** (ChatExperienceSafetyTests.cs): Exercise real CompleteResponse, FlushCurrentResponse, AbortSessionAsync via reflection. Cover INV-1 cleanup, generation guards, content preservation, TCS completion, dedup guards, race conditions. - **Watchdog decision tests** (ProcessingWatchdogTests.cs): Verify the 3-way branch logic, ExternalToolRequestedEvent classification, periodic flush presence/ordering. - **TurnEnd fallback tests** (TurnEndFallbackTests.cs): Verify TurnEndIdleToolFallbackAdditionalMs bounds, total delay < watchdog timeout, cancellation-by-TurnStart, eventual-fire. All 2344 tests pass (3 pre-existing failures: 2 symlink tests on Windows, 1 flaky unobserved-task test). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen changed the title ~~test: add 41 behavioral safety tests for chat watchdog/processing state~~ fix: rescue stuck chat sessions — 30s fallback + smart watchdog + 41 safety tests Mar 9, 2026

PureWeen changed the title ~~fix: rescue stuck chat sessions — 30s fallback + smart watchdog + 41 safety tests~~ fix: rescue stuck chat sessions + 41 behavioral safety tests Mar 9, 2026

PureWeen and others added 3 commits March 9, 2026 18:21

PureWeen force-pushed the chat-safety-tests branch from aebfe6f to 2785239 Compare March 9, 2026 23:22

PureWeen force-pushed the chat-safety-tests branch from a91a605 to 978b752 Compare March 10, 2026 00:30

PureWeen merged commit 3ef667e into main Mar 10, 2026

PureWeen deleted the chat-safety-tests branch March 10, 2026 01:25

PureWeen mentioned this pull request Mar 10, 2026

Fix stuck sessions: cap watchdog Case A resets for dead connections #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rescue stuck chat sessions + 41 behavioral safety tests#332

fix: rescue stuck chat sessions + 41 behavioral safety tests#332
PureWeen merged 6 commits intomainfrom
chat-safety-tests

PureWeen commented Mar 9, 2026 •

edited

Loading

Uh oh!

PureWeen commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root-cause bugs fixed

Four fixes

1. TurnEnd 30s fallback for tool sessions

2. Watchdog 3-way branch

3. Periodic watchdog flush

4. ExternalToolRequestedEvent in EventMatrix

The stuck session scenario this resolves

Tests

Uh oh!

PureWeen commented Mar 9, 2026

🔍 PR Review Squad — Round 1 Review (5-model consensus)

Production Code Analysis

1. 🟡 MODERATE — Case A effectively disables the 10-min tool timeout (3/5 consensus)

2. 🟡 MODERATE — Periodic flush fires too eagerly during normal tool execution (3/5 consensus)

3. 🟡 MODERATE — 34s TurnEnd fallback may be too short for complex tool gaps (3/5 consensus)

4. 🟡 MODERATE — Source-code string scanning tests are brittle (4/5 consensus)

5. 🟡 MODERATE — Duplicate ComputeEffectiveTimeout helpers (2/5 consensus)

Safe / No Issues Found

CI Status: ⚠️ No checks configured

Existing Reviews: None

Test Coverage

Verdict: ⚠️ Request Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PureWeen commented Mar 9, 2026 •

edited

Loading

5. 🟡 MODERATE — Duplicate `ComputeEffectiveTimeout` helpers (2/5 consensus)