Skip to content

fix: rescue stuck chat sessions + 41 behavioral safety tests#332

Merged
PureWeen merged 6 commits intomainfrom
chat-safety-tests
Mar 10, 2026
Merged

fix: rescue stuck chat sessions + 41 behavioral safety tests#332
PureWeen merged 6 commits intomainfrom
chat-safety-tests

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 9, 2026

Summary

Rescues chat sessions that get stuck for 600s waiting for a SessionIdleEvent that never arrives (SDK bug #299 / network hiccup). Adds 41 behavioral safety tests exercising the real CopilotService pipeline via reflection.

Root-cause bugs fixed

  1. TurnEnd fallback permanently disabled for tool sessions: When HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned early with a skip log — meaning tool sessions had zero fallback if SessionIdleEvent was lost. The session would sit stuck until the 600s watchdog killed it with an error message, discarding the (complete) response.

  2. Watchdog couldn't distinguish active tools from lost SDK events: The watchdog treated all elapsed >= effectiveTimeout the same — kill with error. But there are three distinct situations:

    • Tool actively running (ActiveToolCallCount > 0, server alive): should reset timer, not kill
    • Tool finished, SessionIdleEvent lost (ActiveToolCallCount = 0, HasUsedToolsThisTurn = true): should complete cleanly (response IS done)
    • Genuinely stuck / server dead: should kill with error (existing behavior)

Four fixes

1. TurnEnd 30s fallback for tool sessions

Instead of skipping the fallback entirely when tools were used, wait an additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart arrives in that window, SessionIdleEvent was lost → fire CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s.

2. Watchdog 3-way branch

When elapsed >= effectiveTimeout and max time not exceeded:

  • Case A (active tool + server alive): reset LastEventAtTicks, continue loop
  • Case B (no active tool + tools were used): CompleteResponse cleanly (no error message)
  • Case C (max time exceeded / server dead / demo): kill with error (unchanged)

3. Periodic watchdog flush

Every watchdog check interval, if CurrentResponse has accumulated content, flush it to History. This ensures partial responses are visible in the chat even while IsProcessing=true.

4. ExternalToolRequestedEvent in EventMatrix

Classified as TimelineOnly to prevent "Unhandled" log spam while still allowing LastEventAtTicks updates.

The stuck session scenario this resolves

Before this fix: User sends a prompt that triggers tool use → tools execute → LLM responds → AssistantTurnEndEvent fires → fallback timer starts but immediately returns ("tools were used, TurnStart likely coming") → SessionIdleEvent is lost (SDK bug #299) → session sits stuck showing "Thinking..." for 600s → watchdog kills it with error message, discarding the complete response.

After this fix: Same scenario → fallback timer waits 34s total → no TurnStart arrives → CompleteResponse fires → user sees the complete response with no error. If a tool is genuinely still running, the watchdog probes server TCP liveness and resets its timer.

Tests

  • 41 behavioral safety tests (ChatExperienceSafetyTests.cs): Exercise real CompleteResponse, FlushCurrentResponse, AbortSessionAsync via reflection. Cover INV-1 cleanup, generation guards, content preservation, TCS completion, dedup guards, race conditions.
  • Watchdog decision tests (ProcessingWatchdogTests.cs): Verify the 3-way branch logic, ExternalToolRequestedEvent classification, periodic flush presence/ordering.
  • TurnEnd fallback tests (TurnEndFallbackTests.cs): Verify TurnEndIdleToolFallbackAdditionalMs bounds, total delay < watchdog timeout, cancellation-by-TurnStart, eventual-fire.

All 2344 tests pass (3 pre-existing failures: 2 symlink tests on Windows, 1 flaky unobserved-task test).

@PureWeen PureWeen changed the title test: add 41 behavioral safety tests for chat watchdog/processing state fix: rescue stuck chat sessions — 30s fallback + smart watchdog + 41 safety tests Mar 9, 2026
@PureWeen PureWeen changed the title fix: rescue stuck chat sessions — 30s fallback + smart watchdog + 41 safety tests fix: rescue stuck chat sessions + 41 behavioral safety tests Mar 9, 2026
PureWeen and others added 3 commits March 9, 2026 18:21
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330)
covering the invariants documented in processing-state-safety skill:

- INV-1: All 8 termination paths clear state correctly (CompleteResponse,
  SessionErrorEvent, AbortSessionAsync, watchdog)
- INV-2: State mutations marshaled to UI thread via InvokeOnUI
- INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns
- INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just
  while tools are active)
- INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both
  SendPromptAsync and RestoreSingleSessionAsync paths

Behavioral tests (demo mode integration):
- Multi-turn message preservation (5 sequential turns, all history retained)
- Abort clears all 9 INV-1 fields, fires OnSessionComplete
- Post-abort send succeeds without deadlock (SendingFlag cleared)
- Session isolation (stuck session doesn't block others)
- WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds
- WatchdogMaxProcessingTimeSeconds >= 30 minutes

Source-code assertion tests (regression guards against future refactors):
- useToolTimeout formula has all 4 conditions (INV-5)
- TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse
- FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix)
- FlushCurrentResponse dedup guard prevents SDK-replay duplicates
- CompleteResponse cancels watchdog before cleanup
- Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn

These tests are designed to catch the class of regressions documented in
regression-history.md (PRs #141-#284).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…unning tools from lost SDK events

Root causes of two chat reliability bugs:

1. 'Messages stop streaming / user has to prod' — AssistantTurnEndEvent fallback
   suppressed CompleteResponse permanently when HasUsedToolsThisTurn=true (preventing
   premature completion between tool rounds). But if SessionIdleEvent was lost (SDK bug
   #299), nothing would ever complete the session. Fixed: wait an additional 30s
   (TurnEndIdleToolFallbackAdditionalMs) then fire CompleteResponse anyway if no new
   TurnStart arrives. Total wait = 4s + 30s = 34s for tool sessions.

2. 'Long-running processes killed at ~10 min' — watchdog uses SDK event inactivity as
   proxy for stuck, but silent tools (builds, test runs) produce zero events during
   execution. Fixed: check ActiveToolCallCount at timeout:
   - ActiveToolCallCount > 0 (tool running): probe server TCP liveness; reset timer if
     alive, kill if server is dead
   - ActiveToolCallCount = 0 AND tools were used (SessionIdleEvent lost): complete
     cleanly via CompleteResponse — no error message
   - Max time exceeded or demo/remote: kill with error (original behavior)

The key insight: ActiveToolCallCount=0 means the tool finished; the session just needs
the terminal SessionIdleEvent that was dropped. Completing cleanly is correct here.

Tests added:
- TurnEndFallbackTests: TurnEndIdleToolFallbackAdditionalMs reasonable, total < watchdog,
  cancelled-by-TurnStart, eventual fire
- ProcessingWatchdogTests: ActiveTool+ServerAlive resets timer, NoActiveTool+ToolsUsed
  completes cleanly, MaxTime always kills

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses two remaining items from stuck-session analysis:

1. ExternalToolRequestedEvent in SdkEventMatrix (Rec 3):
   This event was arriving as 'Unhandled', causing log spam and potentially
   updating LastEventAtTicks on sessions that shouldn't be kept alive. Now
   explicitly classified as TimelineOnly so it's handled silently.

2. Periodic mid-watchdog flush (Rec 2):
   If content has accumulated in CurrentResponse during a watchdog check
   interval, flush it to History immediately. This ensures partial responses
   are visible even while IsProcessing=true — covering cases where streaming
   stalled and TurnEnd->Idle fallback hasn't fired yet. Flush occurs every
   WatchdogCheckIntervalSeconds (15s) as long as there's pending content.

Combined with the previous commit (30s extended TurnEnd fallback + server
liveness check), all 4 recommendations from the stuck-session analysis are
now addressed:
  Rec 1: TurnEnd fallback works for tool sessions (30s extended delay) ✓
  Rec 2: Periodic watchdog flush of CurrentResponse to History ✓
  Rec 3: ExternalToolRequestedEvent in EventMatrix (no more log spam) ✓
  Rec 4: Case B clean CompleteResponse when ActiveToolCallCount=0+tools used ✓

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the chat-safety-tests branch from aebfe6f to 2785239 Compare March 9, 2026 23:22
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Mar 9, 2026

🔍 PR Review Squad — Round 1 Review (5-model consensus)

Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex

Production Code Analysis

1. 🟡 MODERATE — Case A effectively disables the 10-min tool timeout (3/5 consensus)

CopilotService.Events.cs — When hasActiveTool && serverAlive, the watchdog resets LastEventAtTicks and continues. After the reset, elapsed drops to ~15s, far below effectiveTimeout (600s), so the 600s tool timeout block is never re-entered until another 600s of silence. The 10-minute tool execution timeout is effectively bypassed for any session where the server process stays alive.

The session IS still bounded by WatchdogMaxProcessingTimeSeconds (3600s / 60 min) — totalProcessingSeconds is recomputed from ProcessingStartedAt each iteration, so after 60 min total, exceededMaxTime triggers the kill. This is NOT an infinite loop. But the effective timeout jumps from 10 min to 60 min for stuck-tool-with-alive-server scenarios.

Suggestion: Check totalProcessingSeconds >= WatchdogToolExecutionTimeoutSeconds * N (e.g., 3x) inside Case A to cap at ~30 min.

2. 🟡 MODERATE — Periodic flush fires too eagerly during normal tool execution (3/5 consensus)

CopilotService.Events.cs:1512-1522 — The condition elapsed >= WatchdogCheckIntervalSeconds (15s) is true on virtually every watchdog cycle where no events arrived in the last 15s. During normal tool execution (e.g., a 60s build), there are 15s+ gaps between events. The flush will split a single response into multiple History entries, producing fragmented chat messages.

Suggestion: Gate on a higher threshold (e.g., elapsed >= 60) to only flush when content has been sitting for a significant time.

3. 🟡 MODERATE — 34s TurnEnd fallback may be too short for complex tool gaps (3/5 consensus)

CopilotService.Events.cs:489-503 — The TurnEndIdleToolFallbackAdditionalMs (30s) + base 4s = 34s total. In multi-agent or complex reasoning scenarios, the LLM may legitimately take >30s between TurnEnd and next TurnStart.

Mitigating factor: The fallback token is cancelled by TurnStart events, so this only fires if NO new TurnStart arrives in 34s.

Suggestion: Consider 60s for the additional delay.

4. 🟡 MODERATE — Source-code string scanning tests are brittle (4/5 consensus)

ProcessingWatchdogTests.cs:2157-2303 — 7 tests use File.ReadAllText + IndexOf to verify code structure. These break on refactoring even when behavior is preserved.

Suggestion: Convert to behavioral tests using the existing reflection infrastructure.

5. 🟡 MODERATE — Duplicate ComputeEffectiveTimeout helpers (2/5 consensus)

ChatExperienceSafetyTests.cs:419 and ProcessingWatchdogTests.cs:895 copy the production timeout formula with different parameter orders. If production formula changes, both silently drift.

Suggestion: Extract to a single internal static method on CopilotService.

Safe / No Issues Found

  • Double-completion (TurnEnd fallback + watchdog Case B) — safe, generation guards + IsProcessing early-exit (5/5 agree)
  • IsServerRunning TCP probe on watchdog thread — safe, new TcpClient per call (3/5 agree)
  • ExternalToolRequestedEvent classification as TimelineOnly — correct

CI Status: ⚠️ No checks configured

Existing Reviews: None

Test Coverage

  • ✅ 34 behavioral tests via reflection (CompleteResponse, AbortSession, FlushCurrentResponse, INV-1)
  • ✅ TurnEnd fallback boundary tests (constant bounds, cancellation, fire)
  • ⚠️ Gap: SessionErrorEvent INV-1 cleanup path has no behavioral test
  • ⚠️ Gap: No behavioral test for watchdog Case B clean completion

Verdict: ⚠️ Request Changes

Actionable asks:

  1. Raise periodic flush threshold from 15s to 60s+ (prevents message fragmentation)
  2. Consider raising TurnEnd tool fallback from 30s to 60s
  3. Extract ComputeEffectiveTimeout to single source of truth

…st robustness

- [HIGH] Add ProcessingGeneration guard to periodic watchdog flush: capture
  flushGen before InvokeOnUI and validate inside lambda, matching the same
  pattern used by Case B and Case C. Prevents stale watchdog tick from
  flushing new-turn content into old-turn history after abort+resend.

- [MEDIUM] Switch TurnEnd fallback Task.Run closures from local Invoke() to
  class-level InvokeOnUI() for clarity. The local Invoke captures _syncContext
  via closure correctly, but InvokeOnUI makes the UI-thread dispatch intent
  explicit when reading cross-thread code.

- [MEDIUM] Fix flaky timing in ToolFallback_CancelledByTurnStart_DoesNotFire:
  replace production 4000ms constant with 50ms accelerated delay, matching
  all other tests in the file.

- [MEDIUM] Make watchdog source-parsing tests use dynamic method bounds:
  find end of RunProcessingWatchdogAsync by locating the next class member
  (_permissionRecoveryAttempts) instead of fixed 3500/8000 char substrings.

- [MEDIUM] Add break; control-flow assertion to NoActiveTool test: verifies
  Case B exits via break before reaching Case C error path, catching
  accidental fallthrough regressions.

- [MEDIUM] Add WatchdogPeriodicFlush_HasGenerationGuard test: verifies the
  generation guard is present in the flush block via source assertion.

- [LOW] Add TODO comment near reflection helpers in ChatExperienceSafetyTests
  suggesting InternalsVisibleTo as a compile-time safety improvement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 10, 2026
- Bump header: 9 invariants → 13 invariants, 8 PRs → 9 PRs, 8 paths → 9 paths
- Add 9th IsProcessing path: watchdog Case B (clean CompleteResponse on lost
  terminal event, no error message) — distinguishable from the kill path
- Add INV-10: TurnEnd fallback must not be permanently suppressed by
  HasUsedToolsThisTurn; use ActiveToolCallCount guard + 30s extended delay
- Add INV-11: Watchdog 3-way branch (Case A: liveness probe, Case B: clean
  complete, Case C: error kill) to distinguish active tools from lost events
- Add INV-12: All background→UI dispatches must capture+validate
  ProcessingGeneration before/inside InvokeOnUI lambda
- Add INV-13: Use class-level InvokeOnUI() not local Invoke() in Task.Run
- Update Top 5 Recurring Mistakes: replace 'ActiveToolCallCount as sole
  signal' with the more impactful 'Suppressing TurnEnd fallback entirely'
- Update regression chain: add #332 to the sequence
- Add full PR #332 entry to regression-history.md with root causes,
  3-way branch details, periodic flush, and session recovery improvement

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Bump header: 9 invariants → 13 invariants, 8 PRs → 9 PRs, 8 paths → 9 paths
- Add 9th IsProcessing path: watchdog Case B (clean CompleteResponse on lost
  terminal event, no error message) — distinguishable from the kill path
- Add INV-10: TurnEnd fallback must not be permanently suppressed by
  HasUsedToolsThisTurn; use ActiveToolCallCount guard + 30s extended delay
- Add INV-11: Watchdog 3-way branch (Case A: liveness probe, Case B: clean
  complete, Case C: error kill) to distinguish active tools from lost events
- Add INV-12: All background→UI dispatches must capture+validate
  ProcessingGeneration before/inside InvokeOnUI lambda
- Add INV-13: Use class-level InvokeOnUI() not local Invoke() in Task.Run
- Update Top 5 Recurring Mistakes: replace 'ActiveToolCallCount as sole
  signal' with the more impactful 'Suppressing TurnEnd fallback entirely'
- Update regression chain: add #332 to the sequence
- Add full PR #332 entry to regression-history.md with root causes,
  3-way branch details, periodic flush, and session recovery improvement

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the chat-safety-tests branch from a91a605 to 978b752 Compare March 10, 2026 00:30
… (PR #332 followup)

Root cause analysis of the stuck orchestrator session (gen=28, 00:37:19-00:39:40):
  1. AssistantTurnEndEvent arrived → 4s TurnEnd fallback started
  2. AssistantTurnStartEvent arrived within 4s → CancelTurnEndFallback called SILENTLY
     (TurnStartEvent was not logged in [EVT], making it invisible in diagnostics)
  3. SDK dropped terminal events for the follow-on sub-turn
  4. isMultiAgentSession=true forced 600s watchdog timeout even with no tools
  5. Session stuck until user clicked Stop (~2.5 min later)

Fix 1 — Diagnostic: Add AssistantTurnStartEvent to [EVT] log filter
  The [EVT] filter previously only logged SessionIdleEvent, AssistantTurnEndEvent, and
  SessionErrorEvent. TurnStart events were invisible, making it impossible to see that a
  TurnStart cancelled the TurnEnd fallback. Now all critical lifecycle events are logged.

Fix 2 — Behavioral: Remove isMultiAgentSession from useToolTimeout
  isMultiAgentSession alone no longer extends the watchdog to 600s. Workers doing text-heavy
  PR reviews stream delta events continuously, keeping elapsed small regardless of timeout.
  The 600s guard was protecting against a scenario that couldn't occur during active streaming.
  Pure reasoning/text sub-turns now use the 120s inactivity timeout.

Fix 3 — Case B expansion: multi-agent + server alive → clean complete
  Case B (clean complete, no error message) now fires for multi-agent sessions when
  isMultiAgentSession=true and server is alive, not just when hasUsedTools=true. This
  prevents the error kill message when a session completed successfully but SDK dropped
  terminal events.

Tests (3 new, 4 updated):
  - WatchdogTimeoutSelection_MultiAgent_NoTools_UsesInactivityTimeout (renamed+updated)
  - WatchdogDecision_MultiAgent_NoTools_CaseBIncludesMultiAgent_InSource (new source assert)
  - MultiAgent_NoTools_UseInactivityTimeout_NotToolTimeout (new formula test)
  - AssistantTurnStartEvent_LoggedInEvtDiagnostics_InSource (new source assert)
  - Regression_PR195_MultiAgentWorkers_DeltasKeepElapsedSmall (updated regression test)
  - ExhaustiveMatrix and NonResumed tests updated to reflect 120s for multi-agent no-tool

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 3ef667e into main Mar 10, 2026
@PureWeen PureWeen deleted the chat-safety-tests branch March 10, 2026 01:25
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…n#332)

## Summary

Rescues chat sessions that get stuck for 600s waiting for a
SessionIdleEvent that never arrives (SDK bug PureWeen#299 / network hiccup).
Adds 41 behavioral safety tests exercising the real CopilotService
pipeline via reflection.

## Root-cause bugs fixed

1. **TurnEnd fallback permanently disabled for tool sessions**: When
HasUsedToolsThisTurn was true, the TurnEnd→Idle 4s fallback returned
early with a skip log — meaning tool sessions had *zero* fallback if
SessionIdleEvent was lost. The session would sit stuck until the 600s
watchdog killed it with an error message, discarding the (complete)
response.

2. **Watchdog couldn't distinguish active tools from lost SDK events**:
The watchdog treated all `elapsed >= effectiveTimeout` the same — kill
with error. But there are three distinct situations:
- **Tool actively running** (`ActiveToolCallCount > 0`, server alive):
should reset timer, not kill
- **Tool finished, SessionIdleEvent lost** (`ActiveToolCallCount = 0`,
`HasUsedToolsThisTurn = true`): should complete cleanly (response IS
done)
- **Genuinely stuck / server dead**: should kill with error (existing
behavior)

## Four fixes

### 1. TurnEnd 30s fallback for tool sessions
Instead of skipping the fallback entirely when tools were used, wait an
additional 30s (TurnEndIdleToolFallbackAdditionalMs). If no TurnStart
arrives in that window, SessionIdleEvent was lost → fire
CompleteResponse. Total wait for tool sessions = 4s + 30s = 34s.

### 2. Watchdog 3-way branch
When `elapsed >= effectiveTimeout` and max time not exceeded:
- **Case A** (active tool + server alive): reset LastEventAtTicks,
continue loop
- **Case B** (no active tool + tools were used): CompleteResponse
cleanly (no error message)
- **Case C** (max time exceeded / server dead / demo): kill with error
(unchanged)

### 3. Periodic watchdog flush
Every watchdog check interval, if CurrentResponse has accumulated
content, flush it to History. This ensures partial responses are visible
in the chat even while IsProcessing=true.

### 4. ExternalToolRequestedEvent in EventMatrix
Classified as TimelineOnly to prevent "Unhandled" log spam while still
allowing LastEventAtTicks updates.

## The stuck session scenario this resolves

Before this fix: User sends a prompt that triggers tool use → tools
execute → LLM responds → AssistantTurnEndEvent fires → fallback timer
starts but immediately returns ("tools were used, TurnStart likely
coming") → SessionIdleEvent is lost (SDK bug PureWeen#299) → session sits stuck
showing "Thinking..." for 600s → watchdog kills it with error message,
discarding the complete response.

After this fix: Same scenario → fallback timer waits 34s total → no
TurnStart arrives → CompleteResponse fires → user sees the complete
response with no error. If a tool is genuinely still running, the
watchdog probes server TCP liveness and resets its timer.

## Tests

- **41 behavioral safety tests** (ChatExperienceSafetyTests.cs):
Exercise real CompleteResponse, FlushCurrentResponse, AbortSessionAsync
via reflection. Cover INV-1 cleanup, generation guards, content
preservation, TCS completion, dedup guards, race conditions.
- **Watchdog decision tests** (ProcessingWatchdogTests.cs): Verify the
3-way branch logic, ExternalToolRequestedEvent classification, periodic
flush presence/ordering.
- **TurnEnd fallback tests** (TurnEndFallbackTests.cs): Verify
TurnEndIdleToolFallbackAdditionalMs bounds, total delay < watchdog
timeout, cancellation-by-TurnStart, eventual-fire.

All 2344 tests pass (3 pre-existing failures: 2 symlink tests on
Windows, 1 flaky unobserved-task test).

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant