Skip to content

Fix stuck sessions: watchdog, SEND/COMPLETE race, abort flush, and bug report UI#147

Merged
PureWeen merged 8 commits intomainfrom
fix/watchdog-tool-timeout
Feb 19, 2026
Merged

Fix stuck sessions: watchdog, SEND/COMPLETE race, abort flush, and bug report UI#147
PureWeen merged 8 commits intomainfrom
fix/watchdog-tool-timeout

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Feb 19, 2026

Summary

Fixes multiple causes of sessions getting stuck in "Thinking" state and improves the bug reporting UX.

Bugs Fixed

1. Watchdog too aggressive during tool execution

The 2-minute watchdog was killing sessions running legitimate long tools (UI test builds taking 3-5 min). Now uses a tiered timeout: 2 min when idle, 10 min during active tool execution. Tracks tool calls via ActiveToolCallCount (increment on ToolExecutionStartEvent, decrement on ToolExecutionCompleteEvent, reset on TurnStart).

2. SEND/COMPLETE race condition

When SessionIdleEvent queues CompleteResponse via SyncContext.Post(), a new SendPromptAsync could sneak in before the callback runs. The stale CompleteResponse would then clear the new turn's IsProcessing, turning all subsequent events into ghost events. Fixed with a generation counterSendPromptAsync increments it, SessionIdleEvent captures it, CompleteResponse verifies it matches before proceeding.

Evidence from diagnostic log:

13:00:00.238 [EVT]      SessionIdleEvent → queued CompleteResponse
13:00:00.251 [SEND]     new prompt sets IsProcessing=true (9ms later)
13:00:00.261 [COMPLETE] runs with responseLen=0 → clears the WRONG turn

3. No watchdog on restored sessions

Sessions restored after app relaunch with isStillProcessing=true had a 10s initial-event timeout but no ongoing watchdog. If the CLI went silent after resume, the session was stuck forever. Now calls StartProcessingWatchdog during restore.

4. Stop button discards partial response

AbortSessionAsync was clearing IsProcessing without saving the accumulated CurrentResponse to history. Clicking Stop on a stuck session made the streaming content disappear. Now flushes partial response to history before clearing state.

5. Stale ActiveToolCallCount after abort/watchdog

ActiveToolCallCount was only reset on AssistantTurnStartEvent. After watchdog fire or abort, stale count caused 10-min timeout on dead connections. Now reset in SendPromptAsync, AbortSessionAsync, and watchdog fire path.

Features Added

Session selector in bug report UI

  • Session dropdown in Report Bug and Fix It panels — select which session broke, shows (Thinking) for stuck sessions
  • Selected session's debug info (IsProcessing, Model, MessageCount, WorkDir) auto-included in reports
  • New context menu items: 🐛 Report Bug and 🔧 Fix with Copilot on each session's ⋯ menu
  • Aria-label accessibility attributes on all new controls

Files Changed

File Changes
CopilotService.cs ProcessingGeneration field, ActiveToolCallCount reset in send/abort, watchdog on restore, abort flushes response
CopilotService.Events.cs Tiered watchdog timeout, generation check in CompleteResponse, tool call tracking
SessionListItem.razor 🐛 Report Bug and 🔧 Fix with Copilot menu items
SessionSidebar.razor Session dropdown in bug report/fix panels, aria-labels
SessionSidebar.razor.css .bug-report-select dropdown styling
ProcessingWatchdogTests.cs 14 new tests (abort recovery, rapid sends, concurrent sessions, history integrity)
ScenarioReferenceTests.cs Updated scenario references
mode-switch-scenarios.json Updated scenario descriptions

Testing

  • 655 tests passing (641 existing + 14 new)
  • Mac Catalyst builds clean
  • 4-model code review (Sonnet 4.6, Opus 4.6, GPT-5.3 Codex, Gemini 3 Pro) — all findings addressed

PureWeen and others added 3 commits February 19, 2026 08:29
The 2-minute inactivity watchdog was incorrectly triggering during
legitimate long-running tool executions (e.g., UI tests, builds).

- Track active tool call count via ToolExecutionStart/CompleteEvent
- Use 10-minute timeout when tools are running (WatchdogToolExecutionTimeoutSeconds)
- Keep 2-minute timeout when no tool is active (thinking/generating)
- Reset tool count on each new turn start
- Update message from 'Connection lost' to 'Session appears stuck' (more accurate)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…learing new turn

When SessionIdleEvent queues CompleteResponse via SyncContext.Post(), a new
SendPromptAsync can execute before the callback runs. Without verification,
CompleteResponse would clear IsProcessing for the WRONG turn, causing all
subsequent events to become ghost events (IsProcessing=false).

Evidence from diagnostic log (13:00:00.238–13:00:00.261):
  [EVT]  SessionIdleEvent → queued CompleteResponse
  [SEND] new prompt sets IsProcessing=true (9ms later)
  [COMPLETE] runs with responseLen=0 → clears the NEW turn's state

Fix: Add ProcessingGeneration counter to SessionState. SendPromptAsync
increments it; SessionIdleEvent captures it before Invoke(); CompleteResponse
checks the captured value matches current before proceeding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
14 new tests covering:
- Rapid sequential sends maintain clean state
- Abort clears stuck sessions regardless of generation
- Abort-then-resend flow (exact user-reported scenario)
- Concurrent sessions have independent state
- Multiple rapid aborts are idempotent
- History integrity preserved across abort/resend cycles
- OnStateChanged fires on abort but not on no-op abort
- Debug infrastructure wired up correctly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/watchdog-tool-timeout branch from 43844ff to 9b50636 Compare February 19, 2026 14:30
PureWeen and others added 4 commits February 19, 2026 08:34
- Session dropdown in Report Bug and Fix It panels lets users select
  which session had the issue. Shows (Thinking) indicator for stuck sessions.
- Selected session's debug info (IsProcessing, Model, MessageCount, etc.)
  is included in the bug report.
- New menu items in session context menu: 🐛 Report Bug and 🔧 Fix with
  Copilot — pre-selects the session in the corresponding panel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review findings from 4-model review (Sonnet 4.6, Opus 4.6, GPT-5.3, Gemini 3):

- Reset ActiveToolCallCount to 0 in SendPromptAsync, AbortSessionAsync,
  and watchdog fire path. Prevents stale count from forcing 10-min timeout
  on dead connections after a stuck session is cleared.
- Add aria-label attributes to bug report dropdown, textarea, and close
  buttons for screen reader accessibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restored sessions with isStillProcessing=true had a 10s initial-event
timeout but NO ongoing watchdog. If the CLI goes silent after resume
(as happened with FixUITestFromFreezing at 14:43 — stuck for 26 min
with no recovery), there was nothing to catch it.

Now StartProcessingWatchdog is called during restore, same as
SendPromptAsync, ensuring the 2min/10min tiered timeout applies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbortSessionAsync was clearing IsProcessing without saving the
accumulated CurrentResponse to history. When the user clicked Stop
on a stuck session, the streaming content they could see disappeared
instead of being preserved as a message in the chat history.

Now flushes CurrentResponse to history and DB before clearing state,
so the partial response is preserved when Stop is clicked.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title Use longer watchdog timeout during active tool execution Fix stuck sessions: watchdog, SEND/COMPLETE race, abort flush, and bug report UI Feb 19, 2026
Bug A: Watchdog callback missing ProcessingGeneration guard. Fix: capture generation before Post, verify inside callback.

Bug B: Resume fallback mutated state from background thread. Fix: marshal through InvokeOnUI, use Volatile.Read/Write.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 49972f9 into main Feb 19, 2026
6 checks passed
PureWeen added a commit that referenced this pull request Feb 21, 2026
…edge

Add comprehensive documentation of the recurring stuck-session bug pattern
(7 PRs, 16 fix/regression cycles) to copilot-instructions.md:

- Full cleanup checklist for all IsProcessing=false paths
- Table of all 7 paths with locations
- 7 common mistakes with PR references where each occurred
- Staleness check and IsResumed clearing documentation
- Cross-thread volatile field requirements
- ProcessingGeneration guard explanation
- Watchdog diagnostic log tag additions

This knowledge was hard-won across PRs #141, #147, #148, #153, #158,
#163, #164 and should prevent future regressions by making the invariants
explicit and discoverable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen deleted the fix/watchdog-tool-timeout branch February 22, 2026 00:15
PureWeen added a commit that referenced this pull request Feb 25, 2026
## Problem
After app restart, resumed sessions that were mid-turn show
**Thinking...** with a Stop button. The user must manually click Stop
every time. The existing watchdog waited 600s (10 min!) before clearing
stuck IsProcessing.

## Solution
Add a **30s resume quiescence timeout** for sessions that receive zero
SDK events after restart. If no events flow within 30s of app start, the
session is cleared as stuck.

### Key design decisions (informed by 4-model consultation: Opus 4.6,
Sonnet 4.6, Codex 5.3, GPT-5.1):

1. **30s quiescence** — short enough users don't wait, long enough for
SDK reconnect (~5s typical, 6x safety margin)
2. **Event-gated** — only fires when \HasReceivedEventsSinceResume ==
false\. Once events start flowing, transitions to normal 120s/600s
timeout tiers
3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models
independently flagged that seeding from events.jsonl would cause
immediate kills for sessions >15s old (exact PR #148 regression pattern)
4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup
code, all 8 invariants preserved

### Timeout tiers (3-tier, was 2-tier):
| Condition | Timeout |
|-----------|---------|
| Resumed, zero events since restart | **30s** (NEW) |
| Normal processing, no tools | 120s |
| Active tools / resumed with events / multi-agent | 600s |

## Tests
- **16 new regression guard tests** covering quiescence edge cases, seed
time safety, exhaustive timeout matrix
- Updated existing tests to use \ComputeEffectiveTimeout\ helper
mirroring production 3-tier formula
- **108 total watchdog+recovery tests pass** ✅

## Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs
#141#147#148#153#158#163#164). The most relevant precedent: PR
#148 added a 10s resume timeout that killed active sessions. Our 30s
timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant