Fix stuck sessions: watchdog, SEND/COMPLETE race, abort flush, and bug report UI#147
Merged
Fix stuck sessions: watchdog, SEND/COMPLETE race, abort flush, and bug report UI#147
Conversation
The 2-minute inactivity watchdog was incorrectly triggering during legitimate long-running tool executions (e.g., UI tests, builds). - Track active tool call count via ToolExecutionStart/CompleteEvent - Use 10-minute timeout when tools are running (WatchdogToolExecutionTimeoutSeconds) - Keep 2-minute timeout when no tool is active (thinking/generating) - Reset tool count on each new turn start - Update message from 'Connection lost' to 'Session appears stuck' (more accurate) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…learing new turn When SessionIdleEvent queues CompleteResponse via SyncContext.Post(), a new SendPromptAsync can execute before the callback runs. Without verification, CompleteResponse would clear IsProcessing for the WRONG turn, causing all subsequent events to become ghost events (IsProcessing=false). Evidence from diagnostic log (13:00:00.238–13:00:00.261): [EVT] SessionIdleEvent → queued CompleteResponse [SEND] new prompt sets IsProcessing=true (9ms later) [COMPLETE] runs with responseLen=0 → clears the NEW turn's state Fix: Add ProcessingGeneration counter to SessionState. SendPromptAsync increments it; SessionIdleEvent captures it before Invoke(); CompleteResponse checks the captured value matches current before proceeding. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
14 new tests covering: - Rapid sequential sends maintain clean state - Abort clears stuck sessions regardless of generation - Abort-then-resend flow (exact user-reported scenario) - Concurrent sessions have independent state - Multiple rapid aborts are idempotent - History integrity preserved across abort/resend cycles - OnStateChanged fires on abort but not on no-op abort - Debug infrastructure wired up correctly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
43844ff to
9b50636
Compare
- Session dropdown in Report Bug and Fix It panels lets users select which session had the issue. Shows (Thinking) indicator for stuck sessions. - Selected session's debug info (IsProcessing, Model, MessageCount, etc.) is included in the bug report. - New menu items in session context menu: 🐛 Report Bug and 🔧 Fix with Copilot — pre-selects the session in the corresponding panel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review findings from 4-model review (Sonnet 4.6, Opus 4.6, GPT-5.3, Gemini 3): - Reset ActiveToolCallCount to 0 in SendPromptAsync, AbortSessionAsync, and watchdog fire path. Prevents stale count from forcing 10-min timeout on dead connections after a stuck session is cleared. - Add aria-label attributes to bug report dropdown, textarea, and close buttons for screen reader accessibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restored sessions with isStillProcessing=true had a 10s initial-event timeout but NO ongoing watchdog. If the CLI goes silent after resume (as happened with FixUITestFromFreezing at 14:43 — stuck for 26 min with no recovery), there was nothing to catch it. Now StartProcessingWatchdog is called during restore, same as SendPromptAsync, ensuring the 2min/10min tiered timeout applies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbortSessionAsync was clearing IsProcessing without saving the accumulated CurrentResponse to history. When the user clicked Stop on a stuck session, the streaming content they could see disappeared instead of being preserved as a message in the chat history. Now flushes CurrentResponse to history and DB before clearing state, so the partial response is preserved when Stop is clicked. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug A: Watchdog callback missing ProcessingGeneration guard. Fix: capture generation before Post, verify inside callback. Bug B: Resume fallback mutated state from background thread. Fix: marshal through InvokeOnUI, use Volatile.Read/Write. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 21, 2026
…edge Add comprehensive documentation of the recurring stuck-session bug pattern (7 PRs, 16 fix/regression cycles) to copilot-instructions.md: - Full cleanup checklist for all IsProcessing=false paths - Table of all 7 paths with locations - 7 common mistakes with PR references where each occurred - Staleness check and IsResumed clearing documentation - Cross-thread volatile field requirements - ProcessingGeneration guard explanation - Watchdog diagnostic log tag additions This knowledge was hard-won across PRs #141, #147, #148, #153, #158, #163, #164 and should prevent future regressions by making the invariants explicit and discoverable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
## Problem After app restart, resumed sessions that were mid-turn show **Thinking...** with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing. ## Solution Add a **30s resume quiescence timeout** for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck. ### Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1): 1. **30s quiescence** — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin) 2. **Event-gated** — only fires when \HasReceivedEventsSinceResume == false\. Once events start flowing, transitions to normal 120s/600s timeout tiers 3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR #148 regression pattern) 4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup code, all 8 invariants preserved ### Timeout tiers (3-tier, was 2-tier): | Condition | Timeout | |-----------|---------| | Resumed, zero events since restart | **30s** (NEW) | | Normal processing, no tools | 120s | | Active tools / resumed with events / multi-agent | 600s | ## Tests - **16 new regression guard tests** covering quiescence edge cases, seed time safety, exhaustive timeout matrix - Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula - **108 total watchdog+recovery tests pass** ✅ ## Regression history context This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow. Fixes the 'click Stop on every restart' UX issue. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes multiple causes of sessions getting stuck in "Thinking" state and improves the bug reporting UX.
Bugs Fixed
1. Watchdog too aggressive during tool execution
The 2-minute watchdog was killing sessions running legitimate long tools (UI test builds taking 3-5 min). Now uses a tiered timeout: 2 min when idle, 10 min during active tool execution. Tracks tool calls via
ActiveToolCallCount(increment onToolExecutionStartEvent, decrement onToolExecutionCompleteEvent, reset onTurnStart).2. SEND/COMPLETE race condition
When
SessionIdleEventqueuesCompleteResponseviaSyncContext.Post(), a newSendPromptAsynccould sneak in before the callback runs. The staleCompleteResponsewould then clear the new turn'sIsProcessing, turning all subsequent events into ghost events. Fixed with a generation counter —SendPromptAsyncincrements it,SessionIdleEventcaptures it,CompleteResponseverifies it matches before proceeding.Evidence from diagnostic log:
3. No watchdog on restored sessions
Sessions restored after app relaunch with
isStillProcessing=truehad a 10s initial-event timeout but no ongoing watchdog. If the CLI went silent after resume, the session was stuck forever. Now callsStartProcessingWatchdogduring restore.4. Stop button discards partial response
AbortSessionAsyncwas clearingIsProcessingwithout saving the accumulatedCurrentResponseto history. Clicking Stop on a stuck session made the streaming content disappear. Now flushes partial response to history before clearing state.5. Stale
ActiveToolCallCountafter abort/watchdogActiveToolCallCountwas only reset onAssistantTurnStartEvent. After watchdog fire or abort, stale count caused 10-min timeout on dead connections. Now reset inSendPromptAsync,AbortSessionAsync, and watchdog fire path.Features Added
Session selector in bug report UI
(Thinking)for stuck sessionsFiles Changed
CopilotService.csProcessingGenerationfield,ActiveToolCallCountreset in send/abort, watchdog on restore, abort flushes responseCopilotService.Events.csCompleteResponse, tool call trackingSessionListItem.razorSessionSidebar.razorSessionSidebar.razor.css.bug-report-selectdropdown stylingProcessingWatchdogTests.csScenarioReferenceTests.csmode-switch-scenarios.jsonTesting