Fix watchdog stuck session and false 'stuck' warnings#230
Merged
Conversation
Three bugs fixed: 1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent (e.g., FailedDelegation) without terminal events. Root cause: HandleSessionEvent unconditionally updated LastEventAtTicks for ALL events, including metrics-only events that don't indicate progress. Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and AssistantUsageEvent. 2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety net. Even if progress events keep arriving, no turn should run for 60 minutes without user notification. Uses ProcessingStartedAt (set in SendPromptAsync) so it cannot be reset by events. 3. False 'session stuck' warnings after app restart. Root cause: GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds (120s) as file age threshold, but tool executions can go 5-10 minutes without events.jsonl writes. Fix: Use WatchdogToolExecutionTimeoutSeconds (600s) threshold. Added ~25 regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224, plus invariant tests for the 8 processing state safety invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… diagnostic tag
- Replace DateTime? auto-property with Interlocked long ticks backing field to prevent
torn reads between watchdog background thread (reader) and UI thread (writer).
A torn read could yield HasValue=true but Value=default(DateTime) (0 ticks),
causing a huge totalProcessingSeconds and a false-positive immediate watchdog timeout.
- Add [WATCHDOG] Debug tag inside InvokeOnUI callback at IsProcessing=false path so
watchdog-triggered completions are captured in event-diagnostics.log with a consistent
tag matching other processing-state-change paths ([IDLE], [ERROR], [ABORT], etc.)
- Update Debug() diagnostics filter to include StartsWith("[WATCHDOG") so tag-based
watchdog messages are persisted to file (message.Contains("watchdog") matched
lowercase only and missed the [WATCHDOG] uppercase tag at line 1292)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f9b6f5c to
9aac86f
Compare
Line 1293 used DateTime.Now (local time) while the watchdog computes totalProcessingSeconds as (DateTime.UtcNow - ProcessingStartedAt.Value). With the new Interlocked-backed getter that reconstructs as DateTimeKind.Utc, storing local-time ticks introduces UTC offset skew: - UTC+ zones: totalProcessingSeconds appears negative → exceededMaxTime never fires - UTC- zones: totalProcessingSeconds inflated by offset → premature 60-min cap fire Fix: use DateTime.UtcNow, consistent with all other write sites (line 1787, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Snapshot ProcessingStartedAt once before use to avoid InvalidOperationException if CompleteResponse clears the value between .HasValue and .Value reads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Three bugs in the processing watchdog:
Session stuck at 361 minutes — MergeNET11 session had
IsProcessing=truefor 361m. The SDK was sending repeatedSessionUsageInfoEvent(FailedDelegation) which resetLastEventAtTickswithout ever sending a terminal event. The watchdog timer kept resetting and never fired.No max processing time — Even with fix Polish UI, Rename Sessions, Markdown Output Support, Queued Messages #1, there was no absolute ceiling. A session could run forever as long as progress events arrived.
False 'stuck' warnings — After app restart, sessions got
⚠️ Session appears stuck — no events received for over 30 secondseven though they were actively working.GetEventsFileRestoreHintsused 120s threshold, but tool executions can go 5-10 minutes without events.jsonl writes.Fix
LastEventAtTicksupdate — Skip forSessionUsageInfoEventandAssistantUsageEvent(metrics-only events that don't indicate turn progress).WatchdogMaxProcessingTimeSeconds(3600s = 60 min) — Absolute safety net usingProcessingStartedAt, which cannot be reset by events.WatchdogInactivityTimeoutSeconds(120s) toWatchdogToolExecutionTimeoutSeconds(600s).Testing