Skip to content

Fix watchdog stuck session and false 'stuck' warnings#230

Merged
PureWeen merged 4 commits intomainfrom
fix/session-mergenet11-can-you-please-review-20260226-1947
Feb 26, 2026
Merged

Fix watchdog stuck session and false 'stuck' warnings#230
PureWeen merged 4 commits intomainfrom
fix/session-mergenet11-can-you-please-review-20260226-1947

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Three bugs in the processing watchdog:

  1. Session stuck at 361 minutes — MergeNET11 session had IsProcessing=true for 361m. The SDK was sending repeated SessionUsageInfoEvent (FailedDelegation) which reset LastEventAtTicks without ever sending a terminal event. The watchdog timer kept resetting and never fired.

  2. No max processing time — Even with fix Polish UI, Rename Sessions, Markdown Output Support, Queued Messages #1, there was no absolute ceiling. A session could run forever as long as progress events arrived.

  3. False 'stuck' warnings — After app restart, sessions got ⚠️ Session appears stuck — no events received for over 30 seconds even though they were actively working. GetEventsFileRestoreHints used 120s threshold, but tool executions can go 5-10 minutes without events.jsonl writes.

Fix

  1. Gate LastEventAtTicks update — Skip for SessionUsageInfoEvent and AssistantUsageEvent (metrics-only events that don't indicate turn progress).
  2. Add WatchdogMaxProcessingTimeSeconds (3600s = 60 min) — Absolute safety net using ProcessingStartedAt, which cannot be reset by events.
  3. Fix restore hints threshold — Changed from WatchdogInactivityTimeoutSeconds (120s) to WatchdogToolExecutionTimeoutSeconds (600s).

Testing

PureWeen and others added 2 commits February 26, 2026 16:07
Three bugs fixed:

1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent
   (e.g., FailedDelegation) without terminal events. Root cause:
   HandleSessionEvent unconditionally updated LastEventAtTicks for ALL
   events, including metrics-only events that don't indicate progress.
   Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and
   AssistantUsageEvent.

2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety
   net. Even if progress events keep arriving, no turn should run for
   60 minutes without user notification. Uses ProcessingStartedAt
   (set in SendPromptAsync) so it cannot be reset by events.

3. False 'session stuck' warnings after app restart. Root cause:
   GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds
   (120s) as file age threshold, but tool executions can go 5-10
   minutes without events.jsonl writes. Fix: Use
   WatchdogToolExecutionTimeoutSeconds (600s) threshold.

Added ~25 regression guard tests covering every known failure mode
from PRs #148, #163, #195, #211, #224, plus invariant tests for
the 8 processing state safety invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… diagnostic tag

- Replace DateTime? auto-property with Interlocked long ticks backing field to prevent
  torn reads between watchdog background thread (reader) and UI thread (writer).
  A torn read could yield HasValue=true but Value=default(DateTime) (0 ticks),
  causing a huge totalProcessingSeconds and a false-positive immediate watchdog timeout.
- Add [WATCHDOG] Debug tag inside InvokeOnUI callback at IsProcessing=false path so
  watchdog-triggered completions are captured in event-diagnostics.log with a consistent
  tag matching other processing-state-change paths ([IDLE], [ERROR], [ABORT], etc.)
- Update Debug() diagnostics filter to include StartsWith("[WATCHDOG") so tag-based
  watchdog messages are persisted to file (message.Contains("watchdog") matched
  lowercase only and missed the [WATCHDOG] uppercase tag at line 1292)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/session-mergenet11-can-you-please-review-20260226-1947 branch from f9b6f5c to 9aac86f Compare February 26, 2026 22:11
PureWeen and others added 2 commits February 26, 2026 16:17
Line 1293 used DateTime.Now (local time) while the watchdog computes
totalProcessingSeconds as (DateTime.UtcNow - ProcessingStartedAt.Value).
With the new Interlocked-backed getter that reconstructs as DateTimeKind.Utc,
storing local-time ticks introduces UTC offset skew:
- UTC+ zones: totalProcessingSeconds appears negative → exceededMaxTime never fires
- UTC- zones: totalProcessingSeconds inflated by offset → premature 60-min cap fire

Fix: use DateTime.UtcNow, consistent with all other write sites (line 1787, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Snapshot ProcessingStartedAt once before use to avoid InvalidOperationException
if CompleteResponse clears the value between .HasValue and .Value reads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 8ed8eec into main Feb 26, 2026
@PureWeen PureWeen deleted the fix/session-mergenet11-can-you-please-review-20260226-1947 branch February 26, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant