Skip to content

Fix stuck sessions: cap watchdog Case A resets for dead connections#342

Merged
PureWeen merged 4 commits intomainfrom
fix/session-ci-investigate-can-you-tell-if-t-20260310-1600
Mar 10, 2026
Merged

Fix stuck sessions: cap watchdog Case A resets for dead connections#342
PureWeen merged 4 commits intomainfrom
fix/session-ci-investigate-can-you-tell-if-t-20260310-1600

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Bug

Sessions get stuck with IsProcessing=true for hours when the JSON-RPC connection dies mid-tool-execution. Both CI-Investigate and CopilotImprovements sessions exhibited this.

Root Cause

The watchdog's Case A logic (tool active + server alive) resets LastEventAtTicks indefinitely when:

  1. ActiveToolCallCount > 0 — stuck because ToolExecutionCompleteEvent never arrived after ConnectionLostException
  2. Server is alive — the persistent server is shared and serves other sessions fine
  3. ProcessingStartedAt resets to DateTime.UtcNow on every app restart, so the 60-min absolute max safety net keeps getting pushed forward

Fix

  • Cap Case A resets via WatchdogMaxToolAliveResets = 3. After 3 consecutive resets (~30 min) with no real SDK events, assume the session's connection is dead and fall through to the timeout handler.
  • Reset counter on real events in HandleSessionEvent — proves the connection is alive.
  • Defensive ProcessingStartedAt null guard — if null while IsProcessing=true, treat as exceededMaxTime so Case A can't loop forever.

Testing

  • 7 new tests in ProcessingWatchdogTests.cs covering constant bounds, source-level verification of reset cap, counter clearing, and null guard.
  • All 2388 existing tests pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

PR #342 Review — Fix stuck sessions: cap watchdog Case A resets for dead connections

CI Status: No CI checks on this branch (all 2388 existing tests pass per PR description)
Existing reviews: None
Models: claude-opus-4.6 x2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5)

The core fix logic is correct: capping WatchdogCaseAResets at 3 prevents the infinite-reset loop, the counter resets on real events and at StartProcessingWatchdog, and all accesses are via Interlocked.* for correct thread safety.


MINOR - Test off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime (5/5 consensus)

PolyPilot.Tests/ProcessingWatchdogTests.cs

The test asserts: WatchdogMaxToolAliveResets * WatchdogToolExecutionTimeoutSeconds < WatchdogMaxProcessingTimeSeconds = 3 x 600 = 1800 < 3600.

But the fire happens on the 4th timer trigger (resets = Increment -> 4; 4 > 3 -> terminate), after 3 full reset cycles of 600s each. Actual max stuck time is (WatchdogMaxToolAliveResets + 1) x 600 = 2400s.

Today this is safe (2400 < 3600). But the test creates false confidence: if WatchdogMaxToolAliveResets is bumped to 5, the test passes (5 x 600 = 3000 < 3600) while actual max is 6 x 600 = 3600s - equal to WatchdogMaxProcessingTimeSeconds. The cap would provide zero protection.

Fix: Change the assertion to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds.


MINOR - Fragile 300-char proximity window in CaseA_ResetCounter_ClearedOnRealEvents_InSource (2/5 consensus)

PolyPilot.Tests/ProcessingWatchdogTests.cs

The source.Substring(lastEventResetIdx, 300).Contains("WatchdogCaseAResets") proximity check is fragile: a future code change adding a comment or log line between the LastEventAtTicks reset and WatchdogCaseAResets reset could break this test without affecting correctness.


Below consensus - for author awareness

(2/5: gemini + sonnet) The new !startedAt.HasValue guard sets exceededMaxTime=true when ProcessingStartedAt is null. If Case B (the HasUsedToolsThisTurn / zero-idle path introduced in PR #332) is also gated by !exceededMaxTime, a restored session with null ProcessingStartedAt and HasUsedToolsThisTurn=true would skip Case B and fall to a stuck warning + forced CompleteResponse instead of a clean CompleteResponse. Reachability depends on whether RestoreSingleSessionAsync sets ProcessingStartedAt = DateTime.UtcNow for processing sessions (per INV-9, it should). Worth confirming.


Recommended Action

Request changes - one mechanical fix: correct the off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. The core bug fix is correct and safe today, but the imprecise test guard would pass silently if the constant is raised to a value that defeats the cap.

PureWeen and others added 3 commits March 10, 2026 12:56
When a session's JSON-RPC connection dies mid-tool-execution
(ConnectionLostException), the SDK never emits SessionErrorEvent so
IsProcessing stays true. The watchdog's Case A logic (tool active + server
alive) kept resetting LastEventAtTicks indefinitely because the persistent
server is shared and stays alive serving other sessions. Combined with
ProcessingStartedAt resetting to DateTime.UtcNow on every app restart,
the 60-minute absolute max safety net never fires.

Fix:
- Add WatchdogCaseAResets counter to SessionState, capped at 3 resets
  (WatchdogMaxToolAliveResets). After 3 consecutive resets (~30 min)
  with no real SDK events, assume the session connection is dead and
  fall through to the timeout handler.
- Reset the counter when real SDK events arrive (HandleSessionEvent),
  proving the connection is alive.
- Add defensive ProcessingStartedAt null guard: if null while
  IsProcessing=true, treat as exceededMaxTime so Case A cannot loop
  forever.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The reset counter is incremented before the comparison (Increment returns
new value), so the cap fires on the (N+1)th trigger. Actual max stuck
time is (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds.

Without this fix, bumping WatchdogMaxToolAliveResets to 5 would pass the
test (5*600=3000 < 3600) while actual max is 6*600=3600 — equal to the
absolute max, providing zero protection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a reconnect in SendPromptAsync, the new SessionState was inheriting
HasUsedToolsThisTurn=true from the dead connection, inflating the watchdog
timeout from 120s (inactivity) to 600s (tool execution). Additionally,
ProcessingStartedAt was not reset, so the 60-min max-time safety net
measured from the original send time rather than reconnect time.

Changes:
- Reset HasUsedToolsThisTurn=false in reconnect block (connection-specific)
- Reset ActiveToolCallCount=0 and SuccessfulToolCountThisTurn=0
- Reset ProcessingStartedAt=DateTime.UtcNow for fresh max-time deadline
- Add 3 source-verification tests for reconnect state resets
- Update MultiAgentRegressionTests to match new reconnect behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/session-ci-investigate-can-you-tell-if-t-20260310-1600 branch from 474341b to fbcee03 Compare March 10, 2026 18:03
When Case A reset cap is exceeded, the log now accurately says
reset cap exceeded instead of falling through to server not
responding. Also fix doc comment from 30 to 40 minutes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 3a1ad92 into main Mar 10, 2026
@PureWeen PureWeen deleted the fix/session-ci-investigate-can-you-tell-if-t-20260310-1600 branch March 10, 2026 19:58
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #342 Review — Fix stuck sessions: cap watchdog Case A resets for dead connections

CI Status: No CI checks on this branch (all 2388 existing tests pass per PR description)
Existing reviews: None
Models: claude-opus-4.6 x2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5)

The core fix logic is correct: capping WatchdogCaseAResets at 3 prevents the infinite-reset loop, the counter resets on real events and at StartProcessingWatchdog, and all accesses are via Interlocked.* for correct thread safety.


MINOR - Test off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime (5/5 consensus)

PolyPilot.Tests/ProcessingWatchdogTests.cs

The test asserts: WatchdogMaxToolAliveResets * WatchdogToolExecutionTimeoutSeconds < WatchdogMaxProcessingTimeSeconds = 3 x 600 = 1800 < 3600.

But the fire happens on the 4th timer trigger (resets = Increment -> 4; 4 > 3 -> terminate), after 3 full reset cycles of 600s each. Actual max stuck time is (WatchdogMaxToolAliveResets + 1) x 600 = 2400s.

Today this is safe (2400 < 3600). But the test creates false confidence: if WatchdogMaxToolAliveResets is bumped to 5, the test passes (5 x 600 = 3000 < 3600) while actual max is 6 x 600 = 3600s - equal to WatchdogMaxProcessingTimeSeconds. The cap would provide zero protection.

Fix: Change the assertion to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds.


MINOR - Fragile 300-char proximity window in CaseA_ResetCounter_ClearedOnRealEvents_InSource (2/5 consensus)

PolyPilot.Tests/ProcessingWatchdogTests.cs

The source.Substring(lastEventResetIdx, 300).Contains("WatchdogCaseAResets") proximity check is fragile: a future code change adding a comment or log line between the LastEventAtTicks reset and WatchdogCaseAResets reset could break this test without affecting correctness.


Below consensus - for author awareness

(2/5: gemini + sonnet) The new !startedAt.HasValue guard sets exceededMaxTime=true when ProcessingStartedAt is null. If Case B (the HasUsedToolsThisTurn / zero-idle path introduced in PR #332) is also gated by !exceededMaxTime, a restored session with null ProcessingStartedAt and HasUsedToolsThisTurn=true would skip Case B and fall to a stuck warning + forced CompleteResponse instead of a clean CompleteResponse. Reachability depends on whether RestoreSingleSessionAsync sets ProcessingStartedAt = DateTime.UtcNow for processing sessions (per INV-9, it should). Worth confirming.


Recommended Action

Request changes - one mechanical fix: correct the off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. The core bug fix is correct and safe today, but the imprecise test guard would pass silently if the constant is raised to a value that defeats the cap.

PureWeen added a commit that referenced this pull request Mar 10, 2026
## Problem

`HasUsedToolsThisTurn` on `SessionState` is read by the watchdog timer
thread and SDK background threads while being written from the UI thread
and SDK event handlers. It had **inconsistent synchronization**: 5 sites
used `Volatile.Read/Write` while 16 used plain access.

On ARM (iOS/Android), plain writes from the UI thread may not be
immediately visible to background thread reads. This is a theoretical
data race that could cause the watchdog to use incorrect timeouts (120s
vs 600s).

Identified during code review of PR #342 (INV-7 from the
processing-state-safety skill).

## Fix

- Declare the field `volatile bool` — all reads/writes automatically
have correct memory barriers
- Remove the 5 now-redundant `Volatile.Read/Write` calls (would generate
CS0420 warnings)
- Remove a duplicate write in the reconnect path (line ~2613 was
redundant with ~2639)
- Replace the `HasUsedToolsThisTurn_VolatileConsistency` test with
`HasUsedToolsThisTurn_IsDeclaredVolatile` that uses reflection to verify
the volatile modifier

## Testing

All 2401 tests pass. No CS0420 warnings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #342 — Round 2 Re-review: "Fix stuck sessions: cap watchdog Case A resets for dead connections"

New commits since Round 1: a9748748, fbcee030, 9bd79887
Models: Worker 3 (claude-sonnet-4.6 re-review)


Previous Findings Status

  • Finding 1 (Test off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime): ✅ FIXED
    Test now uses (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds, correctly accounting for the increment-before-compare semantics.

  • Finding 2 (Fragile 300-char proximity window in CaseA_ResetCounter_ClearedOnRealEvents_InSource): ✅ Effectively resolved
    The proximity window remains (300 chars), but the production code now resets WatchdogCaseAResets on the very next line after LastEventAtTicks in HandleSessionEvent, making the test structurally sound in practice.

  • Finding 3 (state.Info vs newState inconsistency + stale tool state in reconnect): ✅ FIXED
    Commit fbcee030 resets all tool tracking for the new connection:

    • newState.HasUsedToolsThisTurn = false
    • Interlocked.Exchange(ref newState.ActiveToolCallCount, 0)
    • Interlocked.Exchange(ref newState.SuccessfulToolCountThisTurn, 0)
    • state.Info.ProcessingStartedAt = DateTime.UtcNow (shared Info object — correct)

    New regression tests cover all three (ReconnectPath_ResetsHasUsedToolsThisTurn_InSource, ReconnectPath_ResetsProcessingStartedAt_InSource, ReconnectPath_ResetsActiveToolCallCount_InSource).

  • Below-consensus: !startedAt.HasValue guard: ✅ Addressed
    Author added ExceededMaxTime_TrueWhenProcessingStartedAtNull_InSource test explicitly covering this path.


New Findings

None. The updated diff is clean.


Recommended Action

Approve — all Round 1 findings resolved, no new issues. The reconnect stale-state fix (fbcee030) is a meaningful correctness improvement beyond the original scope, and the test suite now comprehensively covers the new code paths.


Re-review by Worker 3 (claude-sonnet-4.6)

arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…ureWeen#342)

## Bug

Sessions get stuck with `IsProcessing=true` for hours when the JSON-RPC
connection dies mid-tool-execution. Both `CI-Investigate` and
`CopilotImprovements` sessions exhibited this.

## Root Cause

The watchdog's **Case A** logic (tool active + server alive) resets
`LastEventAtTicks` indefinitely when:
1. `ActiveToolCallCount > 0` — stuck because
`ToolExecutionCompleteEvent` never arrived after
`ConnectionLostException`
2. Server is alive — the persistent server is shared and serves other
sessions fine
3. `ProcessingStartedAt` resets to `DateTime.UtcNow` on every app
restart, so the 60-min absolute max safety net keeps getting pushed
forward

## Fix

- **Cap Case A resets** via `WatchdogMaxToolAliveResets = 3`. After 3
consecutive resets (~30 min) with no real SDK events, assume the
session's connection is dead and fall through to the timeout handler.
- **Reset counter on real events** in `HandleSessionEvent` — proves the
connection is alive.
- **Defensive `ProcessingStartedAt` null guard** — if null while
`IsProcessing=true`, treat as `exceededMaxTime` so Case A can't loop
forever.

## Testing

- 7 new tests in `ProcessingWatchdogTests.cs` covering constant bounds,
source-level verification of reset cap, counter clearing, and null
guard.
- All 2388 existing tests pass.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…een#344)

## Problem

`HasUsedToolsThisTurn` on `SessionState` is read by the watchdog timer
thread and SDK background threads while being written from the UI thread
and SDK event handlers. It had **inconsistent synchronization**: 5 sites
used `Volatile.Read/Write` while 16 used plain access.

On ARM (iOS/Android), plain writes from the UI thread may not be
immediately visible to background thread reads. This is a theoretical
data race that could cause the watchdog to use incorrect timeouts (120s
vs 600s).

Identified during code review of PR PureWeen#342 (INV-7 from the
processing-state-safety skill).

## Fix

- Declare the field `volatile bool` — all reads/writes automatically
have correct memory barriers
- Remove the 5 now-redundant `Volatile.Read/Write` calls (would generate
CS0420 warnings)
- Remove a duplicate write in the reconnect path (line ~2613 was
redundant with ~2639)
- Replace the `HasUsedToolsThisTurn_VolatileConsistency` test with
`HasUsedToolsThisTurn_IsDeclaredVolatile` that uses reflection to verify
the volatile modifier

## Testing

All 2401 tests pass. No CS0420 warnings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant