Conversation
PR #342 Review — Fix stuck sessions: cap watchdog Case A resets for dead connectionsCI Status: No CI checks on this branch (all 2388 existing tests pass per PR description) The core fix logic is correct: capping WatchdogCaseAResets at 3 prevents the infinite-reset loop, the counter resets on real events and at StartProcessingWatchdog, and all accesses are via Interlocked.* for correct thread safety. MINOR - Test off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime (5/5 consensus)PolyPilot.Tests/ProcessingWatchdogTests.cs The test asserts: WatchdogMaxToolAliveResets * WatchdogToolExecutionTimeoutSeconds < WatchdogMaxProcessingTimeSeconds = 3 x 600 = 1800 < 3600. But the fire happens on the 4th timer trigger (resets = Increment -> 4; 4 > 3 -> terminate), after 3 full reset cycles of 600s each. Actual max stuck time is (WatchdogMaxToolAliveResets + 1) x 600 = 2400s. Today this is safe (2400 < 3600). But the test creates false confidence: if WatchdogMaxToolAliveResets is bumped to 5, the test passes (5 x 600 = 3000 < 3600) while actual max is 6 x 600 = 3600s - equal to WatchdogMaxProcessingTimeSeconds. The cap would provide zero protection. Fix: Change the assertion to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. MINOR - Fragile 300-char proximity window in CaseA_ResetCounter_ClearedOnRealEvents_InSource (2/5 consensus)PolyPilot.Tests/ProcessingWatchdogTests.cs The source.Substring(lastEventResetIdx, 300).Contains("WatchdogCaseAResets") proximity check is fragile: a future code change adding a comment or log line between the LastEventAtTicks reset and WatchdogCaseAResets reset could break this test without affecting correctness. Below consensus - for author awareness(2/5: gemini + sonnet) The new !startedAt.HasValue guard sets exceededMaxTime=true when ProcessingStartedAt is null. If Case B (the HasUsedToolsThisTurn / zero-idle path introduced in PR #332) is also gated by !exceededMaxTime, a restored session with null ProcessingStartedAt and HasUsedToolsThisTurn=true would skip Case B and fall to a stuck warning + forced CompleteResponse instead of a clean CompleteResponse. Reachability depends on whether RestoreSingleSessionAsync sets ProcessingStartedAt = DateTime.UtcNow for processing sessions (per INV-9, it should). Worth confirming. Recommended ActionRequest changes - one mechanical fix: correct the off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. The core bug fix is correct and safe today, but the imprecise test guard would pass silently if the constant is raised to a value that defeats the cap. |
When a session's JSON-RPC connection dies mid-tool-execution (ConnectionLostException), the SDK never emits SessionErrorEvent so IsProcessing stays true. The watchdog's Case A logic (tool active + server alive) kept resetting LastEventAtTicks indefinitely because the persistent server is shared and stays alive serving other sessions. Combined with ProcessingStartedAt resetting to DateTime.UtcNow on every app restart, the 60-minute absolute max safety net never fires. Fix: - Add WatchdogCaseAResets counter to SessionState, capped at 3 resets (WatchdogMaxToolAliveResets). After 3 consecutive resets (~30 min) with no real SDK events, assume the session connection is dead and fall through to the timeout handler. - Reset the counter when real SDK events arrive (HandleSessionEvent), proving the connection is alive. - Add defensive ProcessingStartedAt null guard: if null while IsProcessing=true, treat as exceededMaxTime so Case A cannot loop forever. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The reset counter is incremented before the comparison (Increment returns new value), so the cap fires on the (N+1)th trigger. Actual max stuck time is (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. Without this fix, bumping WatchdogMaxToolAliveResets to 5 would pass the test (5*600=3000 < 3600) while actual max is 6*600=3600 — equal to the absolute max, providing zero protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a reconnect in SendPromptAsync, the new SessionState was inheriting HasUsedToolsThisTurn=true from the dead connection, inflating the watchdog timeout from 120s (inactivity) to 600s (tool execution). Additionally, ProcessingStartedAt was not reset, so the 60-min max-time safety net measured from the original send time rather than reconnect time. Changes: - Reset HasUsedToolsThisTurn=false in reconnect block (connection-specific) - Reset ActiveToolCallCount=0 and SuccessfulToolCountThisTurn=0 - Reset ProcessingStartedAt=DateTime.UtcNow for fresh max-time deadline - Add 3 source-verification tests for reconnect state resets - Update MultiAgentRegressionTests to match new reconnect behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
474341b to
fbcee03
Compare
When Case A reset cap is exceeded, the log now accurately says reset cap exceeded instead of falling through to server not responding. Also fix doc comment from 30 to 40 minutes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #342 Review — Fix stuck sessions: cap watchdog Case A resets for dead connectionsCI Status: No CI checks on this branch (all 2388 existing tests pass per PR description) The core fix logic is correct: capping WatchdogCaseAResets at 3 prevents the infinite-reset loop, the counter resets on real events and at StartProcessingWatchdog, and all accesses are via Interlocked.* for correct thread safety. MINOR - Test off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime (5/5 consensus)PolyPilot.Tests/ProcessingWatchdogTests.cs The test asserts: WatchdogMaxToolAliveResets * WatchdogToolExecutionTimeoutSeconds < WatchdogMaxProcessingTimeSeconds = 3 x 600 = 1800 < 3600. But the fire happens on the 4th timer trigger (resets = Increment -> 4; 4 > 3 -> terminate), after 3 full reset cycles of 600s each. Actual max stuck time is (WatchdogMaxToolAliveResets + 1) x 600 = 2400s. Today this is safe (2400 < 3600). But the test creates false confidence: if WatchdogMaxToolAliveResets is bumped to 5, the test passes (5 x 600 = 3000 < 3600) while actual max is 6 x 600 = 3600s - equal to WatchdogMaxProcessingTimeSeconds. The cap would provide zero protection. Fix: Change the assertion to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. MINOR - Fragile 300-char proximity window in CaseA_ResetCounter_ClearedOnRealEvents_InSource (2/5 consensus)PolyPilot.Tests/ProcessingWatchdogTests.cs The source.Substring(lastEventResetIdx, 300).Contains("WatchdogCaseAResets") proximity check is fragile: a future code change adding a comment or log line between the LastEventAtTicks reset and WatchdogCaseAResets reset could break this test without affecting correctness. Below consensus - for author awareness(2/5: gemini + sonnet) The new !startedAt.HasValue guard sets exceededMaxTime=true when ProcessingStartedAt is null. If Case B (the HasUsedToolsThisTurn / zero-idle path introduced in PR #332) is also gated by !exceededMaxTime, a restored session with null ProcessingStartedAt and HasUsedToolsThisTurn=true would skip Case B and fall to a stuck warning + forced CompleteResponse instead of a clean CompleteResponse. Reachability depends on whether RestoreSingleSessionAsync sets ProcessingStartedAt = DateTime.UtcNow for processing sessions (per INV-9, it should). Worth confirming. Recommended ActionRequest changes - one mechanical fix: correct the off-by-one in WatchdogMaxToolAliveResets_BoundsMaxStuckTime to use (WatchdogMaxToolAliveResets + 1) * WatchdogToolExecutionTimeoutSeconds. The core bug fix is correct and safe today, but the imprecise test guard would pass silently if the constant is raised to a value that defeats the cap. |
## Problem `HasUsedToolsThisTurn` on `SessionState` is read by the watchdog timer thread and SDK background threads while being written from the UI thread and SDK event handlers. It had **inconsistent synchronization**: 5 sites used `Volatile.Read/Write` while 16 used plain access. On ARM (iOS/Android), plain writes from the UI thread may not be immediately visible to background thread reads. This is a theoretical data race that could cause the watchdog to use incorrect timeouts (120s vs 600s). Identified during code review of PR #342 (INV-7 from the processing-state-safety skill). ## Fix - Declare the field `volatile bool` — all reads/writes automatically have correct memory barriers - Remove the 5 now-redundant `Volatile.Read/Write` calls (would generate CS0420 warnings) - Remove a duplicate write in the reconnect path (line ~2613 was redundant with ~2639) - Replace the `HasUsedToolsThisTurn_VolatileConsistency` test with `HasUsedToolsThisTurn_IsDeclaredVolatile` that uses reflection to verify the volatile modifier ## Testing All 2401 tests pass. No CS0420 warnings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #342 — Round 2 Re-review: "Fix stuck sessions: cap watchdog Case A resets for dead connections"New commits since Round 1: Previous Findings Status
New FindingsNone. The updated diff is clean. Recommended Action✅ Approve — all Round 1 findings resolved, no new issues. The reconnect stale-state fix ( Re-review by Worker 3 (claude-sonnet-4.6) |
…ureWeen#342) ## Bug Sessions get stuck with `IsProcessing=true` for hours when the JSON-RPC connection dies mid-tool-execution. Both `CI-Investigate` and `CopilotImprovements` sessions exhibited this. ## Root Cause The watchdog's **Case A** logic (tool active + server alive) resets `LastEventAtTicks` indefinitely when: 1. `ActiveToolCallCount > 0` — stuck because `ToolExecutionCompleteEvent` never arrived after `ConnectionLostException` 2. Server is alive — the persistent server is shared and serves other sessions fine 3. `ProcessingStartedAt` resets to `DateTime.UtcNow` on every app restart, so the 60-min absolute max safety net keeps getting pushed forward ## Fix - **Cap Case A resets** via `WatchdogMaxToolAliveResets = 3`. After 3 consecutive resets (~30 min) with no real SDK events, assume the session's connection is dead and fall through to the timeout handler. - **Reset counter on real events** in `HandleSessionEvent` — proves the connection is alive. - **Defensive `ProcessingStartedAt` null guard** — if null while `IsProcessing=true`, treat as `exceededMaxTime` so Case A can't loop forever. ## Testing - 7 new tests in `ProcessingWatchdogTests.cs` covering constant bounds, source-level verification of reset cap, counter clearing, and null guard. - All 2388 existing tests pass. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…een#344) ## Problem `HasUsedToolsThisTurn` on `SessionState` is read by the watchdog timer thread and SDK background threads while being written from the UI thread and SDK event handlers. It had **inconsistent synchronization**: 5 sites used `Volatile.Read/Write` while 16 used plain access. On ARM (iOS/Android), plain writes from the UI thread may not be immediately visible to background thread reads. This is a theoretical data race that could cause the watchdog to use incorrect timeouts (120s vs 600s). Identified during code review of PR PureWeen#342 (INV-7 from the processing-state-safety skill). ## Fix - Declare the field `volatile bool` — all reads/writes automatically have correct memory barriers - Remove the 5 now-redundant `Volatile.Read/Write` calls (would generate CS0420 warnings) - Remove a duplicate write in the reconnect path (line ~2613 was redundant with ~2639) - Replace the `HasUsedToolsThisTurn_VolatileConsistency` test with `HasUsedToolsThisTurn_IsDeclaredVolatile` that uses reflection to verify the volatile modifier ## Testing All 2401 tests pass. No CS0420 warnings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug
Sessions get stuck with
IsProcessing=truefor hours when the JSON-RPC connection dies mid-tool-execution. BothCI-InvestigateandCopilotImprovementssessions exhibited this.Root Cause
The watchdog's Case A logic (tool active + server alive) resets
LastEventAtTicksindefinitely when:ActiveToolCallCount > 0— stuck becauseToolExecutionCompleteEventnever arrived afterConnectionLostExceptionProcessingStartedAtresets toDateTime.UtcNowon every app restart, so the 60-min absolute max safety net keeps getting pushed forwardFix
WatchdogMaxToolAliveResets = 3. After 3 consecutive resets (~30 min) with no real SDK events, assume the session's connection is dead and fall through to the timeout handler.HandleSessionEvent— proves the connection is alive.ProcessingStartedAtnull guard — if null whileIsProcessing=true, treat asexceededMaxTimeso Case A can't loop forever.Testing
ProcessingWatchdogTests.cscovering constant bounds, source-level verification of reset cap, counter clearing, and null guard.