fix: reduce watchdog timeout for multi-agent sessions without tools#316
fix: reduce watchdog timeout for multi-agent sessions without tools#316
Conversation
When a JSON-RPC connection dies (SocketException 10054), orchestrator sessions would previously wait up to 600 seconds (10 minutes) before the watchdog cleared IsProcessing. This is because all multi-agent sessions used WatchdogToolExecutionTimeoutSeconds regardless of tool activity. Now multi-agent sessions WITHOUT active tools use a new moderate timeout of 180 seconds (3 minutes). This is: - Long enough for legitimate model reasoning (typically 1-3 minutes) - Short enough to not leave users waiting 10 minutes for dead connections - Still shorter than the orchestration loop's own CancelAfter timeout Sessions WITH tool activity (hasActiveTool or hasUsedTools) continue to use the full 600 second timeout since tools can legitimately run for many minutes. The fix adds a new timeout tier: 1. Resume quiescence: 30s (no events since restart) 2. Standard inactivity: 120s (no tools, not multi-agent) 3. Multi-agent no-tool: 180s (multi-agent but no tool activity) 4. Tool execution: 600s (tools running or have been used) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- InputValidationTests: Add Windows platform skips for symlink tests (Unix paths treated differently on Windows) - MultiAgentRegressionTests: Increase substring search from 200→400 chars for ReconnectState_ShouldCarryIsMultiAgentSession test - ProcessingWatchdogTests: Add comprehensive coverage for new timeout tier - 3 new InlineData rows for edge cases - 3 new named tests for clarity - Coverage for resumed+events, events-only, and tier transitions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 PR #316 Review — Multi-Model Consensus (5 models, 2+ agreement filter)PR: fix: reduce watchdog timeout for multi-agent sessions without tools Summary4/5 models reviewed (gemini still running at synthesis time). No real bugs found by consensus. Logic Verification: All Invariants Hold ✅All four models traced every multi-agent flag combination:
The key invariant: Is 180s Safe for Long Text Generation? ✅The concern: an orchestrator generating a long text response (no tools) could be killed at 180s. This is NOT an issue because:
Test Coverage ✅Tests are thorough:
Housekeeping (non-blocking)🟢 MINOR — (Flagged by 2/4 models; no runtime impact) Recommended Action: ✅ ApproveThe four-tier timeout logic is correct, all invariants are preserved, test coverage is excellent, and the fix addresses a real user-visible problem (10-minute stuck orchestrators after dead connections). The only item is a stale comment that doesn't affect behavior. |
CRITICAL BUG: If RestorePreviousSessionsAsync() threw an exception or hit the 'break' statement at line 420, IsRestoring was never reset to false. This left the entire UI unresponsive - Resume buttons disabled, session interactions blocked. Root cause: IsRestoring=false was inside the inner try block, so it was skipped when: - An exception occurred before reaching line 426 - The 'break' statement at line 420 exited the loop early - The outer catch block (line 429) didn't reset IsRestoring Fix: Use a finally block to GUARANTEE IsRestoring=false is called, even when restore fails. This follows the same pattern as other critical state cleanup in the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Defense-in-depth fix for the IsRestoring stuck bug. This wraps the outer call site (InitializeAsync) in try-finally in addition to the inner finally block inside RestorePreviousSessionsAsync. If RestorePreviousSessionsAsync throws before setting IsRestoring=false in its own finally block, the outer finally ensures the UI isn't stuck. Found by code review agent cross-referencing with processing-state-safety skill invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔧 Status UpdatePR #302 was merged and created 3 merge conflicts with this PR:
The multi-agent-no-tool timeout tier ( |
|
|
Closing as superseded by PR #302 (merged). The smart Case A watchdog (events.jsonl mtime check) and |
Problem
Orchestrator sessions were stuck at IsProcessing=True for up to 10 minutes after a connection error (SocketException 10054).
Root Cause
Multi-agent sessions always used the 600-second timeout regardless of whether tools were active.
Solution
Add a new 180-second timeout tier for multi-agent sessions WITHOUT active tools:
Testing
All 133 ProcessingWatchdogTests pass.