Skip to content

fix: reduce watchdog timeout for multi-agent sessions without tools#316

Closed
PureWeen wants to merge 4 commits intomainfrom
fix/session-implement--challenge-orchestrato-20260308-2332
Closed

fix: reduce watchdog timeout for multi-agent sessions without tools#316
PureWeen wants to merge 4 commits intomainfrom
fix/session-implement--challenge-orchestrato-20260308-2332

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 8, 2026

Problem

Orchestrator sessions were stuck at IsProcessing=True for up to 10 minutes after a connection error (SocketException 10054).

Root Cause

Multi-agent sessions always used the 600-second timeout regardless of whether tools were active.

Solution

Add a new 180-second timeout tier for multi-agent sessions WITHOUT active tools:

  • Resume quiescence: 30s
  • Standard inactivity: 120s
  • Multi-agent no-tool: 180s (new)
  • Tool execution: 600s

Testing

All 133 ProcessingWatchdogTests pass.

PureWeen and others added 2 commits March 8, 2026 18:53
When a JSON-RPC connection dies (SocketException 10054), orchestrator
sessions would previously wait up to 600 seconds (10 minutes) before
the watchdog cleared IsProcessing. This is because all multi-agent
sessions used WatchdogToolExecutionTimeoutSeconds regardless of tool
activity.

Now multi-agent sessions WITHOUT active tools use a new moderate
timeout of 180 seconds (3 minutes). This is:
- Long enough for legitimate model reasoning (typically 1-3 minutes)
- Short enough to not leave users waiting 10 minutes for dead connections
- Still shorter than the orchestration loop's own CancelAfter timeout

Sessions WITH tool activity (hasActiveTool or hasUsedTools) continue
to use the full 600 second timeout since tools can legitimately run
for many minutes.

The fix adds a new timeout tier:
1. Resume quiescence: 30s (no events since restart)
2. Standard inactivity: 120s (no tools, not multi-agent)
3. Multi-agent no-tool: 180s (multi-agent but no tool activity)
4. Tool execution: 600s (tools running or have been used)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- InputValidationTests: Add Windows platform skips for symlink tests
  (Unix paths treated differently on Windows)

- MultiAgentRegressionTests: Increase substring search from 200→400 chars
  for ReconnectState_ShouldCarryIsMultiAgentSession test

- ProcessingWatchdogTests: Add comprehensive coverage for new timeout tier
  - 3 new InlineData rows for edge cases
  - 3 new named tests for clarity
  - Coverage for resumed+events, events-only, and tier transitions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Mar 9, 2026

🔍 PR #316 Review — Multi-Model Consensus (5 models, 2+ agreement filter)

PR: fix: reduce watchdog timeout for multi-agent sessions without tools
CI: No checks reported on this branch
Existing reviews: None (first review)


Summary

4/5 models reviewed (gemini still running at synthesis time). No real bugs found by consensus.


Logic Verification: All Invariants Hold ✅

All four models traced every multi-agent flag combination:

Scenario Old timeout New timeout Correct?
Multi-agent + active tool 600s 600s
Multi-agent + hasUsedTools 600s 600s
Multi-agent + resumed + events 600s 600s
Multi-agent + resumed + no events 30s 30s
Multi-agent + no tools + not resumed 600s 180s ✅ (intended)
Standard session (non-multi-agent) 120s 120s ✅ (unchanged)

The key invariant: useToolTimeout and useMultiAgentNoToolTimeout are mutually exclusive by construction


Is 180s Safe for Long Text Generation? ✅

The concern: an orchestrator generating a long text response (no tools) could be killed at 180s. This is NOT an issue because:

  • elapsed measures time since the last SDK event, not total turn time
  • Streaming delta events (AssistantMessageDeltaEvent) reset state.LastEventAtTicks on every token
  • 180s only fires when 3 minutes of complete silence (no events of any kind) — indicating a genuinely dead connection (SocketException 10054 scenario in the PR description)
  • PR Fix multi-agent bridge bugs and watchdog timeout for workers #195 regression (120s too short for multi-agent) is preserved: new minimum is 180s, not 120s

Test Coverage ✅

Tests are thorough:

  • New WatchdogMultiAgentNoToolTimeout_IsReasonable validates the constant is in range (120, 300)
  • Exhaustive matrix fully updated with 4 new rows covering the new tier
  • Regression tests for PR#195 updated with correct assertions
  • ComputeEffectiveTimeout helper is structurally identical to production logic (3/3 models confirmed)
  • Transition test (180s → 600s when tool starts) covers the mid-turn tier change

Housekeeping (non-blocking)

🟢 MINORCopilotService.Events.cs ~1371-1381 — the existing comment block above useToolTimeout still lists "4. Session is in a multi-agent group" as a reason for using 600s. This item is now stale — the new code at useMultiAgentNoToolTimeout correctly says 180s, but the old comment above still says 600s. A reader going top-to-bottom will see contradictory statements. Suggest removing item 4 from the old comment, since the new comment block below it is accurate.

(Flagged by 2/4 models; no runtime impact)


Recommended Action: ✅ Approve

The four-tier timeout logic is correct, all invariants are preserved, test coverage is excellent, and the fix addresses a real user-visible problem (10-minute stuck orchestrators after dead connections). The only item is a stale comment that doesn't affect behavior.

PureWeen and others added 2 commits March 8, 2026 19:23
CRITICAL BUG: If RestorePreviousSessionsAsync() threw an exception or
hit the 'break' statement at line 420, IsRestoring was never reset to
false. This left the entire UI unresponsive - Resume buttons disabled,
session interactions blocked.

Root cause: IsRestoring=false was inside the inner try block, so it was
skipped when:
- An exception occurred before reaching line 426
- The 'break' statement at line 420 exited the loop early
- The outer catch block (line 429) didn't reset IsRestoring

Fix: Use a finally block to GUARANTEE IsRestoring=false is called, even
when restore fails. This follows the same pattern as other critical
state cleanup in the codebase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Defense-in-depth fix for the IsRestoring stuck bug. This wraps
the outer call site (InitializeAsync) in try-finally in addition
to the inner finally block inside RestorePreviousSessionsAsync.

If RestorePreviousSessionsAsync throws before setting IsRestoring=false
in its own finally block, the outer finally ensures the UI isn't stuck.

Found by code review agent cross-referencing with processing-state-safety
skill invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔧 Status Update

PR #302 was merged and created 3 merge conflicts with this PR:

  • CopilotService.Events.cs (watchdog timeout logic vs new escalation architecture)
  • ProcessingWatchdogTests.cs (test additions)
  • MultiAgentRegressionTests.cs (test additions)

The multi-agent-no-tool timeout tier (WatchdogMultiAgentNoToolTimeoutSeconds = 180) is still valuable alongside #302's tool-health escalation. Merging main and resolving conflicts now.

@dinisdopion-sys
Copy link
Copy Markdown

Проблема

После ошибки подключения (SocketException 10054) сеансы Orchestrator зависали в состоянии IsProcessing=True на срок до 10 минут.

Первопричина

В многоагентных сессиях всегда использовался 600-секундный тайм-аут, независимо от того, были ли активны какие-либо инструменты.

Решение

Добавлен новый уровень тайм-аута в 180 секунд для многоагентных сессий БЕЗ активных инструментов:

  • Возобновление периода покоя: 30 секунд
  • Стандартный период бездействия: 120 секунд
  • Многоагентный режим без инструментов: 180 секунд (новый)
  • Выполнение инструмента: 600 с

Тестирование

Все 133 теста ProcessingWatchdogTest пройдены успешно.

@PureWeen
Copy link
Copy Markdown
Owner Author

Closing as superseded by PR #302 (merged). The smart Case A watchdog (events.jsonl mtime check) and WatchdogUsedToolsIdleTimeoutSeconds = 180 already on main handle this more adaptively than a fixed timeout tier. Stale sessions are now killed in ~75s via events.jsonl staleness detection, and the 180s idle-after-tools timeout is already defined.

@PureWeen PureWeen closed this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants