Skip to content

Fix watchdog stuck in 'tool running + server alive' state indefinitely#345

Closed
PureWeen wants to merge 1 commit intomainfrom
fix/session-skill-validator-anthropic-evalua-20260310-1950
Closed

Fix watchdog stuck in 'tool running + server alive' state indefinitely#345
PureWeen wants to merge 1 commit intomainfrom
fix/session-skill-validator-anthropic-evalua-20260310-1950

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

When the JSON-RPC connection dies mid-tool-execution but the server process stays alive:

  1. Tool starts → \ActiveToolCallCount\ incremented to 1
  2. JSON-RPC connection dies → \ToolExecutionCompleteEvent\ never arrives
  3. \ActiveToolCallCount\ stays > 0 (zombie count)
  4. Watchdog checks: \hasActiveTool = (ActiveToolCallCount > 0) = TRUE\
  5. Watchdog checks: \serverAlive = _serverManager.IsServerRunning = TRUE\ (TCP port check passes)
  6. Watchdog logs: 'tool is running and server is alive — resetting timer'
  7. Loop repeats indefinitely → session stuck forever showing 'Thinking...'

Debug info from the bug report showed 607+ seconds of inactivity with the watchdog continuously resetting the timer.

Solution

Add a \StaleToolLivenessChecks\ counter that tracks consecutive watchdog resets in the 'tool running + server alive' state without any real events arriving.

  • When the watchdog sees 'tool running + server alive', it increments this counter before resetting the timer
  • If the counter exceeds \WatchdogMaxStaleToolLivenessChecks\ (4 checks ≈ 60 seconds), the watchdog assumes the JSON-RPC connection is dead and proceeds to recovery
  • The counter resets to 0 when:
    • Any real SDK event arrives (proves connection is alive)
    • A new prompt is sent (fresh turn)
    • Session is aborted/completed
    • Error paths clear state

Testing

  • Added 10 new regression tests in \ProcessingWatchdogTests.cs\
  • All 2401 existing tests pass
  • Windows build verified

Files Changed

  • \CopilotService.cs: Added \StaleToolLivenessChecks\ field to \SessionState, reset in all state cleanup paths
  • \CopilotService.Events.cs: Added constant, increment/check logic in watchdog, reset on real events
  • \ProcessingWatchdogTests.cs: 10 new tests covering the fix

When the JSON-RPC connection dies mid-tool-execution but the server process
stays alive, ActiveToolCallCount stays > 0 (zombie) because ToolExecutionCompleteEvent
never arrives. The watchdog's TCP port check passes (server alive), so it keeps
resetting the inactivity timer forever.

Fix: Add StaleToolLivenessChecks counter that increments each time the watchdog
resets the timer in the 'tool running + server alive' state. If this counter
exceeds WatchdogMaxStaleToolLivenessChecks (4 checks = ~60s) without any real
events arriving to reset it, assume the JSON-RPC connection is dead and proceed
to recovery instead of continuing to wait indefinitely.

The counter is reset to 0:
- When any real SDK event arrives (proves connection alive)
- On new prompt send (fresh turn)
- On abort/complete response (turn ended)
- On error handling paths

Includes comprehensive regression tests verifying:
- Constant exists and is reasonable (2-10 range)
- Field exists in SessionState
- Counter resets when real events arrive
- Counter increments in server-alive case
- Recovery triggered when limit exceeded
- Counter reset in all state cleanup paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Closing as duplicate of PR #342 which was already merged to main.

The stuck session (Skill Validator-anthropic-evaluator) was running from worktree cf92f011 at commit 4dcf10f, which is before PR #342 was merged.

PR #342 introduced WatchdogMaxToolAliveResets=3 and WatchdogCaseAResets counter - the same fix I implemented here under different names (WatchdogMaxStaleToolLivenessChecks=4 and StaleToolLivenessChecks).

The user just needs to rebuild PolyPilot from the latest main to get the fix.

@PureWeen PureWeen closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant