Fix watchdog stuck in 'tool running + server alive' state indefinitely#345
Closed
Fix watchdog stuck in 'tool running + server alive' state indefinitely#345
Conversation
When the JSON-RPC connection dies mid-tool-execution but the server process stays alive, ActiveToolCallCount stays > 0 (zombie) because ToolExecutionCompleteEvent never arrives. The watchdog's TCP port check passes (server alive), so it keeps resetting the inactivity timer forever. Fix: Add StaleToolLivenessChecks counter that increments each time the watchdog resets the timer in the 'tool running + server alive' state. If this counter exceeds WatchdogMaxStaleToolLivenessChecks (4 checks = ~60s) without any real events arriving to reset it, assume the JSON-RPC connection is dead and proceed to recovery instead of continuing to wait indefinitely. The counter is reset to 0: - When any real SDK event arrives (proves connection alive) - On new prompt send (fresh turn) - On abort/complete response (turn ended) - On error handling paths Includes comprehensive regression tests verifying: - Constant exists and is reasonable (2-10 range) - Field exists in SessionState - Counter resets when real events arrive - Counter increments in server-alive case - Recovery triggered when limit exceeded - Counter reset in all state cleanup paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Owner
Author
|
Closing as duplicate of PR #342 which was already merged to main. The stuck session (Skill Validator-anthropic-evaluator) was running from worktree cf92f011 at commit 4dcf10f, which is before PR #342 was merged. PR #342 introduced WatchdogMaxToolAliveResets=3 and WatchdogCaseAResets counter - the same fix I implemented here under different names (WatchdogMaxStaleToolLivenessChecks=4 and StaleToolLivenessChecks). The user just needs to rebuild PolyPilot from the latest main to get the fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the JSON-RPC connection dies mid-tool-execution but the server process stays alive:
Debug info from the bug report showed 607+ seconds of inactivity with the watchdog continuously resetting the timer.
Solution
Add a \StaleToolLivenessChecks\ counter that tracks consecutive watchdog resets in the 'tool running + server alive' state without any real events arriving.
Testing
Files Changed