fix: classify agent JSON-RPC errors as application errors, not transport#539
Merged
Merged
Conversation
When sprout-agent returns a JSON-RPC error response (e.g. transient LLM failure), the harness was classifying it as AcpError::Protocol — a transport-class error that triggers agent respawn. But the stdio pipe is intact and the agent is healthy; only the upstream LLM call failed. Add AcpError::AgentError for well-formed JSON-RPC error responses from the agent. This variant is not in the transport-error match in handle_prompt_result(), so the agent is returned to the pool instead of being killed and respawned. Before: transient LLM timeout → agent respawn → session loss for all channels After: transient LLM timeout → agent returned to pool → next prompt works
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
sprout-agentreturns a JSON-RPC error response for a recoverable failure (e.g. transient LLM timeout, malformed provider response),sprout-acpclassifies it asAcpError::Protocol— a transport-class error — and respawns the entire agent process. This destroys all session state across every channel on that agent slot and burns crash-circuit budget.But the agent is fine. The stdio pipe is intact, the protocol worked correctly, the agent simply reported that its upstream LLM call failed. This is an application-level error, not transport corruption.
Before: transient LLM timeout → agent respawn → session loss for all channels
After: transient LLM timeout → agent returned to pool → next prompt works
Root Cause
Two sites in
acp.rs(read_until_responseandread_until_response_with_idle_timeout) convert any JSON-RPC error response intoAcpError::Protocol:Then
handle_prompt_result()inmain.rsclassifiesProtocolas transport-fatal:Fix
Add
AcpError::AgentError(String)for well-formed JSON-RPC error responses from the agent. This variant is not in the transport-error match, so the agent is returned to the pool instead of being respawned.All 14 other
AcpError::Protocolconstruction sites (missing stdin/stdout handles, missingsessionId, oversized lines, malformed permission requests, unknownstopReason, etc.) remain correctly classified as genuine protocol violations.Changes
acp.rs: AddAcpError::AgentErrorvariant; use it at bothmsg.get("error")sitesmain.rsorpool.rs— the new variant naturally falls through to the existing application-error path ("pipe intact, return agent to pool")5 lines added, 2 changed.