Skip to content

fix: classify agent JSON-RPC errors as application errors, not transport#539

Merged
tlongwell-block merged 2 commits into
mainfrom
fix/acp-agent-error-classification
May 11, 2026
Merged

fix: classify agent JSON-RPC errors as application errors, not transport#539
tlongwell-block merged 2 commits into
mainfrom
fix/acp-agent-error-classification

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Problem

When sprout-agent returns a JSON-RPC error response for a recoverable failure (e.g. transient LLM timeout, malformed provider response), sprout-acp classifies it as AcpError::Protocol — a transport-class error — and respawns the entire agent process. This destroys all session state across every channel on that agent slot and burns crash-circuit budget.

But the agent is fine. The stdio pipe is intact, the protocol worked correctly, the agent simply reported that its upstream LLM call failed. This is an application-level error, not transport corruption.

Before: transient LLM timeout → agent respawn → session loss for all channels
After: transient LLM timeout → agent returned to pool → next prompt works

Root Cause

Two sites in acp.rs (read_until_response and read_until_response_with_idle_timeout) convert any JSON-RPC error response into AcpError::Protocol:

if let Some(error) = msg.get("error") {
    return Err(AcpError::Protocol(error.to_string()));
}

Then handle_prompt_result() in main.rs classifies Protocol as transport-fatal:

let is_transport_error = matches!(
    e,
    AcpError::Io(_) | AcpError::WriteTimeout(_) | AcpError::Timeout(_) | AcpError::Protocol(_)
);

Fix

Add AcpError::AgentError(String) for well-formed JSON-RPC error responses from the agent. This variant is not in the transport-error match, so the agent is returned to the pool instead of being respawned.

All 14 other AcpError::Protocol construction sites (missing stdin/stdout handles, missing sessionId, oversized lines, malformed permission requests, unknown stopReason, etc.) remain correctly classified as genuine protocol violations.

Changes

  • acp.rs: Add AcpError::AgentError variant; use it at both msg.get("error") sites
  • No changes to main.rs or pool.rs — the new variant naturally falls through to the existing application-error path ("pipe intact, return agent to pool")

5 lines added, 2 changed.

When sprout-agent returns a JSON-RPC error response (e.g. transient LLM
failure), the harness was classifying it as AcpError::Protocol — a
transport-class error that triggers agent respawn. But the stdio pipe is
intact and the agent is healthy; only the upstream LLM call failed.

Add AcpError::AgentError for well-formed JSON-RPC error responses from
the agent. This variant is not in the transport-error match in
handle_prompt_result(), so the agent is returned to the pool instead of
being killed and respawned.

Before: transient LLM timeout → agent respawn → session loss for all channels
After:  transient LLM timeout → agent returned to pool → next prompt works
…lassification

* origin/main:
  fix: default max_rounds to 0 (unlimited) (#540)
  feat(desktop): channel templates — reusable project settings (#538)
@tlongwell-block tlongwell-block merged commit 23f1637 into main May 11, 2026
15 checks passed
@tlongwell-block tlongwell-block deleted the fix/acp-agent-error-classification branch May 11, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant