Skip to content

chore(acp): raise default idle timeout from 320s to 620s#566

Merged
tlongwell-block merged 2 commits into
mainfrom
chore/acp-idle-timeout-default-620
May 13, 2026
Merged

chore(acp): raise default idle timeout from 320s to 620s#566
tlongwell-block merged 2 commits into
mainfrom
chore/acp-idle-timeout-default-620

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented May 13, 2026

Summary

Raise the default ACP idle timeout from 320s → 620s, and extract the value into a single DEFAULT_IDLE_TIMEOUT_SECS constant.

Why the bump

The default was sized as "goose's 300s turn timeout + 20s buffer". That's no longer a good fit for the agent shapes we actually run:

  • A sprout-agent running another agent as a tool naturally goes silent on its outer ACP channel while the inner agent works. The harness sees no stdout activity and assumes wedged.
  • codex/claude routinely do single tool calls (long compile, long network fetch, long shell command) that exceed 5 minutes with no streaming output on the ACP channel.
  • Model providers with slow generation can stall for minutes before the first token, especially for long contexts or reasoning models.

Each of these triggers idle timeout (320s) — no agent activity followed by session/cancel and a full agent subprocess respawn — losing session state, paying cold-start cost, and counting against the circuit breaker. The agent was healthy the whole time; it just wasn't talking on the outer channel.

Why the constant

The 620 literal lived at four real call sites: the actual default in from_args, the resolve_idle_timeout test helper that re-implements precedence, and two test_config() fixtures. The helper was a genuine drift hazard — a future bump would update the prod default but silently leave the test helper on the old value, masking the regression. One pub(crate) const DEFAULT_IDLE_TIMEOUT_SECS: u64 = 620 with the rationale in its docstring fixes that. Two test assertions also now derive from the constant so they auto-track.

Commits

  1. chore(acp): raise default idle timeout from 320s to 620s — the bump itself, plus a fix for a stale README entry that documented the default as 300 (pre-existing drift).
  2. refactor(acp): extract DEFAULT_IDLE_TIMEOUT_SECS constant — single source of truth, no behavior change.

What does not change

  • max_turn_duration (the absolute wall-clock cap) stays at 3600s. A genuinely wedged agent still gets killed at the hard cap.
  • The respawn behavior on idle timeout is unchanged. Reducing how often the timeout fires is the goal of this PR; whether we should also stop respawning on clean cancels is being looked at separately.
  • The idle_timeout < max_turn_duration validation at startup is unchanged.
  • SPROUT_ACP_IDLE_TIMEOUT / --idle-timeout env-var and flag overrides are unchanged.

Operator-facing notes

Edge case worth flagging: operators who have set a custom --max-turn-duration ≤ 620 and rely on the default idle timeout will now fail validation at startup with:

idle_timeout (620s) must be less than max_turn_duration (Ns)

Resolution: either raise --max-turn-duration above 620, or set --idle-timeout explicitly to a value below your custom max. The default-default combo (idle=620, max=3600) is well within the constraint.

Test plan

  • cargo build -p sprout-acp — clean.
  • cargo clippy -p sprout-acp --all-targets -- -D warnings — clean.
  • cargo test -p sprout-acp — 259 passed, 0 failed.

What I considered and decided against

  • Updating the literal 320 in idle_timeout_error_includes_duration (acp.rs:1688). That test uses 320 as an arbitrary Duration value to verify error-message formatting; it doesn't reference the default. Touching it would be unrelated churn.
  • Going to 900s or higher. Reasonable, but 620 is the conservative bump — strictly better than 320 without entering "is this thing hung forever?" territory. Operators can override via SPROUT_ACP_IDLE_TIMEOUT if their workload needs more.

tlongwell-block pushed a commit that referenced this pull request May 13, 2026
Per review on #566: the 620 literal was repeated at four call sites
(the real default in `from_args`, the `resolve_idle_timeout` test
helper that re-implements precedence, and two `test_config()` fixtures)
plus two test assertions. The helper was the real drift hazard — a
future bump would update the prod default but silently leave the test
helper on the old value, masking the regression.

Single `pub(crate) const DEFAULT_IDLE_TIMEOUT_SECS: u64 = 620` in
config.rs with the rationale in its docstring. All real call sites now
reference it; the test assertions derive from it so they auto-track.

No behavior change.
The default idle timeout was sized as goose's 300s turn timeout + 20s
buffer. In practice agents now go silent on the outer ACP channel for
much longer stretches — sprout-agents running other agents as tools,
codex/claude doing multi-minute single tool calls, model providers with
slow generation — and 320s causes spurious cancel+respawn on otherwise
healthy turns.

This bumps the default to 620s (600s working budget + 20s buffer). The
hard turn cap (`max_turn_duration`, default 3600s) is unchanged, so the
safety valve still fires for genuinely wedged agents. Operators with
custom `max_turn_duration` <= 620 and no explicit `--idle-timeout` will
now fail the existing `idle < max_turn` validation at startup; they
should either raise `max_turn_duration` or set `--idle-timeout`
explicitly.

Also corrects a stale README entry that documented the default as 300.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Per review on #566: the 620 literal was repeated at four call sites
(the real default in `from_args`, the `resolve_idle_timeout` test
helper that re-implements precedence, and two `test_config()` fixtures)
plus two test assertions. The helper was the real drift hazard — a
future bump would update the prod default but silently leave the test
helper on the old value, masking the regression.

Single `pub(crate) const DEFAULT_IDLE_TIMEOUT_SECS: u64 = 620` in
config.rs with the rationale in its docstring. All real call sites now
reference it; the test assertions derive from it so they auto-track.

No behavior change.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
@tlongwell-block tlongwell-block force-pushed the chore/acp-idle-timeout-default-620 branch from 3ecdc62 to 807fcdb Compare May 13, 2026 14:39
@tlongwell-block tlongwell-block merged commit 8d5abb7 into main May 13, 2026
22 of 26 checks passed
@tlongwell-block tlongwell-block deleted the chore/acp-idle-timeout-default-620 branch May 13, 2026 15:04
tlongwell-block added a commit that referenced this pull request May 13, 2026
* origin/main:
  chore(acp): raise default idle timeout from 320s to 620s (#566)
  fix(cli): derive thread root from parent event tags (#564)
  fix: skip empty assistant turns instead of placeholder space (#560)

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant