fix(core): retry first-turn agent generation on transient errors (#125)#163
fix(core): retry first-turn agent generation on transient errors (#125)#163
Conversation
There was a problem hiding this comment.
Findings
- [Major] First-turn retry can replay non-idempotent tool side effects — the retry loop wraps the whole
agent.prompt() + agent.waitForIdle()call. If the first attempt already executed tool calls (e.g. file edits) and then fails on a transient upstream error, retrying sends the same turn again and can duplicate/corrupt output state, evidencepackages/core/src/agent.ts:832,packages/core/src/agent.ts:837,packages/core/src/agent.ts:852.
Suggested fix:// Only retry truly side-effect-free first turns. const canRetryFirstTurn = input.history.length === 0 && tools.length === 0; if (canRetryFirstTurn) { await withBackoff(sendOnce, retryOpts); } else { await sendOnce(); }
Summary
- Review mode: initial
- 1 major issue found in retry safety for agent first turn.
docs/VISION.mdanddocs/PRINCIPLES.md: Not found in repo/docs.
Testing
- Not run (automation). Suggested coverage: add a test that simulates first-turn tool execution followed by transient failure and asserts no automatic replay when tools are enabled.
open-codesign Bot
| }, | ||
| }; | ||
| if (input.signal) retryOpts.signal = input.signal; | ||
| await withBackoff(sendOnce, retryOpts); |
There was a problem hiding this comment.
withBackoff(sendOnce, ...) retries the full prompt + waitForIdle turn. If attempt 1 already ran tool side effects before a transient failure, attempt 2 replays the same turn and can duplicate/corrupt state. Consider gating retries to side-effect-free cases (e.g. tools.length === 0) or otherwise proving idempotency before replaying.
The agent runtime (USE_AGENT_RUNTIME, default on) invokes `agent.prompt()`
without any retry, so a single transient 5xx/429/network blip surfaces to
the user as an immediate failure. The legacy `completeWithRetry` path
already had backoff but it's gated off for ~99% of users.
Extract a generic `withBackoff<T>` helper in `@open-codesign/providers`
sharing the existing classify / jitter / Retry-After / abort logic.
`completeWithRetry` becomes a thin wrapper (behavior preserved, existing
tests unchanged).
In `generateViaAgent`, wrap `agent.prompt() + waitForIdle()` with
`withBackoff` only when `input.history.length === 0` — retrying a
multi-turn request would replay partial tool state and corrupt the
session. Retries surface via `log.warn('[generate] step=send_request.retry')`
and the existing `deps.onRetry` hook; UI toasts are follow-up.
Scope note: Codex 401 refresh-retry is a separate follow-up.
Signed-off-by: hqhq1025 <1506751656@qq.com>
9dbdb4a to
4a36931
Compare
There was a problem hiding this comment.
Findings
- [Major] First-turn retry can replay non-idempotent tool side effects —
withBackoff(sendOnce, ...)retries the fullagent.prompt() + agent.waitForIdle()call wheneverinput.history.length === 0. A first turn can still execute mutating tools (file edits, TODO updates, etc.) before a transient upstream failure, so replaying the turn can duplicate/corrupt state. Evidencepackages/core/src/agent.ts:831,packages/core/src/agent.ts:852.
Suggested fix:const canRetryFirstTurn = input.history.length === 0 && tools.length === 0; if (canRetryFirstTurn) { await withBackoff(sendOnce, retryOpts); } else { await sendOnce(); }
Summary
- Review mode: follow-up after new commits
- 1 major issue found in first-turn retry safety.
docs/VISION.mdanddocs/PRINCIPLES.md: Not found in repo/docs.
Testing
- Not run (automation). Suggested tests: add a first-turn tool-call scenario where attempt 1 performs a side effect then throws transiently, and assert no automatic replay when tools are enabled.
open-codesign Bot
| }, | ||
| }; | ||
| if (input.signal) retryOpts.signal = input.signal; | ||
| await withBackoff(sendOnce, retryOpts); |
There was a problem hiding this comment.
withBackoff(sendOnce, ...) retries the full first turn. On tool-enabled first turns, attempt 1 may already perform side effects before a transient failure, so attempt 2 replays those effects. Consider gating retry to side-effect-free first turns (e.g. tools.length === 0) or proving idempotency before replaying.
Signed-off-by: hqhq1025 <1506751656@qq.com>
There was a problem hiding this comment.
Findings
- 未在本次 diff 的新增/修改行中发现可高置信度报告的问题。
Summary
- Review mode: follow-up after new commits
- 已基于最新 head commit
38fc851083c6b896fea6c6507ffd8058362b6e07全量审查本次 PR diff(含 follow-up 上下文);当前未发现由本次变更引入或直接触发的问题。 docs/VISION.md与docs/PRINCIPLES.md:Not found in repo/docs。- 残余风险:当前测试主要通过
agent.state.messages增量来阻止重试回放;若上游未来出现“未落 assistant message 即产生副作用”的路径,现有单测可能无法覆盖该语义回归。
Testing
- Not run (automation)。建议补充一条更贴近 pi-agent-core 实际事件序列的集成测试:在 first-turn 中模拟工具副作用先发生、assistant 消息后写入/未写入时,断言不会发生自动重试回放。
open-codesign Bot
Summary
`USE_AGENT_RUNTIME` 默认 ON,99% 用户走 `generateViaAgent → agent.prompt()` 单次调用,没有任何重试。旧的 `completeWithRetry`(3 次指数退避)只在 USE_AGENT_RUNTIME=0 时才走。pi-ai 自身也无内建重试。结果:429/5xx/transient network 直接抛错给用户。
What changed
Test plan
Closes #125