fix(core): retry first-turn agent generation on transient errors (#125) by hqhq1025 · Pull Request #163 · OpenCoworkAI/open-codesign

hqhq1025 · 2026-04-22T14:15:24Z

Summary

`USE_AGENT_RUNTIME` 默认 ON，99% 用户走 `generateViaAgent → agent.prompt()` 单次调用，没有任何重试。旧的 `completeWithRetry`（3 次指数退避）只在 USE_AGENT_RUNTIME=0 时才走。pi-ai 自身也无内建重试。结果：429/5xx/transient network 直接抛错给用户。

What changed

`packages/providers/src/retry.ts`：抽出通用 `withBackoff(fn, opts)`，承担 sleep / jitter / classify / Retry-After / abort 全部逻辑。`completeWithRetry` 改写为薄 wrapper（注入 provider-error normalization + `provider.error` 日志），行为完全不变，原 15 个测试不动
`packages/providers/src/index.ts`：export `withBackoff` / `BackoffOptions` / `RetryDecision`
`packages/core/src/agent.ts`：`input.history.length === 0`（first turn，幂等）时用 `withBackoff({ maxRetries: 3 })` 包 `agent.prompt() + agent.waitForIdle()`。non-first-turn 仍走 `sendOnce()`，避免多 turn 对话被重发污染状态
重试通过 `log.warn('[generate] step=send_request.retry', ...)` 暴露，UI feedback 留作 follow-up
`USE_AGENT_RUNTIME` 默认值未动；Codex 401 路径未动（follow-up）

Test plan

9 new `withBackoff` tests：first-try success / 503→ok / 429 Retry-After / 4xx no retry / exhaustion / pre-aborted signal / mid-backoff abort / custom classify
4 new agent tests：first-turn 500→success / first-turn 3×500 exhaustion / first-turn 401 no retry / non-first-turn 500 no retry（关键：保护 multi-turn 状态）
`pnpm test` providers (139) + core (220) all green
`pnpm typecheck` + `pnpm lint` clean
changeset added (patch bump)

Closes #125

github-actions

Findings

[Major] First-turn retry can replay non-idempotent tool side effects — the retry loop wraps the whole agent.prompt() + agent.waitForIdle() call. If the first attempt already executed tool calls (e.g. file edits) and then fails on a transient upstream error, retrying sends the same turn again and can duplicate/corrupt output state, evidence packages/core/src/agent.ts:832, packages/core/src/agent.ts:837, packages/core/src/agent.ts:852.
Suggested fix:
```
// Only retry truly side-effect-free first turns.
const canRetryFirstTurn = input.history.length === 0 && tools.length === 0;

if (canRetryFirstTurn) {
  await withBackoff(sendOnce, retryOpts);
} else {
  await sendOnce();
}
```

Summary

Review mode: initial
1 major issue found in retry safety for agent first turn.
docs/VISION.md and docs/PRINCIPLES.md: Not found in repo/docs.

Testing

Not run (automation). Suggested coverage: add a test that simulates first-turn tool execution followed by transient failure and asserts no automatic replay when tools are enabled.

open-codesign Bot

github-actions · 2026-04-22T14:18:35Z

+        },
+      };
+      if (input.signal) retryOpts.signal = input.signal;
+      await withBackoff(sendOnce, retryOpts);


withBackoff(sendOnce, ...) retries the full prompt + waitForIdle turn. If attempt 1 already ran tool side effects before a transient failure, attempt 2 replays the same turn and can duplicate/corrupt state. Consider gating retries to side-effect-free cases (e.g. tools.length === 0) or otherwise proving idempotency before replaying.

The agent runtime (USE_AGENT_RUNTIME, default on) invokes `agent.prompt()` without any retry, so a single transient 5xx/429/network blip surfaces to the user as an immediate failure. The legacy `completeWithRetry` path already had backoff but it's gated off for ~99% of users. Extract a generic `withBackoff<T>` helper in `@open-codesign/providers` sharing the existing classify / jitter / Retry-After / abort logic. `completeWithRetry` becomes a thin wrapper (behavior preserved, existing tests unchanged). In `generateViaAgent`, wrap `agent.prompt() + waitForIdle()` with `withBackoff` only when `input.history.length === 0` — retrying a multi-turn request would replay partial tool state and corrupt the session. Retries surface via `log.warn('[generate] step=send_request.retry')` and the existing `deps.onRetry` hook; UI toasts are follow-up. Scope note: Codex 401 refresh-retry is a separate follow-up. Signed-off-by: hqhq1025 <1506751656@qq.com>

github-actions

Findings

[Major] First-turn retry can replay non-idempotent tool side effects — withBackoff(sendOnce, ...) retries the full agent.prompt() + agent.waitForIdle() call whenever input.history.length === 0. A first turn can still execute mutating tools (file edits, TODO updates, etc.) before a transient upstream failure, so replaying the turn can duplicate/corrupt state. Evidence packages/core/src/agent.ts:831, packages/core/src/agent.ts:852.
Suggested fix:
```
const canRetryFirstTurn = input.history.length === 0 && tools.length === 0;

if (canRetryFirstTurn) {
  await withBackoff(sendOnce, retryOpts);
} else {
  await sendOnce();
}
```

Summary

Review mode: follow-up after new commits
1 major issue found in first-turn retry safety.
docs/VISION.md and docs/PRINCIPLES.md: Not found in repo/docs.

Testing

Not run (automation). Suggested tests: add a first-turn tool-call scenario where attempt 1 performs a side effect then throws transiently, and assert no automatic replay when tools are enabled.

open-codesign Bot

github-actions · 2026-04-22T14:21:44Z

+        },
+      };
+      if (input.signal) retryOpts.signal = input.signal;
+      await withBackoff(sendOnce, retryOpts);


withBackoff(sendOnce, ...) retries the full first turn. On tool-enabled first turns, attempt 1 may already perform side effects before a transient failure, so attempt 2 replays those effects. Consider gating retry to side-effect-free first turns (e.g. tools.length === 0) or proving idempotency before replaying.

Signed-off-by: hqhq1025 <1506751656@qq.com>

github-actions

Findings

未在本次 diff 的新增/修改行中发现可高置信度报告的问题。

Summary

Review mode: follow-up after new commits
已基于最新 head commit 38fc851083c6b896fea6c6507ffd8058362b6e07 全量审查本次 PR diff（含 follow-up 上下文）；当前未发现由本次变更引入或直接触发的问题。
docs/VISION.md 与 docs/PRINCIPLES.md：Not found in repo/docs。
残余风险：当前测试主要通过 agent.state.messages 增量来阻止重试回放；若上游未来出现“未落 assistant message 即产生副作用”的路径，现有单测可能无法覆盖该语义回归。

Testing

Not run (automation)。建议补充一条更贴近 pi-agent-core 实际事件序列的集成测试：在 first-turn 中模拟工具副作用先发生、assistant 消息后写入/未写入时，断言不会发生自动重试回放。

open-codesign Bot

github-actions Bot added docs Documentation area:core packages/core (generation orchestration) area:providers packages/providers (pi-ai adapter, model calls) labels Apr 22, 2026

github-actions Bot reviewed Apr 22, 2026

View reviewed changes

hqhq1025 force-pushed the worktree-agent-ab3a501e branch from 9dbdb4a to 4a36931 Compare April 22, 2026 14:19

github-actions Bot reviewed Apr 22, 2026

View reviewed changes

fix(core): only retry first-turn send before any tool side effect (#125)

38fc851

Signed-off-by: hqhq1025 <1506751656@qq.com>

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

hqhq1025 merged commit 63fa316 into main Apr 23, 2026
7 checks passed

hqhq1025 deleted the worktree-agent-ab3a501e branch April 23, 2026 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): retry first-turn agent generation on transient errors (#125)#163

fix(core): retry first-turn agent generation on transient errors (#125)#163
hqhq1025 merged 2 commits intomainfrom
worktree-agent-ab3a501e

hqhq1025 commented Apr 22, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 22, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 22, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hqhq1025 commented Apr 22, 2026

Summary

What changed

Test plan

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant