Skip to content

fix(core): retry first-turn agent generation on transient errors (#125)#163

Merged
hqhq1025 merged 2 commits intomainfrom
worktree-agent-ab3a501e
Apr 23, 2026
Merged

fix(core): retry first-turn agent generation on transient errors (#125)#163
hqhq1025 merged 2 commits intomainfrom
worktree-agent-ab3a501e

Conversation

@hqhq1025
Copy link
Copy Markdown
Collaborator

Summary

`USE_AGENT_RUNTIME` 默认 ON,99% 用户走 `generateViaAgent → agent.prompt()` 单次调用,没有任何重试。旧的 `completeWithRetry`(3 次指数退避)只在 USE_AGENT_RUNTIME=0 时才走。pi-ai 自身也无内建重试。结果:429/5xx/transient network 直接抛错给用户。

What changed

  • `packages/providers/src/retry.ts`:抽出通用 `withBackoff(fn, opts)`,承担 sleep / jitter / classify / Retry-After / abort 全部逻辑。`completeWithRetry` 改写为薄 wrapper(注入 provider-error normalization + `provider.error` 日志),行为完全不变,原 15 个测试不动
  • `packages/providers/src/index.ts`:export `withBackoff` / `BackoffOptions` / `RetryDecision`
  • `packages/core/src/agent.ts`:`input.history.length === 0`(first turn,幂等)时用 `withBackoff({ maxRetries: 3 })` 包 `agent.prompt() + agent.waitForIdle()`。non-first-turn 仍走 `sendOnce()`,避免多 turn 对话被重发污染状态
  • 重试通过 `log.warn('[generate] step=send_request.retry', ...)` 暴露,UI feedback 留作 follow-up
  • `USE_AGENT_RUNTIME` 默认值未动;Codex 401 路径未动(follow-up)

Test plan

  • 9 new `withBackoff` tests:first-try success / 503→ok / 429 Retry-After / 4xx no retry / exhaustion / pre-aborted signal / mid-backoff abort / custom classify
  • 4 new agent tests:first-turn 500→success / first-turn 3×500 exhaustion / first-turn 401 no retry / non-first-turn 500 no retry(关键:保护 multi-turn 状态)
  • `pnpm test` providers (139) + core (220) all green
  • `pnpm typecheck` + `pnpm lint` clean
  • changeset added (patch bump)

Closes #125

@github-actions github-actions Bot added docs Documentation area:core packages/core (generation orchestration) area:providers packages/providers (pi-ai adapter, model calls) labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] First-turn retry can replay non-idempotent tool side effects — the retry loop wraps the whole agent.prompt() + agent.waitForIdle() call. If the first attempt already executed tool calls (e.g. file edits) and then fails on a transient upstream error, retrying sends the same turn again and can duplicate/corrupt output state, evidence packages/core/src/agent.ts:832, packages/core/src/agent.ts:837, packages/core/src/agent.ts:852.
    Suggested fix:
    // Only retry truly side-effect-free first turns.
    const canRetryFirstTurn = input.history.length === 0 && tools.length === 0;
    
    if (canRetryFirstTurn) {
      await withBackoff(sendOnce, retryOpts);
    } else {
      await sendOnce();
    }

Summary

  • Review mode: initial
  • 1 major issue found in retry safety for agent first turn.
  • docs/VISION.md and docs/PRINCIPLES.md: Not found in repo/docs.

Testing

  • Not run (automation). Suggested coverage: add a test that simulates first-turn tool execution followed by transient failure and asserts no automatic replay when tools are enabled.

open-codesign Bot

},
};
if (input.signal) retryOpts.signal = input.signal;
await withBackoff(sendOnce, retryOpts);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withBackoff(sendOnce, ...) retries the full prompt + waitForIdle turn. If attempt 1 already ran tool side effects before a transient failure, attempt 2 replays the same turn and can duplicate/corrupt state. Consider gating retries to side-effect-free cases (e.g. tools.length === 0) or otherwise proving idempotency before replaying.

The agent runtime (USE_AGENT_RUNTIME, default on) invokes `agent.prompt()`
without any retry, so a single transient 5xx/429/network blip surfaces to
the user as an immediate failure. The legacy `completeWithRetry` path
already had backoff but it's gated off for ~99% of users.

Extract a generic `withBackoff<T>` helper in `@open-codesign/providers`
sharing the existing classify / jitter / Retry-After / abort logic.
`completeWithRetry` becomes a thin wrapper (behavior preserved, existing
tests unchanged).

In `generateViaAgent`, wrap `agent.prompt() + waitForIdle()` with
`withBackoff` only when `input.history.length === 0` — retrying a
multi-turn request would replay partial tool state and corrupt the
session. Retries surface via `log.warn('[generate] step=send_request.retry')`
and the existing `deps.onRetry` hook; UI toasts are follow-up.

Scope note: Codex 401 refresh-retry is a separate follow-up.

Signed-off-by: hqhq1025 <1506751656@qq.com>
@hqhq1025 hqhq1025 force-pushed the worktree-agent-ab3a501e branch from 9dbdb4a to 4a36931 Compare April 22, 2026 14:19
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] First-turn retry can replay non-idempotent tool side effects — withBackoff(sendOnce, ...) retries the full agent.prompt() + agent.waitForIdle() call whenever input.history.length === 0. A first turn can still execute mutating tools (file edits, TODO updates, etc.) before a transient upstream failure, so replaying the turn can duplicate/corrupt state. Evidence packages/core/src/agent.ts:831, packages/core/src/agent.ts:852.
    Suggested fix:
    const canRetryFirstTurn = input.history.length === 0 && tools.length === 0;
    
    if (canRetryFirstTurn) {
      await withBackoff(sendOnce, retryOpts);
    } else {
      await sendOnce();
    }

Summary

  • Review mode: follow-up after new commits
  • 1 major issue found in first-turn retry safety.
  • docs/VISION.md and docs/PRINCIPLES.md: Not found in repo/docs.

Testing

  • Not run (automation). Suggested tests: add a first-turn tool-call scenario where attempt 1 performs a side effect then throws transiently, and assert no automatic replay when tools are enabled.

open-codesign Bot

},
};
if (input.signal) retryOpts.signal = input.signal;
await withBackoff(sendOnce, retryOpts);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withBackoff(sendOnce, ...) retries the full first turn. On tool-enabled first turns, attempt 1 may already perform side effects before a transient failure, so attempt 2 replays those effects. Consider gating retry to side-effect-free first turns (e.g. tools.length === 0) or proving idempotency before replaying.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • 未在本次 diff 的新增/修改行中发现可高置信度报告的问题。

Summary

  • Review mode: follow-up after new commits
  • 已基于最新 head commit 38fc851083c6b896fea6c6507ffd8058362b6e07 全量审查本次 PR diff(含 follow-up 上下文);当前未发现由本次变更引入或直接触发的问题。
  • docs/VISION.mddocs/PRINCIPLES.md:Not found in repo/docs。
  • 残余风险:当前测试主要通过 agent.state.messages 增量来阻止重试回放;若上游未来出现“未落 assistant message 即产生副作用”的路径,现有单测可能无法覆盖该语义回归。

Testing

  • Not run (automation)。建议补充一条更贴近 pi-agent-core 实际事件序列的集成测试:在 first-turn 中模拟工具副作用先发生、assistant 消息后写入/未写入时,断言不会发生自动重试回放。

open-codesign Bot

@hqhq1025 hqhq1025 merged commit 63fa316 into main Apr 23, 2026
7 checks passed
@hqhq1025 hqhq1025 deleted the worktree-agent-ab3a501e branch April 23, 2026 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core packages/core (generation orchestration) area:providers packages/providers (pi-ai adapter, model calls) docs Documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPT 调用失败无重试机制

1 participant