Skip to content

feat(backend): add automatic failover and retry for multi-provider requests#50

Merged
pescn merged 5 commits into
mainfrom
feat/automatic-failover
Jan 19, 2026
Merged

feat(backend): add automatic failover and retry for multi-provider requests#50
pescn merged 5 commits into
mainfrom
feat/automatic-failover

Conversation

@pescn
Copy link
Copy Markdown
Contributor

@pescn pescn commented Jan 19, 2026

Summary

  • Add automatic failover service that switches to alternate providers when requests fail
  • Implement exponential backoff with jitter for retries
  • Support both same-provider retries (for transient errors) and cross-provider failover
  • Include weighted load balancing fix (supersedes fix(backend): implement weighted load balancing in selectModel #49)

Features

Failover Service (backend/src/services/failover.ts)

  • Configurable retry behavior: max attempts, retriable status codes, timeouts
  • Exponential backoff with jitter: prevents thundering herd
  • Same-provider retries: for transient errors like 429 rate limits
  • Cross-provider failover: switches to next provider on persistent failures
  • Weighted candidate selection: respects model weights for failover order
  • Detailed error tracking: logs all attempted providers and errors

Configuration Defaults

{
  maxProviderAttempts: 3,
  sameProviderRetries: 1,
  retriableStatusCodes: [429, 500, 502, 503, 504],
  retriableErrorCodes: ['ETIMEDOUT', 'ECONNRESET', 'ECONNREFUSED', ...],
  baseDelayMs: 100,
  maxDelayMs: 5000,
  timeoutMs: 120000
}

Failover Flow

Request arrives
    ↓
Select candidates (weighted random order)
    ↓
For each candidate:
  ├─ Try request
  ├─ If success: return response
  ├─ If retriable error:
  │   ├─ Same-provider retry (if attempts remain)
  │   └─ Move to next provider
  └─ If non-retriable: return error
    ↓
All providers exhausted → return 502

Affected Endpoints

  • /v1/chat/completions
  • /v1/messages
  • /v1/responses

Also Includes

This PR also includes the weighted load balancing fix from #49:

  • Implement proper weighted random selection in selectModel()
  • Add filterCandidates() helper function

Test plan

  • Type check passes (bun run check)
  • Lint passes (bun run lint)
  • Manual testing with multiple providers configured for same model
  • Verify failover triggers on simulated provider errors

🤖 Generated with Claude Code

Summary by CodeRabbit

发布说明

  • 新功能

    • 增加可配置的故障切换服务,支持跨多个提供方的重试、回退与加权候选选择,提升流式与非流式请求鲁棒性。
    • 新的候选过滤与多候选选择策略,支持按目标提供方路由并优先/重排候选。
  • 重构

    • 统一上游请求构建、流式与非流式响应处理、错误处理与时序采集(开始时间)以便度量与追踪。
    • 统一失败处理与日志记录,耗尽后返回标准化上游错误响应并改进重试/超时行为。

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 19, 2026

Warning

Rate limit exceeded

@pescn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 36 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 4392023 and b0e896d.

📒 Files selected for processing (2)
  • backend/src/api/v1/messages.ts
  • backend/src/utils/api-helpers.ts

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

本次更改引入集中式故障转移引擎并将其集成到 completions、messages、responses 三个 v1 API 路径;重构请求构建与流/非流处理器以支持跨多提供商的重试、回退与超时控制;新增候选过滤与多候选选择逻辑并统一错误/完成记录与时序度量。

Changes

Cohort / File(s) 更改摘要
故障转移服务
backend/src/services/failover.ts
新增完整 Failover 服务:FailoverConfigDEFAULT_FAILOVER_CONFIG、错误/结果模型(FailoverErrorFailoverResult)、执行引擎 executeWithFailover、候选选择/重排序与 selectMultipleCandidates、退避/超时/重试与辅助函数(fetchWithTimeoutsleep、退避计算等)。
核心 v1 API(故障转移集成)
backend/src/api/v1/completions.ts, backend/src/api/v1/messages.ts, backend/src/api/v1/responses.ts
将上游请求改为基于故障转移的流:新增模块级 FAILOVER_CONFIG 常量、使用 filterCandidates + selectMultipleCandidates 选择候选、通过 executeWithFailover 发起上游请求;将 handleNonStreamingRequestprocessNonStreamingResponsehandleStreamingRequestprocessStreamingResponse(签名变更为接收 Responsebegin;返回 Promise<string> / AsyncGenerator<string,...>);去除模块内直接 fetch,改用 adapter 构建请求并统一错误/日志/完成记录与时序计算。
工具与模型选择
backend/src/utils/api-helpers.ts
新增 filterCandidates(modelsWithProviders, targetProvider?),并调整 selectModel 行为以利用过滤逻辑;新增 Failover 错误类型与处理工具:FailoverErrorResultgetErrorSummaryprocessFailoverError,用于把 failover 结果映射到完成记录与标准化错误响应。
导入/使用点调整
backend/src/api/v1/*.ts, 其他引用处
替换或新增导入:executeWithFailoverselectMultipleCandidatestype FailoverConfig(来自 @/services/failover);将原先直接 selectModel 的用法改为 filterCandidates + 多候选选择。
流/非流处理器签名变更
backend/src/api/v1/*.ts
将原先负责发起 fetch 的 handler 重命名并改为以上游 Response 为输入的处理函数(processNonStreamingResponse(resp, completion, bearer, providerType, apiKeyRecord, begin)processStreamingResponse(...)),并在调用链中传递 begin 用于时延计算。
日志与错误/完成记录增强
多处文件
在故障转移耗尽、处理错误、无响应体或流无 chunk 等场景统一记录并返回标准化上游失败响应;调用 processFailoverError 更新完成记录并返回结构化错误结果。

Sequence Diagrams

sequenceDiagram
    participant Client
    participant API as "API (completions/messages/responses)"
    participant Failover as "Failover 引擎"
    participant ProviderA as "提供商 A"
    participant ProviderB as "提供商 B"
    participant Logging as "日志 / 完成记录"

    Client->>API: 发送请求(含 model/provider 目标)
    API->>API: 捕获 begin 时间;filterCandidates -> selectMultipleCandidates
    API->>Failover: executeWithFailover(候选, buildRequestPerProvider)
    Failover->>ProviderA: 构建并发送上游请求
    ProviderA-->>Failover: 响应(成功或错误)
    alt 可重试失败
        Failover->>Failover: 退避等待(exponential backoff)
        Failover->>ProviderB: 构建并发送上游请求
        ProviderB-->>Failover: 响应
    end
    alt 成功响应
        Failover-->>API: FailoverResult(success, response)
        API->>API: processNonStreamingResponse / processStreamingResponse
        API->>Logging: 记录完成成功与时延
        API-->>Client: 返回处理后响应
    else 所有候选耗尽
        Failover-->>API: FailoverResult(failure, errors)
        API->>Logging: 记录失败详情
        API-->>Client: 返回标准化上游失败(502 / 错误 payload)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 洞口闻到变更香,
故障转移跳起轻快步,
多家提供商并肩守望,
退避与重试织细网,
请求安然归回草坡上 🌿

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR标题清晰准确地概括了主要变更:实现多提供商请求的自动故障转移和重试功能,与代码变更内容高度相关且具体明确。
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @pescn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the backend's fault tolerance and reliability by introducing an automatic failover and retry mechanism for multi-provider requests. It ensures that API calls to external models are more resilient to transient network issues and provider outages, gracefully attempting alternative providers or retrying failed requests with intelligent backoff. This change also refines the existing weighted model selection to work seamlessly with the new failover capabilities.

Highlights

  • Automatic Failover Service: Introduced a new service (backend/src/services/failover.ts) that automatically switches to alternate providers when requests fail, enhancing system resilience.
  • Retry Mechanism with Exponential Backoff and Jitter: Implemented exponential backoff with jitter for retries, preventing 'thundering herd' issues and allowing for configurable retry behavior based on status codes and network errors.
  • Configurable Retry Strategies: Supports both same-provider retries for transient errors (e.g., 429 rate limits) and cross-provider failover for persistent failures, with configurable maximum attempts and delays.
  • Weighted Load Balancing Enhancement: The model selection logic now includes a proper weighted random selection for load balancing across multiple providers, addressing and superseding a previous TODO item (fix(backend): implement weighted load balancing in selectModel #49).
  • API Endpoint Integration: The new failover and retry logic has been integrated into the /v1/chat/completions, /v1/messages, and /v1/responses API endpoints, making them more robust.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust failover and retry mechanism, which is a great enhancement for reliability. The core failover service is well-structured. However, I've identified a critical issue in how non-retriable upstream errors are handled in the API endpoints, which currently results in incorrect 502 error responses instead of propagating the original error. Additionally, there's significant code duplication across the completions, messages, and responses API endpoints that should be refactored into a generic handler to improve maintainability. I've also noted a minor duplication in api-helpers.ts. Addressing these points will make the implementation more robust and easier to maintain.

Comment thread backend/src/api/v1/completions.ts Outdated
Comment thread backend/src/api/v1/messages.ts Outdated
Comment thread backend/src/api/v1/responses.ts Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/src/api/v1/completions.ts (1)

200-228: 无 chunk 时先记为 completed 会产生重复/错误日志

若 upstream 没有任何 chunk,当前实现会先记录 completed,再抛错记录 failed,导致重复记录/指标污染。建议把“无 chunk”检查放在完成记录之前。

🛠️ 建议修复
-    // Finalize completion record
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
     completion.completion = [
       {
         role: undefined,
         content:
           (thinkingParts.length > 0
             ? `<think>${thinkingParts.join("")}</think>\n`
             : "") + textParts.join(""),
       },
     ];
@@
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
backend/src/api/v1/messages.ts (1)

257-285: 无 chunk 时先记 completed 会导致双记录

若没有任何 chunk,会先记录 completed 再抛错记录 failed,日志与指标会被污染。建议把“无 chunk”检查前置。

🛠️ 建议修复
-    // Finalize completion record
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
     completion.completion = [
       {
         role: undefined,
         content:
           (thinkingParts.length > 0
             ? `<think>${thinkingParts.join("")}</think>\n`
             : "") + textParts.join(""),
       },
     ];
@@
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
backend/src/api/v1/responses.ts (1)

273-301: 无 chunk 时先记 completed 会导致重复/错误记录

如果 upstream 没有任何 chunk,当前实现会先记录 completed,再抛错记录 failed。建议把“无 chunk”检查放到完成记录之前。

🛠️ 建议修复
-    // Finalize completion record
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
     completion.completion = [
       {
         role: undefined,
         content:
           (thinkingParts.length > 0
             ? `<think>${thinkingParts.join("")}</think>\n`
             : "") + textParts.join(""),
       },
     ];
@@
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
🧹 Nitpick comments (1)
backend/src/utils/api-helpers.ts (1)

124-166: 建议 selectModel 复用 filterCandidates,避免重复分支

当前过滤与 warning 逻辑重复,后续容易产生行为分叉;直接复用可保持一致性。

♻️ 建议重构
-  // Filter by target provider if specified
-  let candidates = modelsWithProviders;
-  if (targetProvider) {
-    const filtered = modelsWithProviders.filter(
-      (mp) => mp.provider.name === targetProvider,
-    );
-    if (filtered.length > 0) {
-      candidates = filtered;
-    } else {
-      logger.warn(
-        `Provider '${targetProvider}' does not offer requested model, falling back to available providers`,
-      );
-    }
-  }
+  // Filter by target provider if specified
+  const candidates = filterCandidates(modelsWithProviders, targetProvider);
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae4895b and f0a7b47.

📒 Files selected for processing (5)
  • backend/src/api/v1/completions.ts
  • backend/src/api/v1/messages.ts
  • backend/src/api/v1/responses.ts
  • backend/src/services/failover.ts
  • backend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (1)
backend/src/utils/api-helpers.ts (2)
backend/src/utils/model.ts (1)
  • ModelWithProvider (11-14)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
🔇 Additional comments (6)
backend/src/services/failover.ts (2)

86-98: 指数退避 + 抖动实现清晰

实现直观、可调,没发现问题。


251-288: 重试失败响应前应释放连接资源,避免连接池枯竭

在重试或切换 provider 的循环中,若响应体未被消费或显式释放,可能导致底层连接被占用、无法返回连接池,最终造成连接池枯竭。在进入下一次尝试前应确保清理前次响应;需确认此代码运行环境(Node.js fetch/undici、浏览器或其他客户端)及响应体是否在他处被消费。

🛠️ 建议修复
         logger.warn("Request failed with HTTP error", {
           provider: provider.provider.name,
           status: response.status,
           retriable,
           attempt: totalAttempts,
         });
+
+        if (retriable) {
+          await response.body?.cancel();
+        }
 
         // If not retriable, return the response (let caller handle the error)
         if (!retriable) {
backend/src/utils/api-helpers.ts (1)

82-110: 候选 provider 过滤封装得很清晰

便于 failover 与单次选择共享同一入口。

backend/src/api/v1/completions.ts (1)

300-322: 候选过滤 + 加权选择接入 failover 的流程很顺

链路清晰,便于后续扩展与调参。

backend/src/api/v1/messages.ts (1)

359-382: 候选过滤 + 多 provider 选择接入自然

逻辑清晰,便于 failover 顺序控制。

backend/src/api/v1/responses.ts (1)

375-398: 候选过滤 + 加权选择接入合理

failover 入口更清晰,可读性不错。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

…quests

Implement automatic failover capability that switches to alternate providers
when a request fails with retriable errors (5xx, 429, network timeouts).

Key features:
- Failover service with configurable retry behavior
- Exponential backoff with jitter for retries
- Same-provider retries for transient errors (429)
- Cross-provider failover for persistent failures
- Weighted random selection for failover candidates
- Detailed error logging with all attempted providers

Also includes the weighted load balancing fix from PR #49:
- Implement proper weighted random selection in selectModel()
- Add filterCandidates() helper for failover use

Affected endpoints:
- /v1/chat/completions
- /v1/messages
- /v1/responses

Configuration defaults:
- maxProviderAttempts: 3
- sameProviderRetries: 1
- retriableStatusCodes: [429, 500, 502, 503, 504]
- timeoutMs: 120000 (2 minutes)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn pescn force-pushed the feat/automatic-failover branch from f0a7b47 to cd819f0 Compare January 19, 2026 21:28
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/src/api/v1/responses.ts`:
- Around line 546-549: The SSE error in the catch block currently yields only a
data line and uses the wrong field names; update the catch handler that sets
set.status and yields the error stream so it emits an SSE error event line
("event: error") followed by a data line whose JSON uses the OpenAI fields:
{"type":"error","error":{"code":"internal_error","message":"Stream processing
error","param":null,"help_url":null}}; keep set.status = 500 and ensure the
yield includes the "event: error" prefix, the correctly structured JSON payload,
and the terminating double newline to conform to SSE.

In `@backend/src/services/failover.ts`:
- Around line 142-161: The fetchWithTimeout wrapper needs to ensure
timeout-triggered aborts are classified as retriable: catch errors from fetch
inside fetchWithTimeout (or have fetchWithTimeout rethrow) and, when the error
is due to the AbortController (e.g., error.name === "AbortError" or the
controller.signal.aborted), convert or annotate that error so
isRetriableNetworkError can detect it as a timeout (for example by throwing a
dedicated TimeoutError or appending "timeout" to the error message); update
fetchWithTimeout (and any callers) to surface that timeout-specific error
instead of leaving the raw AbortError that lacks "timeout" in its message.
♻️ Duplicate comments (2)
backend/src/api/v1/messages.ts (1)

400-409: 与 completions.ts 相同的浅拷贝问题

此处 buildRequestForProvider 使用相同的浅拷贝模式,存在与 completions.ts 相同的潜在问题。

backend/src/api/v1/responses.ts (1)

416-425: 与其他 API 文件相同的浅拷贝模式

此处 buildRequestForProvider 使用与 completions.tsmessages.ts 相同的浅拷贝模式。已在 completions.ts 中标记此问题。

🧹 Nitpick comments (5)
backend/src/services/failover.ts (2)

89-98: 退避延迟计算正确,但抖动仅增加延迟

当前实现中,抖动值始终为正数(cappedDelay * jitterFactor * Math.random()),这意味着实际延迟总是 >= cappedDelay。更常见的做法是使用双向抖动,让延迟可以在基准值上下浮动。

当前实现可以正常工作,但如果希望更均匀地分散重试请求,可以考虑:

♻️ 可选:使用双向抖动
 export function calculateBackoffDelay(
   attempt: number,
   config: FailoverConfig,
 ): number {
   const exponentialDelay =
     config.baseDelayMs * Math.pow(config.exponentialBase, attempt);
   const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
-  const jitter = cappedDelay * config.jitterFactor * Math.random();
-  return Math.floor(cappedDelay + jitter);
+  // 双向抖动:延迟可以在 cappedDelay ± (cappedDelay * jitterFactor) 范围内
+  const jitter = cappedDelay * config.jitterFactor * (Math.random() * 2 - 1);
+  return Math.floor(Math.max(0, cappedDelay + jitter));
 }

192-200: 提供商选择逻辑存在潜在的低效问题

当前实现在每次迭代时使用 candidates.find() 从数组开头搜索未尝试的提供商。由于候选者已经按权重随机排序,这在功能上是正确的,但每次调用 find() 都是 O(n) 操作。

对于少量候选者(通常 3 个)这不是问题,但代码可以更清晰地表达意图:直接按顺序遍历候选者数组。

♻️ 可选:简化提供商遍历逻辑
-  for (let providerIndex = 0; providerIndex < maxProviders; providerIndex++) {
-    // Select next provider (skip already tried ones)
-    const provider = candidates.find(
-      (c) => !triedProviders.has(c.provider.id),
-    );
-    if (!provider) {
-      logger.debug("No more providers to try");
-      break;
-    }
+  for (const provider of candidates.slice(0, maxProviders)) {
+    if (triedProviders.has(provider.provider.id)) {
+      continue;
+    }
backend/src/utils/api-helpers.ts (1)

83-111: 新增 filterCandidates 函数逻辑正确

该函数为故障转移场景提供候选过滤,当指定的提供商不可用时会回退到所有可用提供商并记录警告。

注意:此函数与 selectModel 函数(第 127-138 行)存在重复的过滤逻辑。如果 selectModel 在其他地方仍被使用,可以考虑让它内部调用 filterCandidates 来减少重复。

♻️ 可选:复用 filterCandidates 减少重复
 export function selectModel(
   modelsWithProviders: ModelWithProvider[],
   targetProvider?: string,
 ): ModelWithProvider | null {
   if (modelsWithProviders.length === 0) {
     return null;
   }

-  // Filter by target provider if specified
-  let candidates = modelsWithProviders;
-  if (targetProvider) {
-    const filtered = modelsWithProviders.filter(
-      (mp) => mp.provider.name === targetProvider,
-    );
-    if (filtered.length > 0) {
-      candidates = filtered;
-    } else {
-      logger.warn(
-        `Provider '${targetProvider}' does not offer requested model, falling back to available providers`,
-      );
-    }
-  }
+  // Filter by target provider if specified (reuse filterCandidates)
+  const candidates = filterCandidates(modelsWithProviders, targetProvider);

   // Single candidate, return directly
   if (candidates.length === 1) {
backend/src/api/v1/completions.ts (2)

589-592: 流式处理错误的 catch 块未记录错误原因

processStreamingResponse 抛出异常时,catch 块使用空参数 catch {},丢失了错误信息。虽然 processStreamingResponse 内部已经记录了日志,但外层 catch 块不知道是什么类型的错误。

建议至少记录一下这是来自流处理的错误:

♻️ 可选:在 catch 块中添加日志
-        } catch {
+        } catch (error) {
+          logger.debug("Stream processing threw error", error);
           set.status = 500;
           yield JSON.stringify({ error: "Stream processing error" });
         }

481-554: 建议:提取故障转移错误处理的公共逻辑

流式和非流式路径中的 !result.success 错误处理逻辑几乎相同(构建 completion、记录日志、区分非可重试错误和耗尽场景)。同样的模式在 messages.tsresponses.ts 中也重复出现。

考虑将这些公共逻辑提取为辅助函数,以减少代码重复并便于维护。

♻️ 建议:创建通用的故障转移错误处理函数

可以在 failover.tsapi-helpers.ts 中添加类似以下的辅助函数:

export interface FailoverErrorHandlerOptions {
  result: FailoverResult<Response>;
  completion: Completion;
  bearer: string;
  set: { status: number };
  errorType: 'streaming' | 'non-streaming';
  formatError: (msg: string, type: string) => string;
}

export async function handleFailoverError(opts: FailoverErrorHandlerOptions): Promise<string | null> {
  // 统一处理非可重试错误转发和耗尽场景
  // 返回需要 yield 的响应字符串,或 null 表示需要继续处理
}

Also applies to: 601-674

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0a7b47 and cd819f0.

📒 Files selected for processing (5)
  • backend/src/api/v1/completions.ts
  • backend/src/api/v1/messages.ts
  • backend/src/api/v1/responses.ts
  • backend/src/services/failover.ts
  • backend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (3)
backend/src/api/v1/completions.ts (4)
backend/src/services/failover.ts (3)
  • FailoverConfig (18-37)
  • selectMultipleCandidates (377-413)
  • executeWithFailover (179-348)
backend/src/utils/api-helpers.ts (1)
  • filterCandidates (87-111)
backend/src/utils/model.ts (1)
  • ModelWithProvider (11-14)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
backend/src/api/v1/responses.ts (4)
backend/src/services/failover.ts (3)
  • FailoverConfig (18-37)
  • selectMultipleCandidates (377-413)
  • executeWithFailover (179-348)
backend/src/utils/api-helpers.ts (1)
  • filterCandidates (87-111)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
backend/src/adapters/index.ts (1)
  • getUpstreamAdapter (85-92)
backend/src/utils/api-helpers.ts (2)
backend/src/utils/model.ts (1)
  • ModelWithProvider (11-14)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
🔇 Additional comments (8)
backend/src/services/failover.ts (2)

1-57: 配置和类型定义设计合理

故障转移配置接口设计完整,包含了必要的参数:最大提供商尝试次数、同提供商重试次数、可重试状态码、退避策略参数等。默认配置值合理,特别是包含了常见的 5xx 错误和网络错误码。


377-413: 加权随机选择实现正确

该函数正确实现了无放回的加权随机选择,用于按权重优先级选择多个候选提供商。实现逻辑清晰,包括边界检查(候选数量小于等于请求数量时直接返回副本)。

backend/src/api/v1/completions.ts (2)

358-364: 故障转移配置定义合理

配置值适合 completions API 的使用场景,2 分钟超时对于长时间运行的 LLM 请求是合理的。


481-554: 错误处理逻辑正确,已解决之前的审查意见

代码正确区分了两种失败场景:

  1. 非可重试的上游错误(result.response 存在):转发原始状态码和响应体
  2. 所有提供商耗尽(result.response 为 undefined):返回 502

这解决了之前审查中指出的问题。

backend/src/api/v1/messages.ts (2)

303-309: Messages API 故障转移配置与 Completions 一致

配置值与 completions API 相同,保持了一致性。


530-533: 流式错误格式正确使用 SSE event 格式

使用 event: error\ndata: ... 格式符合 Anthropic API 的 SSE 错误响应规范。

backend/src/api/v1/responses.ts (2)

319-325: Responses API 故障转移配置与其他 API 一致

配置值与 completions 和 messages API 相同,保持了系统一致性。


436-507: 故障转移错误处理逻辑正确实现

completions.tsmessages.ts 一样,正确区分了非可重试上游错误和所有提供商耗尽的场景,解决了之前审查中提出的问题。

三个 API 文件之间存在大量重复代码,建议后续迭代中考虑提取公共的故障转移错误处理逻辑。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread backend/src/api/v1/responses.ts Outdated
Comment thread backend/src/services/failover.ts
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust failover and retry mechanism for multi-provider requests, which is a great enhancement for reliability. The core logic in the new failover.ts service is well-structured, implementing exponential backoff with jitter and configurable retry policies.

My review focuses on improving maintainability by reducing code duplication in the API endpoint files and fixing a potential database integrity issue related to model ID handling on failure.

Key feedback points:

  • Code Duplication: The error handling logic for failover results is duplicated across streaming and non-streaming paths in completions.ts, messages.ts, and responses.ts. Refactoring this into a shared helper function would significantly improve maintainability.
  • Database Integrity: There's an issue where a fallback model ID of 0 is used when a request fails before a provider is chosen. This is incorrect for a database identity column and should be undefined instead.
  • Efficiency: A minor efficiency improvement is suggested for the weighted random selection algorithm.

Overall, this is a solid implementation of a complex feature. Addressing the feedback will make the code more robust and easier to maintain.

Comment thread backend/src/api/v1/messages.ts
Comment thread backend/src/api/v1/messages.ts Outdated
Comment thread backend/src/api/v1/responses.ts Outdated
Comment thread backend/src/api/v1/completions.ts
Comment thread backend/src/api/v1/completions.ts Outdated
Comment thread backend/src/api/v1/responses.ts Outdated
Comment thread backend/src/api/v1/responses.ts Outdated
Comment thread backend/src/api/v1/messages.ts Outdated
Comment thread backend/src/api/v1/completions.ts Outdated
Comment thread backend/src/services/failover.ts
- Change modelId fallback from ?? 0 to undefined to avoid database integrity issues
- Add AbortError handling in isRetriableNetworkError for timeout scenarios
- Fix SSE error format in responses.ts to match OpenAI Responses API spec
- Add error capture to catch blocks for better logging
- Optimize selectMultipleCandidates by calculating totalWeight once

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/src/api/v1/responses.ts (1)

273-301: 无 chunk 时会先记录为完成,导致重复/错误日志

当前在 isFirstChunk 仍为 true 时,先写入“completed”记录再抛错,会产生一条成功记录+一条失败记录,且可能影响后续统计。建议先检查再落库/记账。

🛠️ 建议修复
-    // Finalize completion record
-    completion.completion = [
-      {
-        role: undefined,
-        content:
-          (thinkingParts.length > 0
-            ? `<think>${thinkingParts.join("")}</think>\n`
-            : "") + textParts.join(""),
-      },
-    ];
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
+    completion.completion = [
+      {
+        role: undefined,
+        content:
+          (thinkingParts.length > 0
+            ? `<think>${thinkingParts.join("")}</think>\n`
+            : "") + textParts.join(""),
+      },
+    ];
...
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
backend/src/api/v1/messages.ts (1)

257-285: 无 chunk 时会先记录为完成,导致重复/错误日志

这里同样在 isFirstChunktrue 时先写“completed”再抛错,会产生重复/错误完成记录。建议先检查再落库。

🛠️ 建议修复
-    // Finalize completion record
-    completion.completion = [
-      {
-        role: undefined,
-        content:
-          (thinkingParts.length > 0
-            ? `<think>${thinkingParts.join("")}</think>\n`
-            : "") + textParts.join(""),
-      },
-    ];
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
+    completion.completion = [
+      {
+        role: undefined,
+        content:
+          (thinkingParts.length > 0
+            ? `<think>${thinkingParts.join("")}</think>\n`
+            : "") + textParts.join(""),
+      },
+    ];
...
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
backend/src/api/v1/completions.ts (1)

309-340: 无 chunk 时会先记录为完成,导致重复/错误日志

在流式无任何 chunk 的场景里,会先写入成功记录再抛错,造成重复与误导性统计。建议先检查再记录。

🛠️ 建议修复
-    // Finalize completion record
-    const contentText =
+    // Handle case where no chunks were received
+    if (isFirstChunk) {
+      throw new Error("No chunk received from upstream");
+    }
+
+    // Finalize completion record
+    const contentText =
       (thinkingParts.length > 0
         ? `<think>${thinkingParts.join("")}</think>\n`
         : "") + textParts.join("");
...
-    // Handle case where no chunks were received
-    if (isFirstChunk) {
-      throw new Error("No chunk received from upstream");
-    }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd819f0 and d4e3e9e.

📒 Files selected for processing (4)
  • backend/src/api/v1/completions.ts
  • backend/src/api/v1/messages.ts
  • backend/src/api/v1/responses.ts
  • backend/src/services/failover.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/src/services/failover.ts
🧰 Additional context used
🧬 Code graph analysis (2)
backend/src/api/v1/responses.ts (5)
backend/src/services/failover.ts (3)
  • FailoverConfig (18-37)
  • selectMultipleCandidates (382-420)
  • executeWithFailover (184-353)
backend/src/utils/api-helpers.ts (1)
  • filterCandidates (87-111)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
backend/src/adapters/index.ts (1)
  • getUpstreamAdapter (85-92)
backend/src/utils/redisClient.ts (1)
  • set (71-85)
backend/src/api/v1/completions.ts (4)
backend/src/services/failover.ts (3)
  • FailoverConfig (18-37)
  • selectMultipleCandidates (382-420)
  • executeWithFailover (184-353)
backend/src/utils/api-helpers.ts (1)
  • filterCandidates (87-111)
backend/src/utils/model.ts (1)
  • ModelWithProvider (11-14)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
🔇 Additional comments (12)
backend/src/api/v1/responses.ts (4)

113-161: buildCompletionRecord 的 modelId 可选化处理合理

请求模型 ID 缺省时保持 undefined,记录结构也更稳健。


167-207: 非流式处理的耗时与 token 记账逻辑清晰

begin 传入并统一计算 ttft/duration 与后置消耗,整体很顺。


319-325: Failover 配置集中化清晰

将响应端的重试/超时参数下沉到常量,有利于统一治理。


436-507: 区分不可重试错误与耗尽失败的处理方式正确

现在会转发上游不可重试错误并保留状态码,和“全部失败”分开处理,方向很好。

Also applies to: 559-629

backend/src/api/v1/messages.ts (4)

124-151: buildCompletionRecord 的可选 modelId 设计合理

对于未命中模型 ID 的场景更安全,记录结构一致。


156-196: 非流处理逻辑清晰,计时点一致

begin 贯穿到记录与 token 计费,职责划分清晰。


303-309: Failover 配置集中化清晰

便于消息端统一调整重试与超时策略。


420-491: 不可重试错误直返 + 耗尽失败独立处理很到位

错误分流逻辑更准确,避免一概 502。

Also applies to: 543-614

backend/src/api/v1/completions.ts (4)

110-137: buildCompletionRecord 的 modelId 可选化处理正确

避免无效 ID 写入,结构更安全。


142-186: 非流式处理的记录与计时逻辑清楚

ttft/duration 与 token 计费的时机一致。


358-364: Failover 配置集中化清晰

完成端口的重试/超时策略更易维护。


481-554: 不可重试上游错误转发与耗尽失败区分明确

避免错误被统一包装成 502,行为更符合预期。

Also applies to: 602-675

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust automatic failover and retry mechanism for multi-provider requests, which is a significant improvement for the service's reliability. The new failover service is well-structured with configurable exponential backoff, jitter, and timeout support. The refactoring of the API endpoints (completions, messages, responses) to use this new service is extensive and separates concerns nicely. My main feedback focuses on reducing significant code duplication in the error handling paths within the API endpoints and improving the clarity of the provider iteration logic in the failover service. Addressing these points will enhance the maintainability of this new, powerful feature.

Comment thread backend/src/api/v1/completions.ts
Comment thread backend/src/api/v1/messages.ts
Comment thread backend/src/api/v1/responses.ts
Comment thread backend/src/services/failover.ts
Comment thread backend/src/services/failover.ts Outdated
pescn and others added 2 commits January 20, 2026 05:51
…cation

- Add processFailoverError helper function in api-helpers.ts
- Refactor completions.ts, messages.ts, responses.ts to use shared helper
- Reduces ~120 lines of duplicated error handling code
- Maintains same behavior: forward non-retriable errors, return 502 on exhaustion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use more specific error message patterns in isRetriableNetworkError
  to avoid false positives (e.g., "network error" instead of "network")
- Simplify provider iteration loop using for...of with entries()
  since candidates are already unique and ordered from selectMultipleCandidates
- Remove redundant triedProviders Set

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/src/api/v1/messages.ts`:
- Around line 485-489: The SSE error event currently emits an
Anthropic-incompatible error type ("api_error"); update the SSE payload in the
catch block where logger.error, set.status and the yield are used so the inner
error.type is "server_error" instead of "api_error" (i.e., change the JSON
payload produced by the yield statement to use "server_error" for stream
processing errors) while keeping the rest of the structure and status handling
intact.
🧹 Nitpick comments (1)
backend/src/utils/api-helpers.ts (1)

251-252: 建议将 import 语句移至文件顶部

当前 import 语句位于文件中间(第 251-252 行),这不符合 TypeScript/JavaScript 的常规代码组织习惯。建议将这些 import 移至文件顶部与其他 import 语句放在一起,以提高代码可读性和一致性。

♻️ 建议的修改

将以下 import 移至文件顶部:

import type { FailoverResult, FailoverError } from "@/services/failover";
import { addCompletions, type Completion } from "@/utils/completions";
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4e3e9e and 4392023.

📒 Files selected for processing (5)
  • backend/src/api/v1/completions.ts
  • backend/src/api/v1/messages.ts
  • backend/src/api/v1/responses.ts
  • backend/src/services/failover.ts
  • backend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (1)
backend/src/utils/api-helpers.ts (5)
backend/src/utils/model.ts (1)
  • ModelWithProvider (11-14)
backend/src/adapters/types.ts (1)
  • ModelWithProvider (313-316)
backend/src/services/failover.ts (2)
  • FailoverError (63-71)
  • FailoverResult (73-80)
backend/src/db/index.ts (1)
  • Completion (33-33)
backend/src/utils/completions.ts (1)
  • addCompletions (29-64)
🔇 Additional comments (19)
backend/src/utils/api-helpers.ts (3)

83-111: LGTM!

filterCandidates 函数实现正确,包含合理的回退逻辑:当指定的 provider 没有匹配的候选项时,会记录警告并返回所有候选项,确保服务可用性。


304-316: fire-and-forget 模式下的错误处理

addCompletions 调用使用 .catch() 进行 fire-and-forget 模式,这在日志记录场景中是合理的设计。但需要确认这种模式符合预期——如果日志记录失败,请求仍会正常返回,只是在 catch 块中记录错误。

当前实现是可接受的,因为日志记录失败不应阻塞主请求流程。

Also applies to: 332-344


288-350: processFailoverError 实现良好

该函数有效地集中处理了 failover 错误,解决了之前审查中提到的代码重复问题。它正确区分了上游错误(转发原始响应)和 failover 耗尽(返回 502)两种场景。

backend/src/api/v1/messages.ts (3)

304-310: LGTM!

Failover 配置合理,2 分钟超时对于 messages API 是合适的。


421-446: 错误处理已正确重构

之前审查中指出的重复代码问题已通过 processFailoverError 辅助函数解决。现在流式和非流式路径都正确区分了上游错误(转发原始响应)和 failover 耗尽(返回 502)两种情况。

Also applies to: 498-523


283-286: 流式处理中无数据时抛出错误是正确的设计

当没有收到任何 chunk 时抛出错误,并在 catch 块中重新抛出,这允许调用者(主处理器)捕获并返回适当的错误响应。这是合理的错误传播模式。

Also applies to: 299-301

backend/src/services/failover.ts (5)

113-139: 网络错误重试判断已改进

已正确处理 AbortError(第 122-124 行),解决了之前审查中关于超时中止错误可能未被正确分类的问题。错误消息匹配也使用了更具体的短语(如 "timed out", "connection reset")来减少误判。


389-411: 加权选择算法已优化

totalWeight 现在只计算一次并增量更新(第 387、409 行),解决了之前审查中关于效率的问题。默认选择最后一个元素作为浮点精度边界情况的回退(第 394 行)也是正确的。

第 397 行的 if (item) 检查虽然在理论上是多余的(因为 remaining 数组不应包含 undefined),但保留它作为防御性编程也是可以接受的。


201-327: LGTM!

executeWithFailover 实现清晰,使用 for...ofentries() 迭代(如之前审查建议),正确处理了:

  • 同提供商重试(用于 429 等瞬态错误)
  • 跨提供商故障转移
  • 指数退避与抖动
  • 可重试与不可重试错误的区分

151-170: fetchWithTimeout 实现正确

使用 AbortController 实现超时,并在 finally 块中清理 timeout,防止内存泄漏。


86-98: 退避延迟计算建议验证

当前实现使用 attempt 参数计算指数退避:baseDelayMs * Math.pow(exponentialBase, attempt)

第 214 行调用时使用 totalAttempts - 1,意味着第一次重试(totalAttempts=2)的 attempt 为 1,延迟为 100 * 2^1 = 200ms。这是合理的,但建议确认这符合预期的退避曲线。

backend/src/api/v1/completions.ts (4)

359-365: LGTM!

Completions API 的 failover 配置与 messages API 保持一致,2 分钟超时对于 chat completions 是合理的。


482-510: 错误处理已正确重构

流式和非流式路径的错误处理已通过 processFailoverError 统一,正确区分上游错误和 failover 耗尽。modelId 回退逻辑使用 candidates[0]?.model.id(第 485、560 行),解决了之前 ?? 0 的问题。

Also applies to: 557-585


454-463: 请求构建函数设计合理

buildRequestForProvider 函数为每个候选 provider 构建请求,正确处理了 remoteId 回退到 systemName 的逻辑。使用浅拷贝 { ...internalRequest } 来避免修改原始请求对象。


544-548: 流式处理中的错误需遵循 SSE 格式规范

根据 OpenAI 文档,流式传输中的错误分为两种:

  • 流开始前的错误:返回 HTTP 非 2xx 状态码 + JSON 错误体
  • 流传输中的错误:必须以 SSE 事件格式发送,包含 data: 前缀和 \n\n 结尾

当前实现使用 yield 返回 JSON.stringify({ error: "Stream processing error" }),缺少 SSE 格式包装。根据 OpenAI 规范,中途错误应包含错误对象、choices 数组,以及 finish_reason: "error" 标识流终止:

data: {"id":"...","object":"chat.completion.chunk","created":...,"model":"...","error":{"code":"...","message":"..."},"choices":[{"index":0,"delta":{"content":""},"finish_reason":"error"}]}

此外,set.status = 500 在流已启动后可能无法生效,应考虑在流开始前进行错误验证或在中途错误中用 finish_reason 通知客户端。

backend/src/api/v1/responses.ts (4)

320-326: LGTM!

Responses API 的 failover 配置与其他 API 保持一致。


501-505: 流式错误格式已修复

之前审查指出的 SSE 错误格式问题已解决。现在正确使用 event: error 前缀,并且 JSON 结构符合 OpenAI Responses API 规范(包含 code, message, param, help_url 字段)。


437-462: 错误处理已正确重构

与 completions.ts 和 messages.ts 一样,错误处理通过 processFailoverError 统一处理,正确区分上游错误和 failover 耗尽。

Also applies to: 514-540


417-426: 请求构建函数与其他 API 保持一致

buildRequestForProvider 实现与 completions.ts 和 messages.ts 中的相同,保持了代码一致性。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread backend/src/api/v1/messages.ts
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new failover service (services/failover.ts) to handle upstream request retries and provider failover with exponential backoff and jitter. This service is integrated into the completions, messages, and responses API handlers (api/v1/*.ts). The integration involves refactoring the API handlers to use the new executeWithFailover function, replacing the previous selectModel logic with filterCandidates and selectMultipleCandidates, and centralizing error processing with a new processFailoverError utility function (utils/api-helpers.ts). The handleNonStreamingRequest and handleStreamingRequest functions in the API handlers were also refactored into processNonStreamingResponse and processStreamingResponse to focus solely on processing successful responses, with initial fetch and error handling now managed by the failover service. Review comments highlight that despite these changes, there is still significant code duplication in the API handlers between streaming and non-streaming request paths for failover result processing, suggesting further refactoring. Additionally, a discrepancy was noted in the FailoverErrorResult type, which includes an 'internal_error' case that is not currently returned by the processFailoverError function.

Comment thread backend/src/api/v1/messages.ts
Comment thread backend/src/api/v1/responses.ts
Comment thread backend/src/utils/api-helpers.ts
- Fix SSE error type in messages.ts to use "server_error" per Anthropic spec
- Remove unused "internal_error" case from FailoverErrorResult type

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive failover and retry mechanism for upstream API requests across completions.ts, messages.ts, and responses.ts. A new failover.ts service was added, which handles weighted random selection of model providers, exponential backoff, and retries for transient network and HTTP errors (e.g., 429, 5xx). The existing API routes were refactored to integrate this service, extracting the core fetch and initial error handling into the executeWithFailover function. New utility functions like filterCandidates and processFailoverError were added to api-helpers.ts to support this. Review comments suggest enhancing the isRetriableNetworkError function in failover.ts to inspect error.cause for more reliable network error classification, as the current string-based checks are considered too broad. Additionally, a comment in executeWithFailover was noted as inconsistent with the implementation, which correctly returns a FailoverResult object rather than throwing an error for non-retriable HTTP errors.

Comment on lines +127 to +138
const message = error.message.toLowerCase();
return (
message.includes("timed out") ||
message.includes("timeout") ||
message.includes("the operation was aborted") ||
message.includes("connection reset") ||
message.includes("connection refused") ||
message.includes("network error") ||
message.includes("network request failed") ||
message.includes("socket hang up") ||
message.includes("fetch failed")
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string-based checks for retriable network errors are quite broad (e.g., "network error", "fetch failed") and could lead to false positives, causing retries on non-transient errors. Modern fetch implementations, especially in Node.js, often include a cause property on the thrown TypeError which contains the underlying system error.

Checking (error as any).cause could provide more reliable error details than parsing the message string. I suggest enhancing this function to inspect error.cause if it exists for more robust error classification.

attempt: totalAttempts,
});

// If not retriable, throw immediately
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line is inconsistent with the implementation. The code correctly returns a FailoverResult object, which aligns with the handling of non-retriable HTTP errors. Throwing an error here would create an inconsistency in the function's return behavior. Please update the comment to reflect the actual logic.

Suggested change
// If not retriable, throw immediately
// If not retriable, return a failure result immediately

@pescn pescn merged commit 5969191 into main Jan 19, 2026
2 checks passed
@pescn pescn deleted the feat/automatic-failover branch January 24, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant