feat(backend): add automatic failover and retry for multi-provider requests#50
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthrough本次更改引入集中式故障转移引擎并将其集成到 completions、messages、responses 三个 v1 API 路径;重构请求构建与流/非流处理器以支持跨多提供商的重试、回退与超时控制;新增候选过滤与多候选选择逻辑并统一错误/完成记录与时序度量。 Changes
Sequence DiagramssequenceDiagram
participant Client
participant API as "API (completions/messages/responses)"
participant Failover as "Failover 引擎"
participant ProviderA as "提供商 A"
participant ProviderB as "提供商 B"
participant Logging as "日志 / 完成记录"
Client->>API: 发送请求(含 model/provider 目标)
API->>API: 捕获 begin 时间;filterCandidates -> selectMultipleCandidates
API->>Failover: executeWithFailover(候选, buildRequestPerProvider)
Failover->>ProviderA: 构建并发送上游请求
ProviderA-->>Failover: 响应(成功或错误)
alt 可重试失败
Failover->>Failover: 退避等待(exponential backoff)
Failover->>ProviderB: 构建并发送上游请求
ProviderB-->>Failover: 响应
end
alt 成功响应
Failover-->>API: FailoverResult(success, response)
API->>API: processNonStreamingResponse / processStreamingResponse
API->>Logging: 记录完成成功与时延
API-->>Client: 返回处理后响应
else 所有候选耗尽
Failover-->>API: FailoverResult(failure, errors)
API->>Logging: 记录失败详情
API-->>Client: 返回标准化上游失败(502 / 错误 payload)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Summary of ChangesHello @pescn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the backend's fault tolerance and reliability by introducing an automatic failover and retry mechanism for multi-provider requests. It ensures that API calls to external models are more resilient to transient network issues and provider outages, gracefully attempting alternative providers or retrying failed requests with intelligent backoff. This change also refines the existing weighted model selection to work seamlessly with the new failover capabilities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust failover and retry mechanism, which is a great enhancement for reliability. The core failover service is well-structured. However, I've identified a critical issue in how non-retriable upstream errors are handled in the API endpoints, which currently results in incorrect 502 error responses instead of propagating the original error. Additionally, there's significant code duplication across the completions, messages, and responses API endpoints that should be refactored into a generic handler to improve maintainability. I've also noted a minor duplication in api-helpers.ts. Addressing these points will make the implementation more robust and easier to maintain.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/src/api/v1/completions.ts (1)
200-228: 无 chunk 时先记为 completed 会产生重复/错误日志若 upstream 没有任何 chunk,当前实现会先记录 completed,再抛错记录 failed,导致重复记录/指标污染。建议把“无 chunk”检查放在完成记录之前。
🛠️ 建议修复
- // Finalize completion record + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record completion.completion = [ { role: undefined, content: (thinkingParts.length > 0 ? `<think>${thinkingParts.join("")}</think>\n` : "") + textParts.join(""), }, ]; @@ - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }backend/src/api/v1/messages.ts (1)
257-285: 无 chunk 时先记 completed 会导致双记录若没有任何 chunk,会先记录 completed 再抛错记录 failed,日志与指标会被污染。建议把“无 chunk”检查前置。
🛠️ 建议修复
- // Finalize completion record + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record completion.completion = [ { role: undefined, content: (thinkingParts.length > 0 ? `<think>${thinkingParts.join("")}</think>\n` : "") + textParts.join(""), }, ]; @@ - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }backend/src/api/v1/responses.ts (1)
273-301: 无 chunk 时先记 completed 会导致重复/错误记录如果 upstream 没有任何 chunk,当前实现会先记录 completed,再抛错记录 failed。建议把“无 chunk”检查放到完成记录之前。
🛠️ 建议修复
- // Finalize completion record + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record completion.completion = [ { role: undefined, content: (thinkingParts.length > 0 ? `<think>${thinkingParts.join("")}</think>\n` : "") + textParts.join(""), }, ]; @@ - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }
🧹 Nitpick comments (1)
backend/src/utils/api-helpers.ts (1)
124-166: 建议 selectModel 复用 filterCandidates,避免重复分支当前过滤与 warning 逻辑重复,后续容易产生行为分叉;直接复用可保持一致性。
♻️ 建议重构
- // Filter by target provider if specified - let candidates = modelsWithProviders; - if (targetProvider) { - const filtered = modelsWithProviders.filter( - (mp) => mp.provider.name === targetProvider, - ); - if (filtered.length > 0) { - candidates = filtered; - } else { - logger.warn( - `Provider '${targetProvider}' does not offer requested model, falling back to available providers`, - ); - } - } + // Filter by target provider if specified + const candidates = filterCandidates(modelsWithProviders, targetProvider);
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/src/api/v1/completions.tsbackend/src/api/v1/messages.tsbackend/src/api/v1/responses.tsbackend/src/services/failover.tsbackend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (1)
backend/src/utils/api-helpers.ts (2)
backend/src/utils/model.ts (1)
ModelWithProvider(11-14)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)
🔇 Additional comments (6)
backend/src/services/failover.ts (2)
86-98: 指数退避 + 抖动实现清晰实现直观、可调,没发现问题。
251-288: 重试失败响应前应释放连接资源,避免连接池枯竭在重试或切换 provider 的循环中,若响应体未被消费或显式释放,可能导致底层连接被占用、无法返回连接池,最终造成连接池枯竭。在进入下一次尝试前应确保清理前次响应;需确认此代码运行环境(Node.js fetch/undici、浏览器或其他客户端)及响应体是否在他处被消费。
🛠️ 建议修复
logger.warn("Request failed with HTTP error", { provider: provider.provider.name, status: response.status, retriable, attempt: totalAttempts, }); + + if (retriable) { + await response.body?.cancel(); + } // If not retriable, return the response (let caller handle the error) if (!retriable) {backend/src/utils/api-helpers.ts (1)
82-110: 候选 provider 过滤封装得很清晰便于 failover 与单次选择共享同一入口。
backend/src/api/v1/completions.ts (1)
300-322: 候选过滤 + 加权选择接入 failover 的流程很顺链路清晰,便于后续扩展与调参。
backend/src/api/v1/messages.ts (1)
359-382: 候选过滤 + 多 provider 选择接入自然逻辑清晰,便于 failover 顺序控制。
backend/src/api/v1/responses.ts (1)
375-398: 候选过滤 + 加权选择接入合理failover 入口更清晰,可读性不错。
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
…quests Implement automatic failover capability that switches to alternate providers when a request fails with retriable errors (5xx, 429, network timeouts). Key features: - Failover service with configurable retry behavior - Exponential backoff with jitter for retries - Same-provider retries for transient errors (429) - Cross-provider failover for persistent failures - Weighted random selection for failover candidates - Detailed error logging with all attempted providers Also includes the weighted load balancing fix from PR #49: - Implement proper weighted random selection in selectModel() - Add filterCandidates() helper for failover use Affected endpoints: - /v1/chat/completions - /v1/messages - /v1/responses Configuration defaults: - maxProviderAttempts: 3 - sameProviderRetries: 1 - retriableStatusCodes: [429, 500, 502, 503, 504] - timeoutMs: 120000 (2 minutes) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
f0a7b47 to
cd819f0
Compare
|
/gemini review |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@backend/src/api/v1/responses.ts`:
- Around line 546-549: The SSE error in the catch block currently yields only a
data line and uses the wrong field names; update the catch handler that sets
set.status and yields the error stream so it emits an SSE error event line
("event: error") followed by a data line whose JSON uses the OpenAI fields:
{"type":"error","error":{"code":"internal_error","message":"Stream processing
error","param":null,"help_url":null}}; keep set.status = 500 and ensure the
yield includes the "event: error" prefix, the correctly structured JSON payload,
and the terminating double newline to conform to SSE.
In `@backend/src/services/failover.ts`:
- Around line 142-161: The fetchWithTimeout wrapper needs to ensure
timeout-triggered aborts are classified as retriable: catch errors from fetch
inside fetchWithTimeout (or have fetchWithTimeout rethrow) and, when the error
is due to the AbortController (e.g., error.name === "AbortError" or the
controller.signal.aborted), convert or annotate that error so
isRetriableNetworkError can detect it as a timeout (for example by throwing a
dedicated TimeoutError or appending "timeout" to the error message); update
fetchWithTimeout (and any callers) to surface that timeout-specific error
instead of leaving the raw AbortError that lacks "timeout" in its message.
♻️ Duplicate comments (2)
backend/src/api/v1/messages.ts (1)
400-409: 与 completions.ts 相同的浅拷贝问题此处
buildRequestForProvider使用相同的浅拷贝模式,存在与completions.ts相同的潜在问题。backend/src/api/v1/responses.ts (1)
416-425: 与其他 API 文件相同的浅拷贝模式此处
buildRequestForProvider使用与completions.ts和messages.ts相同的浅拷贝模式。已在completions.ts中标记此问题。
🧹 Nitpick comments (5)
backend/src/services/failover.ts (2)
89-98: 退避延迟计算正确,但抖动仅增加延迟当前实现中,抖动值始终为正数(
cappedDelay * jitterFactor * Math.random()),这意味着实际延迟总是 >=cappedDelay。更常见的做法是使用双向抖动,让延迟可以在基准值上下浮动。当前实现可以正常工作,但如果希望更均匀地分散重试请求,可以考虑:
♻️ 可选:使用双向抖动
export function calculateBackoffDelay( attempt: number, config: FailoverConfig, ): number { const exponentialDelay = config.baseDelayMs * Math.pow(config.exponentialBase, attempt); const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); - const jitter = cappedDelay * config.jitterFactor * Math.random(); - return Math.floor(cappedDelay + jitter); + // 双向抖动:延迟可以在 cappedDelay ± (cappedDelay * jitterFactor) 范围内 + const jitter = cappedDelay * config.jitterFactor * (Math.random() * 2 - 1); + return Math.floor(Math.max(0, cappedDelay + jitter)); }
192-200: 提供商选择逻辑存在潜在的低效问题当前实现在每次迭代时使用
candidates.find()从数组开头搜索未尝试的提供商。由于候选者已经按权重随机排序,这在功能上是正确的,但每次调用find()都是 O(n) 操作。对于少量候选者(通常 3 个)这不是问题,但代码可以更清晰地表达意图:直接按顺序遍历候选者数组。
♻️ 可选:简化提供商遍历逻辑
- for (let providerIndex = 0; providerIndex < maxProviders; providerIndex++) { - // Select next provider (skip already tried ones) - const provider = candidates.find( - (c) => !triedProviders.has(c.provider.id), - ); - if (!provider) { - logger.debug("No more providers to try"); - break; - } + for (const provider of candidates.slice(0, maxProviders)) { + if (triedProviders.has(provider.provider.id)) { + continue; + }backend/src/utils/api-helpers.ts (1)
83-111: 新增filterCandidates函数逻辑正确该函数为故障转移场景提供候选过滤,当指定的提供商不可用时会回退到所有可用提供商并记录警告。
注意:此函数与
selectModel函数(第 127-138 行)存在重复的过滤逻辑。如果selectModel在其他地方仍被使用,可以考虑让它内部调用filterCandidates来减少重复。♻️ 可选:复用 filterCandidates 减少重复
export function selectModel( modelsWithProviders: ModelWithProvider[], targetProvider?: string, ): ModelWithProvider | null { if (modelsWithProviders.length === 0) { return null; } - // Filter by target provider if specified - let candidates = modelsWithProviders; - if (targetProvider) { - const filtered = modelsWithProviders.filter( - (mp) => mp.provider.name === targetProvider, - ); - if (filtered.length > 0) { - candidates = filtered; - } else { - logger.warn( - `Provider '${targetProvider}' does not offer requested model, falling back to available providers`, - ); - } - } + // Filter by target provider if specified (reuse filterCandidates) + const candidates = filterCandidates(modelsWithProviders, targetProvider); // Single candidate, return directly if (candidates.length === 1) {backend/src/api/v1/completions.ts (2)
589-592: 流式处理错误的 catch 块未记录错误原因当
processStreamingResponse抛出异常时,catch 块使用空参数catch {},丢失了错误信息。虽然processStreamingResponse内部已经记录了日志,但外层 catch 块不知道是什么类型的错误。建议至少记录一下这是来自流处理的错误:
♻️ 可选:在 catch 块中添加日志
- } catch { + } catch (error) { + logger.debug("Stream processing threw error", error); set.status = 500; yield JSON.stringify({ error: "Stream processing error" }); }
481-554: 建议:提取故障转移错误处理的公共逻辑流式和非流式路径中的
!result.success错误处理逻辑几乎相同(构建 completion、记录日志、区分非可重试错误和耗尽场景)。同样的模式在messages.ts和responses.ts中也重复出现。考虑将这些公共逻辑提取为辅助函数,以减少代码重复并便于维护。
♻️ 建议:创建通用的故障转移错误处理函数
可以在
failover.ts或api-helpers.ts中添加类似以下的辅助函数:export interface FailoverErrorHandlerOptions { result: FailoverResult<Response>; completion: Completion; bearer: string; set: { status: number }; errorType: 'streaming' | 'non-streaming'; formatError: (msg: string, type: string) => string; } export async function handleFailoverError(opts: FailoverErrorHandlerOptions): Promise<string | null> { // 统一处理非可重试错误转发和耗尽场景 // 返回需要 yield 的响应字符串,或 null 表示需要继续处理 }Also applies to: 601-674
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/src/api/v1/completions.tsbackend/src/api/v1/messages.tsbackend/src/api/v1/responses.tsbackend/src/services/failover.tsbackend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (3)
backend/src/api/v1/completions.ts (4)
backend/src/services/failover.ts (3)
FailoverConfig(18-37)selectMultipleCandidates(377-413)executeWithFailover(179-348)backend/src/utils/api-helpers.ts (1)
filterCandidates(87-111)backend/src/utils/model.ts (1)
ModelWithProvider(11-14)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)
backend/src/api/v1/responses.ts (4)
backend/src/services/failover.ts (3)
FailoverConfig(18-37)selectMultipleCandidates(377-413)executeWithFailover(179-348)backend/src/utils/api-helpers.ts (1)
filterCandidates(87-111)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)backend/src/adapters/index.ts (1)
getUpstreamAdapter(85-92)
backend/src/utils/api-helpers.ts (2)
backend/src/utils/model.ts (1)
ModelWithProvider(11-14)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)
🔇 Additional comments (8)
backend/src/services/failover.ts (2)
1-57: 配置和类型定义设计合理故障转移配置接口设计完整,包含了必要的参数:最大提供商尝试次数、同提供商重试次数、可重试状态码、退避策略参数等。默认配置值合理,特别是包含了常见的 5xx 错误和网络错误码。
377-413: 加权随机选择实现正确该函数正确实现了无放回的加权随机选择,用于按权重优先级选择多个候选提供商。实现逻辑清晰,包括边界检查(候选数量小于等于请求数量时直接返回副本)。
backend/src/api/v1/completions.ts (2)
358-364: 故障转移配置定义合理配置值适合 completions API 的使用场景,2 分钟超时对于长时间运行的 LLM 请求是合理的。
481-554: 错误处理逻辑正确,已解决之前的审查意见代码正确区分了两种失败场景:
- 非可重试的上游错误(
result.response存在):转发原始状态码和响应体- 所有提供商耗尽(
result.response为 undefined):返回 502这解决了之前审查中指出的问题。
backend/src/api/v1/messages.ts (2)
303-309: Messages API 故障转移配置与 Completions 一致配置值与 completions API 相同,保持了一致性。
530-533: 流式错误格式正确使用 SSE event 格式使用
event: error\ndata: ...格式符合 Anthropic API 的 SSE 错误响应规范。backend/src/api/v1/responses.ts (2)
319-325: Responses API 故障转移配置与其他 API 一致配置值与 completions 和 messages API 相同,保持了系统一致性。
436-507: 故障转移错误处理逻辑正确实现与
completions.ts和messages.ts一样,正确区分了非可重试上游错误和所有提供商耗尽的场景,解决了之前审查中提出的问题。三个 API 文件之间存在大量重复代码,建议后续迭代中考虑提取公共的故障转移错误处理逻辑。
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Code Review
This pull request introduces a robust failover and retry mechanism for multi-provider requests, which is a great enhancement for reliability. The core logic in the new failover.ts service is well-structured, implementing exponential backoff with jitter and configurable retry policies.
My review focuses on improving maintainability by reducing code duplication in the API endpoint files and fixing a potential database integrity issue related to model ID handling on failure.
Key feedback points:
- Code Duplication: The error handling logic for failover results is duplicated across streaming and non-streaming paths in
completions.ts,messages.ts, andresponses.ts. Refactoring this into a shared helper function would significantly improve maintainability. - Database Integrity: There's an issue where a fallback model ID of
0is used when a request fails before a provider is chosen. This is incorrect for a database identity column and should beundefinedinstead. - Efficiency: A minor efficiency improvement is suggested for the weighted random selection algorithm.
Overall, this is a solid implementation of a complex feature. Addressing the feedback will make the code more robust and easier to maintain.
- Change modelId fallback from ?? 0 to undefined to avoid database integrity issues - Add AbortError handling in isRetriableNetworkError for timeout scenarios - Fix SSE error format in responses.ts to match OpenAI Responses API spec - Add error capture to catch blocks for better logging - Optimize selectMultipleCandidates by calculating totalWeight once Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/src/api/v1/responses.ts (1)
273-301: 无 chunk 时会先记录为完成,导致重复/错误日志当前在
isFirstChunk仍为true时,先写入“completed”记录再抛错,会产生一条成功记录+一条失败记录,且可能影响后续统计。建议先检查再落库/记账。🛠️ 建议修复
- // Finalize completion record - completion.completion = [ - { - role: undefined, - content: - (thinkingParts.length > 0 - ? `<think>${thinkingParts.join("")}</think>\n` - : "") + textParts.join(""), - }, - ]; + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record + completion.completion = [ + { + role: undefined, + content: + (thinkingParts.length > 0 + ? `<think>${thinkingParts.join("")}</think>\n` + : "") + textParts.join(""), + }, + ]; ... - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }backend/src/api/v1/messages.ts (1)
257-285: 无 chunk 时会先记录为完成,导致重复/错误日志这里同样在
isFirstChunk为true时先写“completed”再抛错,会产生重复/错误完成记录。建议先检查再落库。🛠️ 建议修复
- // Finalize completion record - completion.completion = [ - { - role: undefined, - content: - (thinkingParts.length > 0 - ? `<think>${thinkingParts.join("")}</think>\n` - : "") + textParts.join(""), - }, - ]; + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record + completion.completion = [ + { + role: undefined, + content: + (thinkingParts.length > 0 + ? `<think>${thinkingParts.join("")}</think>\n` + : "") + textParts.join(""), + }, + ]; ... - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }backend/src/api/v1/completions.ts (1)
309-340: 无 chunk 时会先记录为完成,导致重复/错误日志在流式无任何 chunk 的场景里,会先写入成功记录再抛错,造成重复与误导性统计。建议先检查再记录。
🛠️ 建议修复
- // Finalize completion record - const contentText = + // Handle case where no chunks were received + if (isFirstChunk) { + throw new Error("No chunk received from upstream"); + } + + // Finalize completion record + const contentText = (thinkingParts.length > 0 ? `<think>${thinkingParts.join("")}</think>\n` : "") + textParts.join(""); ... - // Handle case where no chunks were received - if (isFirstChunk) { - throw new Error("No chunk received from upstream"); - }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/src/api/v1/completions.tsbackend/src/api/v1/messages.tsbackend/src/api/v1/responses.tsbackend/src/services/failover.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/src/services/failover.ts
🧰 Additional context used
🧬 Code graph analysis (2)
backend/src/api/v1/responses.ts (5)
backend/src/services/failover.ts (3)
FailoverConfig(18-37)selectMultipleCandidates(382-420)executeWithFailover(184-353)backend/src/utils/api-helpers.ts (1)
filterCandidates(87-111)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)backend/src/adapters/index.ts (1)
getUpstreamAdapter(85-92)backend/src/utils/redisClient.ts (1)
set(71-85)
backend/src/api/v1/completions.ts (4)
backend/src/services/failover.ts (3)
FailoverConfig(18-37)selectMultipleCandidates(382-420)executeWithFailover(184-353)backend/src/utils/api-helpers.ts (1)
filterCandidates(87-111)backend/src/utils/model.ts (1)
ModelWithProvider(11-14)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)
🔇 Additional comments (12)
backend/src/api/v1/responses.ts (4)
113-161: buildCompletionRecord 的 modelId 可选化处理合理请求模型 ID 缺省时保持
undefined,记录结构也更稳健。
167-207: 非流式处理的耗时与 token 记账逻辑清晰将
begin传入并统一计算 ttft/duration 与后置消耗,整体很顺。
319-325: Failover 配置集中化清晰将响应端的重试/超时参数下沉到常量,有利于统一治理。
436-507: 区分不可重试错误与耗尽失败的处理方式正确现在会转发上游不可重试错误并保留状态码,和“全部失败”分开处理,方向很好。
Also applies to: 559-629
backend/src/api/v1/messages.ts (4)
124-151: buildCompletionRecord 的可选 modelId 设计合理对于未命中模型 ID 的场景更安全,记录结构一致。
156-196: 非流处理逻辑清晰,计时点一致
begin贯穿到记录与 token 计费,职责划分清晰。
303-309: Failover 配置集中化清晰便于消息端统一调整重试与超时策略。
420-491: 不可重试错误直返 + 耗尽失败独立处理很到位错误分流逻辑更准确,避免一概 502。
Also applies to: 543-614
backend/src/api/v1/completions.ts (4)
110-137: buildCompletionRecord 的 modelId 可选化处理正确避免无效 ID 写入,结构更安全。
142-186: 非流式处理的记录与计时逻辑清楚ttft/duration 与 token 计费的时机一致。
358-364: Failover 配置集中化清晰完成端口的重试/超时策略更易维护。
481-554: 不可重试上游错误转发与耗尽失败区分明确避免错误被统一包装成 502,行为更符合预期。
Also applies to: 602-675
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Code Review
This pull request introduces a robust automatic failover and retry mechanism for multi-provider requests, which is a significant improvement for the service's reliability. The new failover service is well-structured with configurable exponential backoff, jitter, and timeout support. The refactoring of the API endpoints (completions, messages, responses) to use this new service is extensive and separates concerns nicely. My main feedback focuses on reducing significant code duplication in the error handling paths within the API endpoints and improving the clarity of the provider iteration logic in the failover service. Addressing these points will enhance the maintainability of this new, powerful feature.
…cation - Add processFailoverError helper function in api-helpers.ts - Refactor completions.ts, messages.ts, responses.ts to use shared helper - Reduces ~120 lines of duplicated error handling code - Maintains same behavior: forward non-retriable errors, return 502 on exhaustion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use more specific error message patterns in isRetriableNetworkError to avoid false positives (e.g., "network error" instead of "network") - Simplify provider iteration loop using for...of with entries() since candidates are already unique and ordered from selectMultipleCandidates - Remove redundant triedProviders Set Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/src/api/v1/messages.ts`:
- Around line 485-489: The SSE error event currently emits an
Anthropic-incompatible error type ("api_error"); update the SSE payload in the
catch block where logger.error, set.status and the yield are used so the inner
error.type is "server_error" instead of "api_error" (i.e., change the JSON
payload produced by the yield statement to use "server_error" for stream
processing errors) while keeping the rest of the structure and status handling
intact.
🧹 Nitpick comments (1)
backend/src/utils/api-helpers.ts (1)
251-252: 建议将 import 语句移至文件顶部当前 import 语句位于文件中间(第 251-252 行),这不符合 TypeScript/JavaScript 的常规代码组织习惯。建议将这些 import 移至文件顶部与其他 import 语句放在一起,以提高代码可读性和一致性。
♻️ 建议的修改
将以下 import 移至文件顶部:
import type { FailoverResult, FailoverError } from "@/services/failover"; import { addCompletions, type Completion } from "@/utils/completions";
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/src/api/v1/completions.tsbackend/src/api/v1/messages.tsbackend/src/api/v1/responses.tsbackend/src/services/failover.tsbackend/src/utils/api-helpers.ts
🧰 Additional context used
🧬 Code graph analysis (1)
backend/src/utils/api-helpers.ts (5)
backend/src/utils/model.ts (1)
ModelWithProvider(11-14)backend/src/adapters/types.ts (1)
ModelWithProvider(313-316)backend/src/services/failover.ts (2)
FailoverError(63-71)FailoverResult(73-80)backend/src/db/index.ts (1)
Completion(33-33)backend/src/utils/completions.ts (1)
addCompletions(29-64)
🔇 Additional comments (19)
backend/src/utils/api-helpers.ts (3)
83-111: LGTM!
filterCandidates函数实现正确,包含合理的回退逻辑:当指定的 provider 没有匹配的候选项时,会记录警告并返回所有候选项,确保服务可用性。
304-316: fire-and-forget 模式下的错误处理
addCompletions调用使用.catch()进行 fire-and-forget 模式,这在日志记录场景中是合理的设计。但需要确认这种模式符合预期——如果日志记录失败,请求仍会正常返回,只是在 catch 块中记录错误。当前实现是可接受的,因为日志记录失败不应阻塞主请求流程。
Also applies to: 332-344
288-350: processFailoverError 实现良好该函数有效地集中处理了 failover 错误,解决了之前审查中提到的代码重复问题。它正确区分了上游错误(转发原始响应)和 failover 耗尽(返回 502)两种场景。
backend/src/api/v1/messages.ts (3)
304-310: LGTM!Failover 配置合理,2 分钟超时对于 messages API 是合适的。
421-446: 错误处理已正确重构之前审查中指出的重复代码问题已通过
processFailoverError辅助函数解决。现在流式和非流式路径都正确区分了上游错误(转发原始响应)和 failover 耗尽(返回 502)两种情况。Also applies to: 498-523
283-286: 流式处理中无数据时抛出错误是正确的设计当没有收到任何 chunk 时抛出错误,并在 catch 块中重新抛出,这允许调用者(主处理器)捕获并返回适当的错误响应。这是合理的错误传播模式。
Also applies to: 299-301
backend/src/services/failover.ts (5)
113-139: 网络错误重试判断已改进已正确处理
AbortError(第 122-124 行),解决了之前审查中关于超时中止错误可能未被正确分类的问题。错误消息匹配也使用了更具体的短语(如"timed out","connection reset")来减少误判。
389-411: 加权选择算法已优化
totalWeight现在只计算一次并增量更新(第 387、409 行),解决了之前审查中关于效率的问题。默认选择最后一个元素作为浮点精度边界情况的回退(第 394 行)也是正确的。第 397 行的
if (item)检查虽然在理论上是多余的(因为remaining数组不应包含 undefined),但保留它作为防御性编程也是可以接受的。
201-327: LGTM!
executeWithFailover实现清晰,使用for...of和entries()迭代(如之前审查建议),正确处理了:
- 同提供商重试(用于 429 等瞬态错误)
- 跨提供商故障转移
- 指数退避与抖动
- 可重试与不可重试错误的区分
151-170: fetchWithTimeout 实现正确使用
AbortController实现超时,并在finally块中清理 timeout,防止内存泄漏。
86-98: 退避延迟计算建议验证当前实现使用
attempt参数计算指数退避:baseDelayMs * Math.pow(exponentialBase, attempt)。第 214 行调用时使用
totalAttempts - 1,意味着第一次重试(totalAttempts=2)的 attempt 为 1,延迟为100 * 2^1 = 200ms。这是合理的,但建议确认这符合预期的退避曲线。backend/src/api/v1/completions.ts (4)
359-365: LGTM!Completions API 的 failover 配置与 messages API 保持一致,2 分钟超时对于 chat completions 是合理的。
482-510: 错误处理已正确重构流式和非流式路径的错误处理已通过
processFailoverError统一,正确区分上游错误和 failover 耗尽。modelId回退逻辑使用candidates[0]?.model.id(第 485、560 行),解决了之前?? 0的问题。Also applies to: 557-585
454-463: 请求构建函数设计合理
buildRequestForProvider函数为每个候选 provider 构建请求,正确处理了remoteId回退到systemName的逻辑。使用浅拷贝{ ...internalRequest }来避免修改原始请求对象。
544-548: 流式处理中的错误需遵循 SSE 格式规范根据 OpenAI 文档,流式传输中的错误分为两种:
- 流开始前的错误:返回 HTTP 非 2xx 状态码 + JSON 错误体
- 流传输中的错误:必须以 SSE 事件格式发送,包含
data:前缀和\n\n结尾当前实现使用
yield返回JSON.stringify({ error: "Stream processing error" }),缺少 SSE 格式包装。根据 OpenAI 规范,中途错误应包含错误对象、choices 数组,以及finish_reason: "error"标识流终止:data: {"id":"...","object":"chat.completion.chunk","created":...,"model":"...","error":{"code":"...","message":"..."},"choices":[{"index":0,"delta":{"content":""},"finish_reason":"error"}]}此外,
set.status = 500在流已启动后可能无法生效,应考虑在流开始前进行错误验证或在中途错误中用 finish_reason 通知客户端。backend/src/api/v1/responses.ts (4)
320-326: LGTM!Responses API 的 failover 配置与其他 API 保持一致。
501-505: 流式错误格式已修复之前审查指出的 SSE 错误格式问题已解决。现在正确使用
event: error前缀,并且 JSON 结构符合 OpenAI Responses API 规范(包含code,message,param,help_url字段)。
437-462: 错误处理已正确重构与 completions.ts 和 messages.ts 一样,错误处理通过
processFailoverError统一处理,正确区分上游错误和 failover 耗尽。Also applies to: 514-540
417-426: 请求构建函数与其他 API 保持一致
buildRequestForProvider实现与 completions.ts 和 messages.ts 中的相同,保持了代码一致性。
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Code Review
This pull request introduces a new failover service (services/failover.ts) to handle upstream request retries and provider failover with exponential backoff and jitter. This service is integrated into the completions, messages, and responses API handlers (api/v1/*.ts). The integration involves refactoring the API handlers to use the new executeWithFailover function, replacing the previous selectModel logic with filterCandidates and selectMultipleCandidates, and centralizing error processing with a new processFailoverError utility function (utils/api-helpers.ts). The handleNonStreamingRequest and handleStreamingRequest functions in the API handlers were also refactored into processNonStreamingResponse and processStreamingResponse to focus solely on processing successful responses, with initial fetch and error handling now managed by the failover service. Review comments highlight that despite these changes, there is still significant code duplication in the API handlers between streaming and non-streaming request paths for failover result processing, suggesting further refactoring. Additionally, a discrepancy was noted in the FailoverErrorResult type, which includes an 'internal_error' case that is not currently returned by the processFailoverError function.
- Fix SSE error type in messages.ts to use "server_error" per Anthropic spec - Remove unused "internal_error" case from FailoverErrorResult type Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive failover and retry mechanism for upstream API requests across completions.ts, messages.ts, and responses.ts. A new failover.ts service was added, which handles weighted random selection of model providers, exponential backoff, and retries for transient network and HTTP errors (e.g., 429, 5xx). The existing API routes were refactored to integrate this service, extracting the core fetch and initial error handling into the executeWithFailover function. New utility functions like filterCandidates and processFailoverError were added to api-helpers.ts to support this. Review comments suggest enhancing the isRetriableNetworkError function in failover.ts to inspect error.cause for more reliable network error classification, as the current string-based checks are considered too broad. Additionally, a comment in executeWithFailover was noted as inconsistent with the implementation, which correctly returns a FailoverResult object rather than throwing an error for non-retriable HTTP errors.
| const message = error.message.toLowerCase(); | ||
| return ( | ||
| message.includes("timed out") || | ||
| message.includes("timeout") || | ||
| message.includes("the operation was aborted") || | ||
| message.includes("connection reset") || | ||
| message.includes("connection refused") || | ||
| message.includes("network error") || | ||
| message.includes("network request failed") || | ||
| message.includes("socket hang up") || | ||
| message.includes("fetch failed") | ||
| ); |
There was a problem hiding this comment.
The string-based checks for retriable network errors are quite broad (e.g., "network error", "fetch failed") and could lead to false positives, causing retries on non-transient errors. Modern fetch implementations, especially in Node.js, often include a cause property on the thrown TypeError which contains the underlying system error.
Checking (error as any).cause could provide more reliable error details than parsing the message string. I suggest enhancing this function to inspect error.cause if it exists for more robust error classification.
| attempt: totalAttempts, | ||
| }); | ||
|
|
||
| // If not retriable, throw immediately |
There was a problem hiding this comment.
The comment on this line is inconsistent with the implementation. The code correctly returns a FailoverResult object, which aligns with the handling of non-retriable HTTP errors. Throwing an error here would create an inconsistency in the function's return behavior. Please update the comment to reflect the actual logic.
| // If not retriable, throw immediately | |
| // If not retriable, return a failure result immediately |
Summary
Features
Failover Service (
backend/src/services/failover.ts)Configuration Defaults
Failover Flow
Affected Endpoints
/v1/chat/completions/v1/messages/v1/responsesAlso Includes
This PR also includes the weighted load balancing fix from #49:
selectModel()filterCandidates()helper functionTest plan
bun run check)bun run lint)🤖 Generated with Claude Code
Summary by CodeRabbit
发布说明
新功能
重构
✏️ Tip: You can customize this high-level summary in your review settings.