Skip to content

fix(auth): honor disable-cooling and enrich no-auth errors#2576

Merged
luispater merged 1 commit intorouter-for-me:devfrom
zilianpn:fix/disable-cooling-auth-errors
Apr 7, 2026
Merged

fix(auth): honor disable-cooling and enrich no-auth errors#2576
luispater merged 1 commit intorouter-for-me:devfrom
zilianpn:fix/disable-cooling-auth-errors

Conversation

@zilianpn
Copy link
Copy Markdown
Contributor

@zilianpn zilianpn commented Apr 6, 2026

概要

本 PR 修复两类线上可见问题:

  1. auth_not_found / auth_unavailable 错误信息过于泛化,排障成本高。
  2. disable-cooling: true 在部分失败路径下未完全生效,仍会把 auth/model 打入“不可用窗口”。

关联:#1706

线上问题表现(修复前)

问题 A:no auth available 返回信息不可诊断

  • 典型场景:/v1/messages(Anthropic Messages 格式)请求失败。
  • 返回通常是:auth unavailable: no auth available,且很多情况下落成 500
  • 缺失关键信息:这次选择的是哪些 provider、哪个 model、是否是 Claude 认证链路。
  • 结果:用户无法快速判断是登录态失效、auth 文件缺失、还是 provider 路由不匹配。

问题 B:disable-cooling: true 仍会“用着用着突然不可用”

  • 典型场景:某个 auth 连续遇到 403429
  • 预期:开启 disable-cooling 后,不应进入 cooldown/blackout。
  • 实际:
    • model 级失败路径下,403 等状态仍可能写入不可用时间或触发挂起;
    • 429 即使不产生 backoff 时长,仍可能触发 quota 挂起副作用;
    • auth 级失败路径对 401/402/403/404/408/5xx 仍会写 NextRetryAfter
  • 结果:后续请求选不到可用 auth,最终表现为 no auth available,看起来像“突然坏了”。

根因分析

根因 A(错误可观测性不足)

  • Handler 层直接透传 auth 错误。
  • 当原始错误未携带 HTTP 状态码时,框架默认使用 500
  • message 未注入 provider/model 上下文,也没有 Claude 场景下的排障指引。

根因 B(disable-cooling 覆盖不完整)

  • disable-cooling 判定散落在多个分支,语义不一致。
  • model 级与 auth 级失败路径存在漏网分支。
  • 429 分支存在“无 cooldown 但仍挂起”的副作用路径。

修复内容

1) 鉴权错误信息增强(API Handler)

涉及文件:

  • sdk/api/handlers/handlers.go
  • sdk/api/handlers/handlers_error_response_test.go

改动:

  • 新增 enrichAuthSelectionError(err, providers, model)
  • ExecuteWithAuthManagerExecuteCountWithAuthManagerExecuteStreamWithAuthManager 统一调用。
  • auth_not_found / auth_unavailable
    • message 增加 providers=...model=...
    • provider 包含 claude 时,增加 /v0/management/auth-files 排障提示;
    • 原错误无状态码时默认设置为 503 Service Unavailable
    • 原错误有显式状态码时保持不变。

2) disable-cooling 语义修复(Auth Conductor)

涉及文件:

  • sdk/cliproxy/auth/conductor.go
  • sdk/cliproxy/auth/conductor_overrides_test.go

改动:

  • 在 model 级失败处理统一使用 disableCooling 判定。
  • disable-cooling=true
    • 401/402/403/404/408/5xx 不再设置 NextRetryAfter
    • 429 不再触发模型挂起/配额挂起副作用(避免进入 blackout 状态)。
  • 在 auth 级 applyAuthFailureState 同步同样规则,消除路径差异。

行为对比(Before / After)

场景 修复前 修复后
no-auth 错误返回 常见泛化 500,仅 no auth available message 包含 provider/model,上下文可定位;无显式状态码时为 503
Claude 路由排障 无明确提示 message 附带 /v0/management/auth-files 提示
disable-cooling=true + 403 可能进入不可用窗口 不写 cooldown,不进入 blackout
disable-cooling=true + 429 可能仍触发挂起副作用 不触发挂起副作用,避免后续选路被“拉黑”

测试

新增/更新:

  • TestEnrichAuthSelectionError_DefaultsTo503WithContext
  • TestEnrichAuthSelectionError_PreservesExplicitStatus
  • TestEnrichAuthSelectionError_IgnoresOtherErrors
  • TestManager_MarkResult_RespectsAuthDisableCoolingOverride_On403
  • TestManager_Execute_DisableCooling_DoesNotBlackoutAfter403

执行结果:

go test -count=1 ./sdk/cliproxy/auth ./sdk/api/handlers

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces error enrichment for authentication failures and adds support for disabling the cooldown mechanism via metadata. Specifically, the enrichAuthSelectionError function was implemented to provide detailed context and troubleshooting hints for authentication errors. Additionally, the authentication manager was updated to respect a disable_cooling flag in the auth metadata, preventing automatic suspension or retry delays for specific error codes. Unit tests were included to verify both the error enrichment logic and the cooldown override behavior. I have no feedback to provide.

@zilianpn zilianpn force-pushed the fix/disable-cooling-auth-errors branch from 4c3c643 to 56acff1 Compare April 6, 2026 16:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56acff1049

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/cliproxy/auth/conductor.go
@zilianpn zilianpn force-pushed the fix/disable-cooling-auth-errors branch from 56acff1 to 1a3b64c Compare April 6, 2026 16:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a3b64cf34

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/api/handlers/handlers.go
@zilianpn zilianpn force-pushed the fix/disable-cooling-auth-errors branch from 1a3b64c to 0ea7680 Compare April 6, 2026 17:19
Copy link
Copy Markdown
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:
The disable-cooling handling is now applied consistently across both model- and auth-level failure paths, and the auth selection error enrichment makes the failure mode materially more diagnosable.

Key findings:

  • No blocking findings.

Test plan:

  • Reviewed the new handler/auth conductor regression tests.

This is an automated Codex review result and still requires manual verification by a human reviewer.

@luispater luispater added the codex label Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:
This makes disable-cooling behavior more consistent across model-level and auth-level failure paths, and the error enrichment is backed by focused handler/auth regression tests.

Key findings:

  • No blocking issues found in the diff I reviewed.

Test plan:

  • Reviewed the sdk/cliproxy/auth/conductor.go changes.
  • Reviewed the added tests in sdk/api/handlers and sdk/cliproxy/auth.

This is an automated Codex review result and still requires manual verification by a human reviewer.

@luispater luispater changed the base branch from main to dev April 7, 2026 01:56
@luispater luispater merged commit 6a27bce into router-for-me:dev Apr 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants