Stop app-server auth refresh storms after permanent token failure by etraut-openai · Pull Request #14256 · openai/codex

etraut-openai · 2026-03-10T21:25:49Z

This PR addresses a hole in PR 11802. The previous PR assumed that app server clients would respond to token refresh failures by presenting the user with an error ("you must log in again") and then not making further attempts to call network endpoints using the expired token. While they do present the user with this error, they don't prevent further attempts to call network endpoints and can repeatedly call getAuthStatus(refreshToken=true) resulting in many failed calls to the token refresh endpoint.

There are three solutions I considered here:

Change the getAuthStatus app server call to return a null auth if the caller specified "refreshToken" on input and the refresh attempt fails. This will cause clients to immediately log out the user and return them to the log in screen. This is a really bad user experience. It's also a breaking change in the app server contract that could break third-party clients.
Augment the getAuthStatus app server call to return an additional field that indicates the state of "token could not be refreshed". This is a non-breaking change to the app server API, but it requires non-trivial changes for all clients to properly handle this new field properly.
Change the getAuthStatus implementation to handle the case where a token refresh fails by marking the AuthManager's in-memory access and refresh tokens as "poisoned" so it they are no longer used. This is the simplest fix that requires no client changes.

I chose option 3.

Here's Codex's explanation of this change:

When an app-server client asks getAuthStatus(refreshToken=true), we may try to
refresh a stale ChatGPT access token. If that refresh fails permanently
(for example refresh_token_reused, expired, or revoked), the old behavior was
bad in two ways:

We kept the in-memory auth snapshot alive as if it were still usable.
Later auth checks could retry refresh again and again, creating a storm of
doomed /oauth/token requests and repeatedly surfacing the same failure.

This is especially painful for app-server clients because they poll auth status
and can keep driving the refresh path without any real chance of recovery.

This change makes permanent refresh failures terminal for the current managed
auth snapshot without changing the app-server API contract.

What changed:

AuthManager now poisons the current managed auth snapshot in memory after a
permanent refresh failure, keyed to the unchanged AuthDotJson.
Once poisoned, later refresh attempts for that same snapshot fail fast locally
without calling the auth service again.
The poison is cleared automatically when auth materially changes, such as a
new login, logout, or reload of different auth state from storage.
getAuthStatus(includeToken=true) now omits authToken after a permanent
refresh failure instead of handing out the stale cached bearer token.

This keeps the current auth method visible to clients, avoids forcing an
immediate logout flow, and stops repeated refresh attempts for credentials that
cannot recover.

## Summary Foundation PR only (base for PR #3). This PR contains the SDK runtime foundation and generated artifacts: - pinned runtime binary in `sdk/python/bin/` (`codex` or `codex.exe` by platform) - single maintenance script: `sdk/python/scripts/update_sdk_artifacts.py` - generated protocol/types artifacts under: - `sdk/python/src/codex_app_server/generated/protocol_types.py` - `sdk/python/src/codex_app_server/generated/schema_types.py` - `sdk/python/src/codex_app_server/generated/v2_all/*` - generation-contract test wiring (`tests/test_contract_generation.py`) ## Release asset behavior `update_sdk_artifacts.py` now: - selects latest release by channel (`--channel stable|alpha`) - resolves the correct asset for current OS/arch - extracts platform binary (`codex` on macOS/Linux, `codex.exe` on Windows) - keeps runtime on single pinned binary source in `sdk/python/bin/` ## Scope boundary - ✅ PR #2 = binary + generation pipeline + generated types foundation - ❌ PR #2 does **not** include examples/integration logic polish (that is PR #3) ## Validation - Ran: `python scripts/update_sdk_artifacts.py --channel stable` - Regenerated and committed resulting generated artifacts - Local tests pass on branch

Addresses #13586 This doesn't affect our CI scripts. It was user-reported. Summary - add `wiremock::ResponseTemplate` and `body_string_contains` imports behind `#[cfg(not(debug_assertions))]` in `codex-rs/core/tests/suite/view_image.rs` so release builds only pull the helpers they actually use

Replace the Unix shell lookup path in `codex-rs/core/src/shell.rs` to use `libc::getpwuid_r()` instead of `libc::getpwuid()` when resolving the current user's shell. Why: - `getpwuid()` can return pointers into libc-managed shared storage - on the musl static Linux build, concurrent callers can race on that storage - this matches the crash pattern reported in tmux/Linux sessions with parallel shell activity Refs: - Fixes #13842

There are some bug investigations that currently require us to ask users for their user ID even though they've already uploaded logs and session details via `/feedback`. This frustrates users and increases the time for diagnosis. This PR includes the ChatGPT user ID in the metadata uploaded for `/feedback` (both the TUI and app-server).

celia-oai · 2026-03-23T20:58:48Z

I'm continuing this in #15530.

@etraut-openai

…ure (#15530) built from #14256. PR description from @etraut-openai: This PR addresses a hole in [PR 11802](#11802). The previous PR assumed that app server clients would respond to token refresh failures by presenting the user with an error ("you must log in again") and then not making further attempts to call network endpoints using the expired token. While they do present the user with this error, they don't prevent further attempts to call network endpoints and can repeatedly call `getAuthStatus(refreshToken=true)` resulting in many failed calls to the token refresh endpoint. There are three solutions I considered here: 1. Change the getAuthStatus app server call to return a null auth if the caller specified "refreshToken" on input and the refresh attempt fails. This will cause clients to immediately log out the user and return them to the log in screen. This is a really bad user experience. It's also a breaking change in the app server contract that could break third-party clients. 2. Augment the getAuthStatus app server call to return an additional field that indicates the state of "token could not be refreshed". This is a non-breaking change to the app server API, but it requires non-trivial changes for all clients to properly handle this new field properly. 3. Change the getAuthStatus implementation to handle the case where a token refresh fails by marking the AuthManager's in-memory access and refresh tokens as "poisoned" so it they are no longer used. This is the simplest fix that requires no client changes. I chose option 3. Here's Codex's explanation of this change: When an app-server client asks `getAuthStatus(refreshToken=true)`, we may try to refresh a stale ChatGPT access token. If that refresh fails permanently (for example `refresh_token_reused`, expired, or revoked), the old behavior was bad in two ways: 1. We kept the in-memory auth snapshot alive as if it were still usable. 2. Later auth checks could retry refresh again and again, creating a storm of doomed `/oauth/token` requests and repeatedly surfacing the same failure. This is especially painful for app-server clients because they poll auth status and can keep driving the refresh path without any real chance of recovery. This change makes permanent refresh failures terminal for the current managed auth snapshot without changing the app-server API contract. What changed: - `AuthManager` now poisons the current managed auth snapshot in memory after a permanent refresh failure, keyed to the unchanged `AuthDotJson`. - Once poisoned, later refresh attempts for that same snapshot fail fast locally without calling the auth service again. - The poison is cleared automatically when auth materially changes, such as a new login, logout, or reload of different auth state from storage. - `getAuthStatus(includeToken=true)` now omits `authToken` after a permanent refresh failure instead of handing out the stale cached bearer token. This keeps the current auth method visible to clients, avoids forcing an immediate logout flow, and stops repeated refresh attempts for credentials that cannot recover. --------- Co-authored-by: Eric Traut <etraut@openai.com>

celia-oai · 2026-03-24T19:40:52Z

#15530 is merged so good to close this one out

shaqayeq-oai and others added 5 commits March 10, 2026 01:00

Mitigate token refresh storms

a7f7568

etraut-openai changed the title ~~Handle permanent refresh failures without retry storms~~ Stop app-server auth refresh storms after permanent token failure Mar 10, 2026

etraut-openai added 2 commits March 10, 2026 15:56

Document poisoned_managed_auth field

d9d6129

Document permanent refresh failure

b8951c9

bolinfest force-pushed the main branch from 402ca6c to f5bb338 Compare March 11, 2026 19:49

etraut-openai marked this pull request as draft March 16, 2026 09:38

celia-oai mentioned this pull request Mar 23, 2026

chore: stop app-server auth refresh storms after permanent token failure #15530

Merged

celia-oai closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop app-server auth refresh storms after permanent token failure#14256

Stop app-server auth refresh storms after permanent token failure#14256
etraut-openai wants to merge 7 commits intomainfrom
etraut/issue-9634-token-refresh-storm

etraut-openai commented Mar 10, 2026 •

edited

Loading

Uh oh!

celia-oai commented Mar 23, 2026

Uh oh!

celia-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

etraut-openai commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

celia-oai commented Mar 23, 2026

Uh oh!

celia-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

etraut-openai commented Mar 10, 2026 •

edited

Loading