Skip to content

[aw-failures] 6h Failure Report 2026-04-22: Codex 401 Auth Loop — Lock File Root Cause Identified #27729

@github-actions

Description

@github-actions

Overview

Two scheduled workflows failed in the 6-hour window ending 2026-04-22T01:11Z, both due to 401 Unauthorized from api.openai.com/v1/responses. Root cause identified: lock files are running without the openai-proxy provider config introduced in PR #27711. Workflows attempt to reach api.openai.com directly, but the OPENAI_API_KEY only works via the internal proxy (172.30.0.30:10000). The fix is already tracked in #27724 (recompile lock files). This report connects the dots.

Failure Clusters

Run ID Workflow Engine Conclusion Error Tracked
§24752310887 AI Moderator codex failure 401 Unauthorized api.openai.com #27678 (older run)
§24752303616 Daily Observability Report codex failure 401 Unauthorized api.openai.com #27716
§24752934132 Smoke CI copilot cancelled Superseded by next push (47s)
§24752949457 Smoke CI copilot cancelled Superseded by next push (2.7m)

Evidence

Audit-diff: failed AI Moderator (24752310887) vs successful Design Decision Gate (24752301186)
  • Firewall: Failed run made 13 allowed calls to api.openai.com:443 and 1 blocked to chatgpt.com:443; successful run only contacted api.anthropic.com:443
  • Token output: Failed run recorded 0 tokens — engine crashed before producing any output
  • Duration: Failed run ran 8m57s vs 3m25s for the successful run — codex was retrying reconnections (Reconnecting... 1/5 through 5/5) before giving up
  • No MCP tools called: Failed run never reached safe_outputs — engine failure was terminal
Error signature (from issues #27678, #27716)
Reconnecting... 2/5
Reconnecting... 3/5
Reconnecting... 4/5
Reconnecting... 5/5
Reconnecting... 1/5
unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, 
url: (api.openai.com/redacted), cf-ray: ..., request id: req_...

Root Cause Analysis

PR #27711 (commit 28d8df1) added logic to inject an openai-proxy provider into the generated codex config, routing codex through (172.30.0.30/redacted) instead of api.openai.com` directly.

But the lock files (.lock.yml) have not been recompiled since this change, as tracked in #27724. This means currently-running codex workflows use OLD lock files that send requests directly to api.openai.com with an API key that is only valid when routed through the internal AWF proxy.

Chain: PR #27711 merged → lock files stale → codex workflows use old config → 401 at api.openai.com

Existing Issue Correlation

Issue Status Notes
#27724 Open Lock files out of sync — the fix (recompile)
#27716 Open Daily Obs Report 401 failure — tracked
#27678 Open, expires ~9 AM AI Moderator 401 failure — older run tracked, new run 24752310887 not yet in scope
#27689 Open, expires ~9 AM Smoke Codex 401 failure — same pattern

Proposed Fix Roadmap

Priority Action Owner
P0 Recompile lock files per #27724 — see sub-issue #27731 On-call
P1 After recompile, verify AI Moderator and Daily Obs Report on next scheduled run Auto
P2 Consider adding a CI check to block merges when lock files are stale Dev

Sub-Issues Created

References:



6h Follow-Up Window: 2026-04-22T01:11Z → 07:14Z

P0 fix (#27731) still unresolved — Codex 401 failures continue on main.

New failure in this window

Run ID Workflow Engine Time (UTC) Branch Status
§24762849092 AI Moderator Codex 0.121.0 06:00 main failure — 401 loop
§24762218468 Smoke Codex Codex 0.121.0 05:40 PR branch (now merged) failure — 401 loop
§24762218497 Smoke Gemini Gemini CLI 05:40 PR branch (now merged) failure — awf-api-proxy unhealthy

Evidence — AI Moderator (06:00 UTC, post-previous-report)

Same error signature as prior window — sk-place***roxy key sent directly to api.openai.com:

startup websocket prewarm setup failed: unexpected status 401 Unauthorized:
Incorrect API key provided: sk-place****************roxy.
url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 2/5 ... 3/5 ... (agent exits, safe_outputs: no output)

Note: PR #27762 (container image digest pins, merged in same window) does not fix the openai-proxy lock-file issue. The Codex 401 root cause from #27731 remains outstanding.

Smoke Gemini — separate failure (awf-api-proxy)

Different root cause from Codex 401. The Gemini API proxy sidecar container failed its health check on the PR branch:

Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy

Tracked by #27688 (smoke test on now-merged PR branch — lower priority).

Assessment

No new P0 failure clusters. #27731 (recompile lock files) remains the single highest-priority fix. Codex workflows running on main will continue failing at every scheduled trigger until the proxy config is added to the lock files.

References:

Generated by [aw] Failure Investigator (6h) · ● 235.5K ·



6h Window: 2026-04-22T07:14Z → 13:14Z

P0 fix (#27731 — recompile lock files) remains unresolved. Codex 401 failures continue. A new untracked P0 cluster identified: Copilot node: command not found.

Failure Clusters (8 runs, 100% error rate)

Run ID Workflow Engine Time (UTC) Cluster Tracked?
§24771662916 AI Moderator Codex 09:46 Codex 401 loop #27789, #27731
§24773828176 Daily Issues Report Generator Copilot 10:39 node not found NEW → #aw_node404
§24774440815 Daily Documentation Updater Claude 10:54 Protected files block #27801
§24775941358 Daily Fact About gh-aw Codex 11:32 Codex 401 loop #27810, #27731
§24777073236 Duplicate Code Detector Codex 12:00 Codex 401 + GitHub blocked #27816, #27731
§24778392225 AI Moderator Codex 12:31 Codex 401 loop #27789, #27731
§24779123568 Auto-Triage Issues Copilot 12:48 Copilot engine crash #27827
§24769919588 Daily News Copilot 09:05 node not found NEW → #aw_node404

Key Evidence from Audit

Codex 401 (still root-caused to stale lock files — #27724/#27731):

  • sk-place***roxy key sent directly to api.openai.com/v1/responses → 401, retries 1–5, engine crash
  • Additional signal: Codex is also attempting chatgpt.com:443 (blocked by firewall) — secondary concern
  • Duplicate Code Detector (§24777073236) additionally missing api.github.com:443 and github.com:443 from allow-list
  • 0 tokens produced across all 4 Codex failures — engine crashes pre-inference

Copilot node not found (NEW — no prior tracking):

/bin/bash: line 1: node: command not found
  • Affects Daily News (§24769919588) and Daily Issues Report Generator (§24773828176)
  • 0 turns, 0 tool calls — engine cannot initialize without Node.js in PATH
  • Sub-issue created: see #aw_node404

Daily Documentation Updater (post-inference failure, $2.68 spent):

  • Agent completed 81 turns successfully; create_pull_request blocked by protected files
  • Files blocked: .github/aw/create-agentic-workflow.md, .github/aw/github-agentic-workflows.md
  • Fix: add protected-files: fallback-to-issue to workflow frontmatter

Assessment

No new P0 cluster was added to the Codex 401 story — #27731 remains the single highest-priority fix. All Codex workflows on main will continue failing at every scheduled trigger until lock files are recompiled with the openai-proxy provider config.

The node not found Copilot pattern is newly identified P0 (see sub-issue).

References:

Generated by [aw] Failure Investigator (6h) · ● 350.4K ·



6h Window: 2026-04-22T13:14Z → 19:14Z

P0 #27731 (Codex 401 / lock file recompile) remains unresolved — AI Moderator failed 6 more times. New P0 identified: awf-api-proxy sidecar now unhealthy on main-branch workflows.

Failure Clusters (18 runs)

Run ID Workflow Engine Time (UTC) Cluster Tracked?
§24795392631 AI Moderator Codex 18:25 Codex 401 loop #27731
§24791757918 AI Moderator Codex 17:05 Codex 401 loop #27731
§24791742300 AI Moderator Codex 17:05 Codex 401 loop #27731
§24791727612 AI Moderator Codex 17:05 Codex 401 loop #27731
§24791210275 AI Moderator Codex 16:53 Codex 401 loop #27731
§24789574257 AI Moderator Codex 16:18 Codex 401 loop #27731
§24785716916 DeepReport Claude 15:01 awf-api-proxy unhealthy NEW → sub-issue
§24786173739 Smoke CI Copilot 15:10 awf-api-proxy unhealthy NEW → sub-issue
§24786862508 Test Quality Sentinel Copilot 15:24 awf-api-proxy unhealthy NEW → sub-issue
§24787456440 Smoke OpenCode OpenCode 15:35 awf-api-proxy unhealthy NEW → sub-issue
§24787456577 Smoke Codex Codex 15:35 Codex 401 loop #27731
§24787456533 Changeset Generator Codex 15:35 Codex 401 loop #27731
§24788136304 Smoke Copilot Copilot 15:49 Post-inference failure (30 turns) auto-issue
§24787456512 Smoke Copilot Copilot 15:35 Post-inference failure (22 turns) auto-issue
§24786862492 Design Decision Gate Claude 15:24 Post-inference failure (3 turns) auto-issue
§24786542053 Breaking Change Checker Copilot 15:17 Post-inference failure (10 turns) auto-issue
§24787456499 Smoke Gemini Gemini 15:35 Engine exit code 144 auto-issue
§24787456576 Smoke Crush Crush 15:35 Engine exit code 1 auto-issue

Key Evidence

Codex 401 (unchanged root cause — stale lock files, #27731):

  • AI Moderator triggered 6× between 16:18 and 18:25 UTC; all fail at 0 turns/0 tokens
  • Confirmed trace: wss://api.openai.com/v1/responses called directly; sk-place***roxy key rejected with 401
  • Smoke Codex + Changeset Generator on PR branch copilot/disable-shell-history-expansion hit same pattern (Codex 0.121.0)

awf-api-proxy unhealthy — NEW P0 (previously PR-branch only via #27688):

Container awf-api-proxy  Started
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
  • Now affecting main-branch scheduled (DeepReport) and push-triggered (Smoke CI) workflows
  • Failures clustered 15:01–15:35 UTC across 3+ engines — points to infra-level issue, not engine config
  • Smoke CI succeeded again at 15:24 UTC onward — may be intermittent (resource spike or flapping health check)
  • 0 turns, 0 tokens, 0 tool calls in all 4 affected runs — complete pre-inference block

Assessment

Two active P0 root causes in this window:

  1. [aw-failures] Recompile lock files to fix Codex 401 Unauthorized failures (blocks AI Moderator + Daily Obs Report) #27731 (Codex 401) — ongoing since previous windows, all Codex workflows on main continue failing at every schedule trigger
  2. awf-api-proxy unhealthy — escalated from PR-branch issue to main-branch; sub-issue created below for actionable follow-up

References:

Note

🔒 Integrity filter blocked 5 items

The following items were blocked because they don't meet the GitHub integrity level.

  • push_to_pull_request_branch does not support multi-repo (side-repo) checkout pattern #27757 list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27880 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27881 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27882 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27883 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by [aw] Failure Investigator (6h) · ● 497.5K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions