[aw-failures] 6h Failure Report 2026-04-22: Codex 401 Auth Loop — Lock File Root Cause Identified

### Overview

Two scheduled workflows failed in the 6-hour window ending 2026-04-22T01:11Z, both due to `401 Unauthorized` from `api.openai.com/v1/responses`. Root cause identified: lock files are running without the `openai-proxy` provider config introduced in PR #27711. Workflows attempt to reach `api.openai.com` directly, but the OPENAI_API_KEY only works via the internal proxy (`172.30.0.30:10000`). The fix is already tracked in #27724 (recompile lock files). This report connects the dots.

### Failure Clusters

| Run ID | Workflow | Engine | Conclusion | Error | Tracked |
|--------|----------|--------|------------|-------|---------|
| [§24752310887](https://github.com/github/gh-aw/actions/runs/24752310887) | AI Moderator | codex | failure | 401 Unauthorized `api.openai.com` | #27678 (older run) |
| [§24752303616](https://github.com/github/gh-aw/actions/runs/24752303616) | Daily Observability Report | codex | failure | 401 Unauthorized `api.openai.com` | #27716 |
| [§24752934132](https://github.com/github/gh-aw/actions/runs/24752934132) | Smoke CI | copilot | cancelled | Superseded by next push (47s) | — |
| [§24752949457](https://github.com/github/gh-aw/actions/runs/24752949457) | Smoke CI | copilot | cancelled | Superseded by next push (2.7m) | — |

### Evidence

<details>
<summary>Audit-diff: failed AI Moderator (24752310887) vs successful Design Decision Gate (24752301186)</summary>

- **Firewall**: Failed run made 13 allowed calls to `api.openai.com:443` and 1 blocked to `chatgpt.com:443`; successful run only contacted `api.anthropic.com:443`
- **Token output**: Failed run recorded **0 tokens** — engine crashed before producing any output
- **Duration**: Failed run ran 8m57s vs 3m25s for the successful run — codex was retrying reconnections (`Reconnecting... 1/5` through `5/5`) before giving up
- **No MCP tools called**: Failed run never reached safe_outputs — engine failure was terminal

</details>

<details>
<summary>Error signature (from issues #27678, #27716)</summary>

```
Reconnecting... 2/5
Reconnecting... 3/5
Reconnecting... 4/5
Reconnecting... 5/5
Reconnecting... 1/5
unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, 
url: (api.openai.com/redacted), cf-ray: ..., request id: req_...
```

</details>

### Root Cause Analysis

PR #27711 (commit `28d8df1`) added logic to inject an `openai-proxy` provider into the generated codex config, routing codex through `(172.30.0.30/redacted) instead of `api.openai.com` directly.

**But** the lock files (`.lock.yml`) have not been recompiled since this change, as tracked in #27724. This means currently-running codex workflows use OLD lock files that send requests directly to `api.openai.com` with an API key that is only valid when routed through the internal AWF proxy.

**Chain**: PR #27711 merged → lock files stale → codex workflows use old config → 401 at `api.openai.com`

### Existing Issue Correlation

| Issue | Status | Notes |
|-------|--------|-------|
| #27724 | Open | Lock files out of sync — the **fix** (recompile) |
| #27716 | Open | Daily Obs Report 401 failure — tracked |
| #27678 | Open, expires ~9 AM | AI Moderator 401 failure — older run tracked, new run 24752310887 not yet in scope |
| #27689 | Open, expires ~9 AM | Smoke Codex 401 failure — same pattern |

### Proposed Fix Roadmap

| Priority | Action | Owner |
|----------|--------|-------|
| **P0** | Recompile lock files per #27724 — see sub-issue #27731 | On-call |
| P1 | After recompile, verify AI Moderator and Daily Obs Report on next scheduled run | Auto |
| P2 | Consider adding a CI check to block merges when lock files are stale | Dev |

### Sub-Issues Created

- #27731 — Recompile lock files to unblock Codex 401 failures

**References:**
- [§24752310887](https://github.com/github/gh-aw/actions/runs/24752310887) — AI Moderator failure (6h window)
- [§24752303616](https://github.com/github/gh-aw/actions/runs/24752303616) — Daily Obs Report failure (6h window)
- [§24754857804](https://github.com/github/gh-aw/actions/runs/24754857804) — This investigator run

---

---

### 6h Follow-Up Window: 2026-04-22T01:11Z → 07:14Z

**P0 fix (#27731) still unresolved — Codex 401 failures continue on `main`.**

#### New failure in this window

| Run ID | Workflow | Engine | Time (UTC) | Branch | Status |
|--------|----------|--------|------------|--------|--------|
| [§24762849092](https://github.com/github/gh-aw/actions/runs/24762849092) | AI Moderator | Codex 0.121.0 | 06:00 | main | failure — 401 loop |
| [§24762218468](https://github.com/github/gh-aw/actions/runs/24762218468) | Smoke Codex | Codex 0.121.0 | 05:40 | PR branch (now merged) | failure — 401 loop |
| [§24762218497](https://github.com/github/gh-aw/actions/runs/24762218497) | Smoke Gemini | Gemini CLI | 05:40 | PR branch (now merged) | failure — awf-api-proxy unhealthy |

#### Evidence — AI Moderator (06:00 UTC, post-previous-report)

Same error signature as prior window — `sk-place***roxy` key sent directly to `api.openai.com`:

```
startup websocket prewarm setup failed: unexpected status 401 Unauthorized:
Incorrect API key provided: sk-place****************roxy.
url: wss://api.openai.com/v1/responses
ERROR: Reconnecting... 2/5 ... 3/5 ... (agent exits, safe_outputs: no output)
```

Note: PR #27762 (container image digest pins, merged in same window) does **not** fix the openai-proxy lock-file issue. The Codex 401 root cause from #27731 remains outstanding.

#### Smoke Gemini — separate failure (awf-api-proxy)

Different root cause from Codex 401. The Gemini API proxy sidecar container failed its health check on the PR branch:
```
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
```
Tracked by #27688 (smoke test on now-merged PR branch — lower priority).

#### Assessment

No new P0 failure clusters. **#27731 (recompile lock files) remains the single highest-priority fix.** Codex workflows running on `main` will continue failing at every scheduled trigger until the proxy config is added to the lock files.

**References:**
- [§24762849092](https://github.com/github/gh-aw/actions/runs/24762849092) — AI Moderator failure (this window)
- [§24765410232](https://github.com/github/gh-aw/actions/runs/24765410232) — This investigator run

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24765410232/agentic_workflow) · ● 235.5K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### 6h Window: 2026-04-22T07:14Z → 13:14Z

**P0 fix (#27731 — recompile lock files) remains unresolved. Codex 401 failures continue. A new untracked P0 cluster identified: Copilot `node: command not found`.**

#### Failure Clusters (8 runs, 100% error rate)

| Run ID | Workflow | Engine | Time (UTC) | Cluster | Tracked? |
|--------|----------|--------|------------|---------|----------|
| [§24771662916](https://github.com/github/gh-aw/actions/runs/24771662916) | AI Moderator | Codex | 09:46 | Codex 401 loop | #27789, #27731 |
| [§24773828176](https://github.com/github/gh-aw/actions/runs/24773828176) | Daily Issues Report Generator | Copilot | 10:39 | `node not found` | NEW → #aw_node404 |
| [§24774440815](https://github.com/github/gh-aw/actions/runs/24774440815) | Daily Documentation Updater | Claude | 10:54 | Protected files block | #27801 |
| [§24775941358](https://github.com/github/gh-aw/actions/runs/24775941358) | Daily Fact About gh-aw | Codex | 11:32 | Codex 401 loop | #27810, #27731 |
| [§24777073236](https://github.com/github/gh-aw/actions/runs/24777073236) | Duplicate Code Detector | Codex | 12:00 | Codex 401 + GitHub blocked | #27816, #27731 |
| [§24778392225](https://github.com/github/gh-aw/actions/runs/24778392225) | AI Moderator | Codex | 12:31 | Codex 401 loop | #27789, #27731 |
| [§24779123568](https://github.com/github/gh-aw/actions/runs/24779123568) | Auto-Triage Issues | Copilot | 12:48 | Copilot engine crash | #27827 |
| [§24769919588](https://github.com/github/gh-aw/actions/runs/24769919588) | Daily News | Copilot | 09:05 | `node not found` | NEW → #aw_node404 |

#### Key Evidence from Audit

**Codex 401 (still root-caused to stale lock files — #27724/#27731):**
- `sk-place***roxy` key sent directly to `api.openai.com/v1/responses` → 401, retries 1–5, engine crash
- Additional signal: Codex is also attempting `chatgpt.com:443` (blocked by firewall) — secondary concern
- Duplicate Code Detector (§24777073236) additionally missing `api.github.com:443` and `github.com:443` from allow-list
- **0 tokens produced across all 4 Codex failures** — engine crashes pre-inference

**Copilot `node not found` (NEW — no prior tracking):**
```
/bin/bash: line 1: node: command not found
```
- Affects Daily News (§24769919588) and Daily Issues Report Generator (§24773828176)
- 0 turns, 0 tool calls — engine cannot initialize without Node.js in PATH
- Sub-issue created: see #aw_node404

**Daily Documentation Updater (post-inference failure, $2.68 spent):**
- Agent completed 81 turns successfully; `create_pull_request` blocked by protected files
- Files blocked: `.github/aw/create-agentic-workflow.md`, `.github/aw/github-agentic-workflows.md`
- Fix: add `protected-files: fallback-to-issue` to workflow frontmatter

#### Assessment

No new P0 cluster was added to the Codex 401 story — **#27731 remains the single highest-priority fix**. All Codex workflows on `main` will continue failing at every scheduled trigger until lock files are recompiled with the `openai-proxy` provider config.

The `node not found` Copilot pattern is newly identified P0 (see sub-issue).

**References:**
- [§24778392225](https://github.com/github/gh-aw/actions/runs/24778392225) — AI Moderator (this window)
- [§24777073236](https://github.com/github/gh-aw/actions/runs/24777073236) — Duplicate Code Detector (this window)
- [§24780078226](https://github.com/github/gh-aw/actions/runs/24780078226) — This investigator run

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24780078226/agentic_workflow) · ● 350.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### 6h Window: 2026-04-22T13:14Z → 19:14Z

**P0 #27731 (Codex 401 / lock file recompile) remains unresolved — AI Moderator failed 6 more times. New P0 identified: `awf-api-proxy` sidecar now unhealthy on main-branch workflows.**

#### Failure Clusters (18 runs)

| Run ID | Workflow | Engine | Time (UTC) | Cluster | Tracked? |
|--------|----------|--------|------------|---------|----------|
| [§24795392631](https://github.com/github/gh-aw/actions/runs/24795392631) | AI Moderator | Codex | 18:25 | Codex 401 loop | #27731 |
| [§24791757918](https://github.com/github/gh-aw/actions/runs/24791757918) | AI Moderator | Codex | 17:05 | Codex 401 loop | #27731 |
| [§24791742300](https://github.com/github/gh-aw/actions/runs/24791742300) | AI Moderator | Codex | 17:05 | Codex 401 loop | #27731 |
| [§24791727612](https://github.com/github/gh-aw/actions/runs/24791727612) | AI Moderator | Codex | 17:05 | Codex 401 loop | #27731 |
| [§24791210275](https://github.com/github/gh-aw/actions/runs/24791210275) | AI Moderator | Codex | 16:53 | Codex 401 loop | #27731 |
| [§24789574257](https://github.com/github/gh-aw/actions/runs/24789574257) | AI Moderator | Codex | 16:18 | Codex 401 loop | #27731 |
| [§24785716916](https://github.com/github/gh-aw/actions/runs/24785716916) | DeepReport | Claude | 15:01 | awf-api-proxy unhealthy | NEW → sub-issue |
| [§24786173739](https://github.com/github/gh-aw/actions/runs/24786173739) | Smoke CI | Copilot | 15:10 | awf-api-proxy unhealthy | NEW → sub-issue |
| [§24786862508](https://github.com/github/gh-aw/actions/runs/24786862508) | Test Quality Sentinel | Copilot | 15:24 | awf-api-proxy unhealthy | NEW → sub-issue |
| [§24787456440](https://github.com/github/gh-aw/actions/runs/24787456440) | Smoke OpenCode | OpenCode | 15:35 | awf-api-proxy unhealthy | NEW → sub-issue |
| [§24787456577](https://github.com/github/gh-aw/actions/runs/24787456577) | Smoke Codex | Codex | 15:35 | Codex 401 loop | #27731 |
| [§24787456533](https://github.com/github/gh-aw/actions/runs/24787456533) | Changeset Generator | Codex | 15:35 | Codex 401 loop | #27731 |
| [§24788136304](https://github.com/github/gh-aw/actions/runs/24788136304) | Smoke Copilot | Copilot | 15:49 | Post-inference failure (30 turns) | auto-issue |
| [§24787456512](https://github.com/github/gh-aw/actions/runs/24787456512) | Smoke Copilot | Copilot | 15:35 | Post-inference failure (22 turns) | auto-issue |
| [§24786862492](https://github.com/github/gh-aw/actions/runs/24786862492) | Design Decision Gate | Claude | 15:24 | Post-inference failure (3 turns) | auto-issue |
| [§24786542053](https://github.com/github/gh-aw/actions/runs/24786542053) | Breaking Change Checker | Copilot | 15:17 | Post-inference failure (10 turns) | auto-issue |
| [§24787456499](https://github.com/github/gh-aw/actions/runs/24787456499) | Smoke Gemini | Gemini | 15:35 | Engine exit code 144 | auto-issue |
| [§24787456576](https://github.com/github/gh-aw/actions/runs/24787456576) | Smoke Crush | Crush | 15:35 | Engine exit code 1 | auto-issue |

#### Key Evidence

**Codex 401 (unchanged root cause — stale lock files, #27731):**
- AI Moderator triggered 6× between 16:18 and 18:25 UTC; all fail at 0 turns/0 tokens
- Confirmed trace: `wss://api.openai.com/v1/responses` called directly; `sk-place***roxy` key rejected with 401
- Smoke Codex + Changeset Generator on PR branch `copilot/disable-shell-history-expansion` hit same pattern (Codex 0.121.0)

**`awf-api-proxy` unhealthy — NEW P0 (previously PR-branch only via #27688):**

```
Container awf-api-proxy  Started
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
```

- Now affecting main-branch scheduled (DeepReport) and push-triggered (Smoke CI) workflows
- Failures clustered 15:01–15:35 UTC across 3+ engines — points to infra-level issue, not engine config
- Smoke CI succeeded again at 15:24 UTC onward — may be intermittent (resource spike or flapping health check)
- 0 turns, 0 tokens, 0 tool calls in all 4 affected runs — complete pre-inference block

#### Assessment

Two active P0 root causes in this window:
1. **#27731 (Codex 401)** — ongoing since previous windows, all Codex workflows on main continue failing at every schedule trigger
2. **`awf-api-proxy` unhealthy** — escalated from PR-branch issue to main-branch; sub-issue created below for actionable follow-up

**References:**
- [§24795392631](https://github.com/github/gh-aw/actions/runs/24795392631) — AI Moderator (latest Codex 401 in window)
- [§24785716916](https://github.com/github/gh-aw/actions/runs/24785716916) — DeepReport (awf-api-proxy failure on main)
- [§24797384460](https://github.com/github/gh-aw/actions/runs/24797384460) — This investigator run

> [!NOTE]
> <details>
> <summary>🔒 Integrity filter blocked 5 items</summary>
>
> The following items were blocked because they don't meet the GitHub integrity level.
>
> - github/gh-aw#27757 `list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27880](https://github.com/github/gh-aw/issues/27880) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27881](https://github.com/github/gh-aw/issues/27881) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27882](https://github.com/github/gh-aw/issues/27882) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27883](https://github.com/github/gh-aw/issues/27883) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
>
> To allow these resources, lower `min-integrity` in your GitHub frontmatter:
>
> ```yaml
> tools:
>   github:
>     min-integrity: approved  # merged | approved | unapproved | none
> ```
>
> </details>


> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24797384460/agentic_workflow) · ● 497.5K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] 6h Failure Report 2026-04-22: Codex 401 Auth Loop — Lock File Root Cause Identified #27729

Overview

Failure Clusters

Evidence

Root Cause Analysis

Existing Issue Correlation

Proposed Fix Roadmap

Sub-Issues Created

6h Follow-Up Window: 2026-04-22T01:11Z → 07:14Z

New failure in this window

Evidence — AI Moderator (06:00 UTC, post-previous-report)

Smoke Gemini — separate failure (awf-api-proxy)

Assessment

6h Window: 2026-04-22T07:14Z → 13:14Z

Failure Clusters (8 runs, 100% error rate)

Key Evidence from Audit

Assessment

6h Window: 2026-04-22T13:14Z → 19:14Z

Failure Clusters (18 runs)

Key Evidence

Assessment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run ID	Workflow	Engine	Conclusion	Error	Tracked
§24752310887	AI Moderator	codex	failure	401 Unauthorized `api.openai.com`	#27678 (older run)
§24752303616	Daily Observability Report	codex	failure	401 Unauthorized `api.openai.com`	#27716
§24752934132	Smoke CI	copilot	cancelled	Superseded by next push (47s)	—
§24752949457	Smoke CI	copilot	cancelled	Superseded by next push (2.7m)	—

Issue	Status	Notes
#27724	Open	Lock files out of sync — the fix (recompile)
#27716	Open	Daily Obs Report 401 failure — tracked
#27678	Open, expires ~9 AM	AI Moderator 401 failure — older run tracked, new run 24752310887 not yet in scope
#27689	Open, expires ~9 AM	Smoke Codex 401 failure — same pattern

Priority	Action	Owner
P0	Recompile lock files per #27724 — see sub-issue #27731	On-call
P1	After recompile, verify AI Moderator and Daily Obs Report on next scheduled run	Auto
P2	Consider adding a CI check to block merges when lock files are stale	Dev

Run ID	Workflow	Engine	Time (UTC)	Branch	Status
§24762849092	AI Moderator	Codex 0.121.0	06:00	main	failure — 401 loop
§24762218468	Smoke Codex	Codex 0.121.0	05:40	PR branch (now merged)	failure — 401 loop
§24762218497	Smoke Gemini	Gemini CLI	05:40	PR branch (now merged)	failure — awf-api-proxy unhealthy

Run ID	Workflow	Engine	Time (UTC)	Cluster	Tracked?
§24771662916	AI Moderator	Codex	09:46	Codex 401 loop	#27789, #27731
§24773828176	Daily Issues Report Generator	Copilot	10:39	`node not found`	NEW → #aw_node404
§24774440815	Daily Documentation Updater	Claude	10:54	Protected files block	#27801
§24775941358	Daily Fact About gh-aw	Codex	11:32	Codex 401 loop	#27810, #27731
§24777073236	Duplicate Code Detector	Codex	12:00	Codex 401 + GitHub blocked	#27816, #27731
§24778392225	AI Moderator	Codex	12:31	Codex 401 loop	#27789, #27731
§24779123568	Auto-Triage Issues	Copilot	12:48	Copilot engine crash	#27827
§24769919588	Daily News	Copilot	09:05	`node not found`	NEW → #aw_node404

Run ID	Workflow	Engine	Time (UTC)	Cluster	Tracked?
§24795392631	AI Moderator	Codex	18:25	Codex 401 loop	#27731
§24791757918	AI Moderator	Codex	17:05	Codex 401 loop	#27731
§24791742300	AI Moderator	Codex	17:05	Codex 401 loop	#27731
§24791727612	AI Moderator	Codex	17:05	Codex 401 loop	#27731
§24791210275	AI Moderator	Codex	16:53	Codex 401 loop	#27731
§24789574257	AI Moderator	Codex	16:18	Codex 401 loop	#27731
§24785716916	DeepReport	Claude	15:01	awf-api-proxy unhealthy	NEW → sub-issue
§24786173739	Smoke CI	Copilot	15:10	awf-api-proxy unhealthy	NEW → sub-issue
§24786862508	Test Quality Sentinel	Copilot	15:24	awf-api-proxy unhealthy	NEW → sub-issue
§24787456440	Smoke OpenCode	OpenCode	15:35	awf-api-proxy unhealthy	NEW → sub-issue
§24787456577	Smoke Codex	Codex	15:35	Codex 401 loop	#27731
§24787456533	Changeset Generator	Codex	15:35	Codex 401 loop	#27731
§24788136304	Smoke Copilot	Copilot	15:49	Post-inference failure (30 turns)	auto-issue
§24787456512	Smoke Copilot	Copilot	15:35	Post-inference failure (22 turns)	auto-issue
§24786862492	Design Decision Gate	Claude	15:24	Post-inference failure (3 turns)	auto-issue
§24786542053	Breaking Change Checker	Copilot	15:17	Post-inference failure (10 turns)	auto-issue
§24787456499	Smoke Gemini	Gemini	15:35	Engine exit code 144	auto-issue
§24787456576	Smoke Crush	Crush	15:35	Engine exit code 1	auto-issue

[aw-failures] 6h Failure Report 2026-04-22: Codex 401 Auth Loop — Lock File Root Cause Identified #27729

Description

Overview

Failure Clusters

Evidence

Root Cause Analysis

Existing Issue Correlation

Proposed Fix Roadmap

Sub-Issues Created

6h Follow-Up Window: 2026-04-22T01:11Z → 07:14Z

New failure in this window

Evidence — AI Moderator (06:00 UTC, post-previous-report)

Smoke Gemini — separate failure (awf-api-proxy)

Assessment

6h Window: 2026-04-22T07:14Z → 13:14Z

Failure Clusters (8 runs, 100% error rate)

Key Evidence from Audit

Assessment

6h Window: 2026-04-22T13:14Z → 19:14Z

Failure Clusters (18 runs)

Key Evidence

Assessment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions