Executive Summary
Investigated 33 runs across the 6-hour window ending ~01:11 UTC April 17, 2026. 18 failure conclusions were observed, collapsing into three active root causes plus one already-fixed regression. The dominant story is a Docker socket GID shell-expansion bug (committed in d73f38a, fixed in ae832fb/PR #26771) that blanketed pre-fix runs with gateway crashes. A second, still-active cluster affects all Codex-engine workflows running in the firewall agent container — 100% failure rate, not previously tracked.
Failure Clusters
| # |
Cluster |
Runs Affected |
Engine(s) |
Status |
Severity |
| 1 |
MCP Gateway crash — Docker socket GID parsed as --cpu-shares flag |
10+ runs on SHAs d73f38a/798da498 |
all |
✅ Fixed by ae832fb |
Was P0 |
| 2 |
Codex agent container — Read-only file system (os error 30) |
24539895233, 24541851028, 24541517687 |
codex |
❌ Active (100% fail) |
P0 |
| 3 |
Serena MCP — EOF / connection closed before valid response |
24541850986 (Smoke Copilot test 3) |
copilot |
❌ Active |
P2 |
| 4 |
Agentic Workflows MCP status tool — "failed to get workflow statuses" |
24541851021 (Smoke Claude test 10) |
claude |
❌ Active |
P2 |
Evidence
Cluster 1 — Docker Socket GID Shell Expansion (FIXED)
Gateway stderr on pre-fix runs:
invalid argument "'%g'" for "-c, --cpu-shares" flag:
strconv.ParseInt: parsing "'%g'": invalid syntax
```
The `start_mcp_gateway.cjs` script was building a docker command containing `--group-add $(stat -c '%g' /var/run/docker.sock)` and passing it to Node.js `spawn` with `shell: false`, so the `$(...)` was never evaluated. Docker received the literal string and misinterpreted `-c '%g'` as the `--cpu-shares` (`-c`) flag value.
Fix: `ae832fb` (PR #26771) computes the GID separately before constructing the docker command.
Affected runs (sample): [§24540514568](https://github.com/github/gh-aw/actions/runs/24540514568), [§24542320562](https://github.com/github/gh-aw/actions/runs/24542320562), [§24542200764](https://github.com/github/gh-aw/actions/runs/24542200764), [§24542086157](https://github.com/github/gh-aw/actions/runs/24542086157)
</details>
<details>
<summary>Cluster 2 — Codex Agent Container Read-Only Filesystem (ACTIVE)</summary>
From `agent-stdio.log` of run [§24541851028](https://github.com/github/gh-aw/actions/runs/24541851028) (Changeset Generator, codex):
```
[entrypoint] CLI bin directory locked (read-only): /home/runner/work/_temp/gh-aw/mcp-cli/bin
WARNING: proceeding, even though we could not update PATH: Read-only file system (os error 30)
Error: Read-only file system (os error 30)
```
And from run [§24539895233](https://github.com/github/gh-aw/actions/runs/24539895233) (Daily Observability Report, codex):
```
CLI bin directory locked (read-only): /home/runner/work/_temp/gh-aw/mcp-cli/bin
Error: Read-only file system (os error 30)
The container entrypoint drops to chroot mode (CAP_SYS_CHROOT is held during setup). The codex CLI then attempts to write to its PATH during first-run setup. The /home/runner/work/_temp/gh-aw/mcp-cli/bin directory is mounted read-only inside the chroot, and codex cannot proceed.
Note: [entrypoint][WARN] Failed to transfer /host/home/runner/work/_temp/gh-aw/safeoutputs ownership to chroot user also appears, suggesting ownership/permission issues in the mount setup.
All 3 Codex-in-container runs in this window failed with exit code 1 before executing any agent turns.
Cluster 3 — Serena MCP EOF Errors
From Smoke Copilot issue #26773, run §24541850986:
Test 3 — Serena CLI: The Serena MCP server returned EOF / connection closed errors on both activate_project and find_symbol calls. The server accepted the connection and returned HTTP 200 but then closed the connection before delivering a valid result.
Cluster 4 — Agentic Workflows MCP `status` Tool Error
From Smoke Claude report #26776, run §24541851021:
Test 10 — Agentic Workflows MCP: status tool returned error: "failed to get workflow statuses"
Existing Issue Correlation
| Open Issue |
Title |
Relationship |
| #26778 |
PR Triage Agent failed |
Cluster 1 (pre-fix, expires today) |
| #26775 |
Issue Monster failed |
Cluster 1 (pre-fix, expires today) |
| #26770 |
Test Quality Sentinel failed |
Cluster 1 (pre-fix, expires today) |
| #26769 |
Design Decision Gate failed |
Cluster 1 (pre-fix, expires today) |
| #26768 |
Smoke Claude failed |
Cluster 1 (pre-fix, expires today) |
| #26767 |
Smoke Codex failed |
Cluster 1 (pre-fix, expires today) |
| #26766 |
Agent Container Smoke Test failed |
Cluster 1 (pre-fix, expires today) |
| #26762 |
Lockfile Statistics Analysis Agent failed |
Cluster 1 (pre-fix, expires today) |
| #26761 |
Daily Observability Report failed |
Cluster 2 — Codex read-only FS (active) |
| #26351 |
Smoke Gemini failed (API key invalid) |
Separate: Gemini API key expired (recurring) |
| #26393 |
Daily Issues Report Generator: node: command not found |
Separate: copilot path issue (recurring) |
Cluster 1 issues will auto-expire today — no manual action needed.
Cluster 2 has no adequate tracking issue yet → see sub-issue #26781.
Proposed Fix Roadmap
| Priority |
Item |
Action |
| P0 |
Codex agent container — read-only FS crash |
Investigate why mcp-cli/bin is mounted read-only in chroot; fix PATH bootstrap or codex CLI startup. See #26781 |
| P1 |
Verify Cluster 1 fully resolved on current HEAD |
Run smoke suite on ae832fb and confirm gateway starts cleanly (the current CI run should confirm this) |
| P2 |
Serena MCP EOF |
Investigate Serena backend stability; add retry logic or graceful failure in smoke test |
| P2 |
Agentic Workflows MCP status tool |
Investigate failed to get workflow statuses error in the MCP server |
Sub-Issues Created
References:
Executive Summary
Investigated 33 runs across the 6-hour window ending ~01:11 UTC April 17, 2026. 18 failure conclusions were observed, collapsing into three active root causes plus one already-fixed regression. The dominant story is a Docker socket GID shell-expansion bug (committed in
d73f38a, fixed inae832fb/PR #26771) that blanketed pre-fix runs with gateway crashes. A second, still-active cluster affects all Codex-engine workflows running in the firewall agent container — 100% failure rate, not previously tracked.Failure Clusters
--cpu-sharesflagd73f38a/798da498ae832fbRead-only file system (os error 30)statustool — "failed to get workflow statuses"Evidence
Cluster 1 — Docker Socket GID Shell Expansion (FIXED)
Gateway stderr on pre-fix runs:
The container entrypoint drops to chroot mode (
CAP_SYS_CHROOTis held during setup). ThecodexCLI then attempts to write to its PATH during first-run setup. The/home/runner/work/_temp/gh-aw/mcp-cli/bindirectory is mounted read-only inside the chroot, andcodexcannot proceed.Note:
[entrypoint][WARN] Failed to transfer /host/home/runner/work/_temp/gh-aw/safeoutputs ownership to chroot useralso appears, suggesting ownership/permission issues in the mount setup.All 3 Codex-in-container runs in this window failed with exit code 1 before executing any agent turns.
Cluster 3 — Serena MCP EOF Errors
From Smoke Copilot issue #26773, run §24541850986:
Cluster 4 — Agentic Workflows MCP `status` Tool Error
From Smoke Claude report #26776, run §24541851021:
Existing Issue Correlation
node: command not foundProposed Fix Roadmap
mcp-cli/binis mounted read-only in chroot; fix PATH bootstrap or codex CLI startup. See #26781ae832fband confirm gateway starts cleanly (the current CI run should confirm this)statustoolfailed to get workflow statuseserror in the MCP serverSub-Issues Created
References: