Skip to content

[aw-failures] Failure Investigation Report – April 17, 2026 (last 6h) #26779

@github-actions

Description

@github-actions

Executive Summary

Investigated 33 runs across the 6-hour window ending ~01:11 UTC April 17, 2026. 18 failure conclusions were observed, collapsing into three active root causes plus one already-fixed regression. The dominant story is a Docker socket GID shell-expansion bug (committed in d73f38a, fixed in ae832fb/PR #26771) that blanketed pre-fix runs with gateway crashes. A second, still-active cluster affects all Codex-engine workflows running in the firewall agent container — 100% failure rate, not previously tracked.


Failure Clusters

# Cluster Runs Affected Engine(s) Status Severity
1 MCP Gateway crash — Docker socket GID parsed as --cpu-shares flag 10+ runs on SHAs d73f38a/798da498 all ✅ Fixed by ae832fb Was P0
2 Codex agent container — Read-only file system (os error 30) 24539895233, 24541851028, 24541517687 codex ❌ Active (100% fail) P0
3 Serena MCP — EOF / connection closed before valid response 24541850986 (Smoke Copilot test 3) copilot ❌ Active P2
4 Agentic Workflows MCP status tool — "failed to get workflow statuses" 24541851021 (Smoke Claude test 10) claude ❌ Active P2

Evidence

Cluster 1 — Docker Socket GID Shell Expansion (FIXED)

Gateway stderr on pre-fix runs:

invalid argument "'%g'" for "-c, --cpu-shares" flag:
strconv.ParseInt: parsing "'%g'": invalid syntax
```

The `start_mcp_gateway.cjs` script was building a docker command containing `--group-add $(stat -c '%g' /var/run/docker.sock)` and passing it to Node.js `spawn` with `shell: false`, so the `$(...)` was never evaluated. Docker received the literal string and misinterpreted `-c '%g'` as the `--cpu-shares` (`-c`) flag value.

Fix: `ae832fb` (PR #26771) computes the GID separately before constructing the docker command.

Affected runs (sample): [§24540514568](https://github.com/github/gh-aw/actions/runs/24540514568), [§24542320562](https://github.com/github/gh-aw/actions/runs/24542320562), [§24542200764](https://github.com/github/gh-aw/actions/runs/24542200764), [§24542086157](https://github.com/github/gh-aw/actions/runs/24542086157)

</details>

<details>
<summary>Cluster 2 — Codex Agent Container Read-Only Filesystem (ACTIVE)</summary>

From `agent-stdio.log` of run [§24541851028](https://github.com/github/gh-aw/actions/runs/24541851028) (Changeset Generator, codex):
```
[entrypoint] CLI bin directory locked (read-only): /home/runner/work/_temp/gh-aw/mcp-cli/bin
WARNING: proceeding, even though we could not update PATH: Read-only file system (os error 30)
Error: Read-only file system (os error 30)
```

And from run [§24539895233](https://github.com/github/gh-aw/actions/runs/24539895233) (Daily Observability Report, codex):
```
CLI bin directory locked (read-only): /home/runner/work/_temp/gh-aw/mcp-cli/bin
Error: Read-only file system (os error 30)

The container entrypoint drops to chroot mode (CAP_SYS_CHROOT is held during setup). The codex CLI then attempts to write to its PATH during first-run setup. The /home/runner/work/_temp/gh-aw/mcp-cli/bin directory is mounted read-only inside the chroot, and codex cannot proceed.

Note: [entrypoint][WARN] Failed to transfer /host/home/runner/work/_temp/gh-aw/safeoutputs ownership to chroot user also appears, suggesting ownership/permission issues in the mount setup.

All 3 Codex-in-container runs in this window failed with exit code 1 before executing any agent turns.

Cluster 3 — Serena MCP EOF Errors

From Smoke Copilot issue #26773, run §24541850986:

Test 3 — Serena CLI: The Serena MCP server returned EOF / connection closed errors on both activate_project and find_symbol calls. The server accepted the connection and returned HTTP 200 but then closed the connection before delivering a valid result.

Cluster 4 — Agentic Workflows MCP `status` Tool Error

From Smoke Claude report #26776, run §24541851021:

Test 10 — Agentic Workflows MCP: status tool returned error: "failed to get workflow statuses"


Existing Issue Correlation

Open Issue Title Relationship
#26778 PR Triage Agent failed Cluster 1 (pre-fix, expires today)
#26775 Issue Monster failed Cluster 1 (pre-fix, expires today)
#26770 Test Quality Sentinel failed Cluster 1 (pre-fix, expires today)
#26769 Design Decision Gate failed Cluster 1 (pre-fix, expires today)
#26768 Smoke Claude failed Cluster 1 (pre-fix, expires today)
#26767 Smoke Codex failed Cluster 1 (pre-fix, expires today)
#26766 Agent Container Smoke Test failed Cluster 1 (pre-fix, expires today)
#26762 Lockfile Statistics Analysis Agent failed Cluster 1 (pre-fix, expires today)
#26761 Daily Observability Report failed Cluster 2 — Codex read-only FS (active)
#26351 Smoke Gemini failed (API key invalid) Separate: Gemini API key expired (recurring)
#26393 Daily Issues Report Generator: node: command not found Separate: copilot path issue (recurring)

Cluster 1 issues will auto-expire today — no manual action needed.
Cluster 2 has no adequate tracking issue yet → see sub-issue #26781.


Proposed Fix Roadmap

Priority Item Action
P0 Codex agent container — read-only FS crash Investigate why mcp-cli/bin is mounted read-only in chroot; fix PATH bootstrap or codex CLI startup. See #26781
P1 Verify Cluster 1 fully resolved on current HEAD Run smoke suite on ae832fb and confirm gateway starts cleanly (the current CI run should confirm this)
P2 Serena MCP EOF Investigate Serena backend stability; add retry logic or graceful failure in smoke test
P2 Agentic Workflows MCP status tool Investigate failed to get workflow statuses error in the MCP server

Sub-Issues Created

References:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions