Skip to content

[aw-failures] Failure Investigation Report — 6h window (2026-04-24) #28267

@github-actions

Description

@github-actions

Executive Summary

52 workflow runs in the last 6 hours (approximate window: 07:00–13:14 UTC); 5 failures across 4 distinct clusters. Three clusters are already tracked by open issues. One new P0 root cause identified: MCP Gateway v0.2.30 schema validation breaking codex-engine workflows that use the mempalace MCP server. A false-positive "engine failure" classification on a successful $1.37 claude run also warrants investigation.

Failure Clusters

Cluster Affected Runs Engine Existing Tracking
node: command not found (exit 127) §24881782690, §24885324351 copilot #28224, #28233
Model Not Supported §24885748725 copilot #28235
MCP Gateway schema validation failure §24887335913 codex NEW#28269
safeoutputs false-positive classification §24888785593 claude #28263 (misclassified)

Evidence

Cluster 1: node not found (copilot)

Daily News and Daily Issues Report Generator fail at agent execution with /bin/bash: line 1: node: command not found (exit 127). Chroot-mode agent setup uses command -v node inside the chroot but node is not available at that path inside the container.

Confirmed from agent-stdio.log for run §24881782690:

[entrypoint] Executing command: ... "$GH_AW_NODE_EXEC" ... copilot_driver.cjs ...
[entrypoint] Chroot mode: running command inside host filesystem (/host)
/bin/bash: line 1: node: command not found
[WARN] Command completed with exit code: 127
Cluster 2: Model Not Supported (copilot)

Daily Community Attribution Updater fails immediately with 400 The requested model is not supported. Copilot driver exits after 2 seconds without retrying — this is a subscription-tier configuration issue, not a transient failure.

Cluster 3: MCP Gateway schema validation (codex) — NEW P0

Daily Fact About gh-aw (run §24887335913) uses codex engine with gpt-5.1-codex-mini v0.121.0 and MCP Gateway v0.2.30. Agent setup fails at the gh-aw.agent.setup span (status=ERROR) with 0 turns, 0 tokens, after 95 seconds.

Error from workflow-logs/4_agent.txt:

jsonschema: '/mcpServers/mempalace' does not validate with
mcp-gateway-config.schema.json#/.../oneOf/0/$ref/required:
missing properties: 'container'

Configuration validation error (MCP Gateway version: v0.2.30):
    Error: oneOf failed
        Error: not failed

The mempalace MCP server (Python package v3.2.0, chromadb-backed) is configured without the container property now required by the updated Gateway schema. The agent cannot start.

Cluster 4: safeoutputs false-positive (claude)

GitHub MCP Structural Analysis (run §24888785593) ran for 36 turns, 18.4 min, cost $1.37. Agent output shows explicit success (terminal_reason: "completed", stop_reason: "end_turn"), discussion created, 4 charts uploaded. However:

  • SafeItemsCount = 0 in run_summary
  • All safeoutputs tool calls (upload_asset ×4, create_discussion ×1) show status: "unknown" in audit
  • Workflow classified as "Engine Failure: terminated unexpectedly" despite the agent stating success

This appears to be a safeoutputs MCP reliability issue or audit tracking gap. The agent's work was completed but safe outputs were not registered, triggering a false-positive failure.

Existing Issue Correlation

Proposed Fix Roadmap

P0 — Fix mempalace MCP server config to satisfy MCP Gateway v0.2.30 container schema requirement → see sub-issue below

P1 — Investigate safeoutputs status: "unknown" in claude runs (§24888785593); determine if safeoutputs MCP has reliability regression causing false-positive "engine failure" classification

P2 — Fix node: command not found in copilot chroot execution (Daily News, Daily Issues Report)

P2 — Update model configuration for Daily Community Attribution Updater to use supported subscription tier

Sub-Issues Created

  • See sub-issue #28269 — P0: mempalace MCP Gateway schema validation failure

References:

  • §24887335913 — Daily Fact About gh-aw (MCP Gateway schema failure, P0)
  • §24888785593 — GitHub MCP Structural Analysis (false-positive failure, $1.37 wasted)
  • §24881782690 — Daily News (node not found)


Updated Window: 13:05–19:05 UTC 2026-04-24

Failure Clusters (new window)

Cluster Runs Engine Tracking
Design Decision Gate: max-turns ($0.72) §24899268907 claude NEW sub-issue → #aw_DDGmax
Issue Monster: Copilot GraphQL failure (×4) §24900325460, §24901478151, §24902781262, §24903839716 copilot NEW sub-issue → #aw_IMdual
Issue Monster: bash markers in comment bodies §24902781262, §24903839716 copilot covered by #aw_IMdual

All other workflows: 0 failures. Issue Monster self-recovered by 18:23 UTC (§24905258841 succeeded as no-op).

Key Findings

  • Design Decision Gate hit error_max_turns (15 turns) because every Bash command was permission-denied (reads of /tmp/gh-aw/agent/*.json context files). $0.72 wasted per occurrence.
  • Issue Monster backend silently swallows assign_to_agent GraphQL errors — agent reports success while gh-aw-bot posts failure comments on each targeted issue.
  • Comment body corruption: Internal Claude Code bash marker strings (___BEGIN___COMMAND_OUTPUT_MARKER___) leaked into two Issue Monster add_comment bodies; security scanner flagged both runs.

No Previously-Tracked Issues to Close

Cluster fixes from the prior report (node not found, model not supported, MCP Gateway schema) have not yet reoccurred in this window and remain unresolved.

Generated by [aw] Failure Investigator (6h) · ● 341K ·



Updated Window: 19:11–01:11 UTC 2026-04-24/25

Failure Clusters (new window)

Cluster Runs Engine Tracking
Smoke Gemini: API_KEY_INVALID §24911755836 gemini NEW sub-issue #aw_SGkey → #28268
Smoke Crush: EROFS read-only hostedtoolcache §24911755864 crush NEW sub-issue #aw_SCerof → #28268
Smoke OpenCode: no safe outputs §24909063346 opencode #28330 (auto-triage only)
Smoke CI cancelled (5 errors) §24914380736 cascade from above three
Go Logger Enhancement: 413 turns, 9 anomalies, terminated §24912564019 claude #28357 (auto-triage only)
Step Name Alignment: safeoutputs MCP drop @ 149s §24908320676 claude P1 carried from prior window
Audit Agent false positive ($2.37 wasted) §24911879231 claude P1 carried from prior window

Key Findings

  • Smoke Gemini returned 400 API_KEY_INVALIDGEMINI_API_KEY secret is expired or revoked; zero tokens consumed, zero turns, agent cannot start. Gemini smoke coverage is entirely blocked.
  • Smoke Crush cannot install the CLI globally: EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/.../bin' — npm global install targets a read-only path on hosted runners.
  • Smoke CI cancelled with 5 errors — cascading from Gemini/Crush/OpenCode breakage; Smoke CI run §24914380736 aborted within 1.1 minutes.
  • Go Logger Enhancement ran 413 turns over 17 minutes with 9 anomalous events and a high severity anomaly signal before terminating (36 tool types, exploratory path). Root cause unclear — may be an unguarded loop or context overflow. Not max-turns (Design Decision Gate pattern) but warrants investigation.
  • safeoutputs MCP drop (Step Name Alignment): HTTP connection dropped after 149s uptime — same The operation was aborted error as Design Decision Gate in prior window. P1 still unresolved.
  • Audit Agent false positive: Run §24911879231 completed with 61 turns, 1 safe output, $2.37 cost, terminal_reason: completed — yet classified as engine failure. Same false-positive detection gap as prior window P1.

No Previously-Tracked Issues Closed

No prior root causes appear fixed in this window. Design Decision Gate ran successfully multiple times (e.g. §24917291168) but the underlying safeoutputs MCP stability issue is unresolved. Issue Monster self-recovered (§24918568357 succeeded as baseline).

Generated by [aw] Failure Investigator (6h) · 29 runs · 6h window · 7 failure signals

Generated by [aw] Failure Investigator (6h) · ● 274.8K ·



Updated Window: 01:10–07:10 UTC 2026-04-25

Failure Clusters (new window)

Cluster Runs Engine Tracking
GitHub Remote MCP Auth Test: gpt-5.4-mini not accessible §24922384597 copilot #28393 (auto-triage)
Smoke CI cancelled (timeout) §24921318705 transient — no tracking

Key Findings

  • GitHub Remote MCP Authentication Test failed with 400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint. All 4 attempts (1 initial + 3 retries) failed identically within ~4 seconds. Copilot driver exhausted retries and exited code 1. The model name gpt-5.4-mini is either invalid, renamed, or unavailable for this Copilot subscription tier. Fix: update the workflow engine config to a supported model (e.g., gpt-4o-mini). Tracked in auto-generated issue [aw] GitHub Remote MCP Authentication Test failed #28393.

  • Smoke CI (run §24921318705) was cancelled due to a job-level timeout firing the instant the last of 6 Docker image pulls completed. The agent never started (0 tokens, 0 turns). This is a transient timing event — 4 other Smoke CI runs in the same window completed successfully. No tracking created.

Previously-Tracked Issues — Status in This Window

Issue Last Known State New Evidence
#28345 Smoke Gemini API_KEY_INVALID open Not scheduled in this window — cannot confirm fixed
#28344 Smoke Crush EROFS open Not scheduled in this window — cannot confirm fixed
#28330 Smoke OpenCode no safe outputs open Not scheduled in this window — cannot confirm fixed
#28357 Go Logger Enhancement open Not scheduled in this window
#28356 Audit Agent false positive open Not scheduled in this window

No Previously-Tracked Issues Closed

None of the prior root causes reappeared in this window to confirm resolution, and none have been confirmed fixed externally.

Generated by [aw] Failure Investigator (6h) · 39 runs · 6h window · 1 new failure cluster

Generated by [aw] Failure Investigator (6h) · ● 601.3K ·



Updated Window: 07:07–13:07 UTC 2026-04-25

Failure Clusters (new window)

Cluster Runs Engine Tracking
Smoke Gemini: untrusted directory (exit 55) §24931278139 gemini NEW sub-issue → #28268
Smoke Crush: EROFS read-only hostedtoolcache §24931278150 crush Already tracked → #28382
Workflow Health Manager: ERR_SYSTEM runtime import not found §24930436676 — (activation fail) NEW sub-issue → #28268

Overall: 37 runs, 3 failures, 31 succeeded, 3 in-progress at query time. $7.29 total cost, 11.5M tokens.

Key Findings

  • Smoke Gemini now fails with a different root cause from the prior API_KEY_INVALID issue. Gemini CLI v1.x added a "trusted folders" security model: the workflow passes --yolo but the CLI overrides it to "default" when the workspace is untrusted, then exits with code 55 before executing any turns. Fix: set GEMINI_CLI_TRUST_WORKSPACE=true in the workflow env or add --skip-trust to the invocation.

  • Smoke Crush repeats the same EROFS error already tracked in [aw-failures] smoke-crush: EROFS on npm global install to read-only hostedtoolcache #28382 (npm global install into read-only /opt/hostedtoolcache). No fix has landed yet.

  • Workflow Health Manager - Meta-Orchestrator (scheduled) failed in 31 seconds during the activation job with ERR_SYSTEM: Runtime import file not found: .github/workflows/workflow-health-manager.md → <path>. The prior baseline run §24888666710 on 2026-04-24 succeeded with the same trigger — this is a recent regression from a missing or renamed import file. Audit-diff classification: stable (no behavioral change in agent itself, since the agent never started).

Previously-Tracked Issues — Status in This Window

Issue State New Evidence
#28382 Smoke Crush EROFS open Confirmed recurring — same error, same run
#28345 Smoke Gemini API_KEY_INVALID open Root cause has changed; now "untrusted directory" — new sub-issue created
#28419 Daily Issues Report Generator (node not found) open No new run in this window

Sub-Issues Created

References:

Generated by [aw] Failure Investigator (6h) · 37 runs · 6h window · 3 failure signals

Generated by [aw] Failure Investigator (6h) · ● 416.5K ·



Updated Window: 01:10–07:10 UTC 2026-04-26

Failure Clusters (new window)

Cluster Runs Engine Tracking
GitHub Remote MCP Auth Test: gpt-5.4-mini not accessible §24948237798 copilot #28540 (auto-triage) + new sub-issue (→ #28268)
Smoke Gemini: API_KEY_INVALID §24945190974 gemini #28530 (auto-triage)
Smoke Crush: EROFS read-only hostedtoolcache §24945190952 crush #28531 (auto-triage) + #28382 (detailed)

Overall: 48 runs, 3 failures (+ 1 in-progress = current run). All other workflows succeeded.

Key Findings

  • GitHub Remote MCP Authentication Test failed again with 400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint — identical error to run §24922384597 from 2026-04-25. The Copilot driver exhausted all 3 retries (4 total attempts × ~4s each = ~54s total). No tokens consumed, no turns completed. This is the third consecutive run of this workflow failing with the same error. A sub-issue with a concrete fix proposal (update model name to a valid endpoint-accessible model) was added to [aw-failures] [aw] Failure Investigator (6h) - Issue Group #28268.

  • Smoke Gemini returned 400 API_KEY_INVALID — reverted to the API key error seen in the 19:11-01:11 UTC 2026-04-24/25 window. This run was triggered by a pull_request event on branch copilot/add-support-object-form-otlp-headers. The Gemini CLI (v1.x, model auto-gemini-3) could not authenticate. 0 tokens, 0 turns. Note: the prior window's Gemini failure was an "untrusted directory" issue (exit 55); this is a credential failure, which may affect only PR-triggered runs (vs. the untrusted-dir issue on scheduled runs).

  • Smoke Crush hit the same EROFS install failure tracked in [aw-failures] smoke-crush: EROFS on npm global install to read-only hostedtoolcache #28382: Installation failed: EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/24.14.1/x64/lib/node_modules/@charmland/crush/bin'. No fix has landed. Same error, same path, new run.

Stale Issues Closed

Previously-Tracked Issues — Status in This Window

Issue State New Evidence
#28382 Smoke Crush EROFS open Confirmed recurring — same error path
#28529 AWF binary download HTTP 502 open Smoke CI passing — transient CDN issue appears resolved
#28393 GitHub Remote MCP Auth Test model unavailable open Confirmed recurring — 3rd consecutive failure

Sub-Issues Created/Updated

References:

Generated by [aw] Failure Investigator (6h) · 48 runs · 6h window · 3 failure clusters

Generated by [aw] Failure Investigator (6h) · ● 355.9K ·



Updated Window: 07:07–13:07 UTC 2026-04-26

Failure Clusters (new window)

Cluster Runs Engine Tracking
GitHub MCP Tools Report: protected-files blocking PR §24956724357 claude #28599 + new sub-issue
Daily Go Function Namer: exit code 22 (HTTP error) §24955120726 claude #28582 (auto-triage)
Constraint Solving: detection job failure §24957055408 copilot #28601 (auto-triage)

Overall: 41 runs, 3 hard failures, 32 succeeded, 6 cancelled (Smoke CI push-burst supersession). $13.55 total cost, 30.9M tokens.

Key Findings

  • GitHub MCP Remote Server Tools Report Generator ran to completion ($1.64, 24 turns, 10.9 min) but safe_outputs job failed because the patch touches .github/aw/github-mcp-server.md, a protected file. Error: Cannot create pull request: patch modifies protected files. Set protected-files: fallback-to-issue to create a review issue instead. Fix: add protected-files: fallback-to-issue to workflow frontmatter. Sub-issue created.

  • Daily Go Function Namer (Claude Code) failed at agent job with exit code 22 (CURLE_HTTP_RETURNED_ERROR). The agent started (plan event logged), made 14 tool calls, hit 2 errors, and exited in 1.9 min with no output. No firewall blocks, no rate-limit pressure. Root cause: an HTTP 4xx/5xx from a tool call (likely transient external API unavailability). Auto-tracked in [aw] Daily Go Function Namer failed #28582; insufficient signal for a new sub-issue.

  • Constraint Solving — Problem of the Day failed at the detection job (37s, 29-byte log = effectively empty). Agent itself ran successfully (discussion created, safeoutputs called). A cache_memory_miss was reported (first-run / post-expiry). Detection infrastructure failure appears independent of the agent completing successfully. Auto-tracked in [aw] Constraint Solving — Problem of the Day failed #28601.

  • Smoke CI cancellations (6 runs, 11:36–11:58 UTC): Burst of pushes to main caused pipeline supersession. All expected — not real failures. Subsequent Smoke CI runs succeeded.

Stale Issues Closed

Previously-Tracked Issues — Status in This Window

Issue State New Evidence
#28382 Smoke Crush EROFS open Not scheduled in this window
#28540 GitHub Remote MCP Auth Test gpt-5.4-mini open Not scheduled in this window
#28530 Smoke Gemini API_KEY_INVALID open Not scheduled in this window
#28529 AWF binary download HTTP 502 CLOSED Smoke CI consistently passing

Sub-Issues Created

References:

  • §24956724357 — GitHub MCP Tools Report (protected-files failure, $1.64 wasted per run)
  • §24955120726 — Daily Go Function Namer (exit code 22)
  • §24957055408 — Constraint Solving (detection job failure)

Generated by [aw] Failure Investigator (6h) · ● 553.8K ·



Updated Window: ~19:15 UTC 2026-04-26 – 01:15 UTC 2026-04-27

Failure Clusters

Cluster Runs Engine Tracking
Go Logger Enhancement: MCP timeout kills build §24967310561 claude #28639 (auto-triage) + new sub-issue #aw_GoMCP1#28268
Smoke CI cancelled §24966690457 copilot push-burst supersession — transient, no tracking

Overall: 45 runs, 1 hard failure, 1 cancelled (push-burst supersession), 43 succeeded/in-progress. $16.52 total cost, 40.3M tokens.

Key Findings

  • Go Logger Enhancement (§24967310561) failed at the agent job after 18.5 min. Root cause confirmed via agent-stdio.log: all three MCP servers (github, mcpscripts, safeoutputs) timed out at 21:25:55 UTC (~5 min into session) with Connection error: The operation timed out. When mcpscripts.make was called at 21:35:52 to verify the build, the transport was gone: MCP error -32003: context canceled / client is closing, then Unable to connect. The agent had already edited 11 files via native tools (Read/Grep/Edit) but the build verification step — the only MCP-dependent step — was lost. This is the second Go Logger failure with MCP connection issues; the prior run §24912564019 (2026-04-24/25 window) was also flagged as "9 anomalies, unguarded loop or context overflow." Both failures share a pattern of long agent turns (avg TBT: 9.2m) exceeding MCP connection idle timeouts. Auto-triage issue [aw] Go Logger Enhancement failed #28639 captures the symptom; sub-issue #aw_GoMCP1 captures the root cause and remediation.

  • Smoke CI (§24966690457) was cancelled at the activation job after 18s. Triggered by a push event at 20:48:03 UTC. Two subsequent Smoke CI runs (§24966702612 at 20:48:36, §24966772718 at 20:52:09) both succeeded. Classic push-burst supersession pattern — not a real failure.

  • 3 missing-tool events on successful runs: Agentic Workflow Audit Agent, GitHub API Consumption Report Agent, and Daily Regulatory Report Generator each hit a missing_tool event reporting that the agentic-workflows MCP server (status/logs tool) was not available in their runtime. All three completed successfully via safeoutputs.missing_tool. Not P1.

Previously-Tracked Issues — Status in This Window

Issue State New Evidence
#28382 Smoke Crush EROFS open Not scheduled in this window
#28540 GitHub Remote MCP Auth Test gpt-5.4-mini open Not scheduled in this window
#28530 Smoke Gemini API_KEY_INVALID open Not scheduled in this window

Sub-Issues Created

References:

  • §24967310561 — Go Logger Enhancement (MCP timeout, build verification failure)
  • §24966690457 — Smoke CI (push-burst cancellation, transient)

Generated by [aw] Failure Investigator (6h) · 45 runs · 6h window · 1 hard failure

Generated by [aw] Failure Investigator (6h) · ● 399K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions