[aw-failures] Failure Investigation Report — 6h window (2026-04-24)

### Executive Summary

52 workflow runs in the last 6 hours (approximate window: 07:00–13:14 UTC); **5 failures** across 4 distinct clusters. Three clusters are already tracked by open issues. One new P0 root cause identified: **MCP Gateway v0.2.30 schema validation breaking codex-engine workflows** that use the `mempalace` MCP server. A false-positive "engine failure" classification on a successful $1.37 claude run also warrants investigation.

### Failure Clusters

| Cluster | Affected Runs | Engine | Existing Tracking |
|---|---|---|---|
| `node: command not found` (exit 127) | [§24881782690](https://github.com/github/gh-aw/actions/runs/24881782690), [§24885324351](https://github.com/github/gh-aw/actions/runs/24885324351) | copilot | #28224, #28233 |
| Model Not Supported | [§24885748725](https://github.com/github/gh-aw/actions/runs/24885748725) | copilot | #28235 |
| **MCP Gateway schema validation failure** | [§24887335913](https://github.com/github/gh-aw/actions/runs/24887335913) | codex | **NEW** → #28269 |
| safeoutputs false-positive classification | [§24888785593](https://github.com/github/gh-aw/actions/runs/24888785593) | claude | #28263 (misclassified) |

### Evidence

<details>
<summary>Cluster 1: node not found (copilot)</summary>

Daily News and Daily Issues Report Generator fail at agent execution with `/bin/bash: line 1: node: command not found` (exit 127). Chroot-mode agent setup uses `command -v node` inside the chroot but node is not available at that path inside the container.

Confirmed from `agent-stdio.log` for run §24881782690:
```
[entrypoint] Executing command: ... "$GH_AW_NODE_EXEC" ... copilot_driver.cjs ...
[entrypoint] Chroot mode: running command inside host filesystem (/host)
/bin/bash: line 1: node: command not found
[WARN] Command completed with exit code: 127
```

</details>

<details>
<summary>Cluster 2: Model Not Supported (copilot)</summary>

Daily Community Attribution Updater fails immediately with `400 The requested model is not supported`. Copilot driver exits after 2 seconds without retrying — this is a subscription-tier configuration issue, not a transient failure.

</details>

<details>
<summary>Cluster 3: MCP Gateway schema validation (codex) — NEW P0</summary>

Daily Fact About gh-aw (run [§24887335913](https://github.com/github/gh-aw/actions/runs/24887335913)) uses codex engine with `gpt-5.1-codex-mini` v0.121.0 and MCP Gateway v0.2.30. Agent setup fails at the `gh-aw.agent.setup` span (status=ERROR) with 0 turns, 0 tokens, after 95 seconds.

Error from `workflow-logs/4_agent.txt`:
```
jsonschema: '/mcpServers/mempalace' does not validate with
mcp-gateway-config.schema.json#/.../oneOf/0/$ref/required:
missing properties: 'container'

Configuration validation error (MCP Gateway version: v0.2.30):
    Error: oneOf failed
        Error: not failed
```

The `mempalace` MCP server (Python package v3.2.0, chromadb-backed) is configured without the `container` property now required by the updated Gateway schema. The agent cannot start.

</details>

<details>
<summary>Cluster 4: safeoutputs false-positive (claude)</summary>

GitHub MCP Structural Analysis (run [§24888785593](https://github.com/github/gh-aw/actions/runs/24888785593)) ran for 36 turns, 18.4 min, cost $1.37. Agent output shows explicit success (`terminal_reason: "completed"`, `stop_reason: "end_turn"`), discussion created, 4 charts uploaded. However:

- `SafeItemsCount = 0` in run_summary
- All safeoutputs tool calls (`upload_asset` ×4, `create_discussion` ×1) show `status: "unknown"` in audit
- Workflow classified as "Engine Failure: terminated unexpectedly" despite the agent stating success

This appears to be a safeoutputs MCP reliability issue or audit tracking gap. The agent's work was completed but safe outputs were not registered, triggering a false-positive failure.

</details>

### Existing Issue Correlation

- #28245: `[aw] Daily Fact About gh-aw failed` — matches Cluster 3, no root-cause tracking
- #28263: `[aw] GitHub MCP Structural Analysis failed` — matches Cluster 4, misclassified as engine failure
- #28224, #28233: Cluster 1 (node not found) — open, no fix yet
- #28235: Cluster 2 (model not supported) — open, config fix needed

### Proposed Fix Roadmap

**P0** — Fix `mempalace` MCP server config to satisfy MCP Gateway v0.2.30 `container` schema requirement → see sub-issue below

**P1** — Investigate safeoutputs `status: "unknown"` in claude runs (§24888785593); determine if safeoutputs MCP has reliability regression causing false-positive "engine failure" classification

**P2** — Fix `node: command not found` in copilot chroot execution (Daily News, Daily Issues Report)

**P2** — Update model configuration for Daily Community Attribution Updater to use supported subscription tier

### Sub-Issues Created

- See sub-issue `#28269` — P0: `mempalace` MCP Gateway schema validation failure

**References:**
- [§24887335913](https://github.com/github/gh-aw/actions/runs/24887335913) — Daily Fact About gh-aw (MCP Gateway schema failure, P0)
- [§24888785593](https://github.com/github/gh-aw/actions/runs/24888785593) — GitHub MCP Structural Analysis (false-positive failure, $1.37 wasted)
- [§24881782690](https://github.com/github/gh-aw/actions/runs/24881782690) — Daily News (node not found)

---

---

### Updated Window: 13:05–19:05 UTC 2026-04-24

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| Design Decision Gate: max-turns ($0.72) | [§24899268907](https://github.com/github/gh-aw/actions/runs/24899268907) | claude | NEW sub-issue → #aw_DDGmax |
| Issue Monster: Copilot GraphQL failure (×4) | [§24900325460](https://github.com/github/gh-aw/actions/runs/24900325460), [§24901478151](https://github.com/github/gh-aw/actions/runs/24901478151), [§24902781262](https://github.com/github/gh-aw/actions/runs/24902781262), [§24903839716](https://github.com/github/gh-aw/actions/runs/24903839716) | copilot | NEW sub-issue → #aw_IMdual |
| Issue Monster: bash markers in comment bodies | [§24902781262](https://github.com/github/gh-aw/actions/runs/24902781262), [§24903839716](https://github.com/github/gh-aw/actions/runs/24903839716) | copilot | covered by #aw_IMdual |

**All other workflows**: 0 failures. Issue Monster self-recovered by 18:23 UTC ([§24905258841](https://github.com/github/gh-aw/actions/runs/24905258841) succeeded as no-op).

#### Key Findings

- **Design Decision Gate** hit `error_max_turns` (15 turns) because every Bash command was permission-denied (reads of `/tmp/gh-aw/agent/*.json` context files). $0.72 wasted per occurrence.
- **Issue Monster** backend silently swallows `assign_to_agent` GraphQL errors — agent reports success while `gh-aw-bot` posts failure comments on each targeted issue.
- **Comment body corruption**: Internal Claude Code bash marker strings (`___BEGIN___COMMAND_OUTPUT_MARKER___`) leaked into two Issue Monster `add_comment` bodies; security scanner flagged both runs.

#### No Previously-Tracked Issues to Close

Cluster fixes from the prior report (node not found, model not supported, MCP Gateway schema) have not yet reoccurred in this window and remain unresolved.

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24907049726/agentic_workflow) · ● 341K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: 19:11–01:11 UTC 2026-04-24/25

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| Smoke Gemini: `API_KEY_INVALID` | [§24911755836](https://github.com/github/gh-aw/actions/runs/24911755836) | gemini | NEW sub-issue #aw_SGkey → #28268 |
| Smoke Crush: EROFS read-only hostedtoolcache | [§24911755864](https://github.com/github/gh-aw/actions/runs/24911755864) | crush | NEW sub-issue #aw_SCerof → #28268 |
| Smoke OpenCode: no safe outputs | [§24909063346](https://github.com/github/gh-aw/actions/runs/24909063346) | opencode | #28330 (auto-triage only) |
| Smoke CI cancelled (5 errors) | [§24914380736](https://github.com/github/gh-aw/actions/runs/24914380736) | — | cascade from above three |
| Go Logger Enhancement: 413 turns, 9 anomalies, terminated | [§24912564019](https://github.com/github/gh-aw/actions/runs/24912564019) | claude | #28357 (auto-triage only) |
| Step Name Alignment: safeoutputs MCP drop @ 149s | [§24908320676](https://github.com/github/gh-aw/actions/runs/24908320676) | claude | P1 carried from prior window |
| Audit Agent false positive ($2.37 wasted) | [§24911879231](https://github.com/github/gh-aw/actions/runs/24911879231) | claude | P1 carried from prior window |

#### Key Findings

- **Smoke Gemini** returned `400 API_KEY_INVALID` — `GEMINI_API_KEY` secret is expired or revoked; zero tokens consumed, zero turns, agent cannot start. Gemini smoke coverage is entirely blocked.
- **Smoke Crush** cannot install the CLI globally: `EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/.../bin'` — npm global install targets a read-only path on hosted runners.
- **Smoke CI cancelled with 5 errors** — cascading from Gemini/Crush/OpenCode breakage; Smoke CI run [§24914380736](https://github.com/github/gh-aw/actions/runs/24914380736) aborted within 1.1 minutes.
- **Go Logger Enhancement** ran 413 turns over 17 minutes with 9 anomalous events and a `high` severity anomaly signal before terminating (36 tool types, exploratory path). Root cause unclear — may be an unguarded loop or context overflow. Not max-turns (Design Decision Gate pattern) but warrants investigation.
- **safeoutputs MCP drop** (Step Name Alignment): HTTP connection dropped after 149s uptime — same `The operation was aborted` error as Design Decision Gate in prior window. P1 still unresolved.
- **Audit Agent false positive**: Run [§24911879231](https://github.com/github/gh-aw/actions/runs/24911879231) completed with 61 turns, 1 safe output, $2.37 cost, `terminal_reason: completed` — yet classified as engine failure. Same false-positive detection gap as prior window P1.

#### No Previously-Tracked Issues Closed

No prior root causes appear fixed in this window. Design Decision Gate ran successfully multiple times (e.g. [§24917291168](https://github.com/github/gh-aw/actions/runs/24917291168)) but the underlying safeoutputs MCP stability issue is unresolved. Issue Monster self-recovered ([§24918568357](https://github.com/github/gh-aw/actions/runs/24918568357) succeeded as baseline).

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24918925182/agentic_workflow) · 29 runs · 6h window · 7 failure signals

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24918925182/agentic_workflow) · ● 274.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: 01:10–07:10 UTC 2026-04-25

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| GitHub Remote MCP Auth Test: `gpt-5.4-mini` not accessible | [§24922384597](https://github.com/github/gh-aw/actions/runs/24922384597) | copilot | #28393 (auto-triage) |
| Smoke CI cancelled (timeout) | [§24921318705](https://github.com/github/gh-aw/actions/runs/24921318705) | — | transient — no tracking |

#### Key Findings

- **GitHub Remote MCP Authentication Test** failed with `400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint`. All 4 attempts (1 initial + 3 retries) failed identically within ~4 seconds. Copilot driver exhausted retries and exited code 1. The model name `gpt-5.4-mini` is either invalid, renamed, or unavailable for this Copilot subscription tier. Fix: update the workflow engine config to a supported model (e.g., `gpt-4o-mini`). Tracked in auto-generated issue #28393.

- **Smoke CI** (run [§24921318705](https://github.com/github/gh-aw/actions/runs/24921318705)) was cancelled due to a job-level timeout firing the instant the last of 6 Docker image pulls completed. The agent never started (0 tokens, 0 turns). This is a transient timing event — 4 other Smoke CI runs in the same window completed successfully. No tracking created.

#### Previously-Tracked Issues — Status in This Window

| Issue | Last Known State | New Evidence |
|---|---|---|
| #28345 Smoke Gemini API_KEY_INVALID | open | Not scheduled in this window — cannot confirm fixed |
| #28344 Smoke Crush EROFS | open | Not scheduled in this window — cannot confirm fixed |
| #28330 Smoke OpenCode no safe outputs | open | Not scheduled in this window — cannot confirm fixed |
| #28357 Go Logger Enhancement | open | Not scheduled in this window |
| #28356 Audit Agent false positive | open | Not scheduled in this window |

#### No Previously-Tracked Issues Closed

None of the prior root causes reappeared in this window to confirm resolution, and none have been confirmed fixed externally.

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24925380396/agentic_workflow) · 39 runs · 6h window · 1 new failure cluster

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24925380396/agentic_workflow) · ● 601.3K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: 07:07–13:07 UTC 2026-04-25

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| Smoke Gemini: untrusted directory (exit 55) | [§24931278139](https://github.com/github/gh-aw/actions/runs/24931278139) | gemini | NEW sub-issue → #28268 |
| Smoke Crush: EROFS read-only hostedtoolcache | [§24931278150](https://github.com/github/gh-aw/actions/runs/24931278150) | crush | Already tracked → #28382 |
| Workflow Health Manager: ERR_SYSTEM runtime import not found | [§24930436676](https://github.com/github/gh-aw/actions/runs/24930436676) | — (activation fail) | NEW sub-issue → #28268 |

**Overall: 37 runs, 3 failures, 31 succeeded, 3 in-progress at query time. $7.29 total cost, 11.5M tokens.**

#### Key Findings

- **Smoke Gemini** now fails with a **different root cause** from the prior `API_KEY_INVALID` issue. Gemini CLI v1.x added a "trusted folders" security model: the workflow passes `--yolo` but the CLI overrides it to `"default"` when the workspace is untrusted, then exits with code 55 before executing any turns. Fix: set `GEMINI_CLI_TRUST_WORKSPACE=true` in the workflow env or add `--skip-trust` to the invocation.

- **Smoke Crush** repeats the same EROFS error already tracked in #28382 (npm global install into read-only `/opt/hostedtoolcache`). No fix has landed yet.

- **Workflow Health Manager - Meta-Orchestrator** (scheduled) failed in 31 seconds during the `activation` job with `ERR_SYSTEM: Runtime import file not found: .github/workflows/workflow-health-manager.md → <path>`. The prior baseline run [§24888666710](https://github.com/github/gh-aw/actions/runs/24888666710) on 2026-04-24 succeeded with the same trigger — this is a recent regression from a missing or renamed import file. Audit-diff classification: **stable** (no behavioral change in agent itself, since the agent never started).

#### Previously-Tracked Issues — Status in This Window

| Issue | State | New Evidence |
|---|---|---|
| #28382 Smoke Crush EROFS | open | Confirmed recurring — same error, same run |
| #28345 Smoke Gemini API_KEY_INVALID | open | Root cause has changed; now "untrusted directory" — new sub-issue created |
| #28419 Daily Issues Report Generator (node not found) | open | No new run in this window |

#### Sub-Issues Created

- Smoke Gemini: untrusted directory — `GEMINI_CLI_TRUST_WORKSPACE=true` fix needed (linked to #28268)
- Workflow Health Manager: runtime import file not found — activation regression (linked to #28268)

**References:**
- [§24931278139](https://github.com/github/gh-aw/actions/runs/24931278139) — Smoke Gemini (untrusted directory, exit 55)
- [§24931278150](https://github.com/github/gh-aw/actions/runs/24931278150) — Smoke Crush (EROFS repeat)
- [§24930436676](https://github.com/github/gh-aw/actions/runs/24930436676) — Workflow Health Manager (activation runtime import failure)

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24931607739/agentic_workflow) · 37 runs · 6h window · 3 failure signals

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24931607739/agentic_workflow) · ● 416.5K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: 01:10–07:10 UTC 2026-04-26

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| GitHub Remote MCP Auth Test: `gpt-5.4-mini` not accessible | [§24948237798](https://github.com/github/gh-aw/actions/runs/24948237798) | copilot | #28540 (auto-triage) + new sub-issue (→ #28268) |
| Smoke Gemini: `API_KEY_INVALID` | [§24945190974](https://github.com/github/gh-aw/actions/runs/24945190974) | gemini | #28530 (auto-triage) |
| Smoke Crush: EROFS read-only hostedtoolcache | [§24945190952](https://github.com/github/gh-aw/actions/runs/24945190952) | crush | #28531 (auto-triage) + #28382 (detailed) |

**Overall: 48 runs, 3 failures (+ 1 in-progress = current run). All other workflows succeeded.**

#### Key Findings

- **GitHub Remote MCP Authentication Test** failed again with `400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint` — identical error to run [§24922384597](https://github.com/github/gh-aw/actions/runs/24922384597) from 2026-04-25. The Copilot driver exhausted all 3 retries (4 total attempts × ~4s each = ~54s total). No tokens consumed, no turns completed. This is the third consecutive run of this workflow failing with the same error. A sub-issue with a concrete fix proposal (update model name to a valid endpoint-accessible model) was added to #28268.

- **Smoke Gemini** returned `400 API_KEY_INVALID` — reverted to the API key error seen in the 19:11-01:11 UTC 2026-04-24/25 window. This run was triggered by a `pull_request` event on branch `copilot/add-support-object-form-otlp-headers`. The Gemini CLI (v1.x, model `auto-gemini-3`) could not authenticate. 0 tokens, 0 turns. Note: the prior window's Gemini failure was an "untrusted directory" issue (exit 55); this is a credential failure, which may affect only PR-triggered runs (vs. the untrusted-dir issue on scheduled runs).

- **Smoke Crush** hit the same EROFS install failure tracked in #28382: `Installation failed: EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/24.14.1/x64/lib/node_modules/`@charmland/crush`/bin'`. No fix has landed. Same error, same path, new run.

#### Stale Issues Closed

- **#28521** (Smoke CI failed) — Closed. Smoke CI passed in 3 consecutive runs in this window ([§24948584769](https://github.com/github/gh-aw/actions/runs/24948584769), [§24948501105](https://github.com/github/gh-aw/actions/runs/24948501105), [§24945826070](https://github.com/github/gh-aw/actions/runs/24945826070)). The failure was caused by a transient CDN 502 on AWF binary download (tracked in #28529) which resolved on its own.

#### Previously-Tracked Issues — Status in This Window

| Issue | State | New Evidence |
|---|---|---|
| #28382 Smoke Crush EROFS | open | Confirmed recurring — same error path |
| #28529 AWF binary download HTTP 502 | open | Smoke CI passing — transient CDN issue appears resolved |
| #28393 GitHub Remote MCP Auth Test model unavailable | open | Confirmed recurring — 3rd consecutive failure |

#### Sub-Issues Created/Updated

- New sub-issue added to #28268: `gpt-5.4-mini` model unavailability fix — update workflow to use a valid `/chat/completions`-accessible model (e.g. `gpt-4o-mini`)

**References:**
- [§24948237798](https://github.com/github/gh-aw/actions/runs/24948237798) — GitHub Remote MCP Auth Test (gpt-5.4-mini, 3rd occurrence)
- [§24945190974](https://github.com/github/gh-aw/actions/runs/24945190974) — Smoke Gemini (API_KEY_INVALID)
- [§24945190952](https://github.com/github/gh-aw/actions/runs/24945190952) — Smoke Crush (EROFS repeat)

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24950907950/agentic_workflow) · 48 runs · 6h window · 3 failure clusters

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24950907950/agentic_workflow) · ● 355.9K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: 07:07–13:07 UTC 2026-04-26

#### Failure Clusters (new window)

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| GitHub MCP Tools Report: protected-files blocking PR | [§24956724357](https://github.com/github/gh-aw/actions/runs/24956724357) | claude | #28599 + new sub-issue |
| Daily Go Function Namer: exit code 22 (HTTP error) | [§24955120726](https://github.com/github/gh-aw/actions/runs/24955120726) | claude | #28582 (auto-triage) |
| Constraint Solving: detection job failure | [§24957055408](https://github.com/github/gh-aw/actions/runs/24957055408) | copilot | #28601 (auto-triage) |

**Overall: 41 runs, 3 hard failures, 32 succeeded, 6 cancelled (Smoke CI push-burst supersession). $13.55 total cost, 30.9M tokens.**

#### Key Findings

- **GitHub MCP Remote Server Tools Report Generator** ran to completion ($1.64, 24 turns, 10.9 min) but `safe_outputs` job failed because the patch touches `.github/aw/github-mcp-server.md`, a protected file. Error: `Cannot create pull request: patch modifies protected files. Set protected-files: fallback-to-issue to create a review issue instead.` Fix: add `protected-files: fallback-to-issue` to workflow frontmatter. Sub-issue created.

- **Daily Go Function Namer** (Claude Code) failed at `agent` job with exit code 22 (CURLE_HTTP_RETURNED_ERROR). The agent started (plan event logged), made 14 tool calls, hit 2 errors, and exited in 1.9 min with no output. No firewall blocks, no rate-limit pressure. Root cause: an HTTP 4xx/5xx from a tool call (likely transient external API unavailability). Auto-tracked in #28582; insufficient signal for a new sub-issue.

- **Constraint Solving — Problem of the Day** failed at the `detection` job (37s, 29-byte log = effectively empty). Agent itself ran successfully (discussion created, safeoutputs called). A `cache_memory_miss` was reported (first-run / post-expiry). Detection infrastructure failure appears independent of the agent completing successfully. Auto-tracked in #28601.

- **Smoke CI cancellations** (6 runs, 11:36–11:58 UTC): Burst of pushes to main caused pipeline supersession. All expected — not real failures. Subsequent Smoke CI runs succeeded.

#### Stale Issues Closed

- **#28529** (AWF binary download HTTP 502): Smoke CI passed in 5+ consecutive runs across the last two windows. Transient CDN outage resolved — closing.

#### Previously-Tracked Issues — Status in This Window

| Issue | State | New Evidence |
|---|---|---|
| #28382 Smoke Crush EROFS | open | Not scheduled in this window |
| #28540 GitHub Remote MCP Auth Test `gpt-5.4-mini` | open | Not scheduled in this window |
| #28530 Smoke Gemini API_KEY_INVALID | open | Not scheduled in this window |
| #28529 AWF binary download HTTP 502 | **CLOSED** | Smoke CI consistently passing |

#### Sub-Issues Created

- New sub-issue linked to #28268: GitHub MCP Tools Report Generator — add `protected-files: fallback-to-issue` to fix recurring `safe_outputs` job failure

**References:**
- [§24956724357](https://github.com/github/gh-aw/actions/runs/24956724357) — GitHub MCP Tools Report (protected-files failure, $1.64 wasted per run)
- [§24955120726](https://github.com/github/gh-aw/actions/runs/24955120726) — Daily Go Function Namer (exit code 22)
- [§24957055408](https://github.com/github/gh-aw/actions/runs/24957055408) — Constraint Solving (detection job failure)

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24957383887/agentic_workflow) · ● 553.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

---

### Updated Window: ~19:15 UTC 2026-04-26 – 01:15 UTC 2026-04-27

#### Failure Clusters

| Cluster | Runs | Engine | Tracking |
|---|---|---|---|
| Go Logger Enhancement: MCP timeout kills build | [§24967310561](https://github.com/github/gh-aw/actions/runs/24967310561) | claude | #28639 (auto-triage) + new sub-issue `#aw_GoMCP1` → #28268 |
| Smoke CI cancelled | [§24966690457](https://github.com/github/gh-aw/actions/runs/24966690457) | copilot | push-burst supersession — transient, no tracking |

**Overall: 45 runs, 1 hard failure, 1 cancelled (push-burst supersession), 43 succeeded/in-progress. $16.52 total cost, 40.3M tokens.**

#### Key Findings

- **Go Logger Enhancement** ([§24967310561](https://github.com/github/gh-aw/actions/runs/24967310561)) failed at the `agent` job after 18.5 min. Root cause confirmed via `agent-stdio.log`: all three MCP servers (github, mcpscripts, safeoutputs) timed out at 21:25:55 UTC (~5 min into session) with `Connection error: The operation timed out.` When `mcpscripts.make` was called at 21:35:52 to verify the build, the transport was gone: `MCP error -32003: context canceled / client is closing`, then `Unable to connect.` The agent had already edited 11 files via native tools (Read/Grep/Edit) but the build verification step — the only MCP-dependent step — was lost. This is the **second Go Logger failure** with MCP connection issues; the prior run [§24912564019](https://github.com/github/gh-aw/actions/runs/24912564019) (2026-04-24/25 window) was also flagged as "9 anomalies, unguarded loop or context overflow." Both failures share a pattern of long agent turns (avg TBT: 9.2m) exceeding MCP connection idle timeouts. Auto-triage issue #28639 captures the symptom; sub-issue `#aw_GoMCP1` captures the root cause and remediation.

- **Smoke CI** ([§24966690457](https://github.com/github/gh-aw/actions/runs/24966690457)) was cancelled at the activation job after 18s. Triggered by a `push` event at 20:48:03 UTC. Two subsequent Smoke CI runs ([§24966702612](https://github.com/github/gh-aw/actions/runs/24966702612) at 20:48:36, [§24966772718](https://github.com/github/gh-aw/actions/runs/24966772718) at 20:52:09) both succeeded. Classic push-burst supersession pattern — not a real failure.

- **3 missing-tool events** on successful runs: Agentic Workflow Audit Agent, GitHub API Consumption Report Agent, and Daily Regulatory Report Generator each hit a `missing_tool` event reporting that the `agentic-workflows` MCP server (status/logs tool) was not available in their runtime. All three completed successfully via `safeoutputs.missing_tool`. Not P1.

#### Previously-Tracked Issues — Status in This Window

| Issue | State | New Evidence |
|---|---|---|
| #28382 Smoke Crush EROFS | open | Not scheduled in this window |
| #28540 GitHub Remote MCP Auth Test `gpt-5.4-mini` | open | Not scheduled in this window |
| #28530 Smoke Gemini API_KEY_INVALID | open | Not scheduled in this window |

#### Sub-Issues Created

- New sub-issue `#aw_GoMCP1` → #28268: Go Logger Enhancement MCP connection timeout kills build — use `Bash` for build verification instead of `mcpscripts.make` to survive MCP idle timeouts

**References:**
- [§24967310561](https://github.com/github/gh-aw/actions/runs/24967310561) — Go Logger Enhancement (MCP timeout, build verification failure)
- [§24966690457](https://github.com/github/gh-aw/actions/runs/24966690457) — Smoke CI (push-burst cancellation, transient)

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24971997349/agentic_workflow) · 45 runs · 6h window · 1 hard failure

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24971997349/agentic_workflow) · ● 399K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Cluster	Affected Runs	Engine	Existing Tracking
`node: command not found` (exit 127)	§24881782690, §24885324351	copilot	#28224, #28233
Model Not Supported	§24885748725	copilot	#28235
MCP Gateway schema validation failure	§24887335913	codex	NEW → #28269
safeoutputs false-positive classification	§24888785593	claude	#28263 (misclassified)

Cluster	Runs	Engine	Tracking
Design Decision Gate: max-turns ($0.72)	§24899268907	claude	NEW sub-issue → #aw_DDGmax
Issue Monster: Copilot GraphQL failure (×4)	§24900325460, §24901478151, §24902781262, §24903839716	copilot	NEW sub-issue → #aw_IMdual
Issue Monster: bash markers in comment bodies	§24902781262, §24903839716	copilot	covered by #aw_IMdual

Cluster	Runs	Engine	Tracking
Smoke Gemini: `API_KEY_INVALID`	§24911755836	gemini	NEW sub-issue #aw_SGkey → #28268
Smoke Crush: EROFS read-only hostedtoolcache	§24911755864	crush	NEW sub-issue #aw_SCerof → #28268
Smoke OpenCode: no safe outputs	§24909063346	opencode	#28330 (auto-triage only)
Smoke CI cancelled (5 errors)	§24914380736	—	cascade from above three
Go Logger Enhancement: 413 turns, 9 anomalies, terminated	§24912564019	claude	#28357 (auto-triage only)
Step Name Alignment: safeoutputs MCP drop @ 149s	§24908320676	claude	P1 carried from prior window
Audit Agent false positive ($2.37 wasted)	§24911879231	claude	P1 carried from prior window

Cluster	Runs	Engine	Tracking
GitHub Remote MCP Auth Test: `gpt-5.4-mini` not accessible	§24922384597	copilot	#28393 (auto-triage)
Smoke CI cancelled (timeout)	§24921318705	—	transient — no tracking

Issue	Last Known State	New Evidence
#28345 Smoke Gemini API_KEY_INVALID	open	Not scheduled in this window — cannot confirm fixed
#28344 Smoke Crush EROFS	open	Not scheduled in this window — cannot confirm fixed
#28330 Smoke OpenCode no safe outputs	open	Not scheduled in this window — cannot confirm fixed
#28357 Go Logger Enhancement	open	Not scheduled in this window
#28356 Audit Agent false positive	open	Not scheduled in this window

Cluster	Runs	Engine	Tracking
Smoke Gemini: untrusted directory (exit 55)	§24931278139	gemini	NEW sub-issue → #28268
Smoke Crush: EROFS read-only hostedtoolcache	§24931278150	crush	Already tracked → #28382
Workflow Health Manager: ERR_SYSTEM runtime import not found	§24930436676	— (activation fail)	NEW sub-issue → #28268

Issue	State	New Evidence
#28382 Smoke Crush EROFS	open	Confirmed recurring — same error, same run
#28345 Smoke Gemini API_KEY_INVALID	open	Root cause has changed; now "untrusted directory" — new sub-issue created
#28419 Daily Issues Report Generator (node not found)	open	No new run in this window

Cluster	Runs	Engine	Tracking
GitHub Remote MCP Auth Test: `gpt-5.4-mini` not accessible	§24948237798	copilot	#28540 (auto-triage) + new sub-issue (→ #28268)
Smoke Gemini: `API_KEY_INVALID`	§24945190974	gemini	#28530 (auto-triage)
Smoke Crush: EROFS read-only hostedtoolcache	§24945190952	crush	#28531 (auto-triage) + #28382 (detailed)

Issue	State	New Evidence
#28382 Smoke Crush EROFS	open	Confirmed recurring — same error path
#28529 AWF binary download HTTP 502	open	Smoke CI passing — transient CDN issue appears resolved
#28393 GitHub Remote MCP Auth Test model unavailable	open	Confirmed recurring — 3rd consecutive failure

Cluster	Runs	Engine	Tracking
GitHub MCP Tools Report: protected-files blocking PR	§24956724357	claude	#28599 + new sub-issue
Daily Go Function Namer: exit code 22 (HTTP error)	§24955120726	claude	#28582 (auto-triage)
Constraint Solving: detection job failure	§24957055408	copilot	#28601 (auto-triage)

Cluster	Runs	Engine	Tracking
Go Logger Enhancement: MCP timeout kills build	§24967310561	claude	#28639 (auto-triage) + new sub-issue `#aw_GoMCP1` → #28268
Smoke CI cancelled	§24966690457	copilot	push-burst supersession — transient, no tracking

[aw-failures] Failure Investigation Report — 6h window (2026-04-24) #28267

Description

Executive Summary

Failure Clusters

Evidence

Existing Issue Correlation

Proposed Fix Roadmap

Sub-Issues Created

Updated Window: 13:05–19:05 UTC 2026-04-24

Failure Clusters (new window)

Key Findings

No Previously-Tracked Issues to Close

Updated Window: 19:11–01:11 UTC 2026-04-24/25

Failure Clusters (new window)

Key Findings

No Previously-Tracked Issues Closed

Updated Window: 01:10–07:10 UTC 2026-04-25

Failure Clusters (new window)

Key Findings

Previously-Tracked Issues — Status in This Window

No Previously-Tracked Issues Closed

Updated Window: 07:07–13:07 UTC 2026-04-25

Failure Clusters (new window)

Key Findings

Previously-Tracked Issues — Status in This Window

Sub-Issues Created

Updated Window: 01:10–07:10 UTC 2026-04-26

Failure Clusters (new window)

Key Findings

Stale Issues Closed

Previously-Tracked Issues — Status in This Window

Sub-Issues Created/Updated

Updated Window: 07:07–13:07 UTC 2026-04-26

Failure Clusters (new window)

Key Findings

Stale Issues Closed

Previously-Tracked Issues — Status in This Window

Sub-Issues Created

Updated Window: ~19:15 UTC 2026-04-26 – 01:15 UTC 2026-04-27

Failure Clusters

Key Findings

Previously-Tracked Issues — Status in This Window

Sub-Issues Created

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions