You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
52 workflow runs in the last 6 hours (approximate window: 07:00–13:14 UTC); 5 failures across 4 distinct clusters. Three clusters are already tracked by open issues. One new P0 root cause identified: MCP Gateway v0.2.30 schema validation breaking codex-engine workflows that use the mempalace MCP server. A false-positive "engine failure" classification on a successful $1.37 claude run also warrants investigation.
Daily News and Daily Issues Report Generator fail at agent execution with /bin/bash: line 1: node: command not found (exit 127). Chroot-mode agent setup uses command -v node inside the chroot but node is not available at that path inside the container.
Confirmed from agent-stdio.log for run §24881782690:
[entrypoint] Executing command: ... "$GH_AW_NODE_EXEC" ... copilot_driver.cjs ...
[entrypoint] Chroot mode: running command inside host filesystem (/host)
/bin/bash: line 1: node: command not found
[WARN] Command completed with exit code: 127
Cluster 2: Model Not Supported (copilot)
Daily Community Attribution Updater fails immediately with 400 The requested model is not supported. Copilot driver exits after 2 seconds without retrying — this is a subscription-tier configuration issue, not a transient failure.
Cluster 3: MCP Gateway schema validation (codex) — NEW P0
Daily Fact About gh-aw (run §24887335913) uses codex engine with gpt-5.1-codex-mini v0.121.0 and MCP Gateway v0.2.30. Agent setup fails at the gh-aw.agent.setup span (status=ERROR) with 0 turns, 0 tokens, after 95 seconds.
Error from workflow-logs/4_agent.txt:
jsonschema: '/mcpServers/mempalace' does not validate with
mcp-gateway-config.schema.json#/.../oneOf/0/$ref/required:
missing properties: 'container'
Configuration validation error (MCP Gateway version: v0.2.30):
Error: oneOf failed
Error: not failed
The mempalace MCP server (Python package v3.2.0, chromadb-backed) is configured without the container property now required by the updated Gateway schema. The agent cannot start.
All safeoutputs tool calls (upload_asset ×4, create_discussion ×1) show status: "unknown" in audit
Workflow classified as "Engine Failure: terminated unexpectedly" despite the agent stating success
This appears to be a safeoutputs MCP reliability issue or audit tracking gap. The agent's work was completed but safe outputs were not registered, triggering a false-positive failure.
All other workflows: 0 failures. Issue Monster self-recovered by 18:23 UTC (§24905258841 succeeded as no-op).
Key Findings
Design Decision Gate hit error_max_turns (15 turns) because every Bash command was permission-denied (reads of /tmp/gh-aw/agent/*.json context files). $0.72 wasted per occurrence.
Issue Monster backend silently swallows assign_to_agent GraphQL errors — agent reports success while gh-aw-bot posts failure comments on each targeted issue.
Comment body corruption: Internal Claude Code bash marker strings (___BEGIN___COMMAND_OUTPUT_MARKER___) leaked into two Issue Monster add_comment bodies; security scanner flagged both runs.
No Previously-Tracked Issues to Close
Cluster fixes from the prior report (node not found, model not supported, MCP Gateway schema) have not yet reoccurred in this window and remain unresolved.
Smoke Gemini returned 400 API_KEY_INVALID — GEMINI_API_KEY secret is expired or revoked; zero tokens consumed, zero turns, agent cannot start. Gemini smoke coverage is entirely blocked.
Smoke Crush cannot install the CLI globally: EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/.../bin' — npm global install targets a read-only path on hosted runners.
Smoke CI cancelled with 5 errors — cascading from Gemini/Crush/OpenCode breakage; Smoke CI run §24914380736 aborted within 1.1 minutes.
Go Logger Enhancement ran 413 turns over 17 minutes with 9 anomalous events and a high severity anomaly signal before terminating (36 tool types, exploratory path). Root cause unclear — may be an unguarded loop or context overflow. Not max-turns (Design Decision Gate pattern) but warrants investigation.
safeoutputs MCP drop (Step Name Alignment): HTTP connection dropped after 149s uptime — same The operation was aborted error as Design Decision Gate in prior window. P1 still unresolved.
Audit Agent false positive: Run §24911879231 completed with 61 turns, 1 safe output, $2.37 cost, terminal_reason: completed — yet classified as engine failure. Same false-positive detection gap as prior window P1.
No Previously-Tracked Issues Closed
No prior root causes appear fixed in this window. Design Decision Gate ran successfully multiple times (e.g. §24917291168) but the underlying safeoutputs MCP stability issue is unresolved. Issue Monster self-recovered (§24918568357 succeeded as baseline).
GitHub Remote MCP Authentication Test failed with 400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint. All 4 attempts (1 initial + 3 retries) failed identically within ~4 seconds. Copilot driver exhausted retries and exited code 1. The model name gpt-5.4-mini is either invalid, renamed, or unavailable for this Copilot subscription tier. Fix: update the workflow engine config to a supported model (e.g., gpt-4o-mini). Tracked in auto-generated issue [aw] GitHub Remote MCP Authentication Test failed #28393.
Smoke CI (run §24921318705) was cancelled due to a job-level timeout firing the instant the last of 6 Docker image pulls completed. The agent never started (0 tokens, 0 turns). This is a transient timing event — 4 other Smoke CI runs in the same window completed successfully. No tracking created.
Overall: 37 runs, 3 failures, 31 succeeded, 3 in-progress at query time. $7.29 total cost, 11.5M tokens.
Key Findings
Smoke Gemini now fails with a different root cause from the prior API_KEY_INVALID issue. Gemini CLI v1.x added a "trusted folders" security model: the workflow passes --yolo but the CLI overrides it to "default" when the workspace is untrusted, then exits with code 55 before executing any turns. Fix: set GEMINI_CLI_TRUST_WORKSPACE=true in the workflow env or add --skip-trust to the invocation.
Workflow Health Manager - Meta-Orchestrator (scheduled) failed in 31 seconds during the activation job with ERR_SYSTEM: Runtime import file not found: .github/workflows/workflow-health-manager.md → <path>. The prior baseline run §24888666710 on 2026-04-24 succeeded with the same trigger — this is a recent regression from a missing or renamed import file. Audit-diff classification: stable (no behavioral change in agent itself, since the agent never started).
Overall: 48 runs, 3 failures (+ 1 in-progress = current run). All other workflows succeeded.
Key Findings
GitHub Remote MCP Authentication Test failed again with 400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint — identical error to run §24922384597 from 2026-04-25. The Copilot driver exhausted all 3 retries (4 total attempts × ~4s each = ~54s total). No tokens consumed, no turns completed. This is the third consecutive run of this workflow failing with the same error. A sub-issue with a concrete fix proposal (update model name to a valid endpoint-accessible model) was added to [aw-failures] [aw] Failure Investigator (6h) - Issue Group #28268.
Smoke Gemini returned 400 API_KEY_INVALID — reverted to the API key error seen in the 19:11-01:11 UTC 2026-04-24/25 window. This run was triggered by a pull_request event on branch copilot/add-support-object-form-otlp-headers. The Gemini CLI (v1.x, model auto-gemini-3) could not authenticate. 0 tokens, 0 turns. Note: the prior window's Gemini failure was an "untrusted directory" issue (exit 55); this is a credential failure, which may affect only PR-triggered runs (vs. the untrusted-dir issue on scheduled runs).
Overall: 41 runs, 3 hard failures, 32 succeeded, 6 cancelled (Smoke CI push-burst supersession). $13.55 total cost, 30.9M tokens.
Key Findings
GitHub MCP Remote Server Tools Report Generator ran to completion ($1.64, 24 turns, 10.9 min) but safe_outputs job failed because the patch touches .github/aw/github-mcp-server.md, a protected file. Error: Cannot create pull request: patch modifies protected files. Set protected-files: fallback-to-issue to create a review issue instead. Fix: add protected-files: fallback-to-issue to workflow frontmatter. Sub-issue created.
Daily Go Function Namer (Claude Code) failed at agent job with exit code 22 (CURLE_HTTP_RETURNED_ERROR). The agent started (plan event logged), made 14 tool calls, hit 2 errors, and exited in 1.9 min with no output. No firewall blocks, no rate-limit pressure. Root cause: an HTTP 4xx/5xx from a tool call (likely transient external API unavailability). Auto-tracked in [aw] Daily Go Function Namer failed #28582; insufficient signal for a new sub-issue.
Constraint Solving — Problem of the Day failed at the detection job (37s, 29-byte log = effectively empty). Agent itself ran successfully (discussion created, safeoutputs called). A cache_memory_miss was reported (first-run / post-expiry). Detection infrastructure failure appears independent of the agent completing successfully. Auto-tracked in [aw] Constraint Solving — Problem of the Day failed #28601.
Smoke CI cancellations (6 runs, 11:36–11:58 UTC): Burst of pushes to main caused pipeline supersession. All expected — not real failures. Subsequent Smoke CI runs succeeded.
Overall: 45 runs, 1 hard failure, 1 cancelled (push-burst supersession), 43 succeeded/in-progress. $16.52 total cost, 40.3M tokens.
Key Findings
Go Logger Enhancement (§24967310561) failed at the agent job after 18.5 min. Root cause confirmed via agent-stdio.log: all three MCP servers (github, mcpscripts, safeoutputs) timed out at 21:25:55 UTC (~5 min into session) with Connection error: The operation timed out. When mcpscripts.make was called at 21:35:52 to verify the build, the transport was gone: MCP error -32003: context canceled / client is closing, then Unable to connect. The agent had already edited 11 files via native tools (Read/Grep/Edit) but the build verification step — the only MCP-dependent step — was lost. This is the second Go Logger failure with MCP connection issues; the prior run §24912564019 (2026-04-24/25 window) was also flagged as "9 anomalies, unguarded loop or context overflow." Both failures share a pattern of long agent turns (avg TBT: 9.2m) exceeding MCP connection idle timeouts. Auto-triage issue [aw] Go Logger Enhancement failed #28639 captures the symptom; sub-issue #aw_GoMCP1 captures the root cause and remediation.
Smoke CI (§24966690457) was cancelled at the activation job after 18s. Triggered by a push event at 20:48:03 UTC. Two subsequent Smoke CI runs (§24966702612 at 20:48:36, §24966772718 at 20:52:09) both succeeded. Classic push-burst supersession pattern — not a real failure.
3 missing-tool events on successful runs: Agentic Workflow Audit Agent, GitHub API Consumption Report Agent, and Daily Regulatory Report Generator each hit a missing_tool event reporting that the agentic-workflows MCP server (status/logs tool) was not available in their runtime. All three completed successfully via safeoutputs.missing_tool. Not P1.
Executive Summary
52 workflow runs in the last 6 hours (approximate window: 07:00–13:14 UTC); 5 failures across 4 distinct clusters. Three clusters are already tracked by open issues. One new P0 root cause identified: MCP Gateway v0.2.30 schema validation breaking codex-engine workflows that use the
mempalaceMCP server. A false-positive "engine failure" classification on a successful $1.37 claude run also warrants investigation.Failure Clusters
node: command not found(exit 127)Evidence
Cluster 1: node not found (copilot)
Daily News and Daily Issues Report Generator fail at agent execution with
/bin/bash: line 1: node: command not found(exit 127). Chroot-mode agent setup usescommand -v nodeinside the chroot but node is not available at that path inside the container.Confirmed from
agent-stdio.logfor run §24881782690:Cluster 2: Model Not Supported (copilot)
Daily Community Attribution Updater fails immediately with
400 The requested model is not supported. Copilot driver exits after 2 seconds without retrying — this is a subscription-tier configuration issue, not a transient failure.Cluster 3: MCP Gateway schema validation (codex) — NEW P0
Daily Fact About gh-aw (run §24887335913) uses codex engine with
gpt-5.1-codex-miniv0.121.0 and MCP Gateway v0.2.30. Agent setup fails at thegh-aw.agent.setupspan (status=ERROR) with 0 turns, 0 tokens, after 95 seconds.Error from
workflow-logs/4_agent.txt:The
mempalaceMCP server (Python package v3.2.0, chromadb-backed) is configured without thecontainerproperty now required by the updated Gateway schema. The agent cannot start.Cluster 4: safeoutputs false-positive (claude)
GitHub MCP Structural Analysis (run §24888785593) ran for 36 turns, 18.4 min, cost $1.37. Agent output shows explicit success (
terminal_reason: "completed",stop_reason: "end_turn"), discussion created, 4 charts uploaded. However:SafeItemsCount = 0in run_summaryupload_asset×4,create_discussion×1) showstatus: "unknown"in auditThis appears to be a safeoutputs MCP reliability issue or audit tracking gap. The agent's work was completed but safe outputs were not registered, triggering a false-positive failure.
Existing Issue Correlation
[aw] Daily Fact About gh-aw failed— matches Cluster 3, no root-cause tracking[aw] GitHub MCP Structural Analysis failed— matches Cluster 4, misclassified as engine failureProposed Fix Roadmap
P0 — Fix
mempalaceMCP server config to satisfy MCP Gateway v0.2.30containerschema requirement → see sub-issue belowP1 — Investigate safeoutputs
status: "unknown"in claude runs (§24888785593); determine if safeoutputs MCP has reliability regression causing false-positive "engine failure" classificationP2 — Fix
node: command not foundin copilot chroot execution (Daily News, Daily Issues Report)P2 — Update model configuration for Daily Community Attribution Updater to use supported subscription tier
Sub-Issues Created
#28269— P0:mempalaceMCP Gateway schema validation failureReferences:
Updated Window: 13:05–19:05 UTC 2026-04-24
Failure Clusters (new window)
All other workflows: 0 failures. Issue Monster self-recovered by 18:23 UTC (§24905258841 succeeded as no-op).
Key Findings
error_max_turns(15 turns) because every Bash command was permission-denied (reads of/tmp/gh-aw/agent/*.jsoncontext files). $0.72 wasted per occurrence.assign_to_agentGraphQL errors — agent reports success whilegh-aw-botposts failure comments on each targeted issue.___BEGIN___COMMAND_OUTPUT_MARKER___) leaked into two Issue Monsteradd_commentbodies; security scanner flagged both runs.No Previously-Tracked Issues to Close
Cluster fixes from the prior report (node not found, model not supported, MCP Gateway schema) have not yet reoccurred in this window and remain unresolved.
Updated Window: 19:11–01:11 UTC 2026-04-24/25
Failure Clusters (new window)
API_KEY_INVALIDKey Findings
400 API_KEY_INVALID—GEMINI_API_KEYsecret is expired or revoked; zero tokens consumed, zero turns, agent cannot start. Gemini smoke coverage is entirely blocked.EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/.../bin'— npm global install targets a read-only path on hosted runners.highseverity anomaly signal before terminating (36 tool types, exploratory path). Root cause unclear — may be an unguarded loop or context overflow. Not max-turns (Design Decision Gate pattern) but warrants investigation.The operation was abortederror as Design Decision Gate in prior window. P1 still unresolved.terminal_reason: completed— yet classified as engine failure. Same false-positive detection gap as prior window P1.No Previously-Tracked Issues Closed
No prior root causes appear fixed in this window. Design Decision Gate ran successfully multiple times (e.g. §24917291168) but the underlying safeoutputs MCP stability issue is unresolved. Issue Monster self-recovered (§24918568357 succeeded as baseline).
Updated Window: 01:10–07:10 UTC 2026-04-25
Failure Clusters (new window)
gpt-5.4-mininot accessibleKey Findings
GitHub Remote MCP Authentication Test failed with
400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint. All 4 attempts (1 initial + 3 retries) failed identically within ~4 seconds. Copilot driver exhausted retries and exited code 1. The model namegpt-5.4-miniis either invalid, renamed, or unavailable for this Copilot subscription tier. Fix: update the workflow engine config to a supported model (e.g.,gpt-4o-mini). Tracked in auto-generated issue [aw] GitHub Remote MCP Authentication Test failed #28393.Smoke CI (run §24921318705) was cancelled due to a job-level timeout firing the instant the last of 6 Docker image pulls completed. The agent never started (0 tokens, 0 turns). This is a transient timing event — 4 other Smoke CI runs in the same window completed successfully. No tracking created.
Previously-Tracked Issues — Status in This Window
No Previously-Tracked Issues Closed
None of the prior root causes reappeared in this window to confirm resolution, and none have been confirmed fixed externally.
Updated Window: 07:07–13:07 UTC 2026-04-25
Failure Clusters (new window)
Overall: 37 runs, 3 failures, 31 succeeded, 3 in-progress at query time. $7.29 total cost, 11.5M tokens.
Key Findings
Smoke Gemini now fails with a different root cause from the prior
API_KEY_INVALIDissue. Gemini CLI v1.x added a "trusted folders" security model: the workflow passes--yolobut the CLI overrides it to"default"when the workspace is untrusted, then exits with code 55 before executing any turns. Fix: setGEMINI_CLI_TRUST_WORKSPACE=truein the workflow env or add--skip-trustto the invocation.Smoke Crush repeats the same EROFS error already tracked in [aw-failures] smoke-crush: EROFS on npm global install to read-only hostedtoolcache #28382 (npm global install into read-only
/opt/hostedtoolcache). No fix has landed yet.Workflow Health Manager - Meta-Orchestrator (scheduled) failed in 31 seconds during the
activationjob withERR_SYSTEM: Runtime import file not found: .github/workflows/workflow-health-manager.md → <path>. The prior baseline run §24888666710 on 2026-04-24 succeeded with the same trigger — this is a recent regression from a missing or renamed import file. Audit-diff classification: stable (no behavioral change in agent itself, since the agent never started).Previously-Tracked Issues — Status in This Window
Sub-Issues Created
GEMINI_CLI_TRUST_WORKSPACE=truefix needed (linked to [aw-failures] [aw] Failure Investigator (6h) - Issue Group #28268)References:
Updated Window: 01:10–07:10 UTC 2026-04-26
Failure Clusters (new window)
gpt-5.4-mininot accessibleAPI_KEY_INVALIDOverall: 48 runs, 3 failures (+ 1 in-progress = current run). All other workflows succeeded.
Key Findings
GitHub Remote MCP Authentication Test failed again with
400 model "gpt-5.4-mini" is not accessible via the /chat/completions endpoint— identical error to run §24922384597 from 2026-04-25. The Copilot driver exhausted all 3 retries (4 total attempts × ~4s each = ~54s total). No tokens consumed, no turns completed. This is the third consecutive run of this workflow failing with the same error. A sub-issue with a concrete fix proposal (update model name to a valid endpoint-accessible model) was added to [aw-failures] [aw] Failure Investigator (6h) - Issue Group #28268.Smoke Gemini returned
400 API_KEY_INVALID— reverted to the API key error seen in the 19:11-01:11 UTC 2026-04-24/25 window. This run was triggered by apull_requestevent on branchcopilot/add-support-object-form-otlp-headers. The Gemini CLI (v1.x, modelauto-gemini-3) could not authenticate. 0 tokens, 0 turns. Note: the prior window's Gemini failure was an "untrusted directory" issue (exit 55); this is a credential failure, which may affect only PR-triggered runs (vs. the untrusted-dir issue on scheduled runs).Smoke Crush hit the same EROFS install failure tracked in [aw-failures] smoke-crush: EROFS on npm global install to read-only hostedtoolcache #28382:
Installation failed: EROFS: read-only file system, mkdir '/opt/hostedtoolcache/node/24.14.1/x64/lib/node_modules/@charmland/crush/bin'. No fix has landed. Same error, same path, new run.Stale Issues Closed
Previously-Tracked Issues — Status in This Window
Sub-Issues Created/Updated
gpt-5.4-minimodel unavailability fix — update workflow to use a valid/chat/completions-accessible model (e.g.gpt-4o-mini)References:
Updated Window: 07:07–13:07 UTC 2026-04-26
Failure Clusters (new window)
Overall: 41 runs, 3 hard failures, 32 succeeded, 6 cancelled (Smoke CI push-burst supersession). $13.55 total cost, 30.9M tokens.
Key Findings
GitHub MCP Remote Server Tools Report Generator ran to completion ($1.64, 24 turns, 10.9 min) but
safe_outputsjob failed because the patch touches.github/aw/github-mcp-server.md, a protected file. Error:Cannot create pull request: patch modifies protected files. Set protected-files: fallback-to-issue to create a review issue instead.Fix: addprotected-files: fallback-to-issueto workflow frontmatter. Sub-issue created.Daily Go Function Namer (Claude Code) failed at
agentjob with exit code 22 (CURLE_HTTP_RETURNED_ERROR). The agent started (plan event logged), made 14 tool calls, hit 2 errors, and exited in 1.9 min with no output. No firewall blocks, no rate-limit pressure. Root cause: an HTTP 4xx/5xx from a tool call (likely transient external API unavailability). Auto-tracked in [aw] Daily Go Function Namer failed #28582; insufficient signal for a new sub-issue.Constraint Solving — Problem of the Day failed at the
detectionjob (37s, 29-byte log = effectively empty). Agent itself ran successfully (discussion created, safeoutputs called). Acache_memory_misswas reported (first-run / post-expiry). Detection infrastructure failure appears independent of the agent completing successfully. Auto-tracked in [aw] Constraint Solving — Problem of the Day failed #28601.Smoke CI cancellations (6 runs, 11:36–11:58 UTC): Burst of pushes to main caused pipeline supersession. All expected — not real failures. Subsequent Smoke CI runs succeeded.
Stale Issues Closed
Previously-Tracked Issues — Status in This Window
gpt-5.4-miniSub-Issues Created
protected-files: fallback-to-issueto fix recurringsafe_outputsjob failureReferences:
Updated Window: ~19:15 UTC 2026-04-26 – 01:15 UTC 2026-04-27
Failure Clusters
#aw_GoMCP1→ #28268Overall: 45 runs, 1 hard failure, 1 cancelled (push-burst supersession), 43 succeeded/in-progress. $16.52 total cost, 40.3M tokens.
Key Findings
Go Logger Enhancement (§24967310561) failed at the
agentjob after 18.5 min. Root cause confirmed viaagent-stdio.log: all three MCP servers (github, mcpscripts, safeoutputs) timed out at 21:25:55 UTC (~5 min into session) withConnection error: The operation timed out.Whenmcpscripts.makewas called at 21:35:52 to verify the build, the transport was gone:MCP error -32003: context canceled / client is closing, thenUnable to connect.The agent had already edited 11 files via native tools (Read/Grep/Edit) but the build verification step — the only MCP-dependent step — was lost. This is the second Go Logger failure with MCP connection issues; the prior run §24912564019 (2026-04-24/25 window) was also flagged as "9 anomalies, unguarded loop or context overflow." Both failures share a pattern of long agent turns (avg TBT: 9.2m) exceeding MCP connection idle timeouts. Auto-triage issue [aw] Go Logger Enhancement failed #28639 captures the symptom; sub-issue#aw_GoMCP1captures the root cause and remediation.Smoke CI (§24966690457) was cancelled at the activation job after 18s. Triggered by a
pushevent at 20:48:03 UTC. Two subsequent Smoke CI runs (§24966702612 at 20:48:36, §24966772718 at 20:52:09) both succeeded. Classic push-burst supersession pattern — not a real failure.3 missing-tool events on successful runs: Agentic Workflow Audit Agent, GitHub API Consumption Report Agent, and Daily Regulatory Report Generator each hit a
missing_toolevent reporting that theagentic-workflowsMCP server (status/logs tool) was not available in their runtime. All three completed successfully viasafeoutputs.missing_tool. Not P1.Previously-Tracked Issues — Status in This Window
gpt-5.4-miniSub-Issues Created
#aw_GoMCP1→ [aw-failures] [aw] Failure Investigator (6h) - Issue Group #28268: Go Logger Enhancement MCP connection timeout kills build — useBashfor build verification instead ofmcpscripts.maketo survive MCP idle timeoutsReferences: