Skip to content

[aw-failures] Failure Investigation Report — 6h window (2026-04-27) #28673

@github-actions

Description

@github-actions

Executive Summary

4 workflow failures detected in the 6-hour window ending 2026-04-27 ~08:00 UTC. Two failures require actionable fixes (P0/P1); two are transient or partially-successful. One sub-issue created for the blocking P0 configuration error.


Failure Clusters

Workflow Run Engine Root Cause Priority Tracking
GitHub Remote MCP Authentication Test §24976660123 Copilot 400 The requested model is not supported — workflow requests a model not available on the subscription tier P1 #28660
Documentation Unbloat §24975734231 Claude Code Execute Claude Code CLI timed out after 30 minutes — container post-processing (threat detection + artifact upload) took 19 min after claude-code exited; orphan process awf-cmd-1.sh killed at deadline P2 #28659
Daily CLI Tools Exploratory Tester §24978441315 Copilot API rate limit exceeded for installation — safe_outputs create_issue failed after 4 attempts; agent's valid bug report (compile --workflow_name naming inconsistency) was lost P1 (none)
Schema Feature Coverage Checker §24981796377 Codex All 10 create_pull_request calls blocked: patch touches .github/workflows/schema-demo-*.md which are protected files by default P0 #28671, #28674

Evidence

GitHub Remote MCP Auth Test — model not supported

From agent-stdio.log of run §24976660123:

400 The requested model is not supported.

Copilot-driver message:

model not supported — not retrying (the requested model is unavailable for this subscription tier; 
specify a supported model in the workflow frontmatter)

Duration: 1.8 minutes. Agent never started. 1 error, 0 turns.

Documentation Unbloat — 30-minute step timeout

From 5_agent.txt (agent job log) for run §24975734231:

##[error]The action 'Execute Claude Code CLI' has timed out after 30 minutes.

Timeline:

  • Container started: ~03:56 UTC
  • Claude Code agent finished (exit 0): 04:07:31 UTC (11 min)
  • Container still running (post-processing): 04:07–04:26 UTC (19 more minutes)
  • Step timeout hit: 04:26:32 UTC (30 min total)
  • Orphan processes killed: awf-cmd-1.sh, bash

PR was successfully created: branch docs/unbloat-daily-ops-2a3a65767cbfabee, PR #28658. The agent's actual work completed; the failure is an instrumentation-level false positive.

Daily CLI Tools Exploratory Tester — API rate limit in safe_outputs

From 1_safe_outputs.txt for run §24978441315:

##[warning]create_issue in github/gh-aw failed (attempt 1/4): API rate limit exceeded for installation.
##[warning]create_issue in github/gh-aw failed (attempt 2/4): API rate limit exceeded for installation.
##[warning]create_issue in github/gh-aw failed (attempt 3/4): API rate limit exceeded for installation.
##[error]✗ Failed to create issue "[cli-tools-test] compile tool: using `--workflow_name`..."

All 4 attempts at 36–53s retry intervals exhausted. The agent successfully identified a real bug (the compile tool uses --workflows while logs uses --workflow_name, causing a cryptic MCP schema error), but the report was never filed.

Agent itself concluded "success" (copilot-driver exit 0); the safe_outputs job failed, which caused the overall run conclusion to be "failure".

Schema Feature Coverage Checker — protected files block all PRs

From 0_conclusion.txt (conclusion job) for run §24981796377:

GH_AW_CODE_PUSH_FAILURE_COUNT: 10

Each of 10 branches failed with:

Cannot create pull request: patch modifies protected files (.github/workflows/schema-demo-*.md).
Add them to the allowed-files configuration field or set protected-files: fallback-to-issue.

The agent correctly identified 10 uncovered schema fields and prepared valid patches; all were blocked by the default protected-files policy which covers **.github/workflows/**.


Existing Issue Correlation


Proposed Fix Roadmap

Priority Item Effort
P0 Schema Feature Coverage Checker: add .github/workflows/schema-demo-*.md to allowed-files (sub-issue #28674) Low
P1 GitHub Remote MCP Auth Test: update frontmatter to supported Copilot model (see #28660) Low
P1 Daily CLI Tools: investigate installation API rate limit budget — safe_outputs create_issue rate-limited at 05:46–05:49 UTC Medium
P2 Documentation Unbloat: review 30-min step timeout for 58-turn runs — either raise timeout or optimize post-processing Medium

Sub-Issues Created

References:



6h Window Update — 2026-04-27 ~07:13–13:13 UTC

Overview

37 runs in window (16 Claude, 17 Copilot, 4 Codex) · 0 classified failures · 3 runs with error_count > 0 · all individual failures auto-tracked via [aw] issues · overall health improved vs. prior window.

Failure Clusters

Pattern Affected workflows Runs Tracking
node: command not found (Copilot engine) Daily News, Daily Issues Report Generator §24986870660, §24990655972 Sub-issue #aw_node1
codex: command not found (Codex engine) Daily Fact About gh-aw §24992928191 #28703
No safe outputs emitted Package Specification Enforcer §24991256961 #28692
Missing mcp__playwright__browser_run_code Multi-Device Docs Tester §24994599602 #28717
Missing agentic-workflows MCP status tool Daily Rendering Scripts Verifier §24992350068 (not tracked)
Broken links in Copilot PRs (starlight-links-validator) Visual Regression Checker §24993828520, §24995461013 #28677

Previously Tracked Items — Status

Item Status
P0 Schema Feature Coverage Checker (#28674) Open — config fix not yet merged
P1 GitHub Remote MCP Auth Test (#28660) Open — no retry observed in window
P1 Daily CLI Tools rate limit No recurrence in current window
P2 Documentation Unbloat (#28659) No recurrence in current window

Observability

Firewall block rate: 15% (139/916 requests blocked) — improved significantly from the 48% noted in §24978441315. Dominant blocked domain: (unknown) (119 requests); proxy.golang.org blocked in Refactoring Cadence (10 req) despite being in the global allowlist — per-workflow firewall config likely more restrictive; run completed successfully.

Sub-Issues Added

  • #aw_node1 — Engine binary missing at runner startup (node/codex not found) — P1

References:

  • §24992928191 — Daily Fact About gh-aw (codex not found)
  • §24991256961 — Package Specification Enforcer (no safe output)
  • §24992350068 — Daily Rendering Scripts Verifier (missing aw-mcp status tool)

Generated by [aw] Failure Investigator (6h) · ● 631.2K ·



6h Window Update — 2026-04-27 ~19:30 – 2026-04-28 ~01:30 UTC

Overview

2 auto-generated failure issues in the window. 1 is a true failure (recurrence of existing P1); 1 is a likely false positive (agent completed, workflow still concluded failure). No P0 failures; no new blocking issues.

Failure Clusters

Workflow Run Engine Conclusion Root Cause Priority Tracking
Go Logger Enhancement §25020571393 Claude Code failure Engine killed at 21:46:43 UTC mid-API-call; mcpscripts.make called twice immediately before kill (21:46:19, 21:46:37) P1 Recurrence of #28653
Agentic Workflow Audit Agent §25019817167 Claude Code failure Agent completed (terminal_reason: completed, 53 turns, $1.91, created discussion #28804) but workflow concluded failure; auto-issue #28806 created — likely false positive P1 New — sub-issue #aw_audit1

Evidence

Go Logger Enhancement — engine killed mid-session (§25020571393)

From agent-stdio.log:

2026-04-27T21:46:19.739Z  mcpscripts.make called (call 1)
2026-04-27T21:46:37.706Z  mcpscripts.make called (call 2)
2026-04-27T21:46:43.546Z [DEBUG] autocompact: tokens=58283 threshold=167000
2026-04-27T21:46:43.548Z [DEBUG] [API REQUEST] /v1/messages source=sdk

Log ends abruptly — no API response, no terminal_reason. Token count (58K) is well below the 167K compaction threshold, ruling out context pressure. mcpscripts.make was invoked (MCP was alive) then engine was killed 6s later during the next API call. Pattern matches #28653.

Agentic Workflow Audit Agent — agent success, workflow failure (§25019817167)

From agent-stdio.log:

2026-04-27T21:44:49.870Z  create_discussion completed successfully in 76ms
2026-04-27T21:45:00.080Z {"type":"result","subtype":"success","terminal_reason":"completed","num_turns":53,"total_cost_usd":1.91}

Agent completed 53 turns, created discussion #28804 (audits category), exited cleanly. Auto-issue #28806 was created at 21:47 UTC (2 min later) with "Engine Failure: The claude engine terminated unexpectedly." The harness auto-issue body contains the complete terminal_reason: completed JSON, yet still fired the failure signal. This indicates the workflow's GitHub Actions conclusion was set to failure by a step other than the agent job — possibly the conclusion/safe-outputs job or a post-processing step.

Previously Tracked Items — Status

Item Status
P1 Go Logger MCP timeout (#28653) Recurrence confirmed — run §25020571393 shows identical failure signature
P0 Schema Feature Coverage protected-files (#28674) Open — config fix not yet merged
P1 Copilot/Codex binary missing (#28726) Active — Issue Monster triaged (22:54 UTC), firewall tracking linked
Design Decision Gate safeoutputs drop (#28740) Outside window; auto-issue expires 2026-04-28 02:29 UTC

Sub-Issues Added

References:

Generated by [aw] Failure Investigator (6h) · ● 347.8K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions