Skip to content

[aw-failures] [aw] Failure Investigation Report 2026-05-01 (6h window): 3 clusters, 7 failures #29540

@github-actions

Description

@github-actions

Executive Summary

7 failures in the 6h window (2026-05-01T11:32–12:28Z) across 6 workflows. Three distinct root-cause clusters identified:

  • Cluster A — GitHub API Rate Limiting (4/7 runs): Concurrent safe-output writes exhausted the installation-token quota; create_issue, add_labels, and lock issue all failed after retries.
  • Cluster B — Missing CLI binaries in AWF chroot (2/7 runs): codex not on PATH (Codex engine) and Node.js unavailable (Copilot CLI engine), both resulting in exit 127.
  • Cluster C — Python pip dependency failure (1/7 run): scipy fails to generate package metadata; follow-on pip install pandas matplotlib seaborn timed out after 2× 120 s waits.

One sub-issue for Cluster A (P0) is linked below. Clusters B and C are documented here as P1 and P2 for follow-up.


Failure Cluster Table

Run ID Workflow Engine Cluster Conclusion Run URL
25212818396 Daily Fact About gh-aw Codex B exit 127 (codex not found) §25212818396
25213299148 Daily Skill Optimizer Improvements Copilot A create_issue rate-limit after 3 retries §25213299148
25213666728 AI Moderator Codex A add_labels rate-limit §25213666728
25213669352 Step Name Alignment Claude A create_issue rate-limit after 3 retries §25213669352
25213746690 Daily Issues Report Generator Copilot B exit 127 (Node.js not in chroot) §25213746690
25213787885 GitHub MCP Structural Analysis Claude C scipy install error → pandas timeout §25213787885
25214243935 AI Moderator Codex A lock issue rate-limit §25214243935

Evidence

Cluster A — GitHub API Rate Limiting (4 runs)

All four failures share the same error pattern from the safe_outputs workflow step:

API rate limit exceeded for installation. request ID ...
timestamp 2026-05-01 12:09:07 UTC

Timeline: burst of concurrent runs started between 12:05–12:10 UTC; by 12:09 UTC the installation token was exhausted. A second isolated hit at 12:28 UTC (AI Moderator on issue_comment event) shows the token hadn't fully recovered.

Affected operations:

  • create_issue (Daily Skill Optimizer, Step Name Alignment): 3 retries each, ~90 s total wait, still failed
  • add_labels (AI Moderator): first attempt failed with no retry success
  • lock issue (AI Moderator lock workflow): pre-activation step failed

All failures are in the safe-outputs processing layer, not the agent itself. Agent work was completed successfully in all four cases.

Comparator: Copilot CLI Deep Research Agent (§25213682014) started at 12:06Z and succeeded — it used the noop safe output which requires no write API calls, so it was unaffected.

Cluster B — Missing CLI binaries in AWF chroot (2 runs)

Daily Fact About gh-aw (Codex engine, §25212818396):

/bin/bash: line 1: codex: command not found
Process exiting with code: 127

The entrypoint tries to run codex exec but the binary is absent from PATH inside the chroot. The run never reached the agent.

Daily Issues Report Generator (Copilot CLI engine, §25213746690):

[entrypoint][ERROR] Copilot CLI requires Node.js, but 'node' is not available inside AWF chroot.
[entrypoint][ERROR] Ensure Node.js is installed on the runner and reachable from PATH inside the chroot.
Process exiting with code: 127

The Copilot CLI harness detects Node.js is missing and exits cleanly with code 127. The runner likely uses a different image or lost a cached tool.

Both failures are pre-agent (harness launch failures) with zero tokens consumed.

Cluster C — Python pip dependency failure (1 run)

GitHub MCP Structural Analysis (Claude engine, §25213787885):

The workflow installs Python packages at runtime. scipy fails first:

× Encountered error while generating package metadata.
╰─> scipy
note: This is an issue with the package mentioned above, not pip.

The agent recovers and tries pip install pandas matplotlib seaborn. This runs in background and the agent polls it twice with a 120 s timeout each time, eventually timing out. Total runtime: 17.4m before the workflow failed.

Root cause: scipy build depends on native compilation (Fortran/C) tools not available in the sandbox. pandas/matplotlib/seaborn take >240 s to install from wheels (slow download or missing binary wheels).

Remediation: pre-install or pin dependencies in the workflow setup, or use a requirements file with only binary-wheel packages.


Existing Issue Correlation

Unable to read existing open issues (GitHub API not authenticated in this context). Sub-issues are created de-novo. Reviewers should check for duplicates against existing agentic-workflows issues for Clusters B and C.


Proposed Fix Roadmap

Priority Cluster Fix
P0 A — Rate limiting Stagger concurrent workflow trigger times to avoid burst; add rate-limit backoff/retry logic in safe-outputs handler; see sub-issue #29541
P1 B — Missing CLI binaries Add pre-flight binary checks to entrypoint with actionable error messages and runner image pinning; verify codex binary deploy pipeline and Node.js bind-mount
P2 C — pip deps Pre-install required Python packages in workflow setup step or use a pre-built requirements image for GitHub MCP Structural Analysis

Sub-issues Created

References:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions