Skip to content

[aw-failures] [aw] Failure Investigator (6h) - Issue Group #29232

@github-actions

Description

@github-actions

[aw] Failure Investigator (6h)

Parent issue for grouping related issues from [aw] Failure Investigator (6h).

Sub-issues are automatically linked below (max 64 per parent).

Workflow: [aw] Failure Investigator (6h)

  • expires on May 8, 2026, 1:30 AM UTC

Update — 2026-04-30 ~19:00 UTC (run §25184286958)

Failures in last 6h window (39 runs, 4 true failures)

Cluster Runs Status Tracking
Design Decision Gate GitHub MCP drops (~100s timeout) 25179263531, 25181104179, 25177070075 Recurring P1 New sub-issue created
Smoke Claude safe_outputs resolve_thread failure 25181816514 New P2 New sub-issue created
Design Decision Gate gate rejection (expected) 25181238397 Expected behavior No action needed

Root Causes Confirmed

DDG MCP drops: All three runs show identical signature — GitHub MCP HTTP connection drops at ~100–110s uptime with Terminal connection error 1/3 → 2/3 → exit 1. The DDG agent takes 8–13 minutes total, so the MCP connection always drops before the agent finishes. The ~100s timeout appears to be a hard limit in the GitHub MCP server connection.

Smoke Claude: Agent completed all 19 smoke tests successfully. The resolve_pull_request_review_thread safe output fails because thread PRRT_kwDOPc1QR85-0kYa on PR #29360 was already resolved. 1 of 11 safe outputs failed, causing the safe_outputs step to return failure.

No stale issues closed

Open issues reviewed — #29355, #29279, #29275 are per-run auto-expiring reports; #29231/#29233 remain active tracking. No actionable closures warranted.

Generated by [aw] Failure Investigator (6h) · ● 397.8K ·


[aw] Failure Investigator (6h)

Parent issue for grouping related issues from [aw] Failure Investigator (6h).

Sub-issues are automatically linked below (max 64 per parent).

Workflow: [aw] Failure Investigator (6h)

  • expires on May 8, 2026, 1:30 AM UTC

Update — 2026-04-30 ~19:00 UTC (run §25184286958)

Failures in last 6h window (39 runs, 4 true failures)

Cluster Runs Status Tracking
Design Decision Gate GitHub MCP drops (~100s timeout) 25179263531, 25181104179, 25177070075 Recurring P1 New sub-issue created
Smoke Claude safe_outputs resolve_thread failure 25181816514 New P2 New sub-issue created
Design Decision Gate gate rejection (expected) 25181238397 Expected behavior No action needed

Root Causes Confirmed

DDG MCP drops: All three runs show identical signature — GitHub MCP HTTP connection drops at ~100–110s uptime with Terminal connection error 1/3 → 2/3 → exit 1. The DDG agent takes 8–13 minutes total, so the MCP connection always drops before the agent finishes. The ~100s timeout appears to be a hard limit in the GitHub MCP server connection.

Smoke Claude: Agent completed all 19 smoke tests successfully. The resolve_pull_request_review_thread safe output fails because thread PRRT_kwDOPc1QR85-0kYa on PR #29360 was already resolved. 1 of 11 safe outputs failed, causing the safe_outputs step to return failure.

No stale issues closed

Open issues reviewed — #29355, #29279, #29275 are per-run auto-expiring reports; #29231/#29233 remain active tracking. No actionable closures warranted.

Generated by [aw] Failure Investigator (6h) · ● 397.8K


Update — 2026-05-01 ~01:30 UTC (run §25197698525)

Failures in last 6h window (21 runs, 2 true failures)

Cluster Runs Status Tracking
DDG error_max_turns on complex PRs (bash safeoutputs not authorized) §25196075860, §25196318154 New P1 — distinct from prior MCP-drop pattern New sub-issue created

Root Cause Confirmed — Different from prior DDG MCP-drop failures

Both new DDG failures are caused by error_max_turns, NOT an active GitHub MCP connection drop. Evidence from agent-stdio.log:

terminal_reason: max_turns
errors: ["Reached maximum number of turns (12)"]
num_turns: 13

The GitHub MCP "drop" messages in the failure reporter appear only during post-session cleanup (after the agent exited). This distinguishes these from runs 25177070075/25178770914/etc. where the GitHub MCP dropped mid-session.

What actually happened (both runs):

  • Agent successfully analyzed the PR diff on complex PRs (docs/variable changes) — 13 turns consumed
  • ADR draft written and pushed to PR branch via MCP push_to_pull_request_branch (2 successful calls)
  • Agent attempted bash safeoutputs add_comment to post the PR review comment → permission denied (not pre-authorized in DDG workflow)
  • Agent hit max_turns (12) trying to find an alternative path
  • Pre-step file /tmp/gh-aw/agent/adr-prefetch-summary.json was absent, costing extra analysis turns (was present in Apr 21 runs)

Successful DDG runs in same window — 4 runs all completed in 4–5 turns on simple fix PRs. No max_turns issue.

Historical context

Prior sub-issue #27470 (closed completed Apr 21) fixed max_turns from 5 → 7. Limit is now 12, but complex PRs with large diffs + missing pre-step file need 13+ turns. The adr-prefetch-summary.json pre-step appears to have regressed since April 21.

No stale issues closed

References:

Generated by [aw] Failure Investigator (6h)

Note

🔒 Integrity filter blocked 3 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by [aw] Failure Investigator (6h) · ● 697.5K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions