Skip to content

[aw-failures] Workflow timing out at 40min — MCP get_file_contents 37–71s per call, LLM turns 4–10min #27556

@viktoriyabogdanova

Description

@viktoriyabogdanova

Hi,
We are seeing consistent 40-minute timeouts on our scaffold workflow starting today (Apr 21). Two of our runs hit the hard timeout before the agent could commit/push the scaffold branch. This was working reliably in ~15 minutes last week.

Affected runs (today):

09:26:47  ● Read module standard
09:29:11  ● Create feature branch              (2m 24s gap)
09:31:50  ● List AVM module files              (2m 39s gap)
09:34:00  ● Check existing renovate            (2m 10s gap)
09:35:36  ● Create root outputs.tf
09:37:51  ● Create examples/data.tf            (2m 15s gap)
09:45:21  ● Create storage_account.tftest.hcl  (4m 36s gap)
09:47:42  ● Create README.md                   (2m 21s gap)
09:48:46  ● List all terraform files
09:59:14  ● Response was interrupted due to a server error. Retrying... (10m 28s gap)
10:06:01  ##[error] The action 'Execute GitHub Copilot CLI' has timed out after 40 minutes.

Working run for comparison (Apr 16, different module):

13:33:42  ● Read module standard
13:34:30  ● Check current git state            (48s gap — longest)
13:34:36  ● Check AVM module variables         (6s gap)
13:38:29  ● Create feature branch              (LLM planning)
13:38:35  ● Create required directories        (6s)
13:39:20  ● Create examples/provider.tf        (22s)
... all gaps under 48s, completed in 11m 38s total

gh-aw version: v0.68.3

Direct timing comparison:

Metric Apr 16 (working, 11m38s) Apr 21 (failing, 40m timeout)
LLM turn time (between tool calls) 0–48 seconds max 4–10 minutes
AVM module file reads 6–19s (shell curl/gh api) 37–71s each (get_file_contents MCP)
File creation tool calls 0–22s 1–4 minutes between each
Server errors None Response was interrupted due to a server error. Retrying... at 09:59
Completed? ✅ Yes ❌ Both timed out

Notable: last week the agent read AVM module files via shell (curl/gh api) taking 6–19s each. Today it routes through get_file_contents (MCP: github) taking 37–71s each — 10–12x slower. LLM response times are 20–30x worse.

Observed symptoms:

get_file_contents (MCP: github) calls taking 37–71 seconds each — reading individual files from Azure/terraform-azurerm-avm-res-storage-storageaccount via the GitHub MCP server. Example from run 1:

variables.tf → 71s
variables.storage.tf → 51s (then failed with path-not-found)
variables.storageaccount.tf → 62s
LLM turns taking 4–10 minutes — e.g. gap from 09:48:46 to 09:59:14 (10m28s) with a Response was interrupted due to a server error. Retrying... at the end.

Neither run completed — both exhausted the full 40-minute budget without reaching the git commit/push step.

Environment:

Lock file pins: ghcr.io/github/gh-aw-mcpg:v0.2.19, github-mcp-server:v0.32.0, gh-aw-firewall:0.25.20
gh-aw-actions@ba90f2186d7ad780ec640f364005fa24e797b360
No changes on our side between Apr 16 (working) and Apr 21 (failing)

Suspected cause: Copilot API rate limiting / degradation noted in #27339 (P1: rate limits) combined with elevated GitHub MCP server response times. The combination means a workflow that normally completes in 15 minutes cannot finish within the 40-minute cap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions