Skip to content

Safe-outputs MCP transport silently closes on idle during long agent runs — no outputs produced, workflow reports success #20885

@benvillalobos

Description

@benvillalobos

AI-assisted report

Summary

When an agent spends several minutes on analysis before invoking safe-output tools, the HTTP connection to the safe-outputs MCP server goes idle and is dropped. The agent then fails to call any safe-output tools, producing zero outputs, but the workflow still reports success. This is a silent failure — no labels, assignments, or comments are applied.

Impact

In a sample of the last 20 issue-triage workflow runs in microsoft/vscode-engineering, 7 out of 20 runs (35%) produced zero safe outputs due to this failure. All 7 reported workflow conclusion success. The affected issues remained untriaged with triage-needed still applied.

Reproduction

  1. Create an agentic workflow with safe-output tools (e.g., add_labels, assign_to_user)
  2. Give the agent a task that requires multi-step analysis before invoking safe-output tools (reading files, classification, searching references)
  3. If the agent spends ~5+ minutes analyzing before its first safe-output tool call, the MCP transport closes

Error Sequence (from agent logs)

2026-03-13T13:43:16.749Z [ERROR] MCP client for safeoutputs errored TypeError: fetch failed
2026-03-13T13:43:16.750Z [ERROR] MCP client for safeoutputs errored TypeError: fetch failed
2026-03-13T13:44:29.233Z [ERROR] MCP transport for safeoutputs closed
2026-03-13T13:44:29.233Z [ERROR] MCP client for safeoutputs closed

The agent completes its analysis and attempts to invoke safeoutputs-add_labels, but the transport is already dead. No reconnect is attempted. outputs.jsonl is never written.

Root Cause Analysis

The safe-outputs MCP server communicates over HTTP through the MCP Gateway. During the agent's analysis phase (reading issue data, skill files, working-areas references), no requests are sent to the safe-outputs server. After ~5 minutes of idle time, the HTTP connection is dropped — likely by the Docker network stack, OS TCP keepalive, or the gateway's connection management.
Key gaps identified:

Gap Location Detail
No HTTP keepalive/ping MCP HTTP transport No heartbeat mechanism to keep idle connections alive
No auto-reconnect copilot-agent-runtime StreamableHTTPClientTransport Client does not retry on TypeError: fetch failed
timeout config unused MCPServerConfig in copilot-agent-runtime Field exists but is never wired to the HTTP transport layer
Silent success Workflow conclusion job outputs.jsonl missing → no actions taken, but workflow reports success

Suggested Fixes

  1. Add keepalive/heartbeat to MCP HTTP transport — periodic pings to prevent idle connection closure
  2. Implement auto-reconnect — detect fetch failed and re-establish the transport before retrying the tool call
  3. Increase visibility — emit a warning annotation when the safe-outputs artifact is missing despite the agent job completing

Environment

  • gh-aw compiler: v0.50.0
  • Agent runtime: 0.0.415
  • AWF: v0.20.2
  • MCP Gateway: v0.1.5
  • GitHub MCP Server: v0.31.0

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions