Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# ADR-29354: Per-Workflow MCP Session Timeout via `engine.mcp` Frontmatter

**Date**: 2026-04-30
**Status**: Draft
**Deciders**: lpcox, Copilot SWE Agent

---

## Part 1 — Narrative (Human-Friendly)

### Context

All workflows sharing a single MCP gateway instance inherit the same session lifetime, controlled globally by the `MCP_GATEWAY_SESSION_TIMEOUT` environment variable set on the gateway container. There is no mechanism for individual workflow authors to declare the session lifetime their workflow needs. This is a problem because workflows have wildly different runtime profiles: a short code-review workflow may finish in minutes while a large-scale migration may run for several hours and require a long-lived MCP session to avoid mid-task reconnection failures.

### Decision

We will add an optional `engine.mcp.session-timeout` frontmatter field that workflow authors can use to declare the MCP gateway session lifetime required by their workflow. The compiler will parse the value, validate it is a Go duration string in the range 5m–12h, and propagate it as `sessionTimeout` into the gateway's stdin configuration JSON. The field coexists with the existing `MCP_GATEWAY_SESSION_TIMEOUT` env-var mechanism: the per-workflow value in the stdin config takes precedence over the env var, which takes precedence over the gateway's built-in default of 6h.

### Alternatives Considered

#### Alternative 1: `MCP_GATEWAY_SESSION_TIMEOUT` env-var-only (status quo)

The existing approach relies on a single environment variable that applies uniformly to all workflows on the gateway instance. It was considered acceptable because infrastructure operators can tune it. It was not chosen as the long-term solution because it requires infrastructure-level changes (redeploy) to accommodate per-workflow requirements and cannot express that two workflows running on the same gateway need different lifetimes.

#### Alternative 2: Place the field under `sandbox.mcp` instead of `engine.mcp`

The `sandbox.mcp` sub-object already hosts gateway-level settings like `trusted-bots` and `keepalive-interval`, which are operational concerns about how the gateway behaves. Session timeout was considered for placement there too. However, `engine.mcp` was chosen because session lifetime is an engine-level concern: it describes how long the engine's connection to the MCP gateway must remain alive, not a property of the sandbox environment itself. Keeping engine-specific knobs under `engine.*` maintains a cleaner frontmatter taxonomy.

#### Alternative 3: Integer seconds field instead of a Go duration string

Using an integer (seconds) would be simpler to parse and validate. A Go duration string was chosen instead because the rest of the codebase uses `time.Duration`-compatible strings for timeout configuration, it is more readable in YAML (`4h` versus `14400`), and it aligns with the format already used by `MCP_GATEWAY_SESSION_TIMEOUT` on the gateway side.

### Consequences

#### Positive
- Workflow authors can express session lifetime requirements declaratively alongside the workflow definition, without requiring infrastructure changes.
- Long-running workflows (multi-hour migrations, large batch operations) get appropriately long sessions; short workflows can use shorter values, freeing gateway resources sooner.
- Compile-time validation (5m–12h range, valid Go duration format) surfaces misconfiguration early with actionable error messages.

#### Negative
- Adds another frontmatter field and a new `engine.mcp` sub-object, increasing frontmatter surface area.
- The three-level precedence rule (stdin config > env var > gateway default) is non-obvious and must be documented; operators may be surprised that a per-workflow value overrides their env-var setting.
- Any workflow that sets `session-timeout` to a very long value could hold gateway resources for extended periods if a run hangs or is abandoned.

#### Neutral
- Constants `MCPSessionTimeoutMin` (5m) is added to `pkg/constants`, establishing a named home for the MCP session minimum bound that future policies can reference.
- Only the kebab-case (`session-timeout`) key spelling is accepted during extraction; the JSON Schema uses `additionalProperties: false` to reject any other keys under `engine.mcp`.
- The JSON Schema for `engine_config` gains a new `mcp` object definition with `additionalProperties: false`, which prevents unrecognized MCP sub-keys from silently passing validation.

---

## Part 2 — Normative Specification (RFC 2119)

> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).

### Frontmatter Field

1. The `engine.mcp.session-timeout` frontmatter field is **OPTIONAL**.
2. When present and non-empty, the value **MUST** be a valid Go duration string parseable by `time.ParseDuration`.
3. When present, the value **MUST** be greater than or equal to `5m` (as defined by `constants.MCPSessionTimeoutMin`); there is no upper bound.
4. Implementations **MUST** accept only kebab-case (`session-timeout`) for this key; camelCase is not supported.
5. Implementations **MUST NOT** accept any other spellings or sibling keys under `engine.mcp`; the `mcp` sub-object **MUST** be declared with `additionalProperties: false` in the JSON Schema.

### Validation

1. The compiler **MUST** validate `engine.mcp.session-timeout` during `ParseWorkflowFile`, before the workflow is further processed.
2. A validation error **MUST** produce a human-readable message that includes the offending value, the minimum bound, and a corrected example.
3. Workflows that omit `engine.mcp.session-timeout` or set it to an empty string **MUST** be treated as if the field were absent; no default value **SHALL** be injected by the compiler.

### Propagation

1. When `engine.mcp.session-timeout` is set to a valid non-empty value, the compiler **MUST** propagate it into `MCPGatewayRuntimeConfig.SessionTimeout`.
2. The renderer **MUST** emit `"sessionTimeout": "<value>"` inside the `gateway` JSON block when `MCPGatewayRuntimeConfig.SessionTimeout` is non-empty.
3. The renderer **MUST NOT** emit `"sessionTimeout"` when `MCPGatewayRuntimeConfig.SessionTimeout` is empty.
4. The emitted value **MUST** take precedence over the `MCP_GATEWAY_SESSION_TIMEOUT` environment variable at the gateway runtime.

### Conformance

An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance.

---

*This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/25178770914) workflow. The PR author must review, complete, and finalize this document before the PR can merge.*
55 changes: 53 additions & 2 deletions docs/src/content/docs/reference/mcp-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,8 @@ The `gateway` section is required and configures gateway-specific behavior:
| `payloadSizeThreshold` | integer | No | Size threshold in bytes for storing payloads to disk (default: 524288 = 512KB) |
| `trustedBots` | array[string] | No | Additional GitHub bot identity strings (e.g., `github-actions[bot]`) passed to the gateway and merged with its built-in trusted identity list. This field is additive — it extends the internal list but cannot remove built-in entries. |
| `keepaliveInterval` | integer | No | Keepalive ping interval in seconds for HTTP MCP backends. Prevents session expiry during long-running tasks. Use `-1` to disable, `0` or unset for gateway default (1500s = 25 min), or a positive integer for a custom interval. |
| `opentelemetry` | object | No | OpenTelemetry configuration for emitting distributed tracing events for MCP calls. See Section 4.1.3.6 for details. |
| `sessionTimeout` | string | No | Session timeout for MCP gateway sessions as a Go duration string (e.g. `"30m"`, `"4h"`, `"24h"`). Empty or omitted uses the gateway default (6h). Must be at least 5m when set by the workflow compiler (no upper bound; infrastructure operators may override via `MCP_GATEWAY_SESSION_TIMEOUT` env var). |
| `opentelemetry` | object | No | OpenTelemetry configuration for emitting distributed tracing events for MCP calls. See Section 4.1.3.7 for details. |

#### 4.1.3.1 Payload Directory Path Validation

Expand Down Expand Up @@ -449,7 +450,47 @@ sandbox:
- A value of `-1` disables keepalive pings entirely
- Any positive integer sets the keepalive interval in seconds

#### 4.1.3.6 OpenTelemetry Configuration
#### 4.1.3.6 Session Timeout Configuration

The optional `sessionTimeout` field in the gateway configuration controls how long stateful MCP sessions survive in both unified and routed modes.

| Value | Behavior |
|-------|----------|
| Unset / `""` | Gateway default: 6 hours (or `MCP_GATEWAY_SESSION_TIMEOUT` env var) |
| Go duration string | Custom session lifetime (e.g. `"30m"`, `"4h"`, `"6h"`, `"12h"`) |

**Precedence**: `sessionTimeout` in the stdin config JSON **>** `MCP_GATEWAY_SESSION_TIMEOUT` environment variable **>** gateway built-in default (6h).

**Configuration example (JSON)**:

```json
{
"gateway": {
"port": 8080,
"domain": "localhost",
"apiKey": "${MCP_GATEWAY_API_KEY}",
"sessionTimeout": "4h"
}
}
```

**Workflow frontmatter** (via `engine.mcp.session-timeout`):

```yaml
engine:
id: copilot
mcp:
session-timeout: 4h # 4-hour sessions for long-running migrations
```

**Compliance rules**:

- `sessionTimeout` MUST be a valid Go duration string when present (e.g. `"30m"`, `"4h"`)
- The workflow compiler enforces a minimum of `5m` for author-specified values (no upper bound)
- Infrastructure operators may set `MCP_GATEWAY_SESSION_TIMEOUT` on the gateway container to override the default for all workflows; a per-workflow `sessionTimeout` in the stdin config takes precedence
- When unset, the gateway uses its built-in default (6h)

#### 4.1.3.7 OpenTelemetry Configuration

The optional `opentelemetry` object in the gateway configuration enables the gateway to emit distributed tracing events for MCP calls using the [OpenTelemetry](https://opentelemetry.io/) standard. When configured, the gateway creates spans for each MCP tool invocation and exports them to the designated collector endpoint.

Expand Down Expand Up @@ -1852,6 +1893,16 @@ Content-Type: application/json
- **Added**: Normative references for W3C Trace Context and OTLP
- **Updated**: JSON Schema with `opentelemetry` property and `opentelemetryConfig` definition in `gatewayConfig`

### Version 1.11.0 (Draft)

- **Added**: `sessionTimeout` field to gateway configuration (Section 4.1.3, 4.1.3.6)
- Optional Go duration string for controlling MCP session lifetime (e.g. `"4h"`, `"30m"`)
- Precedence: stdin config `sessionTimeout` > `MCP_GATEWAY_SESSION_TIMEOUT` env var > gateway default (6h)
- Workflow authors configure via `engine.mcp.session-timeout` in frontmatter; the compiler validates (min 5m, no upper bound) and emits it into the gateway config JSON
- **Added**: Section 4.1.3.6 — Session Timeout Configuration
- **Renumbered**: Former Section 4.1.3.6 (OpenTelemetry) to 4.1.3.7
- **Updated**: JSON Schema with `sessionTimeout` property in `gatewayConfig` definition

### Version 1.10.0 (Draft)

- **Added**: `keepaliveInterval` field to gateway configuration (Section 4.1.3, 4.1.3.5)
Expand Down
3 changes: 3 additions & 0 deletions pkg/constants/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,9 @@ const DefaultToolTimeout = 60 * time.Second
// DefaultMCPStartupTimeout is the default timeout for MCP server startup
const DefaultMCPStartupTimeout = 120 * time.Second

// MCPSessionTimeoutMin is the minimum allowed value for engine.mcp.session-timeout (5 minutes).
const MCPSessionTimeoutMin = 5 * time.Minute

// DefaultActivationJobRunnerImage is the default runner image for activation and pre-activation jobs
const DefaultActivationJobRunnerImage = "ubuntu-slim"

Expand Down
12 changes: 12 additions & 0 deletions pkg/parser/schemas/main_workflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -9819,6 +9819,18 @@
"type": "boolean",
"description": "When true, disables automatic loading of context and custom instructions by the AI engine. The engine-specific flag depends on the engine: copilot uses --no-custom-instructions (suppresses .github/AGENTS.md and user-level custom instructions), claude uses --bare (suppresses CLAUDE.md memory files), codex uses --no-system-prompt (suppresses the default system prompt), gemini sets GEMINI_SYSTEM_MD=/dev/null (overrides the built-in system prompt with an empty one). Defaults to false.",
"default": false
},
"mcp": {
"type": "object",
"description": "Engine-level MCP gateway configuration. Settings here apply to the MCP gateway used by this engine.",
"properties": {
"session-timeout": {
"type": "string",
"description": "Session timeout for MCP gateway sessions as a Go duration string (e.g. \"30m\", \"4h\", \"24h\"). Must be at least 5m (no upper bound). Omitted or empty uses the effective gateway default (precedence: this field > MCP_GATEWAY_SESSION_TIMEOUT env var > built-in default 6h). Longer timeouts benefit multi-hour workflows such as large-scale migrations; shorter values free gateway resources sooner.",
"examples": ["30m", "1h", "4h", "6h", "12h"]
}
},
"additionalProperties": false
}
},
"required": ["id"],
Expand Down
5 changes: 5 additions & 0 deletions pkg/workflow/compiler_orchestrator_workflow.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ func (c *Compiler) ParseWorkflowFile(markdownPath string) (*WorkflowData, error)
return nil, fmt.Errorf("%s: %w", cleanPath, err)
}

// Validate optional engine.mcp.session-timeout configuration.
if err := c.validateEngineMCPSessionTimeout(workflowData); err != nil {
return nil, fmt.Errorf("%s: %w", cleanPath, err)
}
Comment on lines +84 to +87

// Validate that inlined-imports is not used with agent file imports.
// Agent files require runtime access and cannot be resolved without sources.
if workflowData.InlinedImports && engineSetup.importsResult.AgentFile != "" {
Expand Down
5 changes: 5 additions & 0 deletions pkg/workflow/compiler_string_api.go
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,11 @@ func (c *Compiler) ParseWorkflowString(content string, virtualPath string) (*Wor
return nil, fmt.Errorf("%s: %w", cleanPath, err)
}

// Validate optional engine.mcp.session-timeout configuration.
if err := c.validateEngineMCPSessionTimeout(workflowData); err != nil {
return nil, fmt.Errorf("%s: %w", cleanPath, err)
}

// Validate GitHub tool configuration
if err := validateGitHubToolConfig(workflowData.ParsedTools, workflowData.Name); err != nil {
return nil, fmt.Errorf("%s: %w", cleanPath, err)
Expand Down
16 changes: 16 additions & 0 deletions pkg/workflow/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ type EngineConfig struct {

// Extended inline request shaping fields (engine.provider.request.*)
InlineProviderRequest *RequestShape // request shaping parsed from engine.provider.request

// MCP gateway configuration from engine.mcp sub-object
MCPSessionTimeout string // session-timeout: Go duration string for MCP gateway sessions (e.g. "4h", "30m")
}

// NetworkPermissions represents network access permissions for workflow execution
Expand Down Expand Up @@ -318,6 +321,19 @@ func (c *Compiler) ExtractEngineConfig(frontmatter map[string]any) (string, *Eng
}
}

// Extract optional 'mcp' sub-object (engine-level MCP gateway configuration)
if mcpVal, hasMCP := engineObj["mcp"]; hasMCP {
if mcpObj, ok := mcpVal.(map[string]any); ok {
// Extract session-timeout (kebab-case only; camelCase is not supported)
if stVal, hasSessionTimeout := mcpObj["session-timeout"]; hasSessionTimeout {
if stStr, ok := stVal.(string); ok && stStr != "" {
config.MCPSessionTimeout = stStr
engineLog.Printf("Extracted engine.mcp.session-timeout: %s", config.MCPSessionTimeout)
}
}
}
}

// Return the ID as the engineSetting for backwards compatibility
engineLog.Printf("Extracted engine configuration: ID=%s", config.ID)
return config.ID, config
Expand Down
55 changes: 55 additions & 0 deletions pkg/workflow/engine_config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -870,3 +870,58 @@ func TestBareMode_UnsupportedEngineNoFlag(t *testing.T) {
}
})
}

// TestEngineMCPSessionTimeoutExtraction tests extraction of engine.mcp.session-timeout.
func TestEngineMCPSessionTimeoutExtraction(t *testing.T) {
compiler := NewCompiler()

tests := []struct {
name string
frontmatter map[string]any
expectedTimeout string
}{
{
name: "extracts session-timeout from engine.mcp",
frontmatter: map[string]any{
"engine": map[string]any{
"id": "copilot",
"mcp": map[string]any{
"session-timeout": "4h",
},
},
},
expectedTimeout: "4h",
},
{
name: "no mcp section - empty session timeout",
frontmatter: map[string]any{
"engine": map[string]any{
"id": "copilot",
},
},
expectedTimeout: "",
},
{
name: "mcp section without session-timeout - empty",
frontmatter: map[string]any{
"engine": map[string]any{
"id": "copilot",
"mcp": map[string]any{},
},
},
expectedTimeout: "",
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
_, config := compiler.ExtractEngineConfig(tt.frontmatter)
if config == nil {
t.Fatal("Expected non-nil config")
}
if config.MCPSessionTimeout != tt.expectedTimeout {
t.Errorf("MCPSessionTimeout = %q, want %q", config.MCPSessionTimeout, tt.expectedTimeout)
}
})
}
}
23 changes: 23 additions & 0 deletions pkg/workflow/engine_validation.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ import (
"path/filepath"
"regexp"
"strings"
"time"

"github.com/github/gh-aw/pkg/console"
"github.com/github/gh-aw/pkg/constants"
Expand Down Expand Up @@ -108,6 +109,28 @@ func (c *Compiler) validateEngineHarnessScript(workflowData *WorkflowData) error
}
}

// validateEngineMCPSessionTimeout validates optional engine.mcp.session-timeout configuration.
// The value must be a valid Go duration string of at least 5m (no upper bound).
func (c *Compiler) validateEngineMCPSessionTimeout(workflowData *WorkflowData) error {
if workflowData == nil || workflowData.EngineConfig == nil || workflowData.EngineConfig.MCPSessionTimeout == "" {
return nil
}

raw := workflowData.EngineConfig.MCPSessionTimeout

d, err := time.ParseDuration(raw)
if err != nil {
return fmt.Errorf("engine.mcp.session-timeout: invalid duration %q. Must be a valid Go duration string (e.g. \"30m\", \"4h\", \"24h\").\n\nExamples:\n engine:\n mcp:\n session-timeout: 4h\n\nSee: %s", raw, constants.DocsEnginesURL)
}

if d < constants.MCPSessionTimeoutMin {
return fmt.Errorf("engine.mcp.session-timeout: %q is too short (minimum is 5m).\n\nExamples:\n session-timeout: 30m\n session-timeout: 4h\n\nSee: %s", raw, constants.DocsEnginesURL)
}

engineValidationLog.Printf("engine.mcp.session-timeout validated: %s (%s)", raw, d)
return nil
}

// validateEngineInlineDefinition validates an inline engine definition parsed from
// engine.runtime + optional engine.provider in the workflow frontmatter.
// Returns an error if:
Expand Down
Loading
Loading