MCP gateway fails on ARC self-hosted runners with dind sidecar — "Invalid container ID format" + "Docker socket not found"

## Summary

When running an agentic workflow on a self-hosted runner deployed via [actions-runner-controller (ARC)](https://github.com/actions/actions-runner-controller) on Kubernetes, with the recommended Docker-in-Docker (dind) sidecar pattern, the MCP gateway fails to start with `Docker socket not found at /var/run/docker.sock` — even though a Unix Docker socket is correctly exposed on the runner pod and `DOCKER_HOST=unix:///var/run/docker.sock` is set on the runner container.

Workflow execution aborts at the gateway startup step.

## Environment

- `gh-aw` CLI: **v0.71.1**
- `gh-aw-mcpg` image: `ghcr.io/github/gh-aw-mcpg:v0.3.0`
- ARC: `gha-runner-scale-set` Helm chart **0.13.1** (OCI registry: `ghcr.io/actions/actions-runner-controller-charts`)
- Runner image: `ghcr.io/actions/actions-runner:2.333.1`
- Kubernetes platform: AWS EKS (`containerd` runtime)
- Pod Security Admission: namespace labelled `pod-security.kubernetes.io/enforce: privileged`

## Runner pod configuration (relevant excerpt)

Standard ARC dind sidecar pattern with K8s native sidecars (`restartPolicy: Always` on an initContainer) and a shared `emptyDir` for the Docker socket:

```yaml
template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:2.333.1
        # copies runner externals to a shared emptyDir
      - name: dind
        image: docker:dind
        restartPolicy: Always   # K8s native sidecar
        args:
          - dockerd
          - --host=unix:///var/run/docker.sock
          - --group=$(DOCKER_GROUP_GID)   # 123 in our setup
        securityContext:
          privileged: true
        volumeMounts:
          - name: dind-sock
            mountPath: /var/run
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:2.333.1
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker.sock
        securityContext:
          privileged: true
        volumeMounts:
          - name: dind-sock
            mountPath: /var/run
    volumes:
      - name: dind-sock
        emptyDir: {}
```

This is the layout described in the [ARC documentation for Docker-in-Docker](https://github.com/actions/actions-runner-controller/blob/master/docs/preview/gha-runner-scale-set-controller/README.md#dind-mode). From the runner container's perspective, `/var/run/docker.sock` is a real Unix socket and `docker` commands work.

## Observed behaviour

The compiled workflow logs the gateway launch command:

```
[info] Starting gateway with container: docker run -i --rm --network host \
  --add-host host.docker.internal:127.0.0.1 \
  --user 1001:1001 --group-add 0 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e MCP_GATEWAY_PORT -e MCP_GATEWAY_DOMAIN ... \
  ghcr.io/github/gh-aw-mcpg:v0.3.0
```

The gateway crashes during initialization:

```
[INFO] Starting MCP Gateway in containerized mode...
[INFO] Auto-detected baked-in WASM guards at /guards
[INFO] MCP_GATEWAY_WASM_GUARDS_DIR=/guards
[INFO] Running in containerized environment
[WARN] Invalid container ID format: arc-gaw-xzpj8-runner-8lthc
[WARN] Could not determine container ID
Error:  Docker socket not found at /var/run/docker.sock
Error:  Mount the Docker socket: -v /var/run/docker.sock:/var/run/docker.sock

Error: Process completed with exit code 1.
```

## Root-cause analysis

We traced the failure to three independent issues in `gh-aw` and `gh-aw-mcpg`. All three are specific to Kubernetes/ARC environments and would not surface on GitHub-hosted runners.

### 1. Container-ID detection rejects K8s/containerd cgroup names

`gh-aw-mcpg/run_containerized.sh` (lines 49-53) parses the cgroup hierarchy and validates the extracted ID against the Docker hash format:

```bash
# Container IDs must be 12-64 hex characters only
if ! echo "$cid" | grep -qE '^[a-f0-9]{12,64}$'; then
    log_warn "Invalid container ID format: $cid"
    return 1
fi
```

On EKS with containerd (and on dind with default cgroup namespacing), the gateway's `/proc/self/cgroup` view contains the K8s pod name — `arc-gaw-xzpj8-runner-8lthc` in our log — not a Docker container hash. The regex rejects it and the script falls back to defaults.

This is logged as a `[WARN]` and is not directly fatal, but combined with #2 below it leads to an avoidable failure path.

### 2. `DOCKER_HOST` is set on the runner but not propagated to the gateway container

The same script (lines 87-94) does honour `DOCKER_HOST` when present:

```bash
local socket_path="${DOCKER_HOST:-/var/run/docker.sock}"
socket_path="${socket_path#unix://}"

if [ ! -S "$socket_path" ]; then
    log_error "Docker socket not found at $socket_path"
    log_error "Mount the Docker socket: -v /var/run/docker.sock:/var/run/docker.sock"
    exit 1
fi
```

…but the `docker run` command generated by `gh-aw` to launch the gateway does **not** pass `-e DOCKER_HOST` from the runner to the gateway container. So even when the runner exports `DOCKER_HOST=unix:///var/run/docker.sock` (or `tcp://…` for TCP-only dind), the gateway always falls back to the hardcoded `/var/run/docker.sock`.

In our specific case the path defaulted to is the same, so this is not what triggers the failure — but it's a latent bug that prevents any custom socket path or TCP daemon from ever working.

### 3. The bind-mounted socket isn't visible as a socket inside the gateway

The `[ -S /var/run/docker.sock ]` test fails inside the gateway container, even though the source path is a real Unix socket on the dind sidecar's filesystem and the bind mount is generated correctly by `gh-aw`.

We have not fully root-caused this yet. Plausible causes:

- **GID mismatch on the bind-mounted socket.** The dind daemon creates the socket with ownership `root:123` (custom `DOCKER_GROUP_GID`, common in `docker:dind`). The gateway runs as `--user 1001:1001 --group-add 0`. The hardcoded `0` matches root group on GitHub-hosted runners (the v0.68.6 fix from #26750/#26771) but is ineffective here. While `[ -S ]` should only require directory traversal permission, the actual `stat()` may fail under certain mount-namespace propagation modes when the file is unreadable to the calling user.
- **Mount-namespace propagation oddity** when `dockerd` bind-mounts its own listening socket into a child container while both processes share the same `emptyDir` parent mount.
- **`--user 1001:1001` clashing with the file ownership** in a way that makes Docker silently substitute an empty directory (a behaviour that can occur when a bind mount target conflicts with image content).

## Suggested directions

This is not a single bug — it's a class of incompatibilities with K8s-based self-hosted runners. We see three orthogonal improvements that would unblock a wide range of ARC setups:

1. **Propagate `DOCKER_HOST` to the MCP gateway container.** Add `-e DOCKER_HOST` to the generated `docker run` invocation. Smallest change, biggest impact for self-hosted users.
2. **Make `--group-add` configurable / auto-detected.** Either add a CLI flag (e.g. `--docker-group-gid`) or detect the docker socket's GID at startup (`stat -c '%g' "$socket_path"`) and pass it through. Hardcoding `0` only works on GitHub-hosted runners.
3. **Make container-ID detection in `gh-aw-mcpg` robust to non-Docker cgroup formats.** Add a containerd/Kubernetes path, or treat detection failure as informational rather than letting it influence downstream logic.

A documentation page on **"running gh-aw on ARC"** (analogous to [self-hosted runners](https://github.github.com/gh-aw/guides/self-hosted-runners/)) covering the dind sidecar pattern, the required `--group-add` GID, and the `DOCKER_HOST` propagation knob, would also be very welcome.

We're happy to test any patch on our ARC + EKS stack.

## Related

- #25511 (workflow-wide DinD breaks `gh aw` workflows) — closed by #26750 / #26771 in v0.68.6, which addresses Unix socket GID on GitHub-hosted runners only.
- #18188 (support execution on self-hosted without sudo) — closed as wontfix.
- #18385 (Squid config error on self-hosted ARC runners) — adjacent ARC-specific issue.
- `gh-aw-firewall/src/docker-manager.ts` `getLocalDockerEnv()` intentionally strips TCP `DOCKER_HOST` values to force usage of a local Unix socket — same design assumption that breaks on ARC.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP gateway fails on ARC self-hosted runners with dind sidecar — "Invalid container ID format" + "Docker socket not found" #28888

Summary

Environment

Runner pod configuration (relevant excerpt)

Observed behaviour

Root-cause analysis

1. Container-ID detection rejects K8s/containerd cgroup names

2. `DOCKER_HOST` is set on the runner but not propagated to the gateway container

3. The bind-mounted socket isn't visible as a socket inside the gateway

Suggested directions

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MCP gateway fails on ARC self-hosted runners with dind sidecar — "Invalid container ID format" + "Docker socket not found" #28888

Description

Summary

Environment

Runner pod configuration (relevant excerpt)

Observed behaviour

Root-cause analysis

1. Container-ID detection rejects K8s/containerd cgroup names

2. DOCKER_HOST is set on the runner but not propagated to the gateway container

3. The bind-mounted socket isn't visible as a socket inside the gateway

Suggested directions

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `DOCKER_HOST` is set on the runner but not propagated to the gateway container