Skip to content

MCP gateway fails on ARC self-hosted runners with dind sidecar — "Invalid container ID format" + "Docker socket not found" #28888

@clementbolin

Description

@clementbolin

Summary

When running an agentic workflow on a self-hosted runner deployed via actions-runner-controller (ARC) on Kubernetes, with the recommended Docker-in-Docker (dind) sidecar pattern, the MCP gateway fails to start with Docker socket not found at /var/run/docker.sock — even though a Unix Docker socket is correctly exposed on the runner pod and DOCKER_HOST=unix:///var/run/docker.sock is set on the runner container.

Workflow execution aborts at the gateway startup step.

Environment

  • gh-aw CLI: v0.71.1
  • gh-aw-mcpg image: ghcr.io/github/gh-aw-mcpg:v0.3.0
  • ARC: gha-runner-scale-set Helm chart 0.13.1 (OCI registry: ghcr.io/actions/actions-runner-controller-charts)
  • Runner image: ghcr.io/actions/actions-runner:2.333.1
  • Kubernetes platform: AWS EKS (containerd runtime)
  • Pod Security Admission: namespace labelled pod-security.kubernetes.io/enforce: privileged

Runner pod configuration (relevant excerpt)

Standard ARC dind sidecar pattern with K8s native sidecars (restartPolicy: Always on an initContainer) and a shared emptyDir for the Docker socket:

template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:2.333.1
        # copies runner externals to a shared emptyDir
      - name: dind
        image: docker:dind
        restartPolicy: Always   # K8s native sidecar
        args:
          - dockerd
          - --host=unix:///var/run/docker.sock
          - --group=$(DOCKER_GROUP_GID)   # 123 in our setup
        securityContext:
          privileged: true
        volumeMounts:
          - name: dind-sock
            mountPath: /var/run
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:2.333.1
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker.sock
        securityContext:
          privileged: true
        volumeMounts:
          - name: dind-sock
            mountPath: /var/run
    volumes:
      - name: dind-sock
        emptyDir: {}

This is the layout described in the ARC documentation for Docker-in-Docker. From the runner container's perspective, /var/run/docker.sock is a real Unix socket and docker commands work.

Observed behaviour

The compiled workflow logs the gateway launch command:

[info] Starting gateway with container: docker run -i --rm --network host \
  --add-host host.docker.internal:127.0.0.1 \
  --user 1001:1001 --group-add 0 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e MCP_GATEWAY_PORT -e MCP_GATEWAY_DOMAIN ... \
  ghcr.io/github/gh-aw-mcpg:v0.3.0

The gateway crashes during initialization:

[INFO] Starting MCP Gateway in containerized mode...
[INFO] Auto-detected baked-in WASM guards at /guards
[INFO] MCP_GATEWAY_WASM_GUARDS_DIR=/guards
[INFO] Running in containerized environment
[WARN] Invalid container ID format: arc-gaw-xzpj8-runner-8lthc
[WARN] Could not determine container ID
Error:  Docker socket not found at /var/run/docker.sock
Error:  Mount the Docker socket: -v /var/run/docker.sock:/var/run/docker.sock

Error: Process completed with exit code 1.

Root-cause analysis

We traced the failure to three independent issues in gh-aw and gh-aw-mcpg. All three are specific to Kubernetes/ARC environments and would not surface on GitHub-hosted runners.

1. Container-ID detection rejects K8s/containerd cgroup names

gh-aw-mcpg/run_containerized.sh (lines 49-53) parses the cgroup hierarchy and validates the extracted ID against the Docker hash format:

# Container IDs must be 12-64 hex characters only
if ! echo "$cid" | grep -qE '^[a-f0-9]{12,64}$'; then
    log_warn "Invalid container ID format: $cid"
    return 1
fi

On EKS with containerd (and on dind with default cgroup namespacing), the gateway's /proc/self/cgroup view contains the K8s pod name — arc-gaw-xzpj8-runner-8lthc in our log — not a Docker container hash. The regex rejects it and the script falls back to defaults.

This is logged as a [WARN] and is not directly fatal, but combined with #2 below it leads to an avoidable failure path.

2. DOCKER_HOST is set on the runner but not propagated to the gateway container

The same script (lines 87-94) does honour DOCKER_HOST when present:

local socket_path="${DOCKER_HOST:-/var/run/docker.sock}"
socket_path="${socket_path#unix://}"

if [ ! -S "$socket_path" ]; then
    log_error "Docker socket not found at $socket_path"
    log_error "Mount the Docker socket: -v /var/run/docker.sock:/var/run/docker.sock"
    exit 1
fi

…but the docker run command generated by gh-aw to launch the gateway does not pass -e DOCKER_HOST from the runner to the gateway container. So even when the runner exports DOCKER_HOST=unix:///var/run/docker.sock (or tcp://… for TCP-only dind), the gateway always falls back to the hardcoded /var/run/docker.sock.

In our specific case the path defaulted to is the same, so this is not what triggers the failure — but it's a latent bug that prevents any custom socket path or TCP daemon from ever working.

3. The bind-mounted socket isn't visible as a socket inside the gateway

The [ -S /var/run/docker.sock ] test fails inside the gateway container, even though the source path is a real Unix socket on the dind sidecar's filesystem and the bind mount is generated correctly by gh-aw.

We have not fully root-caused this yet. Plausible causes:

  • GID mismatch on the bind-mounted socket. The dind daemon creates the socket with ownership root:123 (custom DOCKER_GROUP_GID, common in docker:dind). The gateway runs as --user 1001:1001 --group-add 0. The hardcoded 0 matches root group on GitHub-hosted runners (the v0.68.6 fix from Add Docker socket supplementary group to MCP gateway container command #26750/fix: compute Docker socket GID separately for shell expansion #26771) but is ineffective here. While [ -S ] should only require directory traversal permission, the actual stat() may fail under certain mount-namespace propagation modes when the file is unreadable to the calling user.
  • Mount-namespace propagation oddity when dockerd bind-mounts its own listening socket into a child container while both processes share the same emptyDir parent mount.
  • --user 1001:1001 clashing with the file ownership in a way that makes Docker silently substitute an empty directory (a behaviour that can occur when a bind mount target conflicts with image content).

Suggested directions

This is not a single bug — it's a class of incompatibilities with K8s-based self-hosted runners. We see three orthogonal improvements that would unblock a wide range of ARC setups:

  1. Propagate DOCKER_HOST to the MCP gateway container. Add -e DOCKER_HOST to the generated docker run invocation. Smallest change, biggest impact for self-hosted users.
  2. Make --group-add configurable / auto-detected. Either add a CLI flag (e.g. --docker-group-gid) or detect the docker socket's GID at startup (stat -c '%g' "$socket_path") and pass it through. Hardcoding 0 only works on GitHub-hosted runners.
  3. Make container-ID detection in gh-aw-mcpg robust to non-Docker cgroup formats. Add a containerd/Kubernetes path, or treat detection failure as informational rather than letting it influence downstream logic.

A documentation page on "running gh-aw on ARC" (analogous to self-hosted runners) covering the dind sidecar pattern, the required --group-add GID, and the DOCKER_HOST propagation knob, would also be very welcome.

We're happy to test any patch on our ARC + EKS stack.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions