Skip to content

[awf] Agent Container: intermittent engine binary missing at startup (node/codex not found) #2247

@lpcox

Description

@lpcox

Problem

Agentic workflows using Copilot CLI (Node.js-based) and Codex fail at the very first step with node: command not found or codex: command not found. The agent job is killed before any agent output is produced.

Three workflows were affected in a 1.5-hour burst on 2026-04-27 09:25–11:41 UTC:

Other Copilot CLI and Codex runs in the same 6-hour window succeeded, confirming this is intermittent and runner-slot specific.

Context

Original report: github/gh-aw#28726

The agent execution container (containers/agent/) uses selective bind mounts for host system binaries (/usr, /bin, /sbin, /lib, /lib64, /opt) mounted read-only under /host/. The entrypoint.sh chroots into /host before running the user command. If the runner slot that supplies these host paths lacks node or codex in its PATH (e.g., /usr/local/bin/node or /usr/bin/node), the binary will be absent inside the chroot and the command fails immediately.

Root Cause

Runner slot(s) were provisioned without the engine binary present on the host filesystem during the 09:25–11:41 UTC window. Because entrypoint.sh in containers/agent/ chroots to /host and runs the user command directly, any binary missing from the host's PATH will also be missing inside the container. There is no pre-flight check in containers/agent/entrypoint.sh to verify that the resolved command binary actually exists before executing.

Relevant files:

  • containers/agent/entrypoint.sh — chroot and command execution logic
  • src/docker-manager.ts — generates the Docker Compose config and bind mounts
  • src/cli.ts — orchestrates container startup and command execution

Proposed Solution

  1. Add a pre-flight binary check in containers/agent/entrypoint.sh: Before executing the user command, resolve the binary via command -v (or which) inside the chroot and emit a clear diagnostic message if it is not found. Fail fast with exit code 127 and a human-readable error pointing to missing PATH setup, rather than letting the shell print a cryptic command not found.

  2. Add a node --version / engine health-check step to the AWF wrapper CLI (src/cli.ts): After startContainers() and before runAgentCommand(), optionally execute a lightweight check that the target binary is resolvable inside the agent container (e.g., docker exec awf-agent command -v node). If the check fails, abort with a clear error before wasting runner time.

  3. Document the runner requirement: Update README / docs/environment.md to explicitly state that the host runner must have the engine binary (e.g., node, codex) in a system PATH directory that is bind-mounted into the agent container (/usr, /bin, /sbin, /opt).

  4. Investigate intermittent provisioning: Cross-reference with the runner infrastructure team to determine whether a specific runner pool/label intermittently lacks node or codex on the host filesystem.

Generated by Firewall Issue Dispatcher · ● 631.5K ·

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions