Problem
The awf-api-proxy sidecar container intermittently fails its Docker health check during docker compose up. This blocks agent startup entirely across multiple engines (copilot, claude) and workflow types. At least 3 confirmed runs failed in the 2026-04-28 07:00–13:00 UTC window.
Affected runs:
- Sub-Issue Closer (copilot) — run 25049338576
- Daily Team Evolution Insights (claude) — run 25049605437
- Smoke CI (copilot on PR trigger) — run 25052667955
Context
Parent report: github/gh-aw#28947.
Detailed issue: github/gh-aw#28949.
This is a higher-frequency occurrence of the same pattern reported in #28898.
Root Cause
The awf-api-proxy health check in the generated docker-compose.yml (produced by src/docker-manager.ts → generateDockerCompose()) fires before the Node.js proxy process has bound its ports. The start_period is likely too short to accommodate:
- Container image cold pulls
- GitHub-hosted runner CPU contention under burst load
- Node.js module initialization latency
The containers/api-proxy/ healthcheck endpoint (likely GET /healthz or GET /) must respond within the configured timeout window, and when it doesn't, Docker marks the container unhealthy, triggering depends_on: condition: service_healthy failure in the agent service.
Proposed Solution
- Immediate (in
src/docker-manager.ts): Increase api-proxy healthcheck start_period to 30–45s and retries to 5–8. This is a low-risk change that directly reduces flap frequency.
- Medium-term: Add a lightweight
/healthz route to containers/api-proxy/ if one doesn't exist, returning 200 OK as soon as the HTTP server is listening — avoiding any dependency on upstream connectivity in the health check.
- Resilience: In
containers/agent/entrypoint.sh, add a pre-flight check that polls the proxy ports before proceeding, with a configurable timeout (e.g., AWF_PROXY_STARTUP_TIMEOUT_S=60).
- Observability: Log the proxy container startup time in debug mode so flap frequency can be correlated with runner load.
Generated by Firewall Issue Dispatcher · ● 436.3K · ◷
Problem
The
awf-api-proxysidecar container intermittently fails its Docker health check duringdocker compose up. This blocks agent startup entirely across multiple engines (copilot, claude) and workflow types. At least 3 confirmed runs failed in the 2026-04-28 07:00–13:00 UTC window.Affected runs:
Context
Parent report: github/gh-aw#28947.
Detailed issue: github/gh-aw#28949.
This is a higher-frequency occurrence of the same pattern reported in #28898.
Root Cause
The
awf-api-proxyhealth check in the generateddocker-compose.yml(produced bysrc/docker-manager.ts→generateDockerCompose()) fires before the Node.js proxy process has bound its ports. Thestart_periodis likely too short to accommodate:The
containers/api-proxy/healthcheck endpoint (likelyGET /healthzorGET /) must respond within the configuredtimeoutwindow, and when it doesn't, Docker marks the container unhealthy, triggeringdepends_on: condition: service_healthyfailure in the agent service.Proposed Solution
src/docker-manager.ts): Increase api-proxy healthcheckstart_periodto 30–45s andretriesto 5–8. This is a low-risk change that directly reduces flap frequency./healthzroute tocontainers/api-proxy/if one doesn't exist, returning200 OKas soon as the HTTP server is listening — avoiding any dependency on upstream connectivity in the health check.containers/agent/entrypoint.sh, add a pre-flight check that polls the proxy ports before proceeding, with a configurable timeout (e.g.,AWF_PROXY_STARTUP_TIMEOUT_S=60).