Problem
The api-proxy container healthcheck can flap during startup, causing Docker Compose to mark it as unhealthy. Since the agent container has depends_on: api-proxy: service_healthy, this cascades into a full startup failure with:
dependency failed to start: container awf-api-proxy is unhealthy
This is an intermittent failure — the api-proxy Node.js server may not have fully bound to port 10000 before the healthcheck fires, or transient resource contention on CI runners causes a brief unresponsiveness window.
Current healthcheck config
From src/docker-manager.ts:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:10000/health"]
interval: 1s
timeout: 1s
retries: 5
start_period: 2s
The start_period: 2s gives only 2 seconds of grace before failures count. Combined with interval: 1s, timeout: 1s, and only retries: 5, a slow Node.js startup can exhaust retries quickly.
Impact
This causes intermittent failures in gh-aw agentic workflows that use --enable-api-proxy. A workaround PR was opened in gh-aw to add a retry wrapper around AWF startup, but the root cause should be fixed here.
Possible fixes
- Increase
start_period — give the Node.js server more time to bind (e.g., 5s or 10s)
- Increase
retries — allow more attempts before marking unhealthy (e.g., 10 or 15)
- Increase
timeout — give each health probe more time (e.g., 2s or 3s)
- Add startup readiness signal — have the api-proxy write a ready file after all ports are bound, and check for that in the healthcheck instead of (or in addition to) curling the endpoint
- AWF-level retry — if Docker Compose reports an unhealthy dependency, AWF itself could retry startup (similar to what the gh-aw PR does externally)
References
Problem
The
api-proxycontainer healthcheck can flap during startup, causing Docker Compose to mark it as unhealthy. Since the agent container hasdepends_on: api-proxy: service_healthy, this cascades into a full startup failure with:This is an intermittent failure — the api-proxy Node.js server may not have fully bound to port 10000 before the healthcheck fires, or transient resource contention on CI runners causes a brief unresponsiveness window.
Current healthcheck config
From
src/docker-manager.ts:The
start_period: 2sgives only 2 seconds of grace before failures count. Combined withinterval: 1s,timeout: 1s, and onlyretries: 5, a slow Node.js startup can exhaust retries quickly.Impact
This causes intermittent failures in gh-aw agentic workflows that use
--enable-api-proxy. A workaround PR was opened in gh-aw to add a retry wrapper around AWF startup, but the root cause should be fixed here.Possible fixes
start_period— give the Node.js server more time to bind (e.g.,5sor10s)retries— allow more attempts before marking unhealthy (e.g.,10or15)timeout— give each health probe more time (e.g.,2sor3s)References
awf-api-proxyhealth check flaps gh-aw#27909src/docker-manager.ts(~line 1773)containers/api-proxy/server.js(~line 908)