Skip to content

[awf] API Proxy Sidecar: intermittent health check failure blocks agent startup #2274

@lpcox

Description

@lpcox

Problem

The awf-api-proxy sidecar container intermittently fails its Docker health check during docker compose up, causing the entire AWF agent startup to fail with exit code 1 before any agent turns run. Affected workflows include Smoke CI, Sub-Issue Closer, and Daily Team Evolution Insights.

Failure signature:

Container awf-api-proxy  Waiting
Container awf-squid      Healthy
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Command failed with exit code 1: docker compose up -d --pull never

Context

Root Cause

The awf-api-proxy Node.js HTTP server takes longer to bind on some runners (resource contention, cold start), exceeding the Docker health check start_period. The Docker Compose depends_on: api-proxy: condition: service_healthy in src/docker-manager.ts means any health check timeout terminates the entire stack.

Specifically:

  • The HEALTHCHECK in containers/api-proxy/Dockerfile likely uses default start_period (0s) and low retries, which is intolerant of runner load variance
  • No docker logs awf-api-proxy capture in the error path in src/docker-manager.ts, so the actual container failure reason is hidden

Proposed Solution

  1. containers/api-proxy/Dockerfile: Increase HEALTHCHECK start_period to at least 15s and set retries=5 to tolerate slow cold starts:

    HEALTHCHECK --interval=3s --timeout=5s --start-period=15s --retries=5 \
      CMD curl -f (localhost/redacted) || exit 1
  2. src/docker-manager.ts: In the error path for docker compose up failure, add a step to capture and log docker logs awf-api-proxy so the failure reason is surfaced in agent-stdio.log.

  3. src/docker-manager.ts: Consider adding a retry loop (up to 2 retries with docker compose up) specifically for transient health check failures, guarded by checking the exit message.

  4. src/docker-manager.ts: The api-proxy service depends_on block should be reviewed — consider whether service_started (instead of service_healthy) is appropriate for non-critical proxy paths, or keep service_healthy and fix the health check timing.

Success criteria: Smoke CI passes 5 consecutive runs without awf-api-proxy Error.

Generated by Firewall Issue Dispatcher · ● 450K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions