Skip to content

[awf] api-proxy: Flaky awf-api-proxy unhealthy failure during docker compose up #2266

@lpcox

Description

@lpcox

Problem

The awf-api-proxy sidecar container intermittently fails its health check during docker compose up, causing the agent to never start. This has been reported on gh-aw v0.71.1 with firewall image 0.25.28 and matches a prior pattern from #27888 (now closed).

The compose health check configuration is:

healthcheck:
  test: [CMD, curl, '-f', '(localhost/redacted)
  interval: 1s
  timeout: 1s
  retries: 5
  start_period: 2s

Total window is only 7 seconds (2s start + 5×1s), which may be insufficient on loaded GitHub-hosted runners.

Context

Root Cause

Two compounding problems:

  1. Insufficient health check grace period. The start_period: 2s and 5 retries with 1s interval gives a maximum window of ~7s for the Node.js api-proxy process to start, bind its port, and pass the /health check. On a resource-constrained or busy runner, container startup alone can exceed this window.

  2. Missing log capture on failure. When docker compose up fails due to an unhealthy container, the api-proxy logs are not captured before containers are removed. The api-proxy-logs mount directory is absent from uploaded artifacts, making the failure undiagnosable.

Relevant source files:

  • src/docker-manager.ts — generates the Docker Compose config including the health check parameters
  • src/cli.tsstopContainers() cleans up after failure; does not capture api-proxy logs before teardown
  • containers/api-proxy/ — the api-proxy container itself

Proposed Solution

  1. Increase health check tolerance in src/docker-manager.ts: raise start_period to 5s, retries to 10, and timeout to 3s to match the Squid health check's robustness.

  2. Capture api-proxy container logs on failure in src/cli.ts: before calling docker compose down, run docker compose logs api-proxy and write to the work directory's api-proxy-logs/ folder so it is included in uploaded artifacts.

  3. Always include api-proxy-logs/ in artifacts (even if empty) to make absence explicit and aid triage.

  4. Optionally retry docker compose up once if only the api-proxy health check fails, before treating the whole run as failed.

Generated by Firewall Issue Dispatcher · ● 726.9K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions