Skip to content

[awf] api-proxy: Intermittent health check failure blocks agent startup across multiple engines #2295

@lpcox

Description

@lpcox

Problem

The awf-api-proxy sidecar container intermittently fails its Docker health check during docker compose up. This blocks agent startup entirely across multiple engines (copilot, claude) and workflow types. At least 3 confirmed runs failed in the 2026-04-28 07:00–13:00 UTC window.

Affected runs:

  • Sub-Issue Closer (copilot) — run 25049338576
  • Daily Team Evolution Insights (claude) — run 25049605437
  • Smoke CI (copilot on PR trigger) — run 25052667955

Context

Parent report: github/gh-aw#28947.
Detailed issue: github/gh-aw#28949.

This is a higher-frequency occurrence of the same pattern reported in #28898.

Root Cause

The awf-api-proxy health check in the generated docker-compose.yml (produced by src/docker-manager.tsgenerateDockerCompose()) fires before the Node.js proxy process has bound its ports. The start_period is likely too short to accommodate:

  • Container image cold pulls
  • GitHub-hosted runner CPU contention under burst load
  • Node.js module initialization latency

The containers/api-proxy/ healthcheck endpoint (likely GET /healthz or GET /) must respond within the configured timeout window, and when it doesn't, Docker marks the container unhealthy, triggering depends_on: condition: service_healthy failure in the agent service.

Proposed Solution

  1. Immediate (in src/docker-manager.ts): Increase api-proxy healthcheck start_period to 30–45s and retries to 5–8. This is a low-risk change that directly reduces flap frequency.
  2. Medium-term: Add a lightweight /healthz route to containers/api-proxy/ if one doesn't exist, returning 200 OK as soon as the HTTP server is listening — avoiding any dependency on upstream connectivity in the health check.
  3. Resilience: In containers/agent/entrypoint.sh, add a pre-flight check that polls the proxy ports before proceeding, with a configurable timeout (e.g., AWF_PROXY_STARTUP_TIMEOUT_S=60).
  4. Observability: Log the proxy container startup time in debug mode so flap frequency can be correlated with runner load.

Generated by Firewall Issue Dispatcher · ● 436.3K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions