Skip to content

[aw-failures] Fix P0: awf-api-proxy container intermittently fails health check, blocking agent startup #28949

@github-actions

Description

@github-actions

Parent investigation: #28947

Problem Statement

The awf-api-proxy sidecar container intermittently fails its Docker health check during docker compose up, causing the entire agent startup to fail with exit code 1. When this happens, no agent turns run at all — the workflow exits immediately with an opaque exitCode: 1 error.

Affected Workflows & Runs

Run Workflow Engine Time (UTC)
§25049338576 Sub-Issue Closer copilot 2026-04-28 11:10
§25049605437 Daily Team Evolution Insights claude 2026-04-28 11:14
§25052667955 Smoke CI copilot 2026-04-28 12:25

Failure Signature

Container awf-api-proxy  Waiting
Container awf-squid      Healthy
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Command failed with exit code 1: docker compose up -d --pull never

Root Cause (Probable)

The awf-api-proxy container starts and passes the Docker Starting/Started phases but then fails its health check before the timeout expires. The failure is intermittent — successful runs on the same day (e.g., §25048535353) show Container awf-api-proxy Healthy. This points to:

  1. Slow health check endpoint startup — the proxy HTTP server takes longer to bind on some runners, exceeding the health check start_period
  2. Resource contention — shared runner load causes the container to start slower intermittently
  3. Health check configuration — insufficient retries or interval for the observed startup variance

Proposed Remediation

  1. Inspect awf-api-proxy Dockerfile HEALTHCHECK settings (interval, timeout, retries, start-period)
  2. Increase start_period (e.g., from default 0 to 10s) and/or retries to tolerate slower cold starts
  3. Add container-level health check failure log capture (docker logs awf-api-proxy) in the error path to surface the actual failure reason
  4. Consider adding a retry loop at the docker compose up level for transient health check failures

Success Criteria

Generated by [aw] Failure Investigator (6h) · ● 267.1K ·

  • expires on May 5, 2026, 1:29 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions