[awf] api-proxy: Intermittent health check failure blocks agent startup across multiple engines

## Problem

The `awf-api-proxy` sidecar container intermittently fails its Docker health check during `docker compose up`. This blocks agent startup entirely across multiple engines (copilot, claude) and workflow types. At least 3 confirmed runs failed in the 2026-04-28 07:00–13:00 UTC window.

Affected runs:
- Sub-Issue Closer (copilot) — run 25049338576
- Daily Team Evolution Insights (claude) — run 25049605437
- Smoke CI (copilot on PR trigger) — run 25052667955

## Context

Parent report: [github/gh-aw#28947](https://github.com/github/gh-aw/issues/28947).  
Detailed issue: [github/gh-aw#28949](https://github.com/github/gh-aw/issues/28949).

This is a higher-frequency occurrence of the same pattern reported in #28898.

## Root Cause

The `awf-api-proxy` health check in the generated `docker-compose.yml` (produced by `src/docker-manager.ts` → `generateDockerCompose()`) fires before the Node.js proxy process has bound its ports. The `start_period` is likely too short to accommodate:
- Container image cold pulls
- GitHub-hosted runner CPU contention under burst load
- Node.js module initialization latency

The `containers/api-proxy/` healthcheck endpoint (likely `GET /healthz` or `GET /`) must respond within the configured `timeout` window, and when it doesn't, Docker marks the container unhealthy, triggering `depends_on: condition: service_healthy` failure in the agent service.

## Proposed Solution

1. **Immediate (in `src/docker-manager.ts`)**: Increase api-proxy healthcheck `start_period` to 30–45s and `retries` to 5–8. This is a low-risk change that directly reduces flap frequency.
2. **Medium-term**: Add a lightweight `/healthz` route to `containers/api-proxy/` if one doesn't exist, returning `200 OK` as soon as the HTTP server is listening — avoiding any dependency on upstream connectivity in the health check.
3. **Resilience**: In `containers/agent/entrypoint.sh`, add a pre-flight check that polls the proxy ports before proceeding, with a configurable timeout (e.g., `AWF_PROXY_STARTUP_TIMEOUT_S=60`).
4. **Observability**: Log the proxy container startup time in debug mode so flap frequency can be correlated with runner load.




> Generated by [Firewall Issue Dispatcher](https://github.com/github/gh-aw-firewall/actions/runs/25109775398/agentic_workflow) · ● 436.3K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw-firewall+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw-firewall%2Ffirewall-issue-dispatcher%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[awf] api-proxy: Intermittent health check failure blocks agent startup across multiple engines #2295

Problem

Context

Root Cause

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[awf] api-proxy: Intermittent health check failure blocks agent startup across multiple engines #2295

Description

Problem

Context

Root Cause

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions