[awf] api-proxy: Flaky awf-api-proxy unhealthy failure during docker compose up

## Problem

The `awf-api-proxy` sidecar container intermittently fails its health check during `docker compose up`, causing the agent to never start. This has been reported on `gh-aw v0.71.1` with firewall image `0.25.28` and matches a prior pattern from #27888 (now closed).

The compose health check configuration is:
```yaml
healthcheck:
  test: [CMD, curl, '-f', '(localhost/redacted)
  interval: 1s
  timeout: 1s
  retries: 5
  start_period: 2s
```

Total window is only 7 seconds (2s start + 5×1s), which may be insufficient on loaded GitHub-hosted runners.

## Context

- Original issue: https://github.com/github/gh-aw/issues/28898
- Affected version: `gh-aw v0.71.1`, firewall `0.25.28`
- Runner image: `ubuntu24`, `ImageVersion: 20260426.100.1`
- Downstream repo: `elastic/docs-content`
- Failed run: https://github.com/elastic/docs-content/actions/runs/25046213576

## Root Cause

Two compounding problems:

1. **Insufficient health check grace period.** The `start_period: 2s` and 5 retries with 1s interval gives a maximum window of ~7s for the Node.js api-proxy process to start, bind its port, and pass the `/health` check. On a resource-constrained or busy runner, container startup alone can exceed this window.

2. **Missing log capture on failure.** When `docker compose up` fails due to an unhealthy container, the api-proxy logs are not captured before containers are removed. The `api-proxy-logs` mount directory is absent from uploaded artifacts, making the failure undiagnosable.

Relevant source files:
- `src/docker-manager.ts` — generates the Docker Compose config including the health check parameters
- `src/cli.ts` — `stopContainers()` cleans up after failure; does not capture api-proxy logs before teardown
- `containers/api-proxy/` — the api-proxy container itself

## Proposed Solution

1. **Increase health check tolerance** in `src/docker-manager.ts`: raise `start_period` to `5s`, `retries` to `10`, and `timeout` to `3s` to match the Squid health check's robustness.

2. **Capture api-proxy container logs on failure** in `src/cli.ts`: before calling `docker compose down`, run `docker compose logs api-proxy` and write to the work directory's `api-proxy-logs/` folder so it is included in uploaded artifacts.

3. **Always include `api-proxy-logs/` in artifacts** (even if empty) to make absence explicit and aid triage.

4. **Optionally retry `docker compose up`** once if only the api-proxy health check fails, before treating the whole run as failed.




> Generated by [Firewall Issue Dispatcher](https://github.com/github/gh-aw-firewall/actions/runs/25053954854/agentic_workflow) · ● 726.9K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw-firewall+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw-firewall%2Ffirewall-issue-dispatcher%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[awf] api-proxy: Flaky awf-api-proxy unhealthy failure during docker compose up #2266

Problem

Context

Root Cause

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[awf] api-proxy: Flaky awf-api-proxy unhealthy failure during docker compose up #2266

Description

Problem

Context

Root Cause

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions