Skip to content

fix: retry awf-api-proxy health check failures and improve startup error messaging#2255

Merged
lpcox merged 4 commits intomainfrom
copilot/fix-awf-api-proxy-health-check
Apr 28, 2026
Merged

fix: retry awf-api-proxy health check failures and improve startup error messaging#2255
lpcox merged 4 commits intomainfrom
copilot/fix-awf-api-proxy-health-check

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

awf-api-proxy intermittently fails its Docker health check on slow/busy runners (observed ~33% on Azure-hosted runners), hard-failing the entire AWF step. When this happens, the misleading downstream error blames the model for never producing output rather than surfacing the actual container startup failure.

Changes

Health check tolerances (src/docker-manager.ts)

  • start_period: 10s30s
  • retries: 1015
  • interval: 1s2s, timeout: 2s3s

These give slower runners more headroom before declaring the container unhealthy.

Retry on transient api-proxy failure

When docker compose up fails with awf-api-proxy is unhealthy, startContainers now tears down and retries once. Most transient port-binding delays recover on the second attempt with no user-visible failure.

[WARN] awf-api-proxy failed its health check — this may be a transient startup failure, retrying once...

Clearer error on double failure

If api-proxy fails both attempts, a precise error is thrown instead of the generic compose error:

AWF firewall failed to start: awf-api-proxy failed its health check on both attempts.
The agent was never invoked. See awf-api-proxy container logs above for details.

The last 50 lines of awf-api-proxy container logs are dumped to stderr before the retry and on final failure.

Non-api-proxy health check failures

All other health check failures (e.g. squid-proxy) continue through the existing Squid log inspection path unchanged — no retry is attempted.

New helpers

  • isApiProxyUnhealthyError(msg) — detects Docker Compose errors specific to awf-api-proxy
  • logContainerLogsToStderr(containerName) — dumps last 50 lines of a container's logs to stderr for diagnosis

Copilot AI changed the title [WIP] Fix intermittent health check failures in awf-api-proxy container fix: retry awf-api-proxy health check failures and improve startup error messaging Apr 28, 2026
Copilot AI requested a review from lpcox April 28, 2026 02:08
@lpcox lpcox marked this pull request as ready for review April 28, 2026 02:53
@lpcox lpcox requested a review from Mossaka as a code owner April 28, 2026 02:53
Copilot AI review requested due to automatic review settings April 28, 2026 02:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 86.02% 86.15% 📈 +0.13%
Statements 86.02% 86.14% 📈 +0.12%
Functions 88.04% 88.18% 📈 +0.14%
Branches 80.20% 80.29% 📈 +0.09%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/docker-manager.ts 87.2% → 87.6% (+0.42%) 86.8% → 87.2% (+0.42%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability and debuggability of the awf-api-proxy sidecar startup by relaxing its Docker health check thresholds and adding a one-time retry flow when Docker Compose reports the api-proxy container as unhealthy, along with clearer failure messaging and targeted log dumping.

Changes:

  • Increased awf-api-proxy Docker healthcheck tolerances (interval/timeout/retries/start_period) to reduce flakiness on slow runners.
  • Added startContainers() retry-once behavior specifically for awf-api-proxy healthcheck failures, including best-effort teardown between attempts.
  • Added helper functions to detect api-proxy unhealthy errors and to dump last 50 lines of container logs for diagnosis, plus improved final error message after double failure.
Show a summary per file
File Description
src/docker-manager.ts Relaxes api-proxy healthcheck config; adds api-proxy-specific unhealthy detection, container log dumping, and a retry-once startup flow with clearer terminal error on repeated failure.
src/docker-manager.test.ts Updates healthcheck expectation assertions and adds tests covering retry behavior, final error message on double failure, and “no retry” for other unhealthy containers.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment thread src/docker-manager.ts
Comment on lines +2443 to +2451
try {
const result = await execa('docker', ['logs', '--tail', '50', containerName], {
reject: false,
env: getLocalDockerEnv(),
});
const combined = [result.stdout, result.stderr].filter(Boolean).join('\n').trim();
if (combined) {
logger.error(`${containerName} container logs (last 50 lines):\n${combined}`);
}
Comment thread src/docker-manager.ts Outdated
Comment on lines +2544 to +2547
// Other error during retry — fall through to generic handler below
logger.error('Failed to start containers (retry):', retryError);
throw retryError;
}
@github-actions github-actions Bot mentioned this pull request Apr 28, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@lpcox
Copy link
Copy Markdown
Collaborator

lpcox commented Apr 28, 2026

@copilot address the review feedback

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

@copilot address the review feedback

Done in 8295ec4. Both review issues are addressed:

  1. logContainerLogsToStderr no longer emits "No such container" noise — now checks result.exitCode === 0 before emitting logs, and logs a debug message instead when docker logs exits non-zero.

  2. Retry errors route through the Squid diagnostic path — extracted a shared handleHealthcheckError() helper so both first-attempt and retry failures (that aren't api-proxy-specific) go through the Squid log check and produce the same actionable blocked-domain errors.

Two new tests cover these cases: the exit-code suppression and the retry-to-Squid-diagnostic routing.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test Results

✅ GitHub MCP: chore: optimize test-coverage-improver workflow for ~50% token reduction
✅ GitHub MCP: [Test Coverage] Add comprehensive tests for image-tag module
✅ Playwright: Page title contains "GitHub"
✅ File Writing: Test file created successfully
✅ Bash: File verified

PASS

💥 [THE END] — Illustrated by Smoke Claude

@github-actions
Copy link
Copy Markdown
Contributor

🔥 Smoke Test Results

Test Result
GitHub MCP connectivity
GitHub.com HTTP ⚠️ N/A (template vars unresolved)
File write/read ⚠️ N/A (template vars unresolved)

PR: fix: retry awf-api-proxy health check failures and improve startup error messaging
Author: @Copilot | Assignees: @lpcox, @Copilot

Overall: PARTIAL — MCP ✅; HTTP/file tests skipped due to unresolved $\{\{ steps.smoke-data.outputs.* }} variables in workflow.

📰 BREAKING: Report filed by Smoke Copilot

@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test

PRs: "fix: retry awf-api-proxy health check failures and improve startup error messaging"; "feat: preflight binary check for codex in AWF agent container"
Merged PR review: ✅
Safe Inputs GH CLI: ❌
Playwright GitHub title: ✅
Tavily search: ❌
File write/read: ✅
Discussion comment: ✅
Build (npm ci && npm run build): ✅
Overall status: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3 ❌ NO
Node.js v24.14.1 v20.20.2 ❌ NO
Go go1.22.12 go1.22.12 ✅ YES

Overall: ❌ Not all versions match — Python and Node.js versions differ between host and chroot environments.

Tested by Smoke Chroot

@github-actions
Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color passed ✅ PASS
Go env passed ✅ PASS
Go uuid passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx passed ✅ PASS
Node.js execa passed ✅ PASS
Node.js p-limit passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #2255 · ● 531.5K ·

@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test Results

  • Redis PING: ❌ (timeout/no response)
  • PostgreSQL pg_isready: ❌ (no response)
  • PostgreSQL SELECT 1: ❌ (skipped — host unreachable)

Overall: FAILhost.docker.internal services are not reachable from this environment.

🔌 Service connectivity validated by Smoke Services

@lpcox lpcox merged commit 2f5cc71 into main Apr 28, 2026
64 of 68 checks passed
@lpcox lpcox deleted the copilot/fix-awf-api-proxy-health-check branch April 28, 2026 04:22
@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test: Copilot BYOK (Offline) Mode — Run #25054697630

Test Result
GitHub MCP connectivity
GitHub.com connectivity
File write/read
BYOK inference (this response)

Running in BYOK offline mode (COPILOT_OFFLINE=true) via api-proxy → api.githubcopilot.com.

Overall: PASS — triggered by @lpcox (scheduled run, no PR context)

🔑 BYOK report filed by Smoke Copilot BYOK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

awf-api-proxy container fails health check intermittently, causing hard failures with misleading error

3 participants