Skip to content

Harden MCP Gateway startup health check against transient port-binding delays#26697

Merged
pelikhan merged 13 commits intomainfrom
copilot/fix-mcp-gateway-health-check
Apr 16, 2026
Merged

Harden MCP Gateway startup health check against transient port-binding delays#26697
pelikhan merged 13 commits intomainfrom
copilot/fix-mcp-gateway-health-check

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 16, 2026

Start MCP Gateway could fail on transient startup races: the gateway process launched, but /health polling used rigid timing and could exit before port 80 became reachable under runner contention. This updates startup probing to tolerate short bind delays without changing failure semantics for genuinely unhealthy gateways.

  • Health-check timing now uses bounded exponential backoff

    • Updated both startup paths:
      • actions/setup/js/start_mcp_gateway.cjs
      • actions/setup/sh/start_mcp_gateway.sh
    • Retry cadence now ramps quickly (250ms -> 500ms -> 1s) and then stays capped, while preserving the existing long retry horizon (120 attempts).
    • This reduces false negatives immediately after container launch while still surfacing real startup failures.
  • Shell implementation avoids per-retry subprocess overhead

    • Replaced dynamic delay calculation via subprocess with direct branch-based delay tiers in the retry loop.
    • Keeps behavior explicit and deterministic in constrained runner environments.
  • Startup logging/attempt accounting tightened

    • JS path now reports actual attempts made for clearer diagnostics when readiness is reached early or fails late.
  • Focused shell test coverage updated for backoff config

    • actions/setup/sh/start_mcp_gateway_test.sh assertions were expanded to verify the expected backoff tiers and retry-delay usage in the startup script.
# start_mcp_gateway.sh retry tiering
if [ $RETRY_COUNT -eq 1 ]; then
  RETRY_DELAY="0.25"
elif [ $RETRY_COUNT -eq 2 ]; then
  RETRY_DELAY="0.5"
else
  RETRY_DELAY="1"
fi
sleep "$RETRY_DELAY"

Copilot AI and others added 7 commits April 16, 2026 17:51
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix MCP Gateway port 80 health check to include retry logic Harden MCP Gateway startup health check against transient port-binding delays Apr 16, 2026
Copilot AI requested a review from pelikhan April 16, 2026 18:09
let httpCode = 0;
let healthBody = "";
let succeeded = false;
let attemptsMade = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot use withRetry helper

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the shared withRetry helper for MCP gateway health polling in actions/setup/js/start_mcp_gateway.cjs (commit 7682cd2), with equivalent retry behavior preserved.

@pelikhan pelikhan marked this pull request as ready for review April 16, 2026 18:36
Copilot AI review requested due to automatic review settings April 16, 2026 18:36
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0e241816-4acc-4ea2-98b5-d6db16911e8a

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Hardens MCP Gateway startup readiness checks to reduce false negatives caused by transient port-binding delays by introducing bounded exponential backoff in both JS and shell startup paths.

Changes:

  • Updated health-check polling in JS and shell to use exponential backoff capped at 1s while keeping the same retry horizon.
  • Improved JS logging to report actual attempts made.
  • Expanded shell test assertions to validate the new backoff configuration.
Show a summary per file
File Description
actions/setup/sh/start_mcp_gateway.sh Implements capped exponential backoff (250ms→500ms→1s) between health-check attempts.
actions/setup/js/start_mcp_gateway.cjs Adds exponential backoff delays and correct attempt-count reporting in logs.
actions/setup/sh/start_mcp_gateway_test.sh Adds assertions to confirm the backoff logic exists in the shell script.
.github/mcp.json Introduces an MCP server configuration file (appears unrelated to gateway health-check changes).

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread .github/mcp.json Outdated
Comment on lines +1 to +11
{
"mcpServers": {
"github-agentic-workflows": {
"command": "gh",
"args": [
"aw",
"mcp-server"
]
}
}
} No newline at end of file
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new MCP configuration file isn’t mentioned in the PR description and appears unrelated to hardening the gateway startup health check. If it’s intentional, please add rationale to the PR description; otherwise consider removing it from this PR to keep scope focused.

Copilot uses AI. Check for mistakes.
Comment on lines +286 to +294
if grep -q "RETRY_COUNT -eq 1" "$SCRIPT_PATH" &&
grep -q "RETRY_COUNT -eq 2" "$SCRIPT_PATH" &&
grep -q "elif \[ \$RETRY_COUNT -eq 2 \]" "$SCRIPT_PATH" &&
grep -q "else" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"0.25\"" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"0.5\"" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"1\"" "$SCRIPT_PATH" &&
grep -q "attempt 3+ -> 1s" "$SCRIPT_PATH" &&
grep -q "sleep \"\$RETRY_DELAY\"" "$SCRIPT_PATH"; then
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These greps are likely to produce false positives/negatives because they (a) match very generic tokens (e.g., else) that may appear elsewhere, and (b) over-constrain formatting (exact bracket/spacing/quoting) while not guaranteeing the lines are part of the same backoff block. Consider asserting a single, more specific pattern (e.g., matching the exact if/elif/else lines including sleep \"$RETRY_DELAY\"), or limiting the search to the health-check section (e.g., by extracting a window around the backoff comment and matching within it). This will make the test resilient to harmless refactors and more accurately validate intent.

Suggested change
if grep -q "RETRY_COUNT -eq 1" "$SCRIPT_PATH" &&
grep -q "RETRY_COUNT -eq 2" "$SCRIPT_PATH" &&
grep -q "elif \[ \$RETRY_COUNT -eq 2 \]" "$SCRIPT_PATH" &&
grep -q "else" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"0.25\"" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"0.5\"" "$SCRIPT_PATH" &&
grep -q "RETRY_DELAY=\"1\"" "$SCRIPT_PATH" &&
grep -q "attempt 3+ -> 1s" "$SCRIPT_PATH" &&
grep -q "sleep \"\$RETRY_DELAY\"" "$SCRIPT_PATH"; then
if awk '
/if \[ \$RETRY_COUNT -eq 1 \]; then/ { saw_if=1 }
saw_if && /RETRY_DELAY="0\.25"/ { saw_delay1=1 }
saw_delay1 && /elif \[ \$RETRY_COUNT -eq 2 \]; then/ { saw_elif=1 }
saw_elif && /RETRY_DELAY="0\.5"/ { saw_delay2=1 }
saw_delay2 && /else/ { saw_else=1 }
saw_else && /RETRY_DELAY="1"/ { saw_delay3=1 }
saw_delay3 && /attempt 3\+ -> 1s/ { saw_comment=1 }
saw_comment && /sleep "\$RETRY_DELAY"/ { saw_sleep=1 }
END { exit saw_sleep ? 0 : 1 }
' "$SCRIPT_PATH"; then

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot mentioned this pull request Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: N/A

i️ No Go or JavaScript test files were modified in this PR — shell test is outside scoring scope.

Metric Value
New/modified tests analyzed 0 (Go / JS)
✅ Design tests (behavioral contracts)
⚠️ Implementation tests (low value)
Tests with error/edge cases
Duplicate test clusters 0
Test inflation detected No
🚨 Coding-guideline violations None

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests
  • 🟨 JavaScript (*.test.cjs, *.test.js): 0 tests (vitest)

i️ Shell test detected but outside scoring scope: actions/setup/sh/start_mcp_gateway_test.sh was modified (+9 / -1 lines). Shell tests are detected but excluded from the Go/JS behavioral scoring rubric.


Shell Test Observation (Informational)

The modified shell test function checks that start_mcp_gateway.sh contains the new 3-tier hardcoded backoff constants by running grep against the script source. This is a structural/implementation-style check (it verifies code patterns are present rather than exercising the runtime backoff behavior), which is typical for shell "pattern tests" but worth noting:

  • ✅ Covers all three delay tiers ("0.25", "0.5", "1") and the comment attempt 3+ -> 1s
  • ✅ Verifies the sleep "$RETRY_DELAY" invocation is present
  • ✅ Reflects the refactoring from the awk-based formula to the explicit if/elif/else structure
  • ⚠️ Does not exercise the backoff at runtime (no integration test executing the retry loop with a mock HTTP endpoint), but that is expected for this style of lightweight shell pattern test

Test-to-production line ratio: ~1.3:1 (9 test lines vs 9 production lines) — well within the 2:1 threshold.


Verdict

Check passed. No Go or JavaScript tests were added or modified. No coding-guideline violations detected. The shell test update correctly tracks the production code refactoring.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: [§24527431225]

🧪 Test quality analysis by Test Quality Sentinel · ● 577.1K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: N/A score (no Go/JS tests in scope). No coding-guideline violations detected. The only test file changed is a shell script (start_mcp_gateway_test.sh) which is outside the Go/JavaScript scoring rubric. The shell test correctly reflects the production code refactoring.

Copilot AI requested a review from pelikhan April 16, 2026 18:44
@pelikhan pelikhan merged commit 0443968 into main Apr 16, 2026
53 of 54 checks passed
@pelikhan pelikhan deleted the copilot/fix-mcp-gateway-health-check branch April 16, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP Gateway: port 80 health check fails with no retry on transient container startup delay

3 participants