Harden MCP Gateway startup health check against transient port-binding delays#26697
Harden MCP Gateway startup health check against transient port-binding delays#26697
Conversation
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c923f8eb-a610-4433-b2c6-cfae71b2811d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
| let httpCode = 0; | ||
| let healthBody = ""; | ||
| let succeeded = false; | ||
| let attemptsMade = 0; |
There was a problem hiding this comment.
Updated to use the shared withRetry helper for MCP gateway health polling in actions/setup/js/start_mcp_gateway.cjs (commit 7682cd2), with equivalent retry behavior preserved.
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0e241816-4acc-4ea2-98b5-d6db16911e8a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Hardens MCP Gateway startup readiness checks to reduce false negatives caused by transient port-binding delays by introducing bounded exponential backoff in both JS and shell startup paths.
Changes:
- Updated health-check polling in JS and shell to use exponential backoff capped at 1s while keeping the same retry horizon.
- Improved JS logging to report actual attempts made.
- Expanded shell test assertions to validate the new backoff configuration.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/sh/start_mcp_gateway.sh | Implements capped exponential backoff (250ms→500ms→1s) between health-check attempts. |
| actions/setup/js/start_mcp_gateway.cjs | Adds exponential backoff delays and correct attempt-count reporting in logs. |
| actions/setup/sh/start_mcp_gateway_test.sh | Adds assertions to confirm the backoff logic exists in the shell script. |
| .github/mcp.json | Introduces an MCP server configuration file (appears unrelated to gateway health-check changes). |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 2
| { | ||
| "mcpServers": { | ||
| "github-agentic-workflows": { | ||
| "command": "gh", | ||
| "args": [ | ||
| "aw", | ||
| "mcp-server" | ||
| ] | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
This new MCP configuration file isn’t mentioned in the PR description and appears unrelated to hardening the gateway startup health check. If it’s intentional, please add rationale to the PR description; otherwise consider removing it from this PR to keep scope focused.
| if grep -q "RETRY_COUNT -eq 1" "$SCRIPT_PATH" && | ||
| grep -q "RETRY_COUNT -eq 2" "$SCRIPT_PATH" && | ||
| grep -q "elif \[ \$RETRY_COUNT -eq 2 \]" "$SCRIPT_PATH" && | ||
| grep -q "else" "$SCRIPT_PATH" && | ||
| grep -q "RETRY_DELAY=\"0.25\"" "$SCRIPT_PATH" && | ||
| grep -q "RETRY_DELAY=\"0.5\"" "$SCRIPT_PATH" && | ||
| grep -q "RETRY_DELAY=\"1\"" "$SCRIPT_PATH" && | ||
| grep -q "attempt 3+ -> 1s" "$SCRIPT_PATH" && | ||
| grep -q "sleep \"\$RETRY_DELAY\"" "$SCRIPT_PATH"; then |
There was a problem hiding this comment.
These greps are likely to produce false positives/negatives because they (a) match very generic tokens (e.g., else) that may appear elsewhere, and (b) over-constrain formatting (exact bracket/spacing/quoting) while not guaranteeing the lines are part of the same backoff block. Consider asserting a single, more specific pattern (e.g., matching the exact if/elif/else lines including sleep \"$RETRY_DELAY\"), or limiting the search to the health-check section (e.g., by extracting a window around the backoff comment and matching within it). This will make the test resilient to harmless refactors and more accurately validate intent.
| if grep -q "RETRY_COUNT -eq 1" "$SCRIPT_PATH" && | |
| grep -q "RETRY_COUNT -eq 2" "$SCRIPT_PATH" && | |
| grep -q "elif \[ \$RETRY_COUNT -eq 2 \]" "$SCRIPT_PATH" && | |
| grep -q "else" "$SCRIPT_PATH" && | |
| grep -q "RETRY_DELAY=\"0.25\"" "$SCRIPT_PATH" && | |
| grep -q "RETRY_DELAY=\"0.5\"" "$SCRIPT_PATH" && | |
| grep -q "RETRY_DELAY=\"1\"" "$SCRIPT_PATH" && | |
| grep -q "attempt 3+ -> 1s" "$SCRIPT_PATH" && | |
| grep -q "sleep \"\$RETRY_DELAY\"" "$SCRIPT_PATH"; then | |
| if awk ' | |
| /if \[ \$RETRY_COUNT -eq 1 \]; then/ { saw_if=1 } | |
| saw_if && /RETRY_DELAY="0\.25"/ { saw_delay1=1 } | |
| saw_delay1 && /elif \[ \$RETRY_COUNT -eq 2 \]; then/ { saw_elif=1 } | |
| saw_elif && /RETRY_DELAY="0\.5"/ { saw_delay2=1 } | |
| saw_delay2 && /else/ { saw_else=1 } | |
| saw_else && /RETRY_DELAY="1"/ { saw_delay3=1 } | |
| saw_delay3 && /attempt 3\+ -> 1s/ { saw_comment=1 } | |
| saw_comment && /sleep "\$RETRY_DELAY"/ { saw_sleep=1 } | |
| END { exit saw_sleep ? 0 : 1 } | |
| ' "$SCRIPT_PATH"; then |
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0e241816-4acc-4ea2-98b5-d6db16911e8a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0e241816-4acc-4ea2-98b5-d6db16911e8a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
🧪 Test Quality Sentinel ReportTest Quality Score: N/Ai️ No Go or JavaScript test files were modified in this PR — shell test is outside scoring scope.
Language SupportTests analyzed:
Shell Test Observation (Informational)The modified shell test function checks that
Test-to-production line ratio: ~1.3:1 (9 test lines vs 9 production lines) — well within the 2:1 threshold. Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: [§24527431225]
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: N/A score (no Go/JS tests in scope). No coding-guideline violations detected. The only test file changed is a shell script (start_mcp_gateway_test.sh) which is outside the Go/JavaScript scoring rubric. The shell test correctly reflects the production code refactoring.
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0e241816-4acc-4ea2-98b5-d6db16911e8a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Start MCP Gatewaycould fail on transient startup races: the gateway process launched, but/healthpolling used rigid timing and could exit before port 80 became reachable under runner contention. This updates startup probing to tolerate short bind delays without changing failure semantics for genuinely unhealthy gateways.Health-check timing now uses bounded exponential backoff
actions/setup/js/start_mcp_gateway.cjsactions/setup/sh/start_mcp_gateway.sh250ms -> 500ms -> 1s) and then stays capped, while preserving the existing long retry horizon (120attempts).Shell implementation avoids per-retry subprocess overhead
Startup logging/attempt accounting tightened
Focused shell test coverage updated for backoff config
actions/setup/sh/start_mcp_gateway_test.shassertions were expanded to verify the expected backoff tiers and retry-delay usage in the startup script.