Skip to content

[CI Failure Doctor] CI Failure Investigation - Run #36032 #16125

@github-actions

Description

@github-actions

🏥 CI Failure Investigation - Run #36032

Summary

Integration: CLI Completion & Other fails because TestMCPRegistryClient_LiveGetServer now hits the live MCP registry and the service is returning 503 upstream connect error or disconnect/reset before headers with a delayed connect failure, so the test cannot reach io.github.netdata/mcp-server.

Failure Details

  • Run: 22068117409
  • Commit: 5e5b9d282752b1430867cdc76a09603348c08d4c
  • Trigger: push

Root Cause Analysis

  1. TestMCPRegistryClient_LiveGetServer connects to the live MCP registry while exercising GetServer; the registry returned 503 upstream connect error or disconnect/reset before headers with the latest retry reporting delayed connect error: Connection refused, so the subtest cannot complete.
  2. Every subtest (get_github_server and get_nonexistent_server) tries to assert specific output but receives the same 503, which is treated as a failure instead of being skipped or mocked.

Failed Jobs and Errors

  • Integration: CLI Completion & Other: TestMCPRegistryClient_LiveGetServer/get_github_server
    • mcp_registry_live_test.go:141: GetServer failed for 'io.github.netdata/mcp-server': MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused
  • Integration: CLI Completion & Other: TestMCPRegistryClient_LiveGetServer/get_nonexistent_server
    • mcp_registry_live_test.go:175: Expected error to contain 'not found in registry', got: MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused

Investigation Findings

  • Running go test -v -tags integration ./pkg/cli -run TestMCPRegistryClient_LiveGetServer against the live registry reproduces the 503/delayed connect error because the test talks to io.github.netdata/mcp-server and the registry is currently refusing connections.
  • The integration suite therefore fails before reporting a specific test since the package-level run detects the panic/failure and aborts, logging that no individual test passed cleanly.

Recommended Actions

  • Guard TestMCPRegistryClient_LiveGetServer (and similar MCP live tests) so that 5xx/delayed-connect responses are skipped or stubbed instead of failing the suite, e.g., detect the 503 and mark the test as skipped when the registry is unreachable.
  • Replace the live MCP dependency in CI with a stub or canned response when possible so transient outages do not break the workflow.
  • Rerun the integration job after MCP connectivity is restored to confirm there are no additional regressions.

Prevention Strategies

  • Avoid calling production MCP services directly from CI without handling known failure modes (503s, connection refused, etc.) and mark the tests as flaky or skipped when the service is down.
  • Use local stubs or recorded fixtures for MCP responses in GitHub Actions so network availability does not gate the whole suite.

AI Team Self-Improvement

  • When generating tests that talk to MCP or other external services, guard them with explicit skip/retry logic and explain that 5xx/delayed connect errors should not be treated as regressions.
  • Prefer mocking remote MCP responses in CI workflows so the tests stay deterministic even if the upstream service is temporarily unreachable.

Historical Context

  • Run #35694 had the same TestMCPRegistryClient_LiveGetServer failure because the MCP registry returned a 503; see #15700 for the prior investigation.

🩺 Diagnosis provided by CI Failure Doctor

To install this workflow, run gh aw add githubnext/agentics/workflows/ci-doctor.md@ea350161ad5dcc9624cf510f134c6a9e39a6f94d. View source at https://github.com/githubnext/agentics/tree/ea350161ad5dcc9624cf510f134c6a9e39a6f94d/workflows/ci-doctor.md.

  • expires on Feb 17, 2026, 3:26 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Labels

    cookieIssue Monster Loves Cookies!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions