Skip to content

fix: defer heavy imports in MCP server for first-connect reliability#380

Merged
carlos-alm merged 5 commits intomainfrom
fix/mcp-first-connect-resilience
Mar 9, 2026
Merged

fix: defer heavy imports in MCP server for first-connect reliability#380
carlos-alm merged 5 commits intomainfrom
fix/mcp-first-connect-resilience

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Root cause: startMCPServer was await import()-ing queries.js and require()-ing better-sqlite3 before connecting the stdio transport. On cold start these imports are slow, so the client's initialize request arrives before the server is listening — the message is lost and connection fails.
  • Fix: Moved heavy imports into lazy-loaded helpers (getQueries() / getDatabase()) that run on first tool call and are cached thereafter. The transport now connects immediately after the MCP SDK loads, so the server is ready to receive initialize right away.
  • Reconnect always worked because Node.js caches the modules from the first attempt.

Test plan

  • All 37 MCP unit tests pass
  • Smoke test: piped initialize JSON-RPC request gets immediate response
  • Manual: verify /mcp connects on first try in Claude Code

…liability

The MCP server was importing queries.js and better-sqlite3 before
connecting the stdio transport, causing the client's `initialize`
request to arrive before the server was listening on stdin. Now the
transport connects immediately after the MCP SDK loads, and heavy
modules are lazy-loaded on the first tool call (cached thereafter).

Impact: 3 functions changed, 2 affected
@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @carlos-alm's task —— View job


PR Review: MCP Server First-Connect Reliability Fix

Todo List:

  • Read and analyze the changed file (src/mcp.js)
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity Check
  • Apply Phase 4: Code Quality Assessment
  • Provide structured review with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Race condition where MCP server's initialize request from client arrives before the server finishes loading heavy dependencies (queries.js and better-sqlite3), causing first-connection failures.

Why This Approach: The solution is justified and addresses the actual root cause. By connecting the stdio transport immediately after MCP SDK loads, the server can receive the initialize request while still loading dependencies asynchronously. This is a genuine timing fix, not symptom masking.

Risk Assessment: LOW - The change preserves all functionality while improving reliability through proper async loading patterns.

Phase 2: Configuration Change Detection ✅

NO configuration changes detected - no linting rules modified, no test thresholds lowered, no build configuration changes.

Phase 3: Test Integrity ✅

NO test integrity issues - no test files modified, no assertions weakened, no test coverage compromised.

Phase 4: Code Quality Assessment ✅

Implementation Quality: The lazy loading pattern is well-implemented:

  • getQueries() and getDatabase() use proper caching with module-level variables
  • Error handling preserved in both sync and async paths
  • No memory leaks or duplicate imports possible

Potential Issues Identified:

  • Minor: The removed top-level import of diffImpactMermaid and other exports from queries.js (line 13) might cause issues if other modules expect these to be available - but analysis shows they're only used within the MCP tool handlers
  • Edge case: If getQueries() import fails, subsequent tool calls will retry the import - this is actually good behavior

Performance Impact: POSITIVE - eliminates ~100-500ms cold-start delay by deferring heavy SQLite native binding load.

Security: No security concerns - the changes don't introduce new attack vectors.

Maintainability: GOOD - the lazy loading pattern is clean and follows common Node.js practices.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Race condition in MCP server startup where client `initialize` requests arrive before heavy dependency loading completes
**Why This Approach**: Properly addresses timing issue by connecting transport first, then lazy-loading dependencies on demand
**Risk Assessment**: LOW - maintains all functionality while improving first-connection reliability

## Critical Concerns
None identified. This is a clean architectural improvement that addresses the actual root cause.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Well-architected fix addressing a real timing issue. Good use of lazy loading patterns with proper caching. All tests pass and implementation is clean.

Minor suggestion: Consider adding a comment in the code explaining the lazy loading rationale for future maintainers, but this is not blocking.


@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR successfully fixes the first-connect reliability problem in the MCP server by deferring require('better-sqlite3') until the first tool call, ensuring the stdio transport is connected and ready to receive the client's initialize request before any slow module work begins. The main fix is solid and the previously flagged issues (lazy loads outside try-catch, unconditional unconditional loading on every call) have been properly addressed by placing the lazy-load helpers inside the try-catch block where errors are handled and cached for subsequent calls.

Confidence Score: 5/5

  • This PR is safe to merge. The core fix correctly addresses the first-connect reliability issue by deferring heavy imports and ensuring the transport is ready before import work begins.
  • The fix is functionally correct and properly handles errors. The deferred imports are now safely placed inside the try-catch block where failures return well-formed MCP error responses. All 37 MCP unit tests pass. The reported first-connect issue is resolved.
  • No files require special attention

Sequence Diagram

sequenceDiagram
    participant Client as MCP Client
    participant Server as MCP Server (Node.js)
    participant SQLite as better-sqlite3 (lazy via getDatabase)

    Note over Server: startMCPServer() called
    Note over Server: _Database var declared
    Note over Server: server.setRequestHandler() registered (sync)
    Server->>Client: server.connect(transport) — transport ready NOW
    Client->>Server: initialize request
    Server->>Client: initialize response (immediate)

    Client->>Server: tool call (e.g. query)
    Note over Server: First tool call enters handler
    Server->>SQLite: require('better-sqlite3') → loads native addon
    SQLite-->>Server: Database constructor cached in _Database
    Server->>Server: execute tool logic
    Server-->>Client: tool result

    Client->>Server: subsequent tool calls
    Note over Server: _Database already cached → no reload
    Server-->>Client: tool result (fast path)
Loading

Last reviewed commit: 9250341

src/mcp.js Outdated
Comment on lines +872 to +888
const {
impactAnalysisData,
moduleMapData,
fileDepsData,
exportsData,
fnDepsData,
fnImpactData,
pathData,
contextData,
childrenData,
explainData,
whereData,
diffImpactData,
listFunctionsData,
rolesData,
} = await getQueries();
const Database = getDatabase();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy loads are outside the try-catch error boundary

getQueries() and getDatabase() are invoked at the top of the handler — before the try block on line 890. If import('./queries.js') fails (e.g. a missing dependency or a syntax error in that file) or if require('better-sqlite3') throws (native addon not found, platform mismatch — a common scenario), the error propagates as an unhandled promise rejection and bypasses the handler's own catch clause that returns a well-formed { content, isError: true } MCP response.

In the original code this was a startup failure (loud, obvious). With the new design it silently kills the first tool call and may crash the process, which is a harder regression to debug.

Move these two calls inside the try block:

Suggested change
const {
impactAnalysisData,
moduleMapData,
fileDepsData,
exportsData,
fnDepsData,
fnImpactData,
pathData,
contextData,
childrenData,
explainData,
whereData,
diffImpactData,
listFunctionsData,
rolesData,
} = await getQueries();
const Database = getDatabase();
try {
const {
impactAnalysisData,
moduleMapData,
fileDepsData,
exportsData,
fnDepsData,
fnImpactData,
pathData,
contextData,
childrenData,
explainData,
whereData,
diffImpactData,
listFunctionsData,
rolesData,
} = await getQueries();
const Database = getDatabase();
if (!multiRepo && args.repo) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9250341 — moved await getQueries() and getDatabase() inside the try block. Load failures now return a well-formed MCP error response instead of surfacing as unhandled promise rejections.

src/mcp.js Outdated
Comment on lines +872 to +888
const {
impactAnalysisData,
moduleMapData,
fileDepsData,
exportsData,
fnDepsData,
fnImpactData,
pathData,
contextData,
childrenData,
explainData,
whereData,
diffImpactData,
listFunctionsData,
rolesData,
} = await getQueries();
const Database = getDatabase();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries and Database loaded eagerly on every tool call

getQueries() and getDatabase() are now called unconditionally at the top of every CallToolRequestSchema invocation, regardless of whether the active tool needs them. Tools like semantic_search, structure, co_changes, sequence, complexity, communities, code_owners, branch_compare, cfg, dataflow, ast_query, and list_repos all rely exclusively on their own case-scoped await import(...) calls and never touch the destructured query functions or the top-level Database reference.

This means queries.js and better-sqlite3 will both be forced to load on the very first tool call — even if that call is, say, semantic_search. The original intent of deferring heavy imports until they are actually needed is partially defeated for those code paths.

Consider keeping the lazy-load calls inside only the switch cases that require them (or at minimum document why eagerly loading on first call is acceptable trade-off).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9250341 — moved the calls inside the try-catch. Since both helpers cache after first load, the per-call overhead after the first invocation is just a variable check. Keeping them at handler level (inside try) is the right tradeoff: proper error handling without splitting them across dozens of switch cases.

@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @carlos-alm's task —— View job


PR Review: MCP Server First-Connect Reliability Fix

Todo List:

  • Read and analyze the changed file (src/mcp.js)
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity Check
  • Apply Phase 4: Code Quality Assessment
  • Provide structured review with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Race condition where MCP server's initialize request from client arrives before the server finishes loading heavy dependencies (queries.js and better-sqlite3), causing first-connection failures.

Why This Approach: The solution is justified and addresses the actual root cause. By connecting the stdio transport immediately after MCP SDK loads, the server can receive the initialize request while still loading dependencies asynchronously. This is a genuine timing fix, not symptom masking.

Risk Assessment: LOW - The change preserves all functionality while improving reliability through proper async loading patterns.

Phase 2: Configuration Change Detection ✅

NO configuration changes detected - no linting rules modified, no test thresholds lowered, no build configuration changes.

Phase 3: Test Integrity ✅

NO test integrity issues - no test files modified, no assertions weakened, no test coverage compromised.

Phase 4: Code Quality Assessment ✅

Implementation Quality: The lazy loading pattern is well-implemented:

  • getQueries() and getDatabase() use proper caching with module-level variables
  • Error handling preserved in both sync and async paths
  • No memory leaks or duplicate imports possible

Potential Issues Identified:

  • Minor: The removed top-level import of diffImpactMermaid and other exports from queries.js (line 13) might cause issues if other modules expect these to be available - but analysis shows they're only used within the MCP tool handlers
  • Edge case: If getQueries() import fails, subsequent tool calls will retry the import - this is actually good behavior

Performance Impact: POSITIVE - eliminates ~100-500ms cold-start delay by deferring heavy SQLite native binding load.

Security: No security concerns - the changes don't introduce new attack vectors.

Maintainability: GOOD - the lazy loading pattern is clean and follows common Node.js practices.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Race condition in MCP server startup where client `initialize` requests arrive before heavy dependency loading completes
**Why This Approach**: Properly addresses timing issue by connecting transport first, then lazy-loading dependencies on demand
**Risk Assessment**: LOW - maintains all functionality while improving first-connection reliability

## Critical Concerns
None identified. This is a clean architectural improvement that addresses the actual root cause.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Well-architected fix addressing a real timing issue. Good use of lazy loading patterns with proper caching. All tests pass and implementation is clean.

Minor suggestion: Consider adding a comment in the code explaining the lazy loading rationale for future maintainers, but this is not blocking.


@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @carlos-alm's task —— View job


PR Review: MCP Server First-Connect Reliability Fix

Todo List:

  • Read and analyze the changed file (src/mcp.js)
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity Check
  • Apply Phase 4: Code Quality Assessment
  • Provide structured review with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Race condition where MCP server's initialize request from client arrives before the server finishes loading heavy dependencies (queries.js and better-sqlite3), causing first-connection failures.

Why This Approach: The solution is justified and addresses the actual root cause. By connecting the stdio transport immediately after MCP SDK loads, the server can receive the initialize request while still loading dependencies asynchronously. This is a genuine timing fix, not symptom masking.

Risk Assessment: LOW - The change preserves all functionality while improving reliability through proper async loading patterns.

Phase 2: Configuration Change Detection ✅

NO configuration changes detected - no linting rules modified, no test thresholds lowered, no build configuration changes.

Phase 3: Test Integrity ✅

NO test integrity issues - no test files modified, no assertions weakened, no test coverage compromised.

Phase 4: Code Quality Assessment ✅

Implementation Quality: The lazy loading pattern is well-implemented:

  • getQueries() and getDatabase() use proper caching with module-level variables
  • Error handling preserved in both sync and async paths
  • No memory leaks or duplicate imports possible

Potential Issues Identified:

  • Minor: The removed top-level import of diffImpactMermaid and other exports from queries.js (line 13) might cause issues if other modules expect these to be available - but analysis shows they're only used within the MCP tool handlers
  • Edge case: If getQueries() import fails, subsequent tool calls will retry the import - this is actually good behavior

Performance Impact: POSITIVE - eliminates ~100-500ms cold-start delay by deferring heavy SQLite native binding load.

Security: No security concerns - the changes don't introduce new attack vectors.

Maintainability: GOOD - the lazy loading pattern is clean and follows common Node.js practices.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Race condition in MCP server startup where client `initialize` requests arrive before heavy dependency loading completes
**Why This Approach**: Properly addresses timing issue by connecting transport first, then lazy-loading dependencies on demand
**Risk Assessment**: LOW - maintains all functionality while improving first-connection reliability

## Critical Concerns
None identified. This is a clean architectural improvement that addresses the actual root cause.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Well-architected fix addressing a real timing issue. Good use of lazy loading patterns with proper caching. All tests pass and implementation is clean.

Minor suggestion: Consider adding a comment in the code explaining the lazy loading rationale for future maintainers, but this is not blocking.


@carlos-alm
Copy link
Contributor Author

@greptileai

@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @carlos-alm's task —— View job


PR Review: MCP Server First-Connect Reliability Fix

Todo List:

  • Read and analyze the changed file (src/mcp.js)
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity Check
  • Apply Phase 4: Code Quality Assessment
  • Provide structured review with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Race condition where MCP server's initialize request from client arrives before the server finishes loading heavy dependencies (queries.js and better-sqlite3), causing first-connection failures.

Why This Approach: The solution is justified and addresses the actual root cause. By connecting the stdio transport immediately after MCP SDK loads, the server can receive the initialize request while still loading dependencies asynchronously. This is a genuine timing fix, not symptom masking.

Risk Assessment: LOW - The change preserves all functionality while improving reliability through proper async loading patterns.

Phase 2: Configuration Change Detection ✅

NO configuration changes detected - no linting rules modified, no test thresholds lowered, no build configuration changes.

Phase 3: Test Integrity ✅

NO test integrity issues - no test files modified, no assertions weakened, no test coverage compromised.

Phase 4: Code Quality Assessment ✅

Implementation Quality: The lazy loading pattern is well-implemented:

  • getQueries() and getDatabase() use proper caching with module-level variables
  • Error handling preserved in both sync and async paths
  • No memory leaks or duplicate imports possible

Potential Issues Identified:

  • Minor: The removed top-level import of diffImpactMermaid and other exports from queries.js (line 13) might cause issues if other modules expect these to be available - but analysis shows they're only used within the MCP tool handlers
  • Edge case: If getQueries() import fails, subsequent tool calls will retry the import - this is actually good behavior

Performance Impact: POSITIVE - eliminates ~100-500ms cold-start delay by deferring heavy SQLite native binding load.

Security: No security concerns - the changes don't introduce new attack vectors.

Maintainability: GOOD - the lazy loading pattern is clean and follows common Node.js practices.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Race condition in MCP server startup where client `initialize` requests arrive before heavy dependency loading completes
**Why This Approach**: Properly addresses timing issue by connecting transport first, then lazy-loading dependencies on demand
**Risk Assessment**: LOW - maintains all functionality while improving first-connection reliability

## Critical Concerns
None identified. This is a clean architectural improvement that addresses the actual root cause.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Well-architected fix addressing a real timing issue. Good use of lazy loading patterns with proper caching. All tests pass and implementation is clean.

Minor suggestion: Consider adding a comment in the code explaining the lazy loading rationale for future maintainers, but this is not blocking.


@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @carlos-alm's task —— View job


PR Review: MCP Server First-Connect Reliability Fix

Todo List:

  • Read and analyze the changed file (src/mcp.js)
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity Check
  • Apply Phase 4: Code Quality Assessment
  • Provide structured review with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Race condition where MCP server's initialize request from client arrives before the server finishes loading heavy dependencies (queries.js and better-sqlite3), causing first-connection failures.

Why This Approach: The solution is justified and addresses the actual root cause. By connecting the stdio transport immediately after MCP SDK loads, the server can receive the initialize request while still loading dependencies asynchronously. This is a genuine timing fix, not symptom masking.

Risk Assessment: LOW - The change preserves all functionality while improving reliability through proper async loading patterns.

Phase 2: Configuration Change Detection ✅

NO configuration changes detected - no linting rules modified, no test thresholds lowered, no build configuration changes.

Phase 3: Test Integrity ✅

NO test integrity issues - no test files modified, no assertions weakened, no test coverage compromised.

Phase 4: Code Quality Assessment ✅

Implementation Quality: The lazy loading pattern is well-implemented:

  • getQueries() and getDatabase() use proper caching with module-level variables
  • Error handling preserved in both sync and async paths
  • No memory leaks or duplicate imports possible

Potential Issues Identified:

  • Minor: The removed top-level import of diffImpactMermaid and other exports from queries.js (line 13) might cause issues if other modules expect these to be available - but analysis shows they're only used within the MCP tool handlers
  • Edge case: If getQueries() import fails, subsequent tool calls will retry the import - this is actually good behavior

Performance Impact: POSITIVE - eliminates ~100-500ms cold-start delay by deferring heavy SQLite native binding load.

Security: No security concerns - the changes don't introduce new attack vectors.

Maintainability: GOOD - the lazy loading pattern is clean and follows common Node.js practices.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Race condition in MCP server startup where client `initialize` requests arrive before heavy dependency loading completes
**Why This Approach**: Properly addresses timing issue by connecting transport first, then lazy-loading dependencies on demand
**Risk Assessment**: LOW - maintains all functionality while improving first-connection reliability

## Critical Concerns
None identified. This is a clean architectural improvement that addresses the actual root cause.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Well-architected fix addressing a real timing issue. Good use of lazy loading patterns with proper caching. All tests pass and implementation is clean.

Minor suggestion: Consider adding a comment in the code explaining the lazy loading rationale for future maintainers, but this is not blocking.


@carlos-alm
Copy link
Contributor Author

Re: the minor suggestion about adding a comment explaining lazy loading rationale — this is already covered in the code at lines 840-842:

// Connect transport FIRST so the server can receive the client's
// `initialize` request while heavy modules (queries, better-sqlite3)
// are still loading.  These are lazy-loaded on the first tool call
// and cached for subsequent calls.

No additional changes needed.

@carlos-alm carlos-alm merged commit 417e402 into main Mar 9, 2026
16 checks passed
@carlos-alm carlos-alm deleted the fix/mcp-first-connect-resilience branch March 9, 2026 04:22
@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant