Skip to content

fix: MCP reliability — singleflight context poisoning, error observability, LLM retry#7

Merged
thebtf merged 3 commits into
mainfrom
fix/mcp-reliability
Mar 14, 2026
Merged

fix: MCP reliability — singleflight context poisoning, error observability, LLM retry#7
thebtf merged 3 commits into
mainfrom
fix/mcp-reliability

Conversation

@thebtf
Copy link
Copy Markdown
Owner

@thebtf thebtf commented Mar 14, 2026

Summary

Fixes two production reliability bugs on engram MCP server (Docker/Unraid), discovered during PR #6 review session.

  • singleflight context poisoning (search/manager.go): singleflight.Do replaced with DoChan + detached context. When concurrent identical search requests share a singleflight group, the first caller's HTTP context cancellation no longer poisons the shared pgvector query for all coalesced waiters.
  • Tool error observability (mcp/server.go): Error text now included in MCP Message field (not just Data), making errors visible in Claude Code SDK. Server-side logging added for all tool call failures.
  • LLM backend diagnostics (worker/sdk/processor.go): Expanded transient retry list (connection refused, no such host, 500, 502, 504). Startup log with backend config summary. Final error message includes backend state and last error for debugging.
  • Schema consistency (mcp/server.go): Added explicit empty required array to search tool schema.

Test plan

  • go build ./cmd/worker/ — compiles
  • go vet ./internal/mcp/ ./internal/search/ ./internal/worker/sdk/ — clean
  • go test ./internal/search/... — pass
  • go test ./internal/mcp/... — pass
  • go test ./internal/worker/sdk/... — pre-existing Windows path failures only
  • Deploy to Docker → verify /api/logs shows "SDK processor backend summary" at startup
  • Long session test: search tool should not degrade after 30+ min

Summary by CodeRabbit

Примечания к выпуску

  • Исправления ошибок
    • Улучшены сообщения об ошибках при сбое инструментов с добавлением деталей.
    • Повышена надежность обработки поисковых запросов с лучшей защитой от отмены операций.
    • Расширена диагностика ошибок при недоступности серверной части с более подробной информацией.

thebtf added 3 commits March 14, 2026 13:10
Tool errors now include the actual error text in the Message field
(not just the Data field), making errors visible in Claude Code SDK.
Added server-side logging of all tool call failures with tool name.
Added explicit empty required array to search tool schema for consistency.
Replace singleflight.Do with DoChan + detached context (10s timeout).
When concurrent identical searches share a singleflight group, the first
caller's HTTP context cancellation no longer propagates to the shared
pgvector query, preventing -32000 errors in long MCP sessions.
Add startup log with backend configuration summary (llm_configured,
url, model). Expand transient error list with connection refused,
no such host, 500, 502, 504. Include backend state and last error
in the final "no LLM backend available" message for easier debugging.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical reliability issues within the MCP server, focusing on robust error handling and improved system stability. The changes prevent context propagation problems in concurrent search operations, enhance the visibility of tool execution errors for better debugging, and fortify the LLM integration with more comprehensive retry mechanisms and diagnostic logging.

Highlights

  • Context Poisoning Fix: Prevented singleflight context cancellation from affecting shared search requests by using DoChan with a detached context, ensuring independent execution.
  • Improved Tool Error Observability: Enhanced error reporting for tool calls by including detailed error messages in the Message field of JSON-RPC responses and adding server-side logging for failures.
  • LLM Backend Reliability: Expanded the list of transient errors that trigger LLM retry logic and improved error messages for unavailable LLM backends to provide more diagnostic information.
  • Schema Consistency: Ensured the search tool schema explicitly defines an empty required array for better consistency.
Changelog
  • internal/mcp/server.go
    • Added an explicit empty required array to the search tool's JSON schema definition.
    • Included the full error message in the Message field of JSON-RPC error responses for tool calls.
    • Added server-side error logging for failed tool calls.
  • internal/search/manager.go
    • Refactored UnifiedSearch to use singleflight.DoChan instead of singleflight.Do.
    • Implemented a detached context with a timeout for the executeSearch operation within singleflight to prevent context poisoning.
    • Adjusted error handling to correctly process results from the DoChan channel, checking for context cancellation and execution errors.
  • internal/worker/sdk/processor.go
    • Introduced a new startup log entry summarizing the SDK processor's backend configuration, including LLM and local verification status.
    • Expanded the list of transient errors that trigger LLM retry logic to include 'connection refused', 'no such host', '500', '502', and '504' HTTP status codes.
    • Modified the final error message for unavailable LLM backends to provide more diagnostic information, such as LLM configuration status and the last encountered error.
Activity
  • Issues were identified during a review session for a previous pull request (refactor: remove local-server assumptions, JS-only hooks #6).
  • The author has verified that the code compiles, passes go vet checks, and all relevant unit tests (internal/search/..., internal/mcp/..., internal/worker/sdk/...) pass, with existing known Windows path failures noted.
  • Further testing is planned, including deployment verification and long-session stability tests for the search tool.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

Обзор

Изменения включают улучшение обработки ошибок и логирования в MCP-сервере, оптимизацию сокращения запросов в поиске с использованием контекста с timeout'ом, и расширение диагностики ошибок в обработчике LLM с лучшей обработкой временных ошибок.

Изменения

Когорта / Файл(ы) Резюме
MCP-сервер инструментов
internal/mcp/server.go
Добавлено поле "required": []string{} в схему входных параметров первого инструмента; улучшены логирование и текст ошибок при сбое вызова инструмента для лучшей диагностики.
Оптимизация поиска
internal/search/manager.go
Замена singleflight Do на DoChan для асинхронной обработки результатов; введён отдельный контекст с 10-секундным timeout'ом для предотвращения отравления кэша при отмене запроса; переход на обработку результатов через канал с поддержкой отмены контекста.
Обработка LLM-ошибок
internal/worker/sdk/processor.go
Добавлена диагностическая логирование конфигурации LLM-бэкенда; расширена обработка временных ошибок (connection refused, no such host, HTTP 5xx); улучшено сообщение об ошибке с контекстной информацией при недоступности всех бэкендов.

Возможно связанные PR

  • feat: collection MCP tools and instinct import #2 — Оба PR модифицируют logic инструментов в internal/mcp/server.go (handleToolsList / handleToolsCall), обработку каталога и вызовов инструментов.

Предлагаемые метки

ai-review:passed

Оценка сложности код-ревью

🎯 3 (Moderate) | ⏱️ ~20 minutes

Стихотворение

🐰 В поиске волшебный канал дождался,
Таймаут спасает от ошибок-проказ,
Логи поют о беде инструментов —
Заяц работает чётче теперь!
Ошибки в сети узнаны с честью! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the three main changes: singleflight context poisoning fix, error observability improvements, and LLM retry enhancements, matching the PR's objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/mcp-reliability
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several important reliability improvements. The fix for singleflight context poisoning in search/manager.go is well-implemented using DoChan and a detached context. The enhancements to error observability in mcp/server.go and worker/sdk/processor.go are valuable for debugging. The expanded list of transient errors for LLM retries is also a good addition. I've added one suggestion in worker/sdk/processor.go to improve the maintainability of the transient error check.

Comment thread internal/worker/sdk/processor.go
@thebtf
Copy link
Copy Markdown
Owner Author

thebtf commented Mar 14, 2026

@coderabbitai review

@thebtf
Copy link
Copy Markdown
Owner Author

thebtf commented Mar 14, 2026

@codex review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/worker/sdk/processor.go`:
- Around line 691-697: The retry logic in the strings.Contains checks (the
conditional using errStr and strings.Contains) must match the actual LLM client
error format ("LLM API error 500: ..."), so don't replace the existing "500"
check with patterns like "status 500" or "500 Internal Server Error"; instead
update the condition to look for the real token (e.g., strings.Contains(errStr,
"LLM API error 500") or at minimum strings.Contains(errStr, "API error 500")) or
keep the existing generic "500" check to preserve behavior; modify the
conditional in internal/worker/sdk/processor.go where errStr is examined to use
one of these exact substrings so 5xx errors from the LLM client are correctly
caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 296e9c10-4c36-46c4-aad3-ac881143a38e

📥 Commits

Reviewing files that changed from the base of the PR and between 7c8a144 and d194551.

📒 Files selected for processing (3)
  • internal/mcp/server.go
  • internal/search/manager.go
  • internal/worker/sdk/processor.go

Comment thread internal/worker/sdk/processor.go
@thebtf thebtf merged commit 654de91 into main Mar 14, 2026
2 checks passed
@thebtf thebtf deleted the fix/mcp-reliability branch March 14, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant