fix: MCP reliability — singleflight context poisoning, error observability, LLM retry by thebtf · Pull Request #7 · thebtf/engram

thebtf · 2026-03-14T10:30:11Z

Summary

Fixes two production reliability bugs on engram MCP server (Docker/Unraid), discovered during PR #6 review session.

singleflight context poisoning (search/manager.go): singleflight.Do replaced with DoChan + detached context. When concurrent identical search requests share a singleflight group, the first caller's HTTP context cancellation no longer poisons the shared pgvector query for all coalesced waiters.
Tool error observability (mcp/server.go): Error text now included in MCP Message field (not just Data), making errors visible in Claude Code SDK. Server-side logging added for all tool call failures.
LLM backend diagnostics (worker/sdk/processor.go): Expanded transient retry list (connection refused, no such host, 500, 502, 504). Startup log with backend config summary. Final error message includes backend state and last error for debugging.
Schema consistency (mcp/server.go): Added explicit empty required array to search tool schema.

Test plan

go build ./cmd/worker/ — compiles
go vet ./internal/mcp/ ./internal/search/ ./internal/worker/sdk/ — clean
go test ./internal/search/... — pass
go test ./internal/mcp/... — pass
go test ./internal/worker/sdk/... — pre-existing Windows path failures only
Deploy to Docker → verify /api/logs shows "SDK processor backend summary" at startup
Long session test: search tool should not degrade after 30+ min

Summary by CodeRabbit

Примечания к выпуску

Исправления ошибок
- Улучшены сообщения об ошибках при сбое инструментов с добавлением деталей.
- Повышена надежность обработки поисковых запросов с лучшей защитой от отмены операций.
- Расширена диагностика ошибок при недоступности серверной части с более подробной информацией.

Tool errors now include the actual error text in the Message field (not just the Data field), making errors visible in Claude Code SDK. Added server-side logging of all tool call failures with tool name. Added explicit empty required array to search tool schema for consistency.

Replace singleflight.Do with DoChan + detached context (10s timeout). When concurrent identical searches share a singleflight group, the first caller's HTTP context cancellation no longer propagates to the shared pgvector query, preventing -32000 errors in long MCP sessions.

Add startup log with backend configuration summary (llm_configured, url, model). Expand transient error list with connection refused, no such host, 500, 502, 504. Include backend state and last error in the final "no LLM backend available" message for easier debugging.

gemini-code-assist · 2026-03-14T10:30:27Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical reliability issues within the MCP server, focusing on robust error handling and improved system stability. The changes prevent context propagation problems in concurrent search operations, enhance the visibility of tool execution errors for better debugging, and fortify the LLM integration with more comprehensive retry mechanisms and diagnostic logging.

Highlights

Context Poisoning Fix: Prevented singleflight context cancellation from affecting shared search requests by using DoChan with a detached context, ensuring independent execution.
Improved Tool Error Observability: Enhanced error reporting for tool calls by including detailed error messages in the Message field of JSON-RPC responses and adding server-side logging for failures.
LLM Backend Reliability: Expanded the list of transient errors that trigger LLM retry logic and improved error messages for unavailable LLM backends to provide more diagnostic information.
Schema Consistency: Ensured the search tool schema explicitly defines an empty required array for better consistency.

Changelog

internal/mcp/server.go
- Added an explicit empty required array to the search tool's JSON schema definition.
- Included the full error message in the Message field of JSON-RPC error responses for tool calls.
- Added server-side error logging for failed tool calls.
internal/search/manager.go
- Refactored UnifiedSearch to use singleflight.DoChan instead of singleflight.Do.
- Implemented a detached context with a timeout for the executeSearch operation within singleflight to prevent context poisoning.
- Adjusted error handling to correctly process results from the DoChan channel, checking for context cancellation and execution errors.
internal/worker/sdk/processor.go
- Introduced a new startup log entry summarizing the SDK processor's backend configuration, including LLM and local verification status.
- Expanded the list of transient errors that trigger LLM retry logic to include 'connection refused', 'no such host', '500', '502', and '504' HTTP status codes.
- Modified the final error message for unavailable LLM backends to provide more diagnostic information, such as LLM configuration status and the last encountered error.

Activity

Issues were identified during a review session for a previous pull request (refactor: remove local-server assumptions, JS-only hooks #6).
The author has verified that the code compiles, passes go vet checks, and all relevant unit tests (internal/search/..., internal/mcp/..., internal/worker/sdk/...) pass, with existing known Windows path failures noted.
Further testing is planned, including deployment verification and long-session stability tests for the search tool.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-14T10:30:28Z

Обзор

Изменения включают улучшение обработки ошибок и логирования в MCP-сервере, оптимизацию сокращения запросов в поиске с использованием контекста с timeout'ом, и расширение диагностики ошибок в обработчике LLM с лучшей обработкой временных ошибок.

Изменения

Когорта / Файл(ы)	Резюме
MCP-сервер инструментов `internal/mcp/server.go`	Добавлено поле "required": []string{} в схему входных параметров первого инструмента; улучшены логирование и текст ошибок при сбое вызова инструмента для лучшей диагностики.
Оптимизация поиска `internal/search/manager.go`	Замена singleflight Do на DoChan для асинхронной обработки результатов; введён отдельный контекст с 10-секундным timeout'ом для предотвращения отравления кэша при отмене запроса; переход на обработку результатов через канал с поддержкой отмены контекста.
Обработка LLM-ошибок `internal/worker/sdk/processor.go`	Добавлена диагностическая логирование конфигурации LLM-бэкенда; расширена обработка временных ошибок (connection refused, no such host, HTTP 5xx); улучшено сообщение об ошибке с контекстной информацией при недоступности всех бэкендов.

Возможно связанные PR

feat: collection MCP tools and instinct import #2 — Оба PR модифицируют logic инструментов в internal/mcp/server.go (handleToolsList / handleToolsCall), обработку каталога и вызовов инструментов.

Предлагаемые метки

ai-review:passed

Оценка сложности код-ревью

🎯 3 (Moderate) | ⏱️ ~20 minutes

Стихотворение

🐰 В поиске волшебный канал дождался,
Таймаут спасает от ошибок-проказ,
Логи поют о беде инструментов —
Заяц работает чётче теперь!
Ошибки в сети узнаны с честью! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the three main changes: singleflight context poisoning fix, error observability improvements, and LLM retry enhancements, matching the PR's objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/mcp-reliability

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces several important reliability improvements. The fix for singleflight context poisoning in search/manager.go is well-implemented using DoChan and a detached context. The enhancements to error observability in mcp/server.go and worker/sdk/processor.go are valuable for debugging. The expanded list of transient errors for LLM retries is also a good addition. I've added one suggestion in worker/sdk/processor.go to improve the maintainability of the transient error check.

thebtf · 2026-03-14T10:33:19Z

@coderabbitai review

thebtf · 2026-03-14T10:33:22Z

@codex review

coderabbitai · 2026-03-14T10:33:28Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/worker/sdk/processor.go`:
- Around line 691-697: The retry logic in the strings.Contains checks (the
conditional using errStr and strings.Contains) must match the actual LLM client
error format ("LLM API error 500: ..."), so don't replace the existing "500"
check with patterns like "status 500" or "500 Internal Server Error"; instead
update the condition to look for the real token (e.g., strings.Contains(errStr,
"LLM API error 500") or at minimum strings.Contains(errStr, "API error 500")) or
keep the existing generic "500" check to preserve behavior; modify the
conditional in internal/worker/sdk/processor.go where errStr is examined to use
one of these exact substrings so 5xx errors from the LLM client are correctly
caught.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 296e9c10-4c36-46c4-aad3-ac881143a38e

📥 Commits

Reviewing files that changed from the base of the PR and between 7c8a144 and d194551.

📒 Files selected for processing (3)

internal/mcp/server.go
internal/search/manager.go
internal/worker/sdk/processor.go

thebtf added 3 commits March 14, 2026 13:10

gemini-code-assist Bot reviewed Mar 14, 2026

View reviewed changes

Comment thread internal/worker/sdk/processor.go

coderabbitai Bot added the ai-review:passed label Mar 14, 2026

thebtf added ai-review:active and removed ai-review:passed labels Mar 14, 2026

coderabbitai Bot requested changes Mar 14, 2026

View reviewed changes

Comment thread internal/worker/sdk/processor.go

coderabbitai Bot approved these changes Mar 14, 2026

View reviewed changes

thebtf added ai-review:passed and removed ai-review:active labels Mar 14, 2026

thebtf merged commit 654de91 into main Mar 14, 2026
2 checks passed

thebtf deleted the fix/mcp-reliability branch March 14, 2026 11:12

coderabbitai Bot mentioned this pull request Mar 14, 2026

refactor: MCP tools — type coercion, tiering, consolidation #8

Merged

8 tasks

This was referenced Mar 24, 2026

fix: Phase 1 — Security & Reliability (P0) #57

Merged

feat: document MCP tools + OpenClaw message classification #60

Merged

coderabbitai Bot mentioned this pull request Apr 13, 2026

refactor: Phase 0 cleanup — remove MCP HTTP transports + dead code #148

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MCP reliability — singleflight context poisoning, error observability, LLM retry#7

fix: MCP reliability — singleflight context poisoning, error observability, LLM retry#7
thebtf merged 3 commits into
mainfrom
fix/mcp-reliability

thebtf commented Mar 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

gemini-code-assist Bot commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

thebtf commented Mar 14, 2026

Uh oh!

thebtf commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thebtf commented Mar 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Примечания к выпуску

Uh oh!

gemini-code-assist Bot commented Mar 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Обзор

Изменения

Возможно связанные PR

Предлагаемые метки

Оценка сложности код-ревью

Стихотворение

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

thebtf commented Mar 14, 2026

Uh oh!

thebtf commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thebtf commented Mar 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading