fix: MCP reliability — singleflight context poisoning, error observability, LLM retry#7
Conversation
Tool errors now include the actual error text in the Message field (not just the Data field), making errors visible in Claude Code SDK. Added server-side logging of all tool call failures with tool name. Added explicit empty required array to search tool schema for consistency.
Replace singleflight.Do with DoChan + detached context (10s timeout). When concurrent identical searches share a singleflight group, the first caller's HTTP context cancellation no longer propagates to the shared pgvector query, preventing -32000 errors in long MCP sessions.
Add startup log with backend configuration summary (llm_configured, url, model). Expand transient error list with connection refused, no such host, 500, 502, 504. Include backend state and last error in the final "no LLM backend available" message for easier debugging.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several critical reliability issues within the MCP server, focusing on robust error handling and improved system stability. The changes prevent context propagation problems in concurrent search operations, enhance the visibility of tool execution errors for better debugging, and fortify the LLM integration with more comprehensive retry mechanisms and diagnostic logging. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
ОбзорИзменения включают улучшение обработки ошибок и логирования в MCP-сервере, оптимизацию сокращения запросов в поиске с использованием контекста с timeout'ом, и расширение диагностики ошибок в обработчике LLM с лучшей обработкой временных ошибок. Изменения
Возможно связанные PR
Предлагаемые метки
Оценка сложности код-ревью🎯 3 (Moderate) | ⏱️ ~20 minutes Стихотворение
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several important reliability improvements. The fix for singleflight context poisoning in search/manager.go is well-implemented using DoChan and a detached context. The enhancements to error observability in mcp/server.go and worker/sdk/processor.go are valuable for debugging. The expanded list of transient errors for LLM retries is also a good addition. I've added one suggestion in worker/sdk/processor.go to improve the maintainability of the transient error check.
|
@coderabbitai review |
|
@codex review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/worker/sdk/processor.go`:
- Around line 691-697: The retry logic in the strings.Contains checks (the
conditional using errStr and strings.Contains) must match the actual LLM client
error format ("LLM API error 500: ..."), so don't replace the existing "500"
check with patterns like "status 500" or "500 Internal Server Error"; instead
update the condition to look for the real token (e.g., strings.Contains(errStr,
"LLM API error 500") or at minimum strings.Contains(errStr, "API error 500")) or
keep the existing generic "500" check to preserve behavior; modify the
conditional in internal/worker/sdk/processor.go where errStr is examined to use
one of these exact substrings so 5xx errors from the LLM client are correctly
caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 296e9c10-4c36-46c4-aad3-ac881143a38e
📒 Files selected for processing (3)
internal/mcp/server.gointernal/search/manager.gointernal/worker/sdk/processor.go
Summary
Fixes two production reliability bugs on engram MCP server (Docker/Unraid), discovered during PR #6 review session.
search/manager.go):singleflight.Doreplaced withDoChan+ detached context. When concurrent identical search requests share a singleflight group, the first caller's HTTP context cancellation no longer poisons the shared pgvector query for all coalesced waiters.mcp/server.go): Error text now included in MCPMessagefield (not justData), making errors visible in Claude Code SDK. Server-side logging added for all tool call failures.worker/sdk/processor.go): Expanded transient retry list (connection refused,no such host,500,502,504). Startup log with backend config summary. Final error message includes backend state and last error for debugging.mcp/server.go): Added explicit emptyrequiredarray tosearchtool schema.Test plan
go build ./cmd/worker/— compilesgo vet ./internal/mcp/ ./internal/search/ ./internal/worker/sdk/— cleango test ./internal/search/...— passgo test ./internal/mcp/...— passgo test ./internal/worker/sdk/...— pre-existing Windows path failures only/api/logsshows "SDK processor backend summary" at startupsearchtool should not degrade after 30+ minSummary by CodeRabbit
Примечания к выпуску