feat: serve MCP over HTTP transport from daemon#388
Merged
CalebisGross merged 2 commits intofeat/exp25-faithfulness-probefrom Apr 9, 2026
Merged
feat: serve MCP over HTTP transport from daemon#388CalebisGross merged 2 commits intofeat/exp25-faithfulness-probefrom
CalebisGross merged 2 commits intofeat/exp25-faithfulness-probefrom
Conversation
…uning Extract mnemonic's own 1.5-2B model from Gemma 4 31B (30.7B dense, 60 layers) via Sheared-LLaMA-style targeted structural pruning. Phases: full fine-tune baseline → learned pruning masks → continued pretraining → standalone GGUF export. Progressive targets 8B→4B→2B→1.5B to find the quality cliff. Target: >200 tok/s, <1.5GB VRAM, match EXP-26 faithfulness metrics. Hardware: MI300X for pruning, local 7800 XT for deployment. Tracking: #386 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add POST /mcp endpoint to the daemon API, eliminating the need for
per-session stdio subprocesses. Claude Code connects via HTTP transport
to the already-running daemon, sharing its LLM, store, and agents.
- New SessionManager (internal/mcp/session.go) creates and caches
MCPServer instances per session ID with 30-minute idle expiry
- HTTP handler (internal/api/routes/mcp.go) accepts JSON-RPC requests,
generates session IDs on first request (returned via Mcp-Session-Id
header), routes subsequent requests to existing sessions
- Export JSONRPCRequest/Response types and HandleSingleRequest for
the HTTP transport layer
- Wire session manager into daemon serve pipeline
Claude Code config changes from stdio to HTTP transport:
{"type": "http", "url": "http://127.0.0.1:9999/mcp"}
Result: N sessions x ~3GB VRAM each → one daemon, one model, ~3GB total.
The mcp subcommand remains as fallback for offline/no-daemon usage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
POST /mcpendpoint to the daemon API — Claude Code connects via HTTP transport instead of spawning stdio subprocessesSessionManagercreates/caches MCPServer instances per session ID with 30-minute idle expiry and background reaperMcp-Session-Idresponse header), routes subsequent requests to existing sessionsJSONRPCRequest/JSONRPCResponsetypes andHandleSingleRequestfor the HTTP transport layer{"type": "http", "url": "http://127.0.0.1:9999/mcp"}Before: Each Claude Code session spawned
mnemonic mcpsubprocess → loaded Qwen model → ~3GB VRAM per session. 4 sessions + daemon = 15GB on a 16GB GPU.After: All sessions share the daemon's single model load. Zero VRAM per session. Stale process accumulation eliminated.
Test plan
go vetandgo testpass for all changed packagesROCM=1 make build-embeddedcompilesPOST /mcpinitialize without session ID → returns session ID in headerDELETE /mcpwith session ID cleans up sessionrocm-smi --showpidsshows single daemon process on GPUCloses #384
🤖 Generated with Claude Code