Skip to content

feat: serve MCP over HTTP transport from daemon to eliminate per-session subprocess spawning #384

@CalebisGross

Description

@CalebisGross

Problem

Every Claude Code chat session spawns a separate mnemonic mcp subprocess via stdio transport. Each subprocess independently loads the full Qwen model into GPU VRAM (~3GB). Multiple concurrent sessions exhaust the 16GB RX 7800 XT:

Session 1 → mnemonic mcp → loads Qwen → 3GB VRAM
Session 2 → mnemonic mcp → loads Qwen → 3GB VRAM
Session 3 → mnemonic mcp → loads Qwen → 3GB VRAM
Daemon    → mnemonic serve → loads Qwen → 3GB VRAM
                                Total:    ~12-15GB

Stale MCP processes from closed sessions often linger, compounding the issue. This is a recurring problem — not a one-time cleanup.

Root Cause

The MCP server config in ~/.claude/settings.local.json uses stdio transport:

"mnemonic": {
  "command": "/home/hubcaps/Projects/mem/bin/mnemonic",
  "args": ["mcp"]
}

Each session spawns a full process that initializes its own LLM provider, store connection, encoding agent, and retrieval agent — duplicating what the already-running daemon has.

Solution

Serve MCP protocol over HTTP transport from the daemon. The daemon is already running with the model loaded, store open, and all agents active.

Key findings driving the approach:

  • The MCP server is pure request/response — no server-initiated notifications. SSE is unnecessary.
  • Claude Code supports "type": "http" transport — SSE is deprecated. HTTP is recommended for already-running servers.
  • handleRequest() is already transport-agnostic — it takes a jsonRPCRequest and returns a jsonRPCResponse. Only the transport layer needs to change.

Implementation:

  1. Add POST /mcp endpoint to daemon API — accepts JSON-RPC request body, calls existing handleRequest(), returns JSON-RPC response
  2. Per-session state via headers — track session ID (e.g. Mcp-Session-Id header) so per-session memory tracking and onSessionEnd() still work
  3. Session lifecycle management — detect when a session disconnects (no requests for N minutes) and call onSessionEnd() for cleanup
  4. Update Claude Code config to:
    "mnemonic": {
      "type": "http",
      "url": "http://127.0.0.1:9999/mcp"
    }
  5. Keep mnemonic mcp (stdio) as fallback for when the daemon isn't running

Result:

Session 1 ──┐
Session 2 ──┼── POST /mcp ──→ daemon :9999 (one process, one model, ~3GB VRAM)
Session 3 ──┘

What changes:

  • New HTTP handler in internal/api/routes/ (~200 lines)
  • Session multiplexing in the MCP server (~100 lines)
  • Session timeout/cleanup logic
  • Claude Code MCP config

What doesn't change:

  • All 24 MCP tool handlers (zero changes)
  • Store, LLM provider, event bus (shared from daemon)
  • mnemonic mcp subcommand (kept as offline fallback)

Impact

  • Eliminates GPU VRAM exhaustion from concurrent sessions
  • Eliminates stale subprocess accumulation
  • Reduces per-session overhead from ~1.4GB RAM + 3GB VRAM to zero (shared daemon)
  • Faster session startup (no model load, no DB open — daemon already has everything)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions