feat: serve MCP over HTTP transport from daemon to eliminate per-session subprocess spawning

## Problem

Every Claude Code chat session spawns a separate `mnemonic mcp` subprocess via stdio transport. Each subprocess independently loads the full Qwen model into GPU VRAM (~3GB). Multiple concurrent sessions exhaust the 16GB RX 7800 XT:

```
Session 1 → mnemonic mcp → loads Qwen → 3GB VRAM
Session 2 → mnemonic mcp → loads Qwen → 3GB VRAM
Session 3 → mnemonic mcp → loads Qwen → 3GB VRAM
Daemon    → mnemonic serve → loads Qwen → 3GB VRAM
                                Total:    ~12-15GB
```

Stale MCP processes from closed sessions often linger, compounding the issue. This is a recurring problem — not a one-time cleanup.

## Root Cause

The MCP server config in `~/.claude/settings.local.json` uses stdio transport:

```json
"mnemonic": {
  "command": "/home/hubcaps/Projects/mem/bin/mnemonic",
  "args": ["mcp"]
}
```

Each session spawns a full process that initializes its own LLM provider, store connection, encoding agent, and retrieval agent — duplicating what the already-running daemon has.

## Solution

Serve MCP protocol over HTTP transport from the daemon. The daemon is already running with the model loaded, store open, and all agents active.

### Key findings driving the approach:
- **The MCP server is pure request/response** — no server-initiated notifications. SSE is unnecessary.
- **Claude Code supports `"type": "http"` transport** — SSE is deprecated. HTTP is recommended for already-running servers.
- **`handleRequest()` is already transport-agnostic** — it takes a `jsonRPCRequest` and returns a `jsonRPCResponse`. Only the transport layer needs to change.

### Implementation:

1. **Add `POST /mcp` endpoint to daemon API** — accepts JSON-RPC request body, calls existing `handleRequest()`, returns JSON-RPC response
2. **Per-session state via headers** — track session ID (e.g. `Mcp-Session-Id` header) so per-session memory tracking and `onSessionEnd()` still work
3. **Session lifecycle management** — detect when a session disconnects (no requests for N minutes) and call `onSessionEnd()` for cleanup
4. **Update Claude Code config** to:
   ```json
   "mnemonic": {
     "type": "http",
     "url": "http://127.0.0.1:9999/mcp"
   }
   ```
5. **Keep `mnemonic mcp` (stdio) as fallback** for when the daemon isn't running

### Result:

```
Session 1 ──┐
Session 2 ──┼── POST /mcp ──→ daemon :9999 (one process, one model, ~3GB VRAM)
Session 3 ──┘
```

### What changes:
- New HTTP handler in `internal/api/routes/` (~200 lines)
- Session multiplexing in the MCP server (~100 lines)
- Session timeout/cleanup logic
- Claude Code MCP config

### What doesn't change:
- All 24 MCP tool handlers (zero changes)
- Store, LLM provider, event bus (shared from daemon)
- `mnemonic mcp` subcommand (kept as offline fallback)

## Impact

- Eliminates GPU VRAM exhaustion from concurrent sessions
- Eliminates stale subprocess accumulation
- Reduces per-session overhead from ~1.4GB RAM + 3GB VRAM to zero (shared daemon)
- Faster session startup (no model load, no DB open — daemon already has everything)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: serve MCP over HTTP transport from daemon to eliminate per-session subprocess spawning #384

Problem

Root Cause

Solution

Key findings driving the approach:

Implementation:

Result:

What changes:

What doesn't change:

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: serve MCP over HTTP transport from daemon to eliminate per-session subprocess spawning #384

Description

Problem

Root Cause

Solution

Key findings driving the approach:

Implementation:

Result:

What changes:

What doesn't change:

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions