feat: Model Control Center with embedded LLM inference by CalebisGross · Pull Request #380 · AppSprout-dev/mnemonic

CalebisGross · 2026-04-08T02:18:11Z

Summary

Self-contained embedded inference: mnemonic binary runs its own LLM via llama.cpp with ROCm GPU acceleration. No LM Studio, no external servers, no API dependencies required.
Model Control Center dashboard: new Models tab with runtime model management — hot-swap GGUF models, toggle between embedded and Gemini API, view model status and swap logs.
SwitchableProvider: wraps embedded + API providers with runtime toggle. Switching to API unloads models from VRAM; switching back reloads them.
ChatML compatibility: fixed prompt formatting for Qwen 3.5 (was using Felix-LM format), added /no_think to disable reasoning tokens, <think> stripping, <|im_end|> stop sequences.
Qwen spoke fusion: export script now fuses spoke matrices for fewer GPU kernel launches, RQ4 quantizer inner-dim fix for ssm_conv1d.

Files Changed

Area	Files	What
Provider	`embedded.go`, `switchable.go`, `provider.go`	Hot-swap, Unload/Reload, manifest listing, SwitchableProvider, ModelManager interface
API	`routes/models.go`, `server.go`	GET/POST /api/v1/models, /api/v1/models/active
Dashboard	`models.js`, `index.html`, `nav.js`, `app.js`	Models tab, status cards, swap UI, provider toggle
Startup	`runtime.go`, `serve.go`	SwitchableProvider wiring, ModelManager injection
Training	`export_qwen35_spokes.py`, `quantize_rq4.py`	Spoke fusion, inner-dim quantization fix

Test plan

go vet ./... clean
go test ./... all pass
golangci-lint run zero issues in changed packages
API endpoints respond correctly (embedded and API modes)
Dashboard Models tab renders, swap buttons work
VRAM released on Gemini switch, reloaded on embedded switch
Dreaming/consolidation/abstraction agents produce content with ChatML fix
Full lifecycle test with fused RQ4 spokes

🤖 Generated with Claude Code

Self-contained embedded inference via llama.cpp with ROCm GPU acceleration. No external servers needed — the mnemonic binary IS the inference engine. - EmbeddedProvider: hot-swap models (SwapChatModel/SwapEmbedModel), Unload/Reload for VRAM management, manifest-based model listing (models.json) - SwitchableProvider: runtime toggle between embedded and Gemini API without restart - ModelManager interface decouples API routes from provider internals - API: GET/POST /api/v1/models, GET/POST /api/v1/models/active - Dashboard: Models tab with status cards, model table, swap buttons, provider toggle - ChatML prompt format with /no_think for Qwen 3.5 compatibility - Strip <think>...</think> tokens from model output - <|im_end|> stop sequence for proper turn boundaries - Qwen spoke fusion in export script, inner-dim fix in RQ4 quantizer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test expected old Felix-LM format (<|system|>) but code switched to ChatML (<|im_start|>system) with /no_think in PR #380. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross merged commit ad2a86b into main Apr 8, 2026

CalebisGross deleted the feat/model-control-center branch April 8, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Model Control Center with embedded LLM inference#380

feat: Model Control Center with embedded LLM inference#380
CalebisGross merged 1 commit intomainfrom
feat/model-control-center

CalebisGross commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalebisGross commented Apr 8, 2026

Summary

Files Changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant