Skip to content

feat: Model Control Center with embedded LLM inference#380

Merged
CalebisGross merged 1 commit intomainfrom
feat/model-control-center
Apr 8, 2026
Merged

feat: Model Control Center with embedded LLM inference#380
CalebisGross merged 1 commit intomainfrom
feat/model-control-center

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

  • Self-contained embedded inference: mnemonic binary runs its own LLM via llama.cpp with ROCm GPU acceleration. No LM Studio, no external servers, no API dependencies required.
  • Model Control Center dashboard: new Models tab with runtime model management — hot-swap GGUF models, toggle between embedded and Gemini API, view model status and swap logs.
  • SwitchableProvider: wraps embedded + API providers with runtime toggle. Switching to API unloads models from VRAM; switching back reloads them.
  • ChatML compatibility: fixed prompt formatting for Qwen 3.5 (was using Felix-LM format), added /no_think to disable reasoning tokens, <think> stripping, <|im_end|> stop sequences.
  • Qwen spoke fusion: export script now fuses spoke matrices for fewer GPU kernel launches, RQ4 quantizer inner-dim fix for ssm_conv1d.

Files Changed

Area Files What
Provider embedded.go, switchable.go, provider.go Hot-swap, Unload/Reload, manifest listing, SwitchableProvider, ModelManager interface
API routes/models.go, server.go GET/POST /api/v1/models, /api/v1/models/active
Dashboard models.js, index.html, nav.js, app.js Models tab, status cards, swap UI, provider toggle
Startup runtime.go, serve.go SwitchableProvider wiring, ModelManager injection
Training export_qwen35_spokes.py, quantize_rq4.py Spoke fusion, inner-dim quantization fix

Test plan

  • go vet ./... clean
  • go test ./... all pass
  • golangci-lint run zero issues in changed packages
  • API endpoints respond correctly (embedded and API modes)
  • Dashboard Models tab renders, swap buttons work
  • VRAM released on Gemini switch, reloaded on embedded switch
  • Dreaming/consolidation/abstraction agents produce content with ChatML fix
  • Full lifecycle test with fused RQ4 spokes

🤖 Generated with Claude Code

Self-contained embedded inference via llama.cpp with ROCm GPU acceleration.
No external servers needed — the mnemonic binary IS the inference engine.

- EmbeddedProvider: hot-swap models (SwapChatModel/SwapEmbedModel), Unload/Reload
  for VRAM management, manifest-based model listing (models.json)
- SwitchableProvider: runtime toggle between embedded and Gemini API without restart
- ModelManager interface decouples API routes from provider internals
- API: GET/POST /api/v1/models, GET/POST /api/v1/models/active
- Dashboard: Models tab with status cards, model table, swap buttons, provider toggle
- ChatML prompt format with /no_think for Qwen 3.5 compatibility
- Strip <think>...</think> tokens from model output
- <|im_end|> stop sequence for proper turn boundaries
- Qwen spoke fusion in export script, inner-dim fix in RQ4 quantizer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross merged commit ad2a86b into main Apr 8, 2026
@CalebisGross CalebisGross deleted the feat/model-control-center branch April 8, 2026 02:20
CalebisGross added a commit that referenced this pull request Apr 10, 2026
Test expected old Felix-LM format (<|system|>) but code switched to
ChatML (<|im_start|>system) with /no_think in PR #380.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant