Skip to content

feat: live dashboard monitor + serve loop improvements#5

Open
msitarzewski wants to merge 1 commit intodanveloper:mainfrom
msitarzewski:feat/dashboard-serve-improvements
Open

feat: live dashboard monitor + serve loop improvements#5
msitarzewski wants to merge 1 commit intodanveloper:mainfrom
msitarzewski:feat/dashboard-serve-improvements

Conversation

@msitarzewski
Copy link
Copy Markdown

Summary

  • ncurses dashboard — htop-style terminal monitor (dashboard.c) that reads /tmp/flash-moe-stats.json and shows real-time inference status, progress bars, TTFT, tok/s, and rolling averages
  • SSE streaming — per-token Server-Sent Events for /v1/chat/completions (OpenAI-compatible streaming)
  • Dashboard stats reporting — server writes live state (prefill progress, generation metrics, uptime) to JSON for the dashboard
  • Tool call parsing — detects <tool_call> blocks in model output and returns structured tool_calls in the response
  • Session state — save/restore KV cache and linear attention state for multi-turn conversations
  • GPU KV buffer — increased pre-allocation from 8K to 32K tokens
  • CPU 2-bit expert path — fallback compute path for 2-bit quantized experts

Testing

Tested end-to-end on Apple M5 Max (128GB RAM):

  • make && make dashboard builds cleanly
  • ./infer --serve 6601 --2bit + ./dashboard — live monitoring works across idle/prefilling/generating states
  • ./infer --serve 6601 (4-bit) — 10.5 tok/s served with correct SSE streaming
  • Terminal resize, disconnect/reconnect, and q exit all work correctly
  • Verified dashboard borders render correctly at various terminal widths

Test plan

  • make clean && make && make chat && make dashboard — all targets build
  • ./infer --serve 6601 --2bit + ./chat --port 6601 — interactive chat works
  • ./dashboard shows live stats during generation
  • Dashboard shows DISCONNECTED when server is stopped
  • Dashboard adapts to terminal resize

🤖 Generated with Claude Code

Dashboard:
- ncurses-based htop-style terminal monitor (dashboard.c)
- Reads /tmp/flash-moe-stats.json written by the inference server
- Shows real-time status, progress bars, TTFT, tok/s, rolling averages
- Auto-adapts to terminal width, clean exit with q or Ctrl+C

Serve loop:
- SSE streaming with per-token delta events
- Dashboard stats reporting (server state, prefill progress, generation metrics)
- Tool call parsing from model output (<tool_call> blocks)
- Session state save/restore for multi-turn conversations
- GPU KV buffer increased to 32K pre-allocation
- CPU 2-bit expert forward path for fallback compute

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AndrewFarley added a commit to AndrewFarley/flash-moe that referenced this pull request Apr 29, 2026
Applied PRs danveloper#5 and danveloper#14:

Dashboard:
- ncurses-based htop-style terminal monitor (dashboard.c)
- Reads /tmp/flash-moe-stats.json written by inference server
- Real-time status, progress bars, TTFT, tok/s, rolling averages

Serve loop improvements:
- SSE streaming with per-token delta events
- Non-streaming JSON response mode (stream: false)
- Tool call parsing from <tool_call> blocks in model output
- Full OpenAI messages array parsing for generic clients
- Dashboard stats reporting (server state, prefill progress, generation)
- GPU KV buffer increased to 32K pre-allocation
- CPU 2-bit expert forward path for fallback compute
- CMD1+CMD2 merge optimization for linear attention layers
- select() loop for idle stats updates, GET /stats endpoint

8-bit gate dequant (PR danveloper#14):
- dequant_matvec_8bit Metal kernel (FMA-optimized)
- cpu_dequant_matvec_8bit CPU fallback
- BatchMatvecSpec.bits field for per-tensor bit-width dispatch
- Auto-detection of gate quantization from config.json
- Gate bits applied dynamically (4-bit or 8-bit based on model)

Additional fixes:
- BPE byte marker decoding in SSE output (Ġ→space, Ċ→newline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant