feat: RQ4 GPU inference, spoke fusion, handoff recall fixes by CalebisGross · Pull Request #379 · AppSprout-dev/mnemonic

CalebisGross · 2026-04-08T00:00:27Z

Summary

RQ4 GPU inference: Fixed 3 bugs in the RotorQ GPU kernels (dequant element ordering, codebook mismatch, vec_dot scaling). Implemented dp4a integer SIMD with AMD byte interleaving. 120 tok/s base model, 101 tok/s with spokes on RX 7800 XT.
Spoke fusion: Pre-concatenate Felix-LM spoke matrices at GGUF export time. Reduces kernel launches from 280 to 70 per token. 9.4% generation speedup.
RQ3 (3-bit) experiment: Full pipeline implemented, negative result -- quality collapsed at 8 centroids. Documented as negative result.
Handoff recall fix: Added SearchByType to store, skip MMR diversity for type filters, exclude handoffs from consolidation merging.
Encoding prompt: Added conciseness guidance for structured_concepts.

Performance

Config	Generation (tok/s)	Notes
RQ4 base	120.3	30% faster than Q4_K_M
RQ4 fused spokes	101.3	With Felix-LM adapters
Q4_K_M baseline	92.2	Standard llama.cpp

Full research report: ~/Documents/rotorq_inference_report_2026-04-07.md

Test plan

make check (go fmt + go vet)
make test (all pass)
RQ4 GPU inference verified correct output at all ngl values
Spoke fusion backward compatible (old GGUFs still load)
Encoding quality verified via stress test (7 inputs, valid JSON)
Full lifecycle test with fused RQ4 spokes model

Generated with Claude Code

…prep - Add training_constants.py as single source of truth for enums, system prompts, and required fields. Reconcile enum mismatch across 8 scripts (emotional_tone had diverged between validate.py and Gemini prompts) - Upgrade validate.py with 3-level quality pipeline: Level 1: Schema (field types, enums, constraints) Level 2: Semantic fidelity (file:line preservation, entity preservation, proportionality, fabrication detection) Level 3: Dataset health (duplicate gists, concept diversity, balance) - Add 19 tests covering all validation levels including fidelity checks - Add generate_targeted_data.py for 5 failure-mode categories: A: Stack traces (file:line preservation) B: Named entities (person name preservation) C: Sparse inputs (minimal output for minimal input) D: Domain terms (no synonym substitution) E: Numerical precision (exact number preservation) - Add batch_generate_targeted.py for Gemini Batch API pipeline (server-side queuing, zero rate limits, 50% cheaper) - Add setup_droplet.sh for DO MI300X (ROCm 7.2, Ubuntu 24.04) - Pre-register EXP-20 in experiment registry - Update system prompt to explicitly instruct file:line and entity preservation in the content field Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ixes - Add --checkpoint, --skip-gemma, --skip-gemini CLI args to stress_test_hallucination.py for droplet use and iterative testing - Update batch_encode.py model to gemini-3.1-pro-preview (was gemini-3-flash-preview which is currently 503ing) - Persist cleaned v5 data to training/data/finetune_qwen_v5_cleaned/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g token limit - Add generate_mnemonic_scenarios.py with 210 scenarios covering every mnemonic subsystem: perception, encoding, retrieval, consolidation, dreaming, episoding, abstraction, metacognition, orchestrator, reactor, store, MCP, API, LLM providers, watchers, daemon, events, config. All scenarios use real file paths, function names, and struct names. - Add generate_mnemonic_bespoke.py for OpenRouter Qwen 3.6 generation (conservative rate limiting: 3 concurrent, 4s delay, daily limit detection) - Fix batch_encode.py max_output_tokens: 2048 -> 8192 (encoding output was being truncated, causing 92% failure rate on structured JSON) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sparse input templates now use a per-input mapping instead of random gist assignment. Each input gets a semantically correct gist, matching concepts, and appropriate emotional tone. Deduplicated to 51 unique examples by gist to avoid template memorization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion - Add generate_distribution_balance.py with 114 scenarios across 4 categories: long_form (19): 400+ word debugging narratives, architecture docs, incidents code_format (25): raw Go code, JSON, YAML, shell output, log excerpts low_significance (40): routine config tweaks, dep updates, formatting fixes emotional_variety (30): frustrated, excited, concerned, reflective observations - Fix batch_encode.py to preserve source/category from raw inputs instead of hardcoding 'swebench_unknown' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add procedural_generator.py: generates mnemonic-specific observations by combining real agent names, file paths, functions, structs, MCP tools, event types from the codebase with randomized realistic numbers. Produces 500+ varied observations covering agent operations, errors, store ops, MCP calls, watcher events, config changes, performance metrics, training, collaboration, and decisions. - Add generate_mnemonic_scenarios_v2.py: 96 hand-written scenarios covering short/medium/long observations, varied emotions, code references, multi-topic notes, cross-session context, and training process observations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Smoke test results on v6 dataset (1000 steps, RX 7800 XT): - Eval loss: 0.9354 -> 0.6319 (32% improvement) - Stress test: 7/7 (up from 5/7 on v5 data) - Both previously failing tests now pass: - Stack trace: preserves spread.go:142 and agent.go:89 - Multi-topic: preserves "Jason" entity name Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EXP-20: updated with actual v6 dataset (4,255 train / 472 eval), smoke test results (7/7 stress), and final MI300X config (batch 16, 8 epochs, eval_interval 100). EXP-21: bottleneck rotation (per_spoke_rope) on same v6 data. Sequential run on same MI300X droplet. Tests whether rotation helps with clean data (EXP-15b tested on poisoned v1 data). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eference Daemon integration: - CompositeProvider routes completions to spoke, embeddings to separate provider - SpokeConfig in config with validation (enabled/endpoint/model/tasks) - serve.go wrap() creates composite for spoke-enabled agent tasks - Relax API key check for localhost in lifecycle-test and benchmark-quality Inference optimization: - Refactor QwenWithSpokes from forward hooks to inline SpokeWrappedLayer (torch.compile compatible, no graph breaks) - serve_spokes.py: /v1/embeddings endpoint, torch.compile, TF32 matmul, SDPA attention, --no-compile and --embedding-model flags - GGUF export script: subclasses convert_hf_to_gguf.py for Qwen 3.5 + spokes - llama.cpp delivers 95.7 tok/s (3.8x vs PyTorch) on RX 7800 XT TurboQuant: - Reference implementation verified on ROCm (3-bit: 4.9x compression, cosine 0.973, quantize 1024 tokens in 0.40ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three fixes and one improvement: - Fix GGUF export: spoke tensors were silently dropped by the converter's tensor mapping pipeline. Rewrote export_qwen35_spokes.py as a two-phase approach: convert base model with standard converter, then read-copy-patch with gguf library to add spoke tensors and metadata directly. Also fixed tensor shape (removed incorrect transpose) and registered spoke tensors in the QWEN35 arch tensor set in llama.cpp. - Fix gist merge UNIQUE constraint: consolidation agent reused cluster[0]'s raw_id for gist memories, causing UNIQUE constraint violations on repeated merge cycles. Gists now get their own UUID (source tracking via gist_of). - Bump max token limits from 1024 to 4096 for encoding completions, retrieval synthesis, and global LLM cap. 32% of v6 training data exceeds 1024 tokens — truncation was silently degrading encoding quality. - Update CLAUDE.md: Qwen 3.5 2B as production model, add llama.cpp inference section, spoke routing convention, models/checkpoints in layout, Linux as primary dev platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- TurboQuant: 4.2x prompt cache compression in llama-server fork. KV layers compressed with rotation+codebook (3-bit K, 4-bit V). Recurrent layers compressed with Q8 affine quantization. - Pre-registered EXP-22 (TurboQuant Phase 1) in experiment registry. - Added generate_turboquant_tables.py (dev-time codebook generator). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Gist memories now use NULL raw_id (via nullableString) instead of a random UUID that violates the raw_memories FK constraint. - SearchByConcepts and SearchByConceptsInProject now use aliased column names (m.content etc.) to avoid ambiguity when JOINing with memories_fts. Also adds training/scripts/rotorq_proof.py — validates that random orthogonal rotation before 4-bit quantization reduces MSE by 28% on average across Qwen 3.5 2B weight matrices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EXP-20 was originally registered as Qwen 3.5 on local RX 7800 XT, completed with eval loss 0.5346 (checkpoints/exp20_v6_local/). The registry entry was later reframed to Gemma 4 on MI300X by another session without preserving the original Qwen results. Split into: - EXP-20a: Qwen 3.5 2B, local, COMPLETED (eval 0.5346, deployed) - EXP-20b: Gemma 4 E2B, MI300X, COMPLETED (eval 0.6082, pending stress) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add wandb logging to train_qwen_spokes.py (--wandb-name, --no-wandb) - Add no_quantize + SDPA support to gemma_spoke_adapter.py for high-VRAM - Add prepare_gemma_finetune_data.py (tokenize v6 for Gemma with EOS) - Add prepare_synthesis_data.py (tokenize synthesis distillation data) - Add MI300X droplet scripts (setup, download, EXP-20/20b/20d/21/23/24) - Register EXP-20b/20c/20d/21/23/24 in experiment registry - EXP-20b: Gemma eval 0.6082, stress test 6/7 (best ever) - EXP-21: Rotation inconclusive (delta 0.0009) - EXP-23: Synthesis proof-of-concept confirmed - EXP-24: Multi-task 0.6291 (3.4% above encoding-only, within 5% target) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --skip-qwen, --gemma-checkpoint, --no-quantize, --batch flags - Add run_model_batched() for MI300X parallel generation (3-5x speedup) - Fix JSON parser: use brace-depth tracking to extract first complete object (model generates valid JSON then continues with extra objects) - Strip Gemma turn markers that survive skip_special_tokens=True - Pass eos_token_id and attention_mask to generate() for clean stopping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adapted from export_qwen35_spokes.py with arch=gemma4, metadata prefix gemma4.num_spokes/gemma4.spoke_rank, and Gemma 4 base GGUF conversion. Used with checkpoints from MI300X EXP-20 training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- quantize_rq4.py: produces GGML_TYPE_RQ4 GGUF from f16 input using TurboQuant Beta-distribution codebook (3.6x weight compression) - rotorq_quantize_gguf.py: alternative custom format quantizer (unused) - rotorq_preprocess_gguf.py: weight rotation preprocessor (unused) - benchmark_quants.sh: automated quant sweep benchmark RQ4 GGUF loads in llama-server but segfaults during warmup (graph splits = 344, likely load_tiles_rq4 memory access bug). Needs GPU debugging in next session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add SearchByType to store for explicit type-filtered memory retrieval - Skip MMR diversity filter for explicit type filters (handoffs are similar by nature, diversity filter was dropping newer ones) - Exclude handoff memories from lossy consolidation merging - Add feedback score tests for ranking adjustments Fixes recall failing to surface most recent handoff memories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instruct encoding agent to keep structured_concepts arrays to 3-5 items with short strings. Reduces token usage for verbose local models while preserving encoding quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RotorQ inference breakthrough session: - Fixed 3 bugs in RQ4 GPU kernels (dequant ordering, codebook, vec_dot scaling) - Implemented dp4a integer SIMD vec_dot with AMD perm byte interleaving - Added RQ3 (3-bit) type: full pipeline, negative result (quality collapsed) - GGUF-level spoke fusion: pre-concatenate matrices, 9.4% speedup - Modified quantize_rq4.py to quantize fused spoke matrices - Added fused tensor export to export_gemma4_spokes.py Performance: 120 tok/s base, 101 tok/s fused spokes on RX 7800 XT. See ~/Documents/rotorq_inference_report_2026-04-07.md for full analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- test_rq4_config.yaml: points daemon at llama-server on port 8899 - test_rq4_quality.py: 7-input stress test for encoding JSON quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross and others added 23 commits April 4, 2026 10:44

chore: gitignore lifecycle-test artifacts

ead434e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: RQ4 lifecycle test config and quality test script

bcf040e

- test_rq4_config.yaml: points daemon at llama-server on port 8899 - test_rq4_quality.py: 7-input stress test for encoding JSON quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: go fmt trailing whitespace

5b59868

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross merged commit de1efd5 into main Apr 8, 2026

CalebisGross deleted the feat/exp20-data-quality branch April 8, 2026 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379

feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379
CalebisGross merged 23 commits intomainfrom
feat/exp20-data-quality

CalebisGross commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalebisGross commented Apr 8, 2026

Summary

Performance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant