feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379
Merged
CalebisGross merged 23 commits intomainfrom Apr 8, 2026
Merged
feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379CalebisGross merged 23 commits intomainfrom
CalebisGross merged 23 commits intomainfrom
Conversation
…prep - Add training_constants.py as single source of truth for enums, system prompts, and required fields. Reconcile enum mismatch across 8 scripts (emotional_tone had diverged between validate.py and Gemini prompts) - Upgrade validate.py with 3-level quality pipeline: Level 1: Schema (field types, enums, constraints) Level 2: Semantic fidelity (file:line preservation, entity preservation, proportionality, fabrication detection) Level 3: Dataset health (duplicate gists, concept diversity, balance) - Add 19 tests covering all validation levels including fidelity checks - Add generate_targeted_data.py for 5 failure-mode categories: A: Stack traces (file:line preservation) B: Named entities (person name preservation) C: Sparse inputs (minimal output for minimal input) D: Domain terms (no synonym substitution) E: Numerical precision (exact number preservation) - Add batch_generate_targeted.py for Gemini Batch API pipeline (server-side queuing, zero rate limits, 50% cheaper) - Add setup_droplet.sh for DO MI300X (ROCm 7.2, Ubuntu 24.04) - Pre-register EXP-20 in experiment registry - Update system prompt to explicitly instruct file:line and entity preservation in the content field Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ixes - Add --checkpoint, --skip-gemma, --skip-gemini CLI args to stress_test_hallucination.py for droplet use and iterative testing - Update batch_encode.py model to gemini-3.1-pro-preview (was gemini-3-flash-preview which is currently 503ing) - Persist cleaned v5 data to training/data/finetune_qwen_v5_cleaned/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g token limit - Add generate_mnemonic_scenarios.py with 210 scenarios covering every mnemonic subsystem: perception, encoding, retrieval, consolidation, dreaming, episoding, abstraction, metacognition, orchestrator, reactor, store, MCP, API, LLM providers, watchers, daemon, events, config. All scenarios use real file paths, function names, and struct names. - Add generate_mnemonic_bespoke.py for OpenRouter Qwen 3.6 generation (conservative rate limiting: 3 concurrent, 4s delay, daily limit detection) - Fix batch_encode.py max_output_tokens: 2048 -> 8192 (encoding output was being truncated, causing 92% failure rate on structured JSON) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sparse input templates now use a per-input mapping instead of random gist assignment. Each input gets a semantically correct gist, matching concepts, and appropriate emotional tone. Deduplicated to 51 unique examples by gist to avoid template memorization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - Add generate_distribution_balance.py with 114 scenarios across 4 categories: long_form (19): 400+ word debugging narratives, architecture docs, incidents code_format (25): raw Go code, JSON, YAML, shell output, log excerpts low_significance (40): routine config tweaks, dep updates, formatting fixes emotional_variety (30): frustrated, excited, concerned, reflective observations - Fix batch_encode.py to preserve source/category from raw inputs instead of hardcoding 'swebench_unknown' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add procedural_generator.py: generates mnemonic-specific observations by combining real agent names, file paths, functions, structs, MCP tools, event types from the codebase with randomized realistic numbers. Produces 500+ varied observations covering agent operations, errors, store ops, MCP calls, watcher events, config changes, performance metrics, training, collaboration, and decisions. - Add generate_mnemonic_scenarios_v2.py: 96 hand-written scenarios covering short/medium/long observations, varied emotions, code references, multi-topic notes, cross-session context, and training process observations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke test results on v6 dataset (1000 steps, RX 7800 XT): - Eval loss: 0.9354 -> 0.6319 (32% improvement) - Stress test: 7/7 (up from 5/7 on v5 data) - Both previously failing tests now pass: - Stack trace: preserves spread.go:142 and agent.go:89 - Multi-topic: preserves "Jason" entity name Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EXP-20: updated with actual v6 dataset (4,255 train / 472 eval), smoke test results (7/7 stress), and final MI300X config (batch 16, 8 epochs, eval_interval 100). EXP-21: bottleneck rotation (per_spoke_rope) on same v6 data. Sequential run on same MI300X droplet. Tests whether rotation helps with clean data (EXP-15b tested on poisoned v1 data). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eference Daemon integration: - CompositeProvider routes completions to spoke, embeddings to separate provider - SpokeConfig in config with validation (enabled/endpoint/model/tasks) - serve.go wrap() creates composite for spoke-enabled agent tasks - Relax API key check for localhost in lifecycle-test and benchmark-quality Inference optimization: - Refactor QwenWithSpokes from forward hooks to inline SpokeWrappedLayer (torch.compile compatible, no graph breaks) - serve_spokes.py: /v1/embeddings endpoint, torch.compile, TF32 matmul, SDPA attention, --no-compile and --embedding-model flags - GGUF export script: subclasses convert_hf_to_gguf.py for Qwen 3.5 + spokes - llama.cpp delivers 95.7 tok/s (3.8x vs PyTorch) on RX 7800 XT TurboQuant: - Reference implementation verified on ROCm (3-bit: 4.9x compression, cosine 0.973, quantize 1024 tokens in 0.40ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes and one improvement: - Fix GGUF export: spoke tensors were silently dropped by the converter's tensor mapping pipeline. Rewrote export_qwen35_spokes.py as a two-phase approach: convert base model with standard converter, then read-copy-patch with gguf library to add spoke tensors and metadata directly. Also fixed tensor shape (removed incorrect transpose) and registered spoke tensors in the QWEN35 arch tensor set in llama.cpp. - Fix gist merge UNIQUE constraint: consolidation agent reused cluster[0]'s raw_id for gist memories, causing UNIQUE constraint violations on repeated merge cycles. Gists now get their own UUID (source tracking via gist_of). - Bump max token limits from 1024 to 4096 for encoding completions, retrieval synthesis, and global LLM cap. 32% of v6 training data exceeds 1024 tokens — truncation was silently degrading encoding quality. - Update CLAUDE.md: Qwen 3.5 2B as production model, add llama.cpp inference section, spoke routing convention, models/checkpoints in layout, Linux as primary dev platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TurboQuant: 4.2x prompt cache compression in llama-server fork. KV layers compressed with rotation+codebook (3-bit K, 4-bit V). Recurrent layers compressed with Q8 affine quantization. - Pre-registered EXP-22 (TurboQuant Phase 1) in experiment registry. - Added generate_turboquant_tables.py (dev-time codebook generator). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Gist memories now use NULL raw_id (via nullableString) instead of a random UUID that violates the raw_memories FK constraint. - SearchByConcepts and SearchByConceptsInProject now use aliased column names (m.content etc.) to avoid ambiguity when JOINing with memories_fts. Also adds training/scripts/rotorq_proof.py — validates that random orthogonal rotation before 4-bit quantization reduces MSE by 28% on average across Qwen 3.5 2B weight matrices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EXP-20 was originally registered as Qwen 3.5 on local RX 7800 XT, completed with eval loss 0.5346 (checkpoints/exp20_v6_local/). The registry entry was later reframed to Gemma 4 on MI300X by another session without preserving the original Qwen results. Split into: - EXP-20a: Qwen 3.5 2B, local, COMPLETED (eval 0.5346, deployed) - EXP-20b: Gemma 4 E2B, MI300X, COMPLETED (eval 0.6082, pending stress) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add wandb logging to train_qwen_spokes.py (--wandb-name, --no-wandb) - Add no_quantize + SDPA support to gemma_spoke_adapter.py for high-VRAM - Add prepare_gemma_finetune_data.py (tokenize v6 for Gemma with EOS) - Add prepare_synthesis_data.py (tokenize synthesis distillation data) - Add MI300X droplet scripts (setup, download, EXP-20/20b/20d/21/23/24) - Register EXP-20b/20c/20d/21/23/24 in experiment registry - EXP-20b: Gemma eval 0.6082, stress test 6/7 (best ever) - EXP-21: Rotation inconclusive (delta 0.0009) - EXP-23: Synthesis proof-of-concept confirmed - EXP-24: Multi-task 0.6291 (3.4% above encoding-only, within 5% target) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --skip-qwen, --gemma-checkpoint, --no-quantize, --batch flags - Add run_model_batched() for MI300X parallel generation (3-5x speedup) - Fix JSON parser: use brace-depth tracking to extract first complete object (model generates valid JSON then continues with extra objects) - Strip Gemma turn markers that survive skip_special_tokens=True - Pass eos_token_id and attention_mask to generate() for clean stopping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adapted from export_qwen35_spokes.py with arch=gemma4, metadata prefix gemma4.num_spokes/gemma4.spoke_rank, and Gemma 4 base GGUF conversion. Used with checkpoints from MI300X EXP-20 training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- quantize_rq4.py: produces GGML_TYPE_RQ4 GGUF from f16 input using TurboQuant Beta-distribution codebook (3.6x weight compression) - rotorq_quantize_gguf.py: alternative custom format quantizer (unused) - rotorq_preprocess_gguf.py: weight rotation preprocessor (unused) - benchmark_quants.sh: automated quant sweep benchmark RQ4 GGUF loads in llama-server but segfaults during warmup (graph splits = 344, likely load_tiles_rq4 memory access bug). Needs GPU debugging in next session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SearchByType to store for explicit type-filtered memory retrieval - Skip MMR diversity filter for explicit type filters (handoffs are similar by nature, diversity filter was dropping newer ones) - Exclude handoff memories from lossy consolidation merging - Add feedback score tests for ranking adjustments Fixes recall failing to surface most recent handoff memories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instruct encoding agent to keep structured_concepts arrays to 3-5 items with short strings. Reduces token usage for verbose local models while preserving encoding quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RotorQ inference breakthrough session: - Fixed 3 bugs in RQ4 GPU kernels (dequant ordering, codebook, vec_dot scaling) - Implemented dp4a integer SIMD vec_dot with AMD perm byte interleaving - Added RQ3 (3-bit) type: full pipeline, negative result (quality collapsed) - GGUF-level spoke fusion: pre-concatenate matrices, 9.4% speedup - Modified quantize_rq4.py to quantize fused spoke matrices - Added fused tensor export to export_gemma4_spokes.py Performance: 120 tok/s base, 101 tok/s fused spokes on RX 7800 XT. See ~/Documents/rotorq_inference_report_2026-04-07.md for full analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_rq4_config.yaml: points daemon at llama-server on port 8899 - test_rq4_quality.py: 7-input stress test for encoding JSON quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance
Full research report:
~/Documents/rotorq_inference_report_2026-04-07.mdTest plan
make check(go fmt + go vet)make test(all pass)Generated with Claude Code