Skip to content

feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379

Merged
CalebisGross merged 23 commits intomainfrom
feat/exp20-data-quality
Apr 8, 2026
Merged

feat: RQ4 GPU inference, spoke fusion, handoff recall fixes#379
CalebisGross merged 23 commits intomainfrom
feat/exp20-data-quality

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

  • RQ4 GPU inference: Fixed 3 bugs in the RotorQ GPU kernels (dequant element ordering, codebook mismatch, vec_dot scaling). Implemented dp4a integer SIMD with AMD byte interleaving. 120 tok/s base model, 101 tok/s with spokes on RX 7800 XT.
  • Spoke fusion: Pre-concatenate Felix-LM spoke matrices at GGUF export time. Reduces kernel launches from 280 to 70 per token. 9.4% generation speedup.
  • RQ3 (3-bit) experiment: Full pipeline implemented, negative result -- quality collapsed at 8 centroids. Documented as negative result.
  • Handoff recall fix: Added SearchByType to store, skip MMR diversity for type filters, exclude handoffs from consolidation merging.
  • Encoding prompt: Added conciseness guidance for structured_concepts.

Performance

Config Generation (tok/s) Notes
RQ4 base 120.3 30% faster than Q4_K_M
RQ4 fused spokes 101.3 With Felix-LM adapters
Q4_K_M baseline 92.2 Standard llama.cpp

Full research report: ~/Documents/rotorq_inference_report_2026-04-07.md

Test plan

  • make check (go fmt + go vet)
  • make test (all pass)
  • RQ4 GPU inference verified correct output at all ngl values
  • Spoke fusion backward compatible (old GGUFs still load)
  • Encoding quality verified via stress test (7 inputs, valid JSON)
  • Full lifecycle test with fused RQ4 spokes model

Generated with Claude Code

CalebisGross and others added 23 commits April 4, 2026 10:44
…prep

- Add training_constants.py as single source of truth for enums, system
  prompts, and required fields. Reconcile enum mismatch across 8 scripts
  (emotional_tone had diverged between validate.py and Gemini prompts)
- Upgrade validate.py with 3-level quality pipeline:
  Level 1: Schema (field types, enums, constraints)
  Level 2: Semantic fidelity (file:line preservation, entity preservation,
  proportionality, fabrication detection)
  Level 3: Dataset health (duplicate gists, concept diversity, balance)
- Add 19 tests covering all validation levels including fidelity checks
- Add generate_targeted_data.py for 5 failure-mode categories:
  A: Stack traces (file:line preservation)
  B: Named entities (person name preservation)
  C: Sparse inputs (minimal output for minimal input)
  D: Domain terms (no synonym substitution)
  E: Numerical precision (exact number preservation)
- Add batch_generate_targeted.py for Gemini Batch API pipeline
  (server-side queuing, zero rate limits, 50% cheaper)
- Add setup_droplet.sh for DO MI300X (ROCm 7.2, Ubuntu 24.04)
- Pre-register EXP-20 in experiment registry
- Update system prompt to explicitly instruct file:line and entity
  preservation in the content field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ixes

- Add --checkpoint, --skip-gemma, --skip-gemini CLI args to
  stress_test_hallucination.py for droplet use and iterative testing
- Update batch_encode.py model to gemini-3.1-pro-preview (was
  gemini-3-flash-preview which is currently 503ing)
- Persist cleaned v5 data to training/data/finetune_qwen_v5_cleaned/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g token limit

- Add generate_mnemonic_scenarios.py with 210 scenarios covering every
  mnemonic subsystem: perception, encoding, retrieval, consolidation,
  dreaming, episoding, abstraction, metacognition, orchestrator, reactor,
  store, MCP, API, LLM providers, watchers, daemon, events, config.
  All scenarios use real file paths, function names, and struct names.
- Add generate_mnemonic_bespoke.py for OpenRouter Qwen 3.6 generation
  (conservative rate limiting: 3 concurrent, 4s delay, daily limit detection)
- Fix batch_encode.py max_output_tokens: 2048 -> 8192 (encoding output
  was being truncated, causing 92% failure rate on structured JSON)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sparse input templates now use a per-input mapping instead of random
gist assignment. Each input gets a semantically correct gist, matching
concepts, and appropriate emotional tone. Deduplicated to 51 unique
examples by gist to avoid template memorization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- Add generate_distribution_balance.py with 114 scenarios across 4 categories:
  long_form (19): 400+ word debugging narratives, architecture docs, incidents
  code_format (25): raw Go code, JSON, YAML, shell output, log excerpts
  low_significance (40): routine config tweaks, dep updates, formatting fixes
  emotional_variety (30): frustrated, excited, concerned, reflective observations
- Fix batch_encode.py to preserve source/category from raw inputs instead of
  hardcoding 'swebench_unknown'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add procedural_generator.py: generates mnemonic-specific observations by
  combining real agent names, file paths, functions, structs, MCP tools,
  event types from the codebase with randomized realistic numbers. Produces
  500+ varied observations covering agent operations, errors, store ops,
  MCP calls, watcher events, config changes, performance metrics, training,
  collaboration, and decisions.
- Add generate_mnemonic_scenarios_v2.py: 96 hand-written scenarios covering
  short/medium/long observations, varied emotions, code references,
  multi-topic notes, cross-session context, and training process observations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke test results on v6 dataset (1000 steps, RX 7800 XT):
- Eval loss: 0.9354 -> 0.6319 (32% improvement)
- Stress test: 7/7 (up from 5/7 on v5 data)
- Both previously failing tests now pass:
  - Stack trace: preserves spread.go:142 and agent.go:89
  - Multi-topic: preserves "Jason" entity name

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EXP-20: updated with actual v6 dataset (4,255 train / 472 eval),
smoke test results (7/7 stress), and final MI300X config (batch 16,
8 epochs, eval_interval 100).

EXP-21: bottleneck rotation (per_spoke_rope) on same v6 data.
Sequential run on same MI300X droplet. Tests whether rotation
helps with clean data (EXP-15b tested on poisoned v1 data).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eference

Daemon integration:
- CompositeProvider routes completions to spoke, embeddings to separate provider
- SpokeConfig in config with validation (enabled/endpoint/model/tasks)
- serve.go wrap() creates composite for spoke-enabled agent tasks
- Relax API key check for localhost in lifecycle-test and benchmark-quality

Inference optimization:
- Refactor QwenWithSpokes from forward hooks to inline SpokeWrappedLayer
  (torch.compile compatible, no graph breaks)
- serve_spokes.py: /v1/embeddings endpoint, torch.compile, TF32 matmul,
  SDPA attention, --no-compile and --embedding-model flags
- GGUF export script: subclasses convert_hf_to_gguf.py for Qwen 3.5 + spokes
- llama.cpp delivers 95.7 tok/s (3.8x vs PyTorch) on RX 7800 XT

TurboQuant:
- Reference implementation verified on ROCm (3-bit: 4.9x compression,
  cosine 0.973, quantize 1024 tokens in 0.40ms)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes and one improvement:

- Fix GGUF export: spoke tensors were silently dropped by the converter's
  tensor mapping pipeline. Rewrote export_qwen35_spokes.py as a two-phase
  approach: convert base model with standard converter, then read-copy-patch
  with gguf library to add spoke tensors and metadata directly. Also fixed
  tensor shape (removed incorrect transpose) and registered spoke tensors
  in the QWEN35 arch tensor set in llama.cpp.

- Fix gist merge UNIQUE constraint: consolidation agent reused cluster[0]'s
  raw_id for gist memories, causing UNIQUE constraint violations on repeated
  merge cycles. Gists now get their own UUID (source tracking via gist_of).

- Bump max token limits from 1024 to 4096 for encoding completions,
  retrieval synthesis, and global LLM cap. 32% of v6 training data exceeds
  1024 tokens — truncation was silently degrading encoding quality.

- Update CLAUDE.md: Qwen 3.5 2B as production model, add llama.cpp
  inference section, spoke routing convention, models/checkpoints in layout,
  Linux as primary dev platform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TurboQuant: 4.2x prompt cache compression in llama-server fork.
  KV layers compressed with rotation+codebook (3-bit K, 4-bit V).
  Recurrent layers compressed with Q8 affine quantization.

- Pre-registered EXP-22 (TurboQuant Phase 1) in experiment registry.

- Added generate_turboquant_tables.py (dev-time codebook generator).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Gist memories now use NULL raw_id (via nullableString) instead of a
  random UUID that violates the raw_memories FK constraint.

- SearchByConcepts and SearchByConceptsInProject now use aliased column
  names (m.content etc.) to avoid ambiguity when JOINing with memories_fts.

Also adds training/scripts/rotorq_proof.py — validates that random
orthogonal rotation before 4-bit quantization reduces MSE by 28% on
average across Qwen 3.5 2B weight matrices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EXP-20 was originally registered as Qwen 3.5 on local RX 7800 XT,
completed with eval loss 0.5346 (checkpoints/exp20_v6_local/). The
registry entry was later reframed to Gemma 4 on MI300X by another
session without preserving the original Qwen results. Split into:

- EXP-20a: Qwen 3.5 2B, local, COMPLETED (eval 0.5346, deployed)
- EXP-20b: Gemma 4 E2B, MI300X, COMPLETED (eval 0.6082, pending stress)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add wandb logging to train_qwen_spokes.py (--wandb-name, --no-wandb)
- Add no_quantize + SDPA support to gemma_spoke_adapter.py for high-VRAM
- Add prepare_gemma_finetune_data.py (tokenize v6 for Gemma with EOS)
- Add prepare_synthesis_data.py (tokenize synthesis distillation data)
- Add MI300X droplet scripts (setup, download, EXP-20/20b/20d/21/23/24)
- Register EXP-20b/20c/20d/21/23/24 in experiment registry
- EXP-20b: Gemma eval 0.6082, stress test 6/7 (best ever)
- EXP-21: Rotation inconclusive (delta 0.0009)
- EXP-23: Synthesis proof-of-concept confirmed
- EXP-24: Multi-task 0.6291 (3.4% above encoding-only, within 5% target)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --skip-qwen, --gemma-checkpoint, --no-quantize, --batch flags
- Add run_model_batched() for MI300X parallel generation (3-5x speedup)
- Fix JSON parser: use brace-depth tracking to extract first complete
  object (model generates valid JSON then continues with extra objects)
- Strip Gemma turn markers that survive skip_special_tokens=True
- Pass eos_token_id and attention_mask to generate() for clean stopping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adapted from export_qwen35_spokes.py with arch=gemma4, metadata
prefix gemma4.num_spokes/gemma4.spoke_rank, and Gemma 4 base GGUF
conversion. Used with checkpoints from MI300X EXP-20 training.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- quantize_rq4.py: produces GGML_TYPE_RQ4 GGUF from f16 input
  using TurboQuant Beta-distribution codebook (3.6x weight compression)
- rotorq_quantize_gguf.py: alternative custom format quantizer (unused)
- rotorq_preprocess_gguf.py: weight rotation preprocessor (unused)
- benchmark_quants.sh: automated quant sweep benchmark

RQ4 GGUF loads in llama-server but segfaults during warmup
(graph splits = 344, likely load_tiles_rq4 memory access bug).
Needs GPU debugging in next session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SearchByType to store for explicit type-filtered memory retrieval
- Skip MMR diversity filter for explicit type filters (handoffs are similar
  by nature, diversity filter was dropping newer ones)
- Exclude handoff memories from lossy consolidation merging
- Add feedback score tests for ranking adjustments

Fixes recall failing to surface most recent handoff memories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instruct encoding agent to keep structured_concepts arrays to 3-5 items
with short strings. Reduces token usage for verbose local models while
preserving encoding quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RotorQ inference breakthrough session:
- Fixed 3 bugs in RQ4 GPU kernels (dequant ordering, codebook, vec_dot scaling)
- Implemented dp4a integer SIMD vec_dot with AMD perm byte interleaving
- Added RQ3 (3-bit) type: full pipeline, negative result (quality collapsed)
- GGUF-level spoke fusion: pre-concatenate matrices, 9.4% speedup
- Modified quantize_rq4.py to quantize fused spoke matrices
- Added fused tensor export to export_gemma4_spokes.py

Performance: 120 tok/s base, 101 tok/s fused spokes on RX 7800 XT.
See ~/Documents/rotorq_inference_report_2026-04-07.md for full analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_rq4_config.yaml: points daemon at llama-server on port 8899
- test_rq4_quality.py: 7-input stress test for encoding JSON quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross merged commit de1efd5 into main Apr 8, 2026
@CalebisGross CalebisGross deleted the feat/exp20-data-quality branch April 8, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant