feat: Gemma 4 E2B spoke training — 25/25 schema compliance#400
Merged
CalebisGross merged 11 commits intomainfrom Apr 13, 2026
Merged
feat: Gemma 4 E2B spoke training — 25/25 schema compliance#400CalebisGross merged 11 commits intomainfrom
CalebisGross merged 11 commits intomainfrom
Conversation
…dict The export script converted sliding_window_pattern from arr[BOOL] to arr[INT32], silently corrupting attention layer assignments in llama.cpp. Keeping native bool type preserves correct SWA/global attention routing. EXP-30 verdict updated to CONFIRMED (training) / BLOCKED (deployment): spokes produce valid faithful JSON via Python HF, but llama.cpp Gemma 4 generation is broken at the engine level (base model also fails). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenAI-compatible HTTP server for serving Gemma 4 E2B + trained Felix spokes via HuggingFace generate(). Drop-in replacement for llama-server or LM Studio — the daemon connects via its existing LMStudioProvider. - Loads NF4-quantized base model with spoke adapters injected at all 35 decoder layers (~110MB spoke overhead on GPU) - Serves /v1/chat/completions, /v1/embeddings, /v1/models, /health - Strips markdown code fences from model output (Gemma chat quirk) - Optional torch.compile, PLE offloading, bf16 mode via CLI flags - Spokes kept in fp32 on GPU (SpokeLayer.forward() casts to fp32 internally for numerical stability) Tested: valid JSON generation at ~14.6 tok/s on RX 7800 XT (NF4, no torch.compile). Schema compliance is partial without grammar enforcement — content is faithful but field structure varies. Grammar enforcement (outlines/GBNF) or bespoke inference engine is the next step for production deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EXP-30 systematic characterization (10/25 gold probes + diagnostics) revealed structural schema compliance is broken despite clean training data. Content faithfulness is confirmed but field structure collapses: concepts as dict instead of list[str], missing summary field, mixed types in structured_concepts, truncated JSON on longer outputs. Root cause: PPL 3.3 leaves too much per-token uncertainty on structural tokens, allowing base model JSON priors to override spoke training. Grammar enforcement (outlines) fails — model distribution fights the grammar constraints. Training data audit: 5,880/5,880 targets correct. EXP-31 pre-registered: constant LR 3e-5 (eliminating the wasteful high-LR phase from EXP-30), Karpathy overfit test first, evaluation via characterize_serve_output.py on all 25 gold probes. Also adds characterize_serve_output.py for systematic schema compliance measurement against the serve endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aining support - Renamed script to reflect it handles both Qwen and Gemma (--model-type flag) - Added --no-quantize flag for bf16 training (train full precision, quantize after) - Fixed gradient checkpointing: HF's gradient_checkpointing_enable() works with bf16 base models. SpokeWrappedLayer's custom checkpointing removed — ISWA attention masks cause shape mismatches during manual checkpoint recomputation. NF4 models skip checkpointing (quantized layers can't recompute). - Updated CLAUDE.md training section with current script names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HF's gradient_checkpointing_enable() forces use_cache=False, which breaks Gemma 4's ISWA attention. KV sharing layers fall back to value_states=key_states when past_key_values=None, producing PPL 2.7M (2% accuracy vs 68.6% with cache present). Every prior Gemma training run was training on corrupted output. Fix: SpokeWrappedLayer owns gradient checkpointing instead of using HF's implementation. TrainingCache wraps DynamicCache with idempotent update() to handle checkpoint recomputation without doubling KV entries. train_spokes.py routes Gemma models to custom checkpointing path. Validated: overfit test (10 examples) loss 1.86→0.0096 (PPL 1.0), inference produces valid JSON with all 10 schema fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…CLAUDE.md - CLAUDE.md: add critical Gemma 4 gradient checkpointing warning, update current state to reflect EXP-31, add Gemma dataset path - Experiment registry: EXP-31 status REGISTERED → RUNNING Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma 4 E2B spokes achieve full schema compliance on all 25 gold probes after fixing the use_cache=False bug. Eval loss 0.5217 (PPL 1.7), 48 consecutive new bests, zero regressions. 17.1h training on RX 7800 XT. Remaining: inference speed (17 tok/s vs Qwen 95). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The actual run used cosine decay (warmup 50 opt steps, min LR 3e-5), not WSD as originally registered. WSD was discussed but never implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ments - transformers 5.5.3 includes the Gemma 4 KV sharing fix (huggingface/transformers#45312) that caused all our training failures - Also updated: datasets 4.8.4, sentence-transformers 5.4.0, wandb 0.25.1 - Removed unused: outlines, flash-linear-attention, causal-conv1d and deps - Updated comments to reference the upstream fix while keeping our TrainingCache workaround as a safety net Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stress test was using google/gemma-4-E2B (base model) instead of the instruction-tuned -it variant that spokes were trained on. Also adds EXP-31 stress test results: 4/7 pass, 3 fail from JSON truncation (not hallucination), 0 hallucination failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
Test plan complete — all checks pass
PR ready for review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gradient_checkpointing_enable()forcesuse_cache=False, breaking ISWA KV sharing layers (garbage output, PPL 2.7M). Built custom gradient checkpointing inSpokeWrappedLayer+TrainingCachewrapper as workaround. Bug also fixed upstream in transformers 5.5.3 ([gemma4] Dissociate kv states sharing from the Cache huggingface/transformers#45312).serve_gemma_spokes.pyinference server,diagnose_gemma_spokes.pydiagnostic tool, renamedtrain_qwen_spokes.py→train_spokes.pywith multi-model support.What's in this PR
45515b26a3c409e0fe2debc7c48cb1eaa8e950043e–65388cbTest plan
characterize_serve_output.py🤖 Generated with Claude Code