Status
Qwen3.5-4B GGUF loads successfully but inference produces garbage. The existing DeltaNet implementation handles ~80% of the forward pass, but several Qwen3.5-specific features are missing.
GGUF Inspection Results (2026-04-13)
Architecture: qwen35, 32 layers (24 DeltaNet + 8 full attention)
DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):
blk.N.ssm_a F32 [32] — decay parameter
blk.N.ssm_alpha.weight Q8_0 [2560,32] — alpha projection
blk.N.ssm_beta.weight Q8_0 [2560,32] — beta projection
blk.N.ssm_conv1d.weight F32 [4,8192] — causal conv1d
blk.N.ssm_dt.bias F32 [32] — dt bias
blk.N.ssm_norm.weight F32 [128] — value norm
blk.N.ssm_out.weight Q5_K [4096,2560] — output projection
blk.N.attn_qkv.weight Q5_K [2560,8192] — fused QKV for conv input
blk.N.attn_gate.weight Q4_K [2560,4096] — attention gate
Full attention layers (3, 7, 11, 15, ...):
blk.N.attn_q.weight Q4_K [2560,8192] — Q + gate (doubled!)
blk.N.attn_k.weight Q4_K [2560,1024]
blk.N.attn_v.weight Q6_K [2560,1024]
blk.N.attn_output.weight Q4_K [4096,2560]
blk.N.attn_q_norm.weight F32 [256] — QK-norm per head
blk.N.attn_k_norm.weight F32 [256]
Metadata:
head_count = 16, head_count_kv = 4, key_length = 256
rope.freq_base = 10,000,000
rope.dimension_count = 64 (partial RoPE: 64/256 = 25%)
full_attention_interval = 4
ssm: v_heads=32, k_heads=16, key_dim=128, val_dim=128, conv=4
vocab = 248,320
Missing/Broken Features
P0 — Must fix for inference
- Partial RoPE:
rope.dimension_count=64 means only 64 of 256 head dims get RoPE rotation. Currently applying to all dims.
- attn_output_gate on full attention layers: Q weight is [2560,8192] = [hidden, 2n_headshead_dim]. First half is Q, second half is gate. Existing
attn_output_gate detection exists but may not trigger for GGUF path.
- full_attention_interval: Not read from GGUF metadata. Need to detect which layers are DeltaNet vs full attention. Currently relies on
ssm_a tensor presence.
- Post-attention norm:
post_attention_norm.weight present on ALL layers (DeltaNet + full attn). Needs to be applied after attention/DeltaNet output, before FFN.
- ssm_out projection: DeltaNet layers have a separate
ssm_out.weight [4096,2560] that maps the DeltaNet output to hidden dim. Not the same as attn_output.
- attn_gate on DeltaNet layers:
attn_gate.weight [2560,4096] — gates the DeltaNet output (sigmoid gate like Gemma 4 PLE?). Different from the Q-gate on full attention layers.
P1 — Quality/performance
- DeltaNet QKV stays as FP32 dequant (Q5_K→FP32) — causes 0.7 tok/s. Need Q4/Q8 fast path.
- 248K vocab → large lm_head. Similar to Qwen3 issue.
- NeoX RoPE vs interleaved: full attention layers may need NeoX due to head_dim=256 ≠ hidden/n_heads.
Existing Infrastructure
quant.cpp already has:
- ✅
deltanet_forward with NEON-optimized recurrent update
- ✅ Causal conv1d + SiLU
- ✅ L2 normalization on Q, K
- ✅ DeltaNet state management (conv_state, delta_state)
- ✅ Hybrid layer detection (via
layer->delta_a_log)
- ✅
attn_output_gate support (for Gemma 4)
- ✅
post_attention_norm support (for Gemma 3)
- ✅ QK-norm support
References
Status
Qwen3.5-4B GGUF loads successfully but inference produces garbage. The existing DeltaNet implementation handles ~80% of the forward pass, but several Qwen3.5-specific features are missing.
GGUF Inspection Results (2026-04-13)
Architecture:
qwen35, 32 layers (24 DeltaNet + 8 full attention)DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):
Full attention layers (3, 7, 11, 15, ...):
Metadata:
Missing/Broken Features
P0 — Must fix for inference
rope.dimension_count=64means only 64 of 256 head dims get RoPE rotation. Currently applying to all dims.attn_output_gatedetection exists but may not trigger for GGUF path.ssm_atensor presence.post_attention_norm.weightpresent on ALL layers (DeltaNet + full attn). Needs to be applied after attention/DeltaNet output, before FFN.ssm_out.weight[4096,2560] that maps the DeltaNet output to hidden dim. Not the same asattn_output.attn_gate.weight[2560,4096] — gates the DeltaNet output (sigmoid gate like Gemma 4 PLE?). Different from the Q-gate on full attention layers.P1 — Quality/performance
Existing Infrastructure
quant.cpp already has:
deltanet_forwardwith NEON-optimized recurrent updatelayer->delta_a_log)attn_output_gatesupport (for Gemma 4)post_attention_normsupport (for Gemma 3)References