Skip to content

feat: Qwen3.5-4B full support — hybrid DeltaNet + partial RoPE #94

@unamedkr

Description

@unamedkr

Status

Qwen3.5-4B GGUF loads successfully but inference produces garbage. The existing DeltaNet implementation handles ~80% of the forward pass, but several Qwen3.5-specific features are missing.

GGUF Inspection Results (2026-04-13)

Architecture: qwen35, 32 layers (24 DeltaNet + 8 full attention)

DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):

blk.N.ssm_a              F32   [32]        — decay parameter
blk.N.ssm_alpha.weight   Q8_0  [2560,32]   — alpha projection
blk.N.ssm_beta.weight    Q8_0  [2560,32]   — beta projection
blk.N.ssm_conv1d.weight  F32   [4,8192]    — causal conv1d
blk.N.ssm_dt.bias         F32   [32]       — dt bias
blk.N.ssm_norm.weight    F32   [128]       — value norm
blk.N.ssm_out.weight     Q5_K  [4096,2560] — output projection
blk.N.attn_qkv.weight    Q5_K  [2560,8192] — fused QKV for conv input
blk.N.attn_gate.weight   Q4_K  [2560,4096] — attention gate

Full attention layers (3, 7, 11, 15, ...):

blk.N.attn_q.weight      Q4_K  [2560,8192] — Q + gate (doubled!)
blk.N.attn_k.weight      Q4_K  [2560,1024]
blk.N.attn_v.weight      Q6_K  [2560,1024]
blk.N.attn_output.weight Q4_K  [4096,2560]
blk.N.attn_q_norm.weight F32   [256]       — QK-norm per head
blk.N.attn_k_norm.weight F32   [256]

Metadata:

head_count = 16, head_count_kv = 4, key_length = 256
rope.freq_base = 10,000,000
rope.dimension_count = 64 (partial RoPE: 64/256 = 25%)
full_attention_interval = 4
ssm: v_heads=32, k_heads=16, key_dim=128, val_dim=128, conv=4
vocab = 248,320

Missing/Broken Features

P0 — Must fix for inference

  1. Partial RoPE: rope.dimension_count=64 means only 64 of 256 head dims get RoPE rotation. Currently applying to all dims.
  2. attn_output_gate on full attention layers: Q weight is [2560,8192] = [hidden, 2n_headshead_dim]. First half is Q, second half is gate. Existing attn_output_gate detection exists but may not trigger for GGUF path.
  3. full_attention_interval: Not read from GGUF metadata. Need to detect which layers are DeltaNet vs full attention. Currently relies on ssm_a tensor presence.
  4. Post-attention norm: post_attention_norm.weight present on ALL layers (DeltaNet + full attn). Needs to be applied after attention/DeltaNet output, before FFN.
  5. ssm_out projection: DeltaNet layers have a separate ssm_out.weight [4096,2560] that maps the DeltaNet output to hidden dim. Not the same as attn_output.
  6. attn_gate on DeltaNet layers: attn_gate.weight [2560,4096] — gates the DeltaNet output (sigmoid gate like Gemma 4 PLE?). Different from the Q-gate on full attention layers.

P1 — Quality/performance

  1. DeltaNet QKV stays as FP32 dequant (Q5_K→FP32) — causes 0.7 tok/s. Need Q4/Q8 fast path.
  2. 248K vocab → large lm_head. Similar to Qwen3 issue.
  3. NeoX RoPE vs interleaved: full attention layers may need NeoX due to head_dim=256 ≠ hidden/n_heads.

Existing Infrastructure

quant.cpp already has:

  • deltanet_forward with NEON-optimized recurrent update
  • ✅ Causal conv1d + SiLU
  • ✅ L2 normalization on Q, K
  • ✅ DeltaNet state management (conv_state, delta_state)
  • ✅ Hybrid layer detection (via layer->delta_a_log)
  • attn_output_gate support (for Gemma 4)
  • post_attention_norm support (for Gemma 3)
  • ✅ QK-norm support

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions