feat: Qwen3.5-4B full support — hybrid DeltaNet + partial RoPE

## Status

Qwen3.5-4B GGUF **loads successfully** but inference produces garbage. The existing DeltaNet implementation handles ~80% of the forward pass, but several Qwen3.5-specific features are missing.

## GGUF Inspection Results (2026-04-13)

Architecture: `qwen35`, 32 layers (24 DeltaNet + 8 full attention)

### DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):
```
blk.N.ssm_a              F32   [32]        — decay parameter
blk.N.ssm_alpha.weight   Q8_0  [2560,32]   — alpha projection
blk.N.ssm_beta.weight    Q8_0  [2560,32]   — beta projection
blk.N.ssm_conv1d.weight  F32   [4,8192]    — causal conv1d
blk.N.ssm_dt.bias         F32   [32]       — dt bias
blk.N.ssm_norm.weight    F32   [128]       — value norm
blk.N.ssm_out.weight     Q5_K  [4096,2560] — output projection
blk.N.attn_qkv.weight    Q5_K  [2560,8192] — fused QKV for conv input
blk.N.attn_gate.weight   Q4_K  [2560,4096] — attention gate
```

### Full attention layers (3, 7, 11, 15, ...):
```
blk.N.attn_q.weight      Q4_K  [2560,8192] — Q + gate (doubled!)
blk.N.attn_k.weight      Q4_K  [2560,1024]
blk.N.attn_v.weight      Q6_K  [2560,1024]
blk.N.attn_output.weight Q4_K  [4096,2560]
blk.N.attn_q_norm.weight F32   [256]       — QK-norm per head
blk.N.attn_k_norm.weight F32   [256]
```

### Metadata:
```
head_count = 16, head_count_kv = 4, key_length = 256
rope.freq_base = 10,000,000
rope.dimension_count = 64 (partial RoPE: 64/256 = 25%)
full_attention_interval = 4
ssm: v_heads=32, k_heads=16, key_dim=128, val_dim=128, conv=4
vocab = 248,320
```

## Missing/Broken Features

### P0 — Must fix for inference
1. **Partial RoPE**: `rope.dimension_count=64` means only 64 of 256 head dims get RoPE rotation. Currently applying to all dims.
2. **attn_output_gate on full attention layers**: Q weight is [2560,8192] = [hidden, 2*n_heads*head_dim]. First half is Q, second half is gate. Existing `attn_output_gate` detection exists but may not trigger for GGUF path.
3. **full_attention_interval**: Not read from GGUF metadata. Need to detect which layers are DeltaNet vs full attention. Currently relies on `ssm_a` tensor presence.
4. **Post-attention norm**: `post_attention_norm.weight` present on ALL layers (DeltaNet + full attn). Needs to be applied after attention/DeltaNet output, before FFN.
5. **ssm_out projection**: DeltaNet layers have a separate `ssm_out.weight` [4096,2560] that maps the DeltaNet output to hidden dim. Not the same as `attn_output`.
6. **attn_gate on DeltaNet layers**: `attn_gate.weight` [2560,4096] — gates the DeltaNet output (sigmoid gate like Gemma 4 PLE?). Different from the Q-gate on full attention layers.

### P1 — Quality/performance
7. DeltaNet QKV stays as FP32 dequant (Q5_K→FP32) — causes 0.7 tok/s. Need Q4/Q8 fast path.
8. 248K vocab → large lm_head. Similar to Qwen3 issue.
9. NeoX RoPE vs interleaved: full attention layers may need NeoX due to head_dim=256 ≠ hidden/n_heads.

## Existing Infrastructure

quant.cpp already has:
- ✅ `deltanet_forward` with NEON-optimized recurrent update
- ✅ Causal conv1d + SiLU
- ✅ L2 normalization on Q, K
- ✅ DeltaNet state management (conv_state, delta_state)
- ✅ Hybrid layer detection (via `layer->delta_a_log`)
- ✅ `attn_output_gate` support (for Gemma 4)
- ✅ `post_attention_norm` support (for Gemma 3)
- ✅ QK-norm support

## References

- [llama.cpp Qwen3.5 PR #19468](https://github.com/ggml-org/llama.cpp/pull/19468)
- [Gated DeltaNet analysis](https://gist.github.com/justinchuby/0213aa253664fb72e9adb0089816de15)
- [Qwen3.5 architecture blog](https://ai.tekin.cn/en/blog/qwen3-5-hybrid-attention-gated-deltanet-moe-deployment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Qwen3.5-4B full support — hybrid DeltaNet + partial RoPE #94

Status

GGUF Inspection Results (2026-04-13)

DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):

Full attention layers (3, 7, 11, 15, ...):

Metadata:

Missing/Broken Features

P0 — Must fix for inference

P1 — Quality/performance

Existing Infrastructure

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Qwen3.5-4B full support — hybrid DeltaNet + partial RoPE #94

Description

Status

GGUF Inspection Results (2026-04-13)

DeltaNet layers (0,1,2, 4,5,6, 8,9,10, ...):

Full attention layers (3, 7, 11, 15, ...):

Metadata:

Missing/Broken Features

P0 — Must fix for inference

P1 — Quality/performance

Existing Infrastructure

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions