Name and Version
version: 7779 (6df686b)
Operating systems
Linux
GGML backends
CUDA
Hardware
4x 4090, 1x 3090
Models
GLM 4.7 Flash Q4_K_M
Problem description & steps to reproduce
This model uses a custom variant of DeepSeek's MLA and the tensor sizes are unsupported. For n_tokens = 2
Q = [576, 2, 20, 1]
V = [576, 256, 1, 1]
First Bad Commit
No response
Relevant log output
common_debug_cb_eval: fattn-0 = (f32) FLASH_ATTN_EXT(CPU#Qcur-0 (view) (permuted)#0{576, 2, 20, 1}, CPU#cache_k_l0 (view) (permuted)#0{576, 256, 1, 1}}) = {512, 20, 2, 1}
Name and Version
version: 7779 (6df686b)
Operating systems
Linux
GGML backends
CUDA
Hardware
4x 4090, 1x 3090
Models
GLM 4.7 Flash Q4_K_M
Problem description & steps to reproduce
This model uses a custom variant of DeepSeek's MLA and the tensor sizes are unsupported. For n_tokens = 2
Q = [576, 2, 20, 1]
V = [576, 256, 1, 1]
First Bad Commit
No response
Relevant log output
common_debug_cb_eval: fattn-0 = (f32) FLASH_ATTN_EXT(CPU#Qcur-0 (view) (permuted)#0{576, 2, 20, 1}, CPU#cache_k_l0 (view) (permuted)#0{576, 256, 1, 1}}) = {512, 20, 2, 1}