Eval bug: FLASH_ATTN_EXT GLM 4.7 Flash tensor schema not supported on CUDA

### Name and Version

version: 7779 (6df686bee)

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

4x 4090, 1x 3090

### Models

GLM 4.7 Flash Q4_K_M

### Problem description & steps to reproduce

This model uses a custom variant of DeepSeek's MLA and the tensor sizes are unsupported. For n_tokens = 2

Q = [576, 2, 20, 1]
V = [576, 256, 1, 1]

### First Bad Commit

_No response_

### Relevant log output

common_debug_cb_eval:              __fattn__-0 = (f32) FLASH_ATTN_EXT(CPU#Qcur-0 (view) (permuted)#0{576, 2, 20, 1}, CPU#cache_k_l0 (view) (permuted)#0{576, 256, 1, 1}}) = {512, 20, 2, 1}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: FLASH_ATTN_EXT GLM 4.7 Flash tensor schema not supported on CUDA #18944

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: FLASH_ATTN_EXT GLM 4.7 Flash tensor schema not supported on CUDA #18944

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions