Skip to content

Eval bug: FLASH_ATTN_EXT GLM 4.7 Flash tensor schema not supported on CUDA #18944

@pwilkin

Description

@pwilkin

Name and Version

version: 7779 (6df686b)

Operating systems

Linux

GGML backends

CUDA

Hardware

4x 4090, 1x 3090

Models

GLM 4.7 Flash Q4_K_M

Problem description & steps to reproduce

This model uses a custom variant of DeepSeek's MLA and the tensor sizes are unsupported. For n_tokens = 2

Q = [576, 2, 20, 1]
V = [576, 256, 1, 1]

First Bad Commit

No response

Relevant log output

common_debug_cb_eval: fattn-0 = (f32) FLASH_ATTN_EXT(CPU#Qcur-0 (view) (permuted)#0{576, 2, 20, 1}, CPU#cache_k_l0 (view) (permuted)#0{576, 256, 1, 1}}) = {512, 20, 2, 1}

Metadata

Metadata

Assignees

No one assigned

    Labels

    CUDARelated to the CUDA backendbugSomething isn't workingperformanceSpeed related topics

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions