graph : make FA compatible with MLA + add initial Metal kernels by ggerganov · Pull Request #12953 · ggml-org/llama.cpp

ggerganov · 2025-04-15T07:04:47Z

For backends that support FA with different K and V head sizes, the FA path can be used. To support that, we decompress the FA result using the v_mla tensor.

jukofyork · 2025-04-15T10:13:02Z

Just linking this old attempt at doing this:

#12227 (comment)

as from @fairydreaming's CPU test and my CUDA test, it seemed the tile size was just too large to be useful.

@JohannesGaessler explained in #12227 (comment) why this likely failed for CUDA, and I guess for CPU it was just such a massive quadratic increase over the previous maximum tile size (256^2 --> 512^2) that it no longer fits in cache.

One other very MLA-specific thing to think about is that if the V-cache doesn't need transposing, the last 512 elements of the K-cache hold the same values, so there would be no need to store these and a 2D view starting at element 64 and offsetting by 576 would get the same data untransposed.

ggml-ci

…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci

Panchovix · 2025-05-02T02:44:37Z

HI there, sorry to bother. I was testing Deepseek V3 0324 on CPU + GPU, but when using FA, I get this issue

slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.632294
srv  update_slots: decoding batch, n_tokens = 2048
set_embeddings: value = 0
clear_adapter_lora: call
/run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2344
  err
CUDA error
[New LWP 64005]
[New LWP 64004]
[New LWP 64003]
[New LWP 64002]
[New LWP 64001]
[New LWP 64000]
[New LWP 63999]
[New LWP 63605]
[New LWP 63604]
[New LWP 63603]
[New LWP 63602]
[New LWP 63601]
[New LWP 63600]
[New LWP 63599]
[New LWP 63598]
[New LWP 63597]
[New LWP 63596]
[New LWP 63595]
[New LWP 63594]
[New LWP 63593]
[New LWP 63592]
[New LWP 63591]
[New LWP 63590]
[New LWP 63589]
[New LWP 63588]
[New LWP 63587]
[New LWP 63586]
[New LWP 63585]
[New LWP 63584]
[New LWP 63583]
[New LWP 63582]
[New LWP 63581]
[New LWP 63580]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f47c407b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f47c407ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f47c40eb5af in wait4 () from /lib64/libc.so.6
#4  0x00007f47c8b35fb6 in ggml_abort () from libggml-base.so
#5  0x00007f47c8c93963 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from libggml-cuda.so
#6  0x00007f47c8c9edbe in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from libggml-cuda.so
#7  0x00007f47c8b4b344 in ggml_backend_sched_graph_compute_async () from libggml-base.so
#8  0x00007f47d5b9d371 in llama_context::graph_compute(ggml_cgraph*, bool) () from libllama.so
#9  0x00007f47d5ba0ef8 in llama_context::decode(llama_batch&) () from libllama.so
#10 0x00007f47d5ba219b in llama_decode () from libllama.so
#11 0x000000000048b040 in server_context::update_slots() ()
#12 0x000000000045b25c in server_queue::start_loop() ()
#13 0x0000000000426020 in main ()
[Inferior 1 (process 63579) detached]

I'm running the model like this

./llama-server -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3' -fa

I did build from source with

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_BLAS=OFF \
  -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \

When not using -fa, it works correctly.

Did I do the setup incorrectly? Raised an issue with more info here #13252

…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 15, 2025

ggerganov added 5 commits April 17, 2025 17:49

graph : make mla compatible with FA

e330856

metal : add exp FA kernels for DeepSeek models

9b64dcc

ggml-ci

llama : minor naming updates

9cc85dd

ggml-ci

ggml : disable FA for DS head sizes

43c762b

tests : add FA tests for MLA shapes

facdf87

ggml-ci

ggerganov force-pushed the gg/mla branch from 2cbc16d to facdf87 Compare April 17, 2025 14:50

ggerganov merged commit 2f74c35 into master Apr 17, 2025
55 of 58 checks passed

ggerganov deleted the gg/mla branch April 17, 2025 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : make FA compatible with MLA + add initial Metal kernels#12953

graph : make FA compatible with MLA + add initial Metal kernels#12953
ggerganov merged 5 commits intomasterfrom
gg/mla

ggerganov commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Panchovix commented May 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Panchovix commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jukofyork commented Apr 15, 2025 •

edited

Loading

Panchovix commented May 2, 2025 •

edited

Loading