ggml : add fallback to CPU for unsupported ops in scheduler#19884
ggml : add fallback to CPU for unsupported ops in scheduler#19884angt wants to merge 1 commit intoggml-org:masterfrom
Conversation
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
Not sure this is the proper fix.
This logs indicate the llama.cpp logic incorrectly decides to create AMX buffers for some of the weights, even though the computation is not supported (hence 0 AMX compute buffers). Rather something in |
|
Looking into Using |
It should be possible to infer these from the tensor and src shapes I think? |
So I think this earlier statement is not correct. Seems like the compute buffer can remain 0 size and the backend to be still used. Atm, I am not sure what is the proper fix. |
|
when it fails (parrallel > 1) we have (custom logs to debug): so we check AMX with this code, which is OK we now have |
|
Got it. It's much simpler to extend AMX ops to not impose limit on dim 2 and 3. |
|
I was worried but it turns out all the code was already in place to allow |
Yes this is the best solution. I had assumed it would be harder and wanted to fix this issue. But the |
|
|
|
That was my initial thought, i was thinking of a max value of Ok, so we should not have a backend that use the |
For example, these usages of llama.cpp/ggml/src/ggml-metal/ggml-metal-device.m Lines 1132 to 1151 in 9051663 Conditioning on Technically, the Lines 203 to 207 in 9051663 It would also have to check support when |
|
True, but i still dont know how we could support a backend like AMX that restrict so far my options are:
|
|
I have a batched version for AMX, i'll push tomorrow after more testing so let's close this one |
I found this while fixing:
with this PR: #19867
I guess it's a fallback of the fallback mechanism somehow and might not be the real root cause.
anyway, with that, I have
llama-serverrunning:and at the end: