vulkan: add mul_mat variant for embedded gpus #15800
vulkan: add mul_mat variant for embedded gpus #15800rmatif wants to merge 4 commits intoggml-org:masterfrom
Conversation
|
I got two failures with iq2_s and iq3_s. Does anyone know why these two in particular are failing? [MUL_MAT] NMSE = 0.363474692 > 0.000500000 MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL
[MUL_MAT] NMSE = 2.417624850 > 0.000500000 MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL |
I think there's a pre-existing bug in the dequant functions for these types. I ran into this once before when working on get_rows (I think?) and the bug didn't seen to be happening in any code paths that were getting hit in practice, and the bug wasn't obvious, so I didn't pursue it. Can you explain what it is about your matmul shader that makes it faster for mobile? At first glance it doesn't seem fundamentally different from what the scalar path is doing, except that you're dequantizing the matrix as a separate pass, and maybe some minor differences to tile size. I'll leave some comments on the code in a little while. |
The main difference I'd say is the very low register pressure and avoiding spilling. Mali provides only 64 registers per thread, so the entire design was built around that constraint. I tried fine-tuning the existing shaders some time ago but without success. I also believe that due to the simplicity, even outdated drivers on low-end hardware should handle them more easily |
|
But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related. |
I know, but for some reason that wasn’t enough (though I was testing on an older and less powerful device, so I should give it another try). On adreno vec4 is much faster, but on mali on the latest gen it shouldn’t make much difference, according to arm’s documentation it should perform similarly to scalar, so I kept it in case another device requires it |
| for (uint t = 0; t < num_k_tiles; t++) { | ||
| const uint k_tile_start = t * BK; | ||
|
|
||
| #pragma unroll |
There was a problem hiding this comment.
[[unroll]] is preferred.
There was a problem hiding this comment.
Isn't #pragma unroll better for old compilers?
There was a problem hiding this comment.
We use [[unroll]] in a bunch of shaders and haven't had any problems.
| } | ||
| } | ||
|
|
||
| if (device->vendor_id == VK_VENDOR_ID_ARM) { |
There was a problem hiding this comment.
I would prefer if this used the same codepath as the main shader, since it's duplicating quite a bit. If the push constants are the same, is there a reason not to just add an ARM path to the shader selection function and leave the dequant to the existing logic?
If we go that route we'll need a better way of testing these shaders. As a start we can have an environment variable or compile flag to enable the embedded path, and while I don't see anything arm specific in the code we should try it out on some desktop GPUs to see if it runs fine there. |
|
Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference. I don't think having less registers matters in this case since there's not much difference in register count between writing the dequantized weights to regular memory versus shared memory. The mul mat shader can be treated as two sections where one does the dequantization to shared memory and the other does the actual multiplication, and the register counts for each can be more or less independent. I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with And as a side note our dequant functions are like triplicated across dequant_funcs.comp, mul_mm.comp, and dequant_q*.comp and they really should be merged if possible. |
I will try to take a look
|
|
The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that. |
I don’t know if Arm offers that. I will reach out to some of their engineers and ask them |
|
I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily. |
I spent quite a bit of time trying to do the dequant directly in the shared memory, but it turned out to be too much work for me, so I dropped it and and by "laziness" adopted this two-stage approach. That said I do agree that the optimal solution is to do it directly |
|
I’ve unlocked a massive improvement that goes far beyond the fine-tuned existing shaders, that would justify the existence of this one. I’ve updated the performance numbers in the OP and will soon look into reports on older devices Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability. I stopped here, I'm a bit short in time anyway @jeffbolznv If you have a chance, could you please review again? I believe I’ve addressed your comments but let me know if I missed something
I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it
I haven’t touched the dequant part, it’s still the same. I’m writing to an f16 buffer, and the f16xf32/f32xf32 path is now much faster hence the speedup. As I mentioned, I started experimenting with dequant in shared memory, but it turned out to be too much work for a first step. We can leave that as a future plan If someone has a Raspberry Pi lying around, I’d be very curious to see the performance gains on that kind of device |
| } | ||
| } | ||
|
|
||
| const uint num_k_tiles = (p.K + BK - 1) / BK; |
There was a problem hiding this comment.
I think this is not robust enough and might be wrong for adreno case, but it passes the tests on test-backend-ops feel like it shouldn't
There was a problem hiding this comment.
Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.
There was a problem hiding this comment.
I misquoted, I was thinking more about the adreno case:
BM = 32, BK = 8 -> VEC_K = 2
WG_SIZE = 128
A_LOADS_PER_THREAD = (32 * 2) / 128 = 64 / 128 = 0
So theoretically it shouldn’t be able to load matrix A regardless of the dimensions, but the tests are passing so I’m a bit confused
It's useful to be able to test it for correctness. |
jeffbolznv
left a comment
There was a problem hiding this comment.
I'd still like to see this, at the least, using the same push constant and spec constant interface as the rest of the matmul shaders, and running through the same code paths in ggml_vk_mul_mat_q_f16. I think it would be nice if it could be folded into mul_mm, but I'm not sure we understand what specifically is causing the better perf.
| return; | ||
| } | ||
|
|
||
| const std::vector<uint32_t> pc = { (uint32_t)M, (uint32_t)K, (uint32_t)K, (uint32_t)K, (uint32_t)(ggml_nelements(src0)) }; |
There was a problem hiding this comment.
I think this path is not handling noncontiguous src0. Like @0cc4m said, it'll be better to let this run through the existing code paths rather than having this separate code path.
| } | ||
| } | ||
|
|
||
| const uint num_k_tiles = (p.K + BK - 1) / BK; |
There was a problem hiding this comment.
Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.
|
I've made a PR to fix the failing dequant shaders. |
At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉 |
I’ve tried reaching out to some ARM devs, so hopefully they’ll take a look into it Speaking of compilers, the Adreno compilers crash when running the existing mul_mat shaders, although they work fine with this one, so I think we’ll need this variant anyway (performances are better than opencl but I suspect a bug so I won't jump too quickly) Sorry I don’t have much time to address all the concerns right now. I’ll try when I have time at least to implement an env var to run the shaders on the dGPU EDIT: Just got confirmation from an arm dev, they don’t have a public disassembler, only a profiler |
| const uint64_t ne12 = src1->ne[2]; | ||
| const uint64_t ne13 = src1->ne[3]; | ||
|
|
||
| if ((ctx->device->vendor_id == VK_VENDOR_ID_ARM || ctx->device->vendor_id == VK_VENDOR_ID_QUALCOMM) && |
There was a problem hiding this comment.
it seems to have an impact also with Intel integrated GPUs in some cases:
before:
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | Vulkan | 999 | pp512 | 60.12 ± 0.00 |
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | Vulkan | 999 | tg128 | 8.02 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | pp512 | 109.99 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 8.12 ± 0.00 |
after:
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | Vulkan | 999 | pp512 | 86.34 ± 0.00 |
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | Vulkan | 999 | tg128 | 8.30 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | pp512 | 101.52 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 8.01 ± 0.00 |
|
i managed to compile it under termux on a mali g52 mc2 with vulkan 1.1 without errors. before i had complaints about the vulkan version not beeing 1.3. or seg faults. so good work so far! since i am relatively new to vulkan coding, what would the the next steps to make it faster on my device? i am getting same speed on qwen3 4b as on cpu alone, offloading all 37 layers to gpu, using no extra parameters. EDIT: |
|
Just tested this with an rk3588 machine using panvk, Unfortunately it crashed with a simple "Hello" prompt: |
|
@rmatif Any updates? |
Sorry, I got super busy these days, but I’ll definitely come back to this at some point, just not sure when |
Ah it's okay 😀 we all got our own lives to lead, I wish I was smart enough to continue your work, maybe someone else will pick it up an continue, you never know! I've hurt my back & been bed bound for a couple years and can't use a laptop or pc since sitting & standing becomes painful after a few mins so I'm in bed trying to do what I normally would like code and DevOps stuff but only using a Pixel 8 Pro phone. So playing with on device LLMs and watching them improve over the past 2 years has been great to watch! I love staying upto date with the Vulkan/Mali 720 speed improvements like this as Snapdragon seems to get all the attention! I was wondering the status of the PR because I was curious of the TPS increase from This PR and this PR #18493 I'm curious, what's left to do to complete this PR? Was the only issue left the compilers qwirky behaviour? |
|
Hello! I saw this PR is targeting embedded GPUs. I wanted to share some baseline benchmark data from an Adreno 840 (Snapdragon 8 Elite) using Gong-Mi/vkpeak. This fork includes Device Info:
Key Metrics:
Full vkpeak OutputIs this kind of peak capability data useful for your embedded GPU optimization efforts? Just wanted to share in case it helps! |
|
Quick note regarding the These artifacts suggest that the Adreno 840 Vulkan Driver (on the Snapdragon 8 Elite) is struggling with certain advanced shader features (likely cooperative matrix or specific extensions), despite claiming support. It would be advisable to avoid using these specific Vulkan extensions/paths on this hardware for now to ensure stability. While the scalar and vector metrics are solid, the driver's handling of matrix operations seems immature and might lead to crashes or undefined behavior in a real production workload like |
|
these are tensorg4 pixel 9a results: device = Mali-G715 fp32-scalar = 1436.36 GFLOPS fp16-scalar = 1419.75 GFLOPS fp64-scalar = 0.00 GFLOPS int32-scalar = 207.53 GIOPS int16-scalar = 207.61 GIOPS int64-scalar = 49.36 GIOPS int8-dotprod = 6300.69 GIOPS bf16-dotprod = 0.00 GFLOPS fp8-matrix = 0.00 GFLOPS copy-h2h = 13.09 GBPS ---- gong-mi version ---- vkpeak 20240505 fp32-scalar = 1425.40 GFLOPS fp16-scalar = 1417.65 GFLOPS int32-scalar = 208.09 GIOPS int16-scalar = 207.49 GIOPS int64-scalar = 49.18 GIOPS int8-scalar = 207.11 GIOPS copy-h2h = 12.07 GBPS |
|
Try to enable the zero copy effect of Android, you can try to reduce the memory occupation. |
|
The build process properly utilized the Android-specific Vulkan extensions including VK_ANDROID_external_memory_android_hardware_buffer which enables zero-copy functionality on Android systems. |
This PR adds a mat_mul variant designed for embedded gpus, as the current shaders perform poorly. Essentially I reused the approach from my opencl implementation #14535
It has currently been tested only on mali gpu, but I believe it should suits well on others as well
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: KHR_coopmatMaster:
PR:
I started messing with coopmat, but didn’t have enough time to include it in this PR. If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future