Metal: faster Q4_0 and Q4_1 matrix x vector kernels#2212
Conversation
|
Results on M1 Pro for
|
|
Btw, I don't see the suggested changes in #2188 - did you submit the changes for review or are they still "pending"? Here is how the PR looks on my end: |
|
@ggerganov This is interesting. Below is what I see in my browser. I now see it has a "Pending" label on the comments. I did start by using "Add a single comment", but it then somehow got converted to a Review that I did not submit?
|
|
Yup, most likely that's the case. I usually get confused the other way around - click on "Add single comment" when I actually intent to do "Start review". But either way - not a great UX :) |
|
I just realized something and I am not sure how we haven't spotted this yet: In ...
[encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
[encoder setBuffer:id_src1 offset:offs_src1 atIndex:1];
[encoder setBuffer:id_dst offset:offs_dst atIndex:2];
[encoder setBytes:&ne00 length:sizeof(ne00) atIndex:3];
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
[encoder setBytes:&nb00 length:sizeof(nb00) atIndex:5];
[encoder setBytes:&nb01 length:sizeof(nb01) atIndex:6];
[encoder setBytes:&nb02 length:sizeof(nb02) atIndex:7];
[encoder setBytes:&ne10 length:sizeof(ne10) atIndex:8];
[encoder setBytes:&ne11 length:sizeof(ne11) atIndex:9];
[encoder setBytes:&nb10 length:sizeof(nb10) atIndex:10];
[encoder setBytes:&nb11 length:sizeof(nb11) atIndex:11];
[encoder setBytes:&nb12 length:sizeof(nb12) atIndex:12];
[encoder setBytes:&ne0 length:sizeof(ne0) atIndex:13];
[encoder setBytes:&ne1 length:sizeof(ne1) atIndex:14];
if (src0t == GGML_TYPE_Q4_0) {
[encoder dispatchThreadgroups:MTLSizeMake(ne01 / 8+((ne01 % 8) & 0x01), ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
}
else if (src0t == GGML_TYPE_Q4_1) {
... etcThese are the arguments we pass to the Metal shaders / kernels and they should match the definitions in However, the number of arguments in the kernels do not match, and neither the order matches as well. kernel void kernel_mul_mat_q4_0_f32(
device const void * src0,
device const float * src1,
device float * dst,
constant int64_t & ne00,
constant int64_t & ne10,
constant int64_t & ne0,
constant int64_t & ne01[[buffer(4)]],
uint2 tgpig[[threadgroup_position_in_grid]],
uint tiisg[[thread_index_in_simdgroup]],
uint sgitg[[simdgroup_index_in_threadgroup]]) {
const int nb = ne00/QK4_0;
... etcThis would put Is my understanding wrong, or we've had a problem for quite some time that we somehow haven't noticed? |
|
@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX! @ggerganov I did notice that arguments in kernel don't match, but I was not sure if that's intended... |
This was me throwing out unused arguments to the kernels (at some points I noticed that having fewer arguments passed to the kernels slightly improved performance). But it looks like the corresponding change in |
|
M1 Max 32c
That's a 7% improvement. |
@lshzh-ww Sorry about the confusion. I felt victim to the brilliant UX. I thought I had posted the comments as they appeared in my browser, but I had not. It must have looked really dumb me talking about additional possible speedup as per my comments in the tables I posted in your PR, while nobody but me seeing these suggestions. |
* 3-5% faster Q4_0 on Metal * 7-25% faster Q4_1 on Metal * Oops, forgot to delete the original Q4_1 kernel --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* 3-5% faster Q4_0 on Metal * 7-25% faster Q4_1 on Metal * Oops, forgot to delete the original Q4_1 kernel --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>


This PR continues along the lines of #2188 and adds two improvements:
Q4_0.Q4_1that leads to a 7-25% speedup compared to master on M2 Max with a 30-core GPU (see table).Token generation time in milliseconds per token on M2 Max with 30-core GPU using
Some interesting observations:
if (tiisg == 0 && ((r0 * N_SIMDGROUP + sgitg) * N_DST + row) < ne01)vsif (tiisg == 0)originally) needed to avoid having two separate kernels leads to a measurable performance degradationQ4_0at 7B I measure 19.9 ms/token without prefetch vs 19.8 ms/token with prefetch; at 65B it is 149 ms/token versus 147 ms/token without/with prefetch. Hence, most of the gain in speed comes from letting each thread in a SIMD group process a full block.