Metal: faster Q4_0 and Q4_1 matrix x vector kernels by ikawrakow · Pull Request #2212 · ggml-org/llama.cpp

ikawrakow · 2023-07-13T09:27:59Z

This PR continues along the lines of #2188 and adds two improvements:

PR metal: new q4_0 mat-vec mul kernel #2188 was merged without the suggested changes that lead to additional 3-5% speed up. This PR adds this additional speedup for Q4_0.
Adding a very similar kernel for matrix vector multiplications for Q4_1 that leads to a 7-25% speedup compared to master on M2 Max with a 30-core GPU (see table).

Token generation time in milliseconds per token on M2 Max with 30-core GPU using

./main -m model -p "I believe the meaning of life is" -c 512 -s 1234 -n 128 -ngl 1 --no-mmap

Model	Q4_0 master	Q4_0 PR	Q4_0 speed up	Q4_1 master	Q4_1 PR	Q4_1 speed up
7B	21.1	19.8	6.6%	22.1	20.7	6.8%
13B	35.5	33.8	5.0%	39.4	34.8	13.2%
33B	83.0	80.5	3.1%	99.8	85.4	16.9%
65B	153	147	4.1%	194	156	24.4%

Some interesting observations:

The addition check (if (tiisg == 0 && ((r0 * N_SIMDGROUP + sgitg) * N_DST + row) < ne01) vs if (tiisg == 0) originally) needed to avoid having two separate kernels leads to a measurable performance degradation
At least on my M2 Max the prefetch of the next quantized block, which was assumed to be the major factor for the metal: new q4_0 mat-vec mul kernel #2188 speed up, is a very minor effect. For instance, for Q4_0 at 7B I measure 19.9 ms/token without prefetch vs 19.8 ms/token with prefetch; at 65B it is 149 ms/token versus 147 ms/token without/with prefetch. Hence, most of the gain in speed comes from letting each thread in a SIMD group process a full block.

ggerganov · 2023-07-13T09:58:15Z

Results on M1 Pro for ./main -m $model -p "I believe the meaning of life is" --ignore-eos -s 3 -n 128 -t 8 -ngl 1

model	master	this PR
7B Q4_0	35.1 ms/tok	32.7 ms/tok
7B Q4_1	46.2 ms/tok	34.9 ms/tok
13B Q4_0	61.2 ms/tok	57.0 ms/tok
13B Q4_1	89.2 ms/tok	61.4 ms/tok

ggerganov · 2023-07-13T10:10:11Z

@ikawrakow

Btw, I don't see the suggested changes in #2188 - did you submit the changes for review or are they still "pending"?

Here is how the PR looks on my end:

Details

ikawrakow · 2023-07-13T10:28:18Z

@ggerganov This is interesting. Below is what I see in my browser. I now see it has a "Pending" label on the comments. I did start by using "Add a single comment", but it then somehow got converted to a Review that I did not submit?

ggerganov · 2023-07-13T10:33:06Z

Yup, most likely that's the case. I usually get confused the other way around - click on "Add single comment" when I actually intent to do "Start review". But either way - not a great UX :)

ggerganov · 2023-07-13T14:22:15Z

I just realized something and I am not sure how we haven't spotted this yet:

In ggml-metal.m we do the following before we dispatch a mul_mat kernel:

...
                                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
                                [encoder setBuffer:id_src1 offset:offs_src1 atIndex:1];
                                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:2];
                                [encoder setBytes:&ne00 length:sizeof(ne00) atIndex:3];
                                [encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
                                [encoder setBytes:&nb00 length:sizeof(nb00) atIndex:5];
                                [encoder setBytes:&nb01 length:sizeof(nb01) atIndex:6];
                                [encoder setBytes:&nb02 length:sizeof(nb02) atIndex:7];
                                [encoder setBytes:&ne10 length:sizeof(ne10) atIndex:8];
                                [encoder setBytes:&ne11 length:sizeof(ne11) atIndex:9];
                                [encoder setBytes:&nb10 length:sizeof(nb10) atIndex:10];
                                [encoder setBytes:&nb11 length:sizeof(nb11) atIndex:11];
                                [encoder setBytes:&nb12 length:sizeof(nb12) atIndex:12];
                                [encoder setBytes:&ne0  length:sizeof(ne0)  atIndex:13];
                                [encoder setBytes:&ne1  length:sizeof(ne1)  atIndex:14];

                                if (src0t == GGML_TYPE_Q4_0) {
                                    [encoder dispatchThreadgroups:MTLSizeMake(ne01 / 8+((ne01 % 8) & 0x01), ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                }
                                else if (src0t == GGML_TYPE_Q4_1) {
                               ... etc

These are the arguments we pass to the Metal shaders / kernels and they should match the definitions in ggml-metal.metal.

However, the number of arguments in the kernels do not match, and neither the order matches as well.
For example, the Q4_0 kernel:

kernel void kernel_mul_mat_q4_0_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
        constant   int64_t & ne01[[buffer(4)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
        uint tiisg[[thread_index_in_simdgroup]],
        uint sgitg[[simdgroup_index_in_threadgroup]]) {
    const int nb = ne00/QK4_0;
... etc

This would put ne00 -> ne00 which is OK, but then it will put ne01 -> ne10 which is not OK.
And from then on, everything else, all arguments won't match.

Is my understanding wrong, or we've had a problem for quite some time that we somehow haven't noticed?
I just don't understand if that is the case, then how the Metal code even produces anything meaningful at all.
So maybe I'm missing something

lshzh-ww · 2023-07-13T14:40:12Z

@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX!

@ggerganov I did notice that arguments in kernel don't match, but I was not sure if that's intended...

ikawrakow · 2023-07-13T14:40:54Z

@ggerganov

I just realized something and I am not sure how we haven't spotted this yet:

In ggml-metal.m we do the following before we dispatch a mul_mat kernel:

This was me throwing out unused arguments to the kernels (at some points I noticed that having fewer arguments passed to the kernels slightly improved performance). But it looks like the corresponding change in ggml-metal.m has been lost? In any case, these kernels are never used in matrix-matrix multiplications where it would matter (but it would be useful to fix it anyway).

lshzh-ww · 2023-07-13T14:59:08Z

M1 Max 32c

model	master	this PR
7B q4_0	22.15 ms/tok	20.73 ms/tok

That's a 7% improvement.

ggerganov

Updated my results for Q4_1 as well - ~25% speed-up indeed 🦙

We can merge this now or wait if we get some M1 / M2 Ultra reports.
We can fix the arg stuff in separate PR

ikawrakow · 2023-07-13T16:19:35Z

@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX!

@lshzh-ww Sorry about the confusion. I felt victim to the brilliant UX. I thought I had posted the comments as they appeared in my browser, but I had not. It must have looked really dumb me talking about additional possible speedup as per my comments in the tables I posted in your PR, while nobody but me seeing these suggestions.

* 3-5% faster Q4_0 on Metal * 7-25% faster Q4_1 on Metal * Oops, forgot to delete the original Q4_1 kernel --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Iwan Kawrakow added 2 commits July 13, 2023 10:32

3-5% faster Q4_0 on Metal

585ac35

7-25% faster Q4_1 on Metal

0f79670

ikawrakow requested a review from ggerganov July 13, 2023 09:27

Oops, forgot to delete the original Q4_1 kernel

08812fe

ggerganov added performance Speed related topics high priority Very important issue labels Jul 13, 2023

ggerganov approved these changes Jul 13, 2023

View reviewed changes

ikawrakow merged commit 27ad57a into master Jul 14, 2023

ikawrakow deleted the ik/metal_new_q4_0_1 branch July 14, 2023 09:56

ikawrakow mentioned this pull request Jul 20, 2023

Faster Q4_K on Metal #2290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: faster Q4_0 and Q4_1 matrix x vector kernels#2212

Metal: faster Q4_0 and Q4_1 matrix x vector kernels#2212
ikawrakow merged 3 commits intomasterfrom
ik/metal_new_q4_0_1

ikawrakow commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023 •

edited

Loading

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

lshzh-ww commented Jul 13, 2023

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

lshzh-ww commented Jul 13, 2023

Uh oh!

ggerganov left a comment •

edited

Loading

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

ggerganov commented Jul 13, 2023

Uh oh!

lshzh-ww commented Jul 13, 2023

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

lshzh-ww commented Jul 13, 2023

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikawrakow commented Jul 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Jul 13, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading