metal: new q4_0 mat-vec mul kernel #2188
Conversation
|
Please feel free to share results on M1/M2 Pro/Max/Ultra! |
Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:
The above is for an empty prompt. If I use |
|
Looks promising! From what I've read before there is still a lot of room for Metal code improvements so this could be very welcome. I will try to test it out on my M1 Max machine as well. |
Nice! I update the measurements with new prompt. Looks like this PR brings the M1 series to the same speed as M2 series. I guess M2 series have better memory access prediction so their speed are already good without this PR. Nonetheless for 65B model I guess we can see a larger performance improvement for M2 series. |
|
Would be nice to see results on M1 Ultra or M2 Ultra chips. Their two-die design may also suffer from cache miss. |
0e3eeb6 to
38ec9a2
Compare
|
Here's my benchmark: MBP 14 Apple M2 Max 32GB 12C CPU, 30C GPU
Logs |
|
M1 Pro 32GB
|
|
Another data point: M2 Max with 30-core GPU, 65B model,
|
There was a problem hiding this comment.
Is it somehow possible to remove this restriction without affecting the performance?
If so, we can simply delete the old pipeline_mul_mat_q4_0_f32 kernel
There was a problem hiding this comment.
This is for models like WizardLM. In these models the last mat-vec multiplication in inference will have row number = 32001, while our new kernel consumes 8 rows every time. Let me test later if adding capabilities for dealing with <8 rows in the kernel will make it run slower or faster.
There was a problem hiding this comment.
Removed the old kernel. Might have minor performance gain or loss depending on model size and n_vocab. Tested on llama-7B and wizardlm-30B, generation results are same.
|
And here a full list of all Meta LLaMA models on 30-core M2 Max with
|
|
MacBook Pro, M2 Pro, 32GB, Ventura 13.4.1
|
minor issue |
|
With M2 Mac Studio Ultra, 128GB RAM
|
38ec9a2 to
fa0e4cf
Compare
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
fa0e4cf to
5150582
Compare
|
@ggerganov how to properly run benchmarks? I would love to contribute! |
| constant int64_t & ne10, | ||
| constant int64_t & ne0, | ||
| threadgroup float * sum [[threadgroup(0)]], | ||
| constant int64_t & ne01[[buffer(4)]], |
There was a problem hiding this comment.
What is the function of [[buffer(4)]] here?
There was a problem hiding this comment.
This gets the value from the buffer at index 4, corresponding the following line in ggml-metal.m:
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
@gauravpathak-infa I also have an M2 ultra, i am getting less than half of your numbers on these models; could you share your command line please? |
|
@amj Here are the commands: If you are getting half the numbers, the problem may be somewhere else. It may be useful to check if you have compiled llama.cpp with Metal. If not: |
|
yep that did it; LLAMA_METAL had been dropped during some clean/build step. I'm getting the same numbers you were showing. Thanks! |
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
Prefetch data to achieve better memory bandwidth utilization. With the new kernel token generation is ~48% faster for 33B model and ~14% faster for 7B model. Tests for 65B model are welcome.
GPUs of M1 Max / M2 Max can sustain 340 GB/s , can we reach that in llama.cpp? 😳
* GPU Read Bandwidth measured using Developer Tools comes with Xcode.

** Measured with the following command on M1 Max 32GB. Generation results are same between this PR and master branch.
./main -m model_file -n 128 -c 512 -s 12 -ngl 1 --no-mmapPrompt 1 “”
Prompt 2 “I believe the meaning of life is”