Another speed gain for Q4_0 and Q4_1 on Metal#2375
Conversation
|
Test on M1 Max 32c
The performance improvement is solid. I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance if it weren't for you. |
@lshzh-ww That doesn't make sense. Your contribution in #2188 and #2248 were absolutely essential for improving Metal performance. Before you came along with #2188 we were basically stuck at 21.5 ms/t on M2 Max for the 7B model. This kind of collaborative progress is the whole point of open source! |
| //Note: This is a template, but strictly speaking it only applies to | ||
| // quantizations where the block size is 32. It also does not | ||
| // giard against the number of rows not being divisible by | ||
| // N_DST, so this is another explicit assumption of the implementation. |
There was a problem hiding this comment.
I think this template also works on matrices with number of rows not being divisible by N_DST. In the very last of template we have if (tiisg == 0 && first_row + row < ne01) which guarantees that we don't write into other tensor's region when there are less than N_DST rows left.
I checked by using a WizardLM model, which has a row number of 32001. This template generated exactly the same results as the llama.cpp before #2188 was merged.
There was a problem hiding this comment.
Well, yes, there is a guard to not write outside of the result bounds. But there is no guard to not read outside the bounds of the quantized tensor. We don't get a segfault because we only step out slightly and the address being accessed is within the process address space, but there is no guarantee that this will always be the case.
There was a problem hiding this comment.
I see. Thank you for answering my question.
ggerganov
left a comment
There was a problem hiding this comment.
M1 Pro
| Quantization | ms/t Master | ms/t PR |
|---|---|---|
| Q4_0 - 7B | 30.2 | 28.7 |
| Q4_1 - 7B | 32.9 | 31.2 |
| Q4_0 - 13B | 53.5 | 50.7 |
| Q4_1 - 13B | 58.1 | 55.6 |
|
@ikawrakow and @lshzh-ww I assume your tables show |
Yes. |
* Another speed gain for Q4_0 and Q4_1 on Metal * Have N_DST, etc., be template parameters --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Another speed gain for Q4_0 and Q4_1 on Metal * Have N_DST, etc., be template parameters --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
In #2279 there was some debate on whether it is better to have each thread in a Metal SIMD group process whole
Q4_0orQ4_1blocks, or have them process half blocks.On my M2 Max with 30-core GPU having threads process half a block is definitely faster, see table below. I'm curious to see if this change also improves (or worsens) performance on M1 Pro/Max.
TG-128 in ms/token on 30-core M2 Max. The command used was