CUDA: faster k-quant mul_mat_q kernels#2525
Conversation
|
You're on fire lately, Johannes! Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison? On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally. |
Doesn't matter, it's essentially the same speed for each quantization type since the entire matrix is only dequantized once and then the computations are done entirely using 32 bit floating point arithmetic. |
|
3090 Ti / WSL2
7b cuBLAS is ~1460 t/s Btw there is quite a bit of noise between measurements. This is obtained with |
| return dm4f.x*sumf_d - dm4f.y*sumf_m; | ||
|
|
||
| #else | ||
| return 0.0f; // only to satisfy the compiler |
There was a problem hiding this comment.
Unrelated, buy maybe an assert(false) in here would be good to make sure that these functions aren't used with incompatible hardware.
@Loufe I forgot to say: for most kernels the distinction does not matter because they are a) I/O bound anyways and b) only take up a very small percentage of the total runtime. |
This PR adds faster mul_mat_q kernels for k-quants. The new kernels are optimized for compute (prompt processing bottleneck) rather than memory bandwidth (token generation bottleneck). The approach is essentially the same as in #2483 : change the order in which data is being iterated to reduce the number of operations, and move as much computation as possible to the data loading which is executed only once per 32 computations. Unfortunately the latter didn't quite work out for assembling q3_K upon loading due to shared memory limits. This is the current performance:
For reference, the speed of cuBLAS is ~1500 t/s on an RTX 3090 and ~500 t/s on a P40.