Skip to content

CUDA: faster k-quant mul_mat_q kernels#2525

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-faster-mmq-4
Aug 5, 2023
Merged

CUDA: faster k-quant mul_mat_q kernels#2525
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-faster-mmq-4

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

This PR adds faster mul_mat_q kernels for k-quants. The new kernels are optimized for compute (prompt processing bottleneck) rather than memory bandwidth (token generation bottleneck). The approach is essentially the same as in #2483 : change the order in which data is being iterated to reduce the number of operations, and move as much computation as possible to the data loading which is executed only once per 32 computations. Unfortunately the latter didn't quite work out for assembling q3_K upon loading due to shared memory limits. This is the current performance:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q2_k pp 746 1445 1.94
RTX 3090 7b q3_k_s pp 579 937 1.62
RTX 3090 7b q4_k_s pp 960 1696 1.77
RTX 3090 7b q5_k_s pp 573 1453 2.54
RTX 3090 7b q6_k pp 694 1408 2.03
P40 7b q2_k pp 240 626 2.61
P40 7b q3_k_s pp 205 432 2.11
P40 7b q4_k_s pp 240 772 3.22
P40 7b q5_k_s pp 210 474 2.26
P40 7b q6_k pp 249 680 2.73

For reference, the speed of cuBLAS is ~1500 t/s on an RTX 3090 and ~500 t/s on a P40.

@Loufe
Copy link
Copy Markdown

Loufe commented Aug 5, 2023

You're on fire lately, Johannes!

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

Doesn't matter, it's essentially the same speed for each quantization type since the entire matrix is only dequantized once and then the computations are done entirely using 32 bit floating point arithmetic.

@slaren
Copy link
Copy Markdown
Member

slaren commented Aug 5, 2023

3090 Ti / WSL2

Model pp t/s
7b q2_k 1404
7b q3_k 1473
7b q4_k_m 1521
7b q5_k_m 1372
7b q6_k_m 1350

7b cuBLAS is ~1460 t/s

Btw there is quite a bit of noise between measurements. This is obtained with perplexity with wiki.test.103 (first 103 lines / 6144 tokens). It would be good to have a standardized way to test performance.

Comment thread ggml-cuda.cu
return dm4f.x*sumf_d - dm4f.y*sumf_m;

#else
return 0.0f; // only to satisfy the compiler
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, buy maybe an assert(false) in here would be good to make sure that these functions aren't used with incompatible hardware.

@JohannesGaessler JohannesGaessler merged commit f514d1b into ggml-org:master Aug 5, 2023
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

@Loufe I forgot to say: for most kernels the distinction does not matter because they are a) I/O bound anyways and b) only take up a very small percentage of the total runtime.

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants