CUDA: faster k-quant mul_mat_q kernels by JohannesGaessler · Pull Request #2525 · ggml-org/llama.cpp

JohannesGaessler · 2023-08-05T14:18:08Z

This PR adds faster mul_mat_q kernels for k-quants. The new kernels are optimized for compute (prompt processing bottleneck) rather than memory bandwidth (token generation bottleneck). The approach is essentially the same as in #2483 : change the order in which data is being iterated to reduce the number of operations, and move as much computation as possible to the data loading which is executed only once per 32 computations. Unfortunately the latter didn't quite work out for assembling q3_K upon loading due to shared memory limits. This is the current performance:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q2_k	pp	746	1445	1.94
RTX 3090	7b q3_k_s	pp	579	937	1.62
RTX 3090	7b q4_k_s	pp	960	1696	1.77
RTX 3090	7b q5_k_s	pp	573	1453	2.54
RTX 3090	7b q6_k	pp	694	1408	2.03
P40	7b q2_k	pp	240	626	2.61
P40	7b q3_k_s	pp	205	432	2.11
P40	7b q4_k_s	pp	240	772	3.22
P40	7b q5_k_s	pp	210	474	2.26
P40	7b q6_k	pp	249	680	2.73

For reference, the speed of cuBLAS is ~1500 t/s on an RTX 3090 and ~500 t/s on a P40.

Loufe · 2023-08-05T15:19:37Z

You're on fire lately, Johannes!

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

JohannesGaessler · 2023-08-05T15:22:38Z

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

Doesn't matter, it's essentially the same speed for each quantization type since the entire matrix is only dequantized once and then the computations are done entirely using 32 bit floating point arithmetic.

slaren · 2023-08-05T15:25:35Z

3090 Ti / WSL2

Model	pp t/s
7b q2_k	1404
7b q3_k	1473
7b q4_k_m	1521
7b q5_k_m	1372
7b q6_k_m	1350

7b cuBLAS is ~1460 t/s

Btw there is quite a bit of noise between measurements. This is obtained with perplexity with wiki.test.103 (first 103 lines / 6144 tokens). It would be good to have a standardized way to test performance.

slaren · 2023-08-05T15:29:22Z

+    return dm4f.x*sumf_d - dm4f.y*sumf_m;
+
+#else
+    return 0.0f; // only to satisfy the compiler


Unrelated, buy maybe an assert(false) in here would be good to make sure that these functions aren't used with incompatible hardware.

JohannesGaessler · 2023-08-05T16:22:01Z

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

@Loufe I forgot to say: for most kernels the distinction does not matter because they are a) I/O bound anyways and b) only take up a very small percentage of the total runtime.

CUDA: faster k-quant mul_mat_q kernels

fe6a8f8

slaren reviewed Aug 5, 2023

View reviewed changes

slaren approved these changes Aug 5, 2023

View reviewed changes

JohannesGaessler merged commit f514d1b into ggml-org:master Aug 5, 2023

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

CUDA: faster k-quant mul_mat_q kernels (ggml-org#2525)

2ab1d32

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

CUDA: faster k-quant mul_mat_q kernels (ggml-org#2525)

d5060f4

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

CUDA: faster k-quant mul_mat_q kernels (ggml-org#2525)

24aa6c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: faster k-quant mul_mat_q kernels#2525

CUDA: faster k-quant mul_mat_q kernels#2525
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-faster-mmq-4

JohannesGaessler commented Aug 5, 2023

Uh oh!

Loufe commented Aug 5, 2023

Uh oh!

JohannesGaessler commented Aug 5, 2023

Uh oh!

slaren commented Aug 5, 2023 •

edited

Loading

Uh oh!

slaren Aug 5, 2023

Uh oh!

JohannesGaessler commented Aug 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Aug 5, 2023

Uh oh!

Loufe commented Aug 5, 2023

Uh oh!

JohannesGaessler commented Aug 5, 2023

Uh oh!

slaren commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Aug 5, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Aug 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slaren commented Aug 5, 2023 •

edited

Loading