Skip to content

opencl: add q6_K gemm and gemv kernels for Adreno#20089

Merged
max-krasnyansky merged 18 commits intoggml-org:masterfrom
qualcomm:lh/q6_k-trans
Mar 23, 2026
Merged

opencl: add q6_K gemm and gemv kernels for Adreno#20089
max-krasnyansky merged 18 commits intoggml-org:masterfrom
qualcomm:lh/q6_k-trans

Conversation

@lhez
Copy link
Copy Markdown
Contributor

@lhez lhez commented Mar 3, 2026

This PR adds Q6_K gemm and gemv kernels for Adreno. This should improve performance for models containing Q6_K quantization.

For Q4_K_M, we will need to go through the same for Q4_K. Therefore, Q4_K_M is still slow but should be better.

On X Elite,

before,

Qwen3-0.6B-Q6_K,

common_perf_print: prompt eval time =     724.44 ms /   235 tokens (    3.08 ms per token,   324.39 tokens per second)
common_perf_print:        eval time =   13167.73 ms /   256 runs   (   51.44 ms per token,    19.44 tokens per second)

Qwen3-4B-Q6_K,

common_perf_print: prompt eval time =    4901.20 ms /   235 tokens (   20.86 ms per token,    47.95 tokens per second)
common_perf_print:        eval time =   51144.81 ms /   256 runs   (  199.78 ms per token,     5.01 tokens per second)

Qwen3-0.6B-Q4_K_M,

common_perf_print: prompt eval time =    1514.15 ms /   235 tokens (    6.44 ms per token,   155.20 tokens per second)
common_perf_print:        eval time =    8231.90 ms /   256 runs   (   32.16 ms per token,    31.10 tokens per second)

Qwen3-4B-Q4_K_M.gguf,

common_perf_print: prompt eval time =   11502.19 ms /   235 tokens (   48.95 ms per token,    20.43 tokens per second)
common_perf_print:        eval time =   28561.40 ms /   256 runs   (  111.57 ms per token,     8.96 tokens per second)

after,

Qwen3-0.6B-Q6_K,

common_perf_print: prompt eval time =     281.45 ms /   235 tokens (    1.20 ms per token,   834.95 tokens per second)
common_perf_print:        eval time =    4243.57 ms /   256 runs   (   16.58 ms per token,    60.33 tokens per second)

Qwen3-4B-Q6_K,

common_perf_print: prompt eval time =    1605.23 ms /   235 tokens (    6.83 ms per token,   146.40 tokens per second)
common_perf_print:        eval time =   23625.54 ms /   256 runs   (   92.29 ms per token,    10.84 tokens per second)

Qwen3-0.6B-Q4_K_M,

common_perf_print: prompt eval time =    1497.88 ms /   235 tokens (    6.37 ms per token,   156.89 tokens per second)
common_perf_print:        eval time =    5191.81 ms /   256 runs   (   20.28 ms per token,    49.31 tokens per second)

Qwen3-4B-Q4_K_M.gguf,

common_perf_print: prompt eval time =   10775.26 ms /   235 tokens (   45.85 ms per token,    21.81 tokens per second)
common_perf_print:        eval time =   25510.26 ms /   256 runs   (   99.65 ms per token,    10.04 tokens per second)

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Mar 3, 2026
@lhez lhez marked this pull request as ready for review March 22, 2026 22:57
@lhez lhez requested a review from a team as a code owner March 22, 2026 22:57
@max-krasnyansky max-krasnyansky merged commit 1772701 into ggml-org:master Mar 23, 2026
48 checks passed
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants