Skip to content

opencl: add optimized q4_1 mm kernel for adreno#19840

Merged
lhez merged 9 commits intoggml-org:masterfrom
qualcomm:sq/q4_1_mm_opencl_kernels
Mar 3, 2026
Merged

opencl: add optimized q4_1 mm kernel for adreno#19840
lhez merged 9 commits intoggml-org:masterfrom
qualcomm:sq/q4_1_mm_opencl_kernels

Conversation

@shaofeiqi
Copy link
Copy Markdown
Contributor

This PR adds optimized OpenCL kernels for Q4_1 GEMM and GEMV operations on Adreno GPUs.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Feb 23, 2026
@lhez lhez force-pushed the sq/q4_1_mm_opencl_kernels branch from 9dbdb49 to 8ef5b83 Compare February 27, 2026 07:51
@lhez
Copy link
Copy Markdown
Contributor

lhez commented Feb 27, 2026

On X Elite,

Before,

Qwen3-0.6B-Q4_1

common_perf_print: prompt eval time =     532.78 ms /   235 tokens (    2.27 ms per token,   441.08 tokens per second)
common_perf_print:        eval time =    5592.80 ms /   256 runs   (   21.85 ms per token,    45.77 tokens per second)

Qwen3-4B-Q4_1

common_perf_print: prompt eval time =    3652.33 ms /   235 tokens (   15.54 ms per token,    64.34 tokens per second)
common_perf_print:        eval time =   23390.10 ms /   256 runs   (   91.37 ms per token,    10.94 tokens per second)

Llama-3.2-3B-Instruct-Q4_1

common_perf_print: prompt eval time =    2826.35 ms /   236 tokens (   11.98 ms per token,    83.50 tokens per second)
common_perf_print:        eval time =   19356.50 ms /   256 runs   (   75.61 ms per token,    13.23 tokens per second)

After,

Qwen3-0.6B-Q4_1

common_perf_print: prompt eval time =     213.15 ms /   235 tokens (    0.91 ms per token,  1102.53 tokens per second)
common_perf_print:        eval time =    5573.82 ms /   256 runs   (   21.77 ms per token,    45.93 tokens per second)

Qwen3-4B-Q4_1

common_perf_print: prompt eval time =    1099.39 ms /   235 tokens (    4.68 ms per token,   213.75 tokens per second)
common_perf_print:        eval time =   21173.46 ms /   256 runs   (   82.71 ms per token,    12.09 tokens per second)

Llama-3.2-3B-Instruct-Q4_1

common_perf_print: prompt eval time =     776.87 ms /   236 tokens (    3.29 ms per token,   303.78 tokens per second)
common_perf_print:        eval time =   15153.24 ms /   256 runs   (   59.19 ms per token,    16.89 tokens per second)

@lhez lhez marked this pull request as ready for review February 27, 2026 21:59
@lhez lhez merged commit 24350fd into ggml-org:master Mar 3, 2026
78 checks passed
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* Add Q4_1 OpenCL Kernels

* opencl: refactor transpose

* opencl: format

* opencl: refactor q4_1 unpack

* opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno`

* opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels

* opencl: rename kernel files and kernes

* opencl: fix build for non adreno

* opencl: move code around and format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Add Q4_1 OpenCL Kernels

* opencl: refactor transpose

* opencl: format

* opencl: refactor q4_1 unpack

* opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno`

* opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels

* opencl: rename kernel files and kernes

* opencl: fix build for non adreno

* opencl: move code around and format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* Add Q4_1 OpenCL Kernels

* opencl: refactor transpose

* opencl: format

* opencl: refactor q4_1 unpack

* opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno`

* opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels

* opencl: rename kernel files and kernes

* opencl: fix build for non adreno

* opencl: move code around and format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants