[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel by PMZFX · Pull Request #21700 · ggml-org/llama.cpp

PMZFX · 2026-04-09T23:43:43Z

Summary

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for all K-quant DMMV kernels (Q2_K through Q6_K), including the reorder variants added in #21638.

These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup size. On Intel targets, native subgroup size is 16. DPCT itself flagged all five original kernels with register pressure warnings recommending a smaller sub-group size. The non-K-quant DMMV path already uses WARP_SIZE (16).

Each thread now processes both halves of the QK_K=256 block via a for (int im = 0; im < 2; ++im) loop. The inner dot-product computation is unchanged.

Updated scope (rebase)

The original version of this PR covered only the five non-reorder K-quant kernels. Since then, #21638 added dequantize_mul_mat_vec_q4_k_reorder and dequantize_mul_mat_vec_q6_k_reorder, which were written against the same QK_WARP_SIZE=32 code. On Intel GPUs, these new reorder kernels cause a hard hang during warmup on dense Q4_K models (tested with EVA-Qwen2.5-72B Q4_K_M on Battlemage B70).

I wrote the original fix here before writing the reorder kernels in #21638. The reorder kernels were copied from the unfixed upstream code because this PR hadn't merged yet. Now that #21638 has landed, the hang is present on current master for dense Q4_K/Q6_K models. This rebased version covers the reorder variants with the same fix.

Changes

ggml/src/ggml-sycl/dmmv.cpp:

Each of the 7 affected kernels (5 original + 2 reorder) is restructured so that each thread processes both halves of the QK_K=256 block via a for (int im = 0; im < 2; ++im) loop, keeping total work identical. The im-dependent offsets move inside the loop. Dispatch functions and warp reductions switch from QK_WARP_SIZE to WARP_SIZE.

No other files changed.

Testing

Intel Arc Pro B70 (Xe2/Battlemage, 32 GB), dual GPU, oneAPI 2025.3.3, Ubuntu 26.04.

Hang fix:

EVA-Qwen2.5-72B Q4_K_M (dense, 80 layers): hangs on master during warmup, loads and serves correctly with this PR. 3 sequential prompts, all coherent.

Correctness:

test-backend-ops -o MUL_MAT -b SYCL0: 911/911 passed, 0 failures

Performance (single GPU, sequential, Release, JIT, GGML_SYCL_F16=ON):

Model	Metric	Master	PR	Delta
Qwen3.5-9B Q4_K_M	pp128	1040 t/s	1128 t/s	+8.5%
Qwen3.5-9B Q4_K_M	tg32	54.6 t/s	55.3 t/s	+1.3%
Qwen3.5-9B Q8_0	pp128	1133 t/s	1079 t/s	-4.8%
Qwen3.5-9B Q8_0	tg32	47.6 t/s	47.5 t/s	-0.2%
Gemma4-31B Q6_K	pp128	327 t/s	327 t/s	0%
Gemma4-31B Q6_K	tg32	13.4 t/s	13.4 t/s	0%

Q8_0 is unaffected by this PR (already uses WARP_SIZE). pp variance is normal run-to-run variation. No regressions.

Previous testing by @arthw across Arc770, BMG580, iGPU (UHD), and PVC confirmed the original non-reorder changes were acceptable. The reorder kernels follow the identical pattern.

maxious · 2026-04-10T03:35:21Z

Oddly I saw a TG improvement but not so much a PP with the B60 🤔
But PR looks good to merge 👍

Llama-2-7B Q2_K (dual GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	1465.5 ± 23 t/s	1484.8 ± 30 t/s	+1.3%
tg128	15.6 t/s	21.3 t/s	+37.1%

Qwen3.5-27B Q2_K_XL (dual GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	430.7 t/s	430.9 t/s	~0%
tg128	8.10 t/s	8.10 t/s	~0%

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	6931 t/s	6960 t/s	+0.4%
tg128	85.7 t/s	102.4 t/s	+19.5%

NeoZhangJianyu · 2026-04-10T07:48:10Z

It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer).
I will feedback later.

Thank you!

PMZFX · 2026-04-10T09:31:04Z

Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly.
I Re-ran with matched builds and updated the description.

@maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed.

Updated title and description to reflect this.

The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards).

arthw · 2026-04-10T13:55:43Z

We use 32 as warp_size in some kernel for better performance by test.

arthw · 2026-04-11T06:44:02Z

Sorry, it's my mistake to close this PR.
I reopen it.

arthw · 2026-04-11T14:18:28Z

The Arc770,BMG580, iGPU (UHD) are not impacted obviously.
PVC has 0% on PP, +12.5% on tg on Q2_K_XL.
PVC has 0% on PP, -4% on tg on Q4_K.

I think it's acceptable.

Thank you!

arthw

It's good job.

The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.

Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.

It also approves the value 16 is better value to all Intel existed GPUs.

There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.

Thank you!

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets. The original kernels were migrated from CUDA via DPCT and retained a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups, and the DPCT tool itself flagged these kernels with register pressure warnings recommending a smaller subgroup size. Each kernel thread now processes both halves of the QK_K=256 block via a loop, preserving identical total work and numerical results. Tested on Intel Arc Pro B70 (Xe2/Battlemage): - test-backend-ops: all K-quant types pass (debug + release) - perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2) - llama-bench: 2.3-2.7x prefill improvement, neutral tg Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PMZFX · 2026-04-16T15:54:39Z

Rebased onto current master and expanded scope to also cover the Q4_K and Q6_K reorder DMMV kernels added in #21638.

After #21638 merged, dense Q4_K_M models (e.g. EVA-Qwen2.5-72B) hang during warmup on Intel GPUs. The reorder DMMV kernels were written against the unfixed QK_WARP_SIZE=32 code because this PR hadn't merged yet. This rebase applies the same for (im = 0; im < 2) restructuring to those two new kernels.

Testing on B70:

EVA-Qwen2.5-72B Q4_K_M: hangs on master, loads and serves correctly with this PR (3 multi-turn prompts, all coherent)
test-backend-ops -o MUL_MAT: 911/911 passed
Benchmarks: no regressions on Q4_K_M, Q6_K, or Q8_0 (matched build flags, sequential, single GPU)

Updated the PR description with full details.

PMZFX requested a review from a team as a code owner April 9, 2026 23:43

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026

PMZFX changed the title ~~[SYCL] Use subgroup size 16 for K-quant DMMV kernels on Intel (2.3x–2.7x pp on Arc B70)~~ [SYCL] Use native subgroup size for K-quant DMMV kernels on Intel Apr 10, 2026

arthw closed this Apr 10, 2026

arthw reopened this Apr 11, 2026

arthw approved these changes Apr 11, 2026

View reviewed changes

PMZFX force-pushed the opt/kquant-dmmv-subgroup16 branch from 839c7e2 to 6f28d8c Compare April 16, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16

PMZFX commented Apr 9, 2026 •

edited

Loading

Uh oh!

maxious commented Apr 10, 2026

Uh oh!

NeoZhangJianyu commented Apr 10, 2026

Uh oh!

PMZFX commented Apr 10, 2026

Uh oh!

arthw commented Apr 10, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw left a comment

Uh oh!

PMZFX commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

PMZFX commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updated scope (rebase)

Changes

Testing

Uh oh!

maxious commented Apr 10, 2026

Llama-2-7B Q2_K (dual GPU)

Qwen3.5-27B Q2_K_XL (dual GPU)

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Uh oh!

NeoZhangJianyu commented Apr 10, 2026

Uh oh!

PMZFX commented Apr 10, 2026

Uh oh!

arthw commented Apr 10, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

PMZFX commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PMZFX commented Apr 9, 2026 •

edited

Loading