Skip to content

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700

Open
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16
Open

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 9, 2026

Summary

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for all K-quant DMMV kernels (Q2_K through Q6_K), including the reorder variants added in #21638.

These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup size. On Intel targets, native subgroup size is 16. DPCT itself flagged all five original kernels with register pressure warnings recommending a smaller sub-group size. The non-K-quant DMMV path already uses WARP_SIZE (16).

Each thread now processes both halves of the QK_K=256 block via a for (int im = 0; im < 2; ++im) loop. The inner dot-product computation is unchanged.

Updated scope (rebase)

The original version of this PR covered only the five non-reorder K-quant kernels. Since then, #21638 added dequantize_mul_mat_vec_q4_k_reorder and dequantize_mul_mat_vec_q6_k_reorder, which were written against the same QK_WARP_SIZE=32 code. On Intel GPUs, these new reorder kernels cause a hard hang during warmup on dense Q4_K models (tested with EVA-Qwen2.5-72B Q4_K_M on Battlemage B70).

I wrote the original fix here before writing the reorder kernels in #21638. The reorder kernels were copied from the unfixed upstream code because this PR hadn't merged yet. Now that #21638 has landed, the hang is present on current master for dense Q4_K/Q6_K models. This rebased version covers the reorder variants with the same fix.

Changes

ggml/src/ggml-sycl/dmmv.cpp:

Each of the 7 affected kernels (5 original + 2 reorder) is restructured so that each thread processes both halves of the QK_K=256 block via a for (int im = 0; im < 2; ++im) loop, keeping total work identical. The im-dependent offsets move inside the loop. Dispatch functions and warp reductions switch from QK_WARP_SIZE to WARP_SIZE.

No other files changed.

Testing

Intel Arc Pro B70 (Xe2/Battlemage, 32 GB), dual GPU, oneAPI 2025.3.3, Ubuntu 26.04.

Hang fix:

  • EVA-Qwen2.5-72B Q4_K_M (dense, 80 layers): hangs on master during warmup, loads and serves correctly with this PR. 3 sequential prompts, all coherent.

Correctness:

  • test-backend-ops -o MUL_MAT -b SYCL0: 911/911 passed, 0 failures

Performance (single GPU, sequential, Release, JIT, GGML_SYCL_F16=ON):

Model Metric Master PR Delta
Qwen3.5-9B Q4_K_M pp128 1040 t/s 1128 t/s +8.5%
Qwen3.5-9B Q4_K_M tg32 54.6 t/s 55.3 t/s +1.3%
Qwen3.5-9B Q8_0 pp128 1133 t/s 1079 t/s -4.8%
Qwen3.5-9B Q8_0 tg32 47.6 t/s 47.5 t/s -0.2%
Gemma4-31B Q6_K pp128 327 t/s 327 t/s 0%
Gemma4-31B Q6_K tg32 13.4 t/s 13.4 t/s 0%

Q8_0 is unaffected by this PR (already uses WARP_SIZE). pp variance is normal run-to-run variation. No regressions.

Previous testing by @arthw across Arc770, BMG580, iGPU (UHD), and PVC confirmed the original non-reorder changes were acceptable. The reorder kernels follow the identical pattern.

@PMZFX PMZFX requested a review from a team as a code owner April 9, 2026 23:43
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026
@maxious
Copy link
Copy Markdown

maxious commented Apr 10, 2026

Oddly I saw a TG improvement but not so much a PP with the B60 🤔
But PR looks good to merge 👍

Llama-2-7B Q2_K (dual GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 1465.5 ± 23 t/s 1484.8 ± 30 t/s +1.3%
tg128 15.6 t/s 21.3 t/s +37.1%

Qwen3.5-27B Q2_K_XL (dual GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 430.7 t/s 430.9 t/s ~0%
tg128 8.10 t/s 8.10 t/s ~0%

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 6931 t/s 6960 t/s +0.4%
tg128 85.7 t/s 102.4 t/s +19.5%

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer).
I will feedback later.

Thank you!

@PMZFX PMZFX changed the title [SYCL] Use subgroup size 16 for K-quant DMMV kernels on Intel (2.3x–2.7x pp on Arc B70) [SYCL] Use native subgroup size for K-quant DMMV kernels on Intel Apr 10, 2026
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 10, 2026

Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly.
I Re-ran with matched builds and updated the description.

@maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed.

Updated title and description to reflect this.

The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards).

@arthw arthw closed this Apr 10, 2026
@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 10, 2026

We use 32 as warp_size in some kernel for better performance by test.

@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 11, 2026

Sorry, it's my mistake to close this PR.
I reopen it.

@arthw arthw reopened this Apr 11, 2026
@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 11, 2026

The Arc770,BMG580, iGPU (UHD) are not impacted obviously.
PVC has 0% on PP, +12.5% on tg on Q2_K_XL.
PVC has 0% on PP, -4% on tg on Q4_K.

I think it's acceptable.

Thank you!

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job.

The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.

Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.

It also approves the value 16 is better value to all Intel existed GPUs.

There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.

Thank you!

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV
kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets.

The original kernels were migrated from CUDA via DPCT and retained
a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups,
and the DPCT tool itself flagged these kernels with register pressure
warnings recommending a smaller subgroup size.

Each kernel thread now processes both halves of the QK_K=256 block
via a loop, preserving identical total work and numerical results.

Tested on Intel Arc Pro B70 (Xe2/Battlemage):
- test-backend-ops: all K-quant types pass (debug + release)
- perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2)
- llama-bench: 2.3-2.7x prefill improvement, neutral tg

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PMZFX PMZFX force-pushed the opt/kquant-dmmv-subgroup16 branch from 839c7e2 to 6f28d8c Compare April 16, 2026 15:54
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 16, 2026

Rebased onto current master and expanded scope to also cover the Q4_K and Q6_K reorder DMMV kernels added in #21638.

After #21638 merged, dense Q4_K_M models (e.g. EVA-Qwen2.5-72B) hang during warmup on Intel GPUs. The reorder DMMV kernels were written against the unfixed QK_WARP_SIZE=32 code because this PR hadn't merged yet. This rebase applies the same for (im = 0; im < 2) restructuring to those two new kernels.

Testing on B70:

  • EVA-Qwen2.5-72B Q4_K_M: hangs on master, loads and serves correctly with this PR (3 multi-turn prompts, all coherent)
  • test-backend-ops -o MUL_MAT: 911/911 passed
  • Benchmarks: no regressions on Q4_K_M, Q6_K, or Q8_0 (matched build flags, sequential, single GPU)

Updated the PR description with full details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants