[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700PMZFX wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Oddly I saw a TG improvement but not so much a PP with the B60 🤔 Llama-2-7B Q2_K (dual GPU)
Qwen3.5-27B Q2_K_XL (dual GPU)
Qwen2.5-1.5B-Instruct Q2_K (single GPU)
|
|
It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer). Thank you! |
|
Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly. @maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed. Updated title and description to reflect this. The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards). |
|
We use 32 as warp_size in some kernel for better performance by test. |
|
Sorry, it's my mistake to close this PR. |
|
The Arc770,BMG580, iGPU (UHD) are not impacted obviously. I think it's acceptable. Thank you! |
arthw
left a comment
There was a problem hiding this comment.
It's good job.
The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.
Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.
It also approves the value 16 is better value to all Intel existed GPUs.
There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.
Thank you!
Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets. The original kernels were migrated from CUDA via DPCT and retained a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups, and the DPCT tool itself flagged these kernels with register pressure warnings recommending a smaller subgroup size. Each kernel thread now processes both halves of the QK_K=256 block via a loop, preserving identical total work and numerical results. Tested on Intel Arc Pro B70 (Xe2/Battlemage): - test-backend-ops: all K-quant types pass (debug + release) - perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2) - llama-bench: 2.3-2.7x prefill improvement, neutral tg Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
839c7e2 to
6f28d8c
Compare
|
Rebased onto current master and expanded scope to also cover the Q4_K and Q6_K reorder DMMV kernels added in #21638. After #21638 merged, dense Q4_K_M models (e.g. EVA-Qwen2.5-72B) hang during warmup on Intel GPUs. The reorder DMMV kernels were written against the unfixed QK_WARP_SIZE=32 code because this PR hadn't merged yet. This rebase applies the same Testing on B70:
Updated the PR description with full details. |
Summary
Use
WARP_SIZE(16) instead ofQK_WARP_SIZE(32) for all K-quant DMMV kernels (Q2_K through Q6_K), including the reorder variants added in #21638.These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup size. On Intel targets, native subgroup size is 16. DPCT itself flagged all five original kernels with register pressure warnings recommending a smaller sub-group size. The non-K-quant DMMV path already uses
WARP_SIZE(16).Each thread now processes both halves of the QK_K=256 block via a
for (int im = 0; im < 2; ++im)loop. The inner dot-product computation is unchanged.Updated scope (rebase)
The original version of this PR covered only the five non-reorder K-quant kernels. Since then, #21638 added
dequantize_mul_mat_vec_q4_k_reorderanddequantize_mul_mat_vec_q6_k_reorder, which were written against the same QK_WARP_SIZE=32 code. On Intel GPUs, these new reorder kernels cause a hard hang during warmup on dense Q4_K models (tested with EVA-Qwen2.5-72B Q4_K_M on Battlemage B70).I wrote the original fix here before writing the reorder kernels in #21638. The reorder kernels were copied from the unfixed upstream code because this PR hadn't merged yet. Now that #21638 has landed, the hang is present on current master for dense Q4_K/Q6_K models. This rebased version covers the reorder variants with the same fix.
Changes
ggml/src/ggml-sycl/dmmv.cpp:Each of the 7 affected kernels (5 original + 2 reorder) is restructured so that each thread processes both halves of the QK_K=256 block via a
for (int im = 0; im < 2; ++im)loop, keeping total work identical. Theim-dependent offsets move inside the loop. Dispatch functions and warp reductions switch fromQK_WARP_SIZEtoWARP_SIZE.No other files changed.
Testing
Intel Arc Pro B70 (Xe2/Battlemage, 32 GB), dual GPU, oneAPI 2025.3.3, Ubuntu 26.04.
Hang fix:
Correctness:
test-backend-ops -o MUL_MAT -b SYCL0: 911/911 passed, 0 failuresPerformance (single GPU, sequential, Release, JIT, GGML_SYCL_F16=ON):
Q8_0 is unaffected by this PR (already uses WARP_SIZE). pp variance is normal run-to-run variation. No regressions.
Previous testing by @arthw across Arc770, BMG580, iGPU (UHD), and PVC confirmed the original non-reorder changes were acceptable. The reorder kernels follow the identical pattern.