[SYCL] Fix the sub group size of Intel#8106
Conversation
airMeng
left a comment
There was a problem hiding this comment.
need confirmation from our codeplay mates to verify on other HW
fcbd140 to
7d8c960
Compare
|
@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon |
|
The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this. |
ed267f2 to
e2023a9
Compare
joeatodd
left a comment
There was a problem hiding this comment.
Hey, changes look good, nice to see some more work to split ggml-sycl.cpp into more small files! 👍
Tested on our side, all fine except for the 2 small things I've raised.
|
Hi @joeatodd please review the change. |
|
I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9. This PR branch with prompt_length=9: The master branch with prompt_length=9: The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one. |
joeatodd
left a comment
There was a problem hiding this comment.
Looks good, thanks for the changes 😃
There was a problem hiding this comment.
shall there be some check of block_size? We don't want all block size supported, right?
There was a problem hiding this comment.
This function is for Q8_1, so therer is only one fixed block size.
There was a problem hiding this comment.
yes, shall we add some check like
static_assert(QUANT_BLOCK_TILE ==32)
There was a problem hiding this comment.
QUANT_BLOCK_TILE = QK8_1/WARP_SIZE. It's not necessary to add this check, the kernel works for all BlockSize%WARP_SIZE==0.
|
@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago. |
You can check this comment. How about a longer prompt? |
|
I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values. |
|
@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR. |
|
It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models. |
|
I didn't test Q4_K_S models. I will test it on A770. |
| set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=nvptx64-nvidia-cuda") | ||
| add_compile_definitions(GGML_SYCL_WARP_SIZE=32) | ||
| else() | ||
| add_compile_definitions(GGML_SYCL_WARP_SIZE=16) |
There was a problem hiding this comment.
@qnixsynapse Could you change this size to 32 and test your model?
I've confirmed this bug on Q4K_S. |
It's also broken on Q4_0 |
|
@qnixsynapse BTW which UI are you using, looking quite cool |
|
@airMeng It's chainlit. |
|
@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this? |
|
@luoyu-intel Sure. :) Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :) |
* use warp_size macro for all sycl kernels * fix mask of permute_sub_group_by_xor * fix rms_norm with correct warp number * fix rms_norm_f32/group_norm_f32 * move norm to norm.cpp file * fix quantize bug * fix mmvq's batch size







Changes:
-nan. It's an issue from dpcpp: [SYCL]subgroup shuffle bug when sg_size=32 intel/llvm#14274Debug can output the same tokens as release in this PR(master runs into exceptions):
Release output:
Performance benefit
Intel Arc A770
37 tokens/s to 39 tokens/s (Windows + 9600K):
38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):