[SYCL] Fix the sub group size of Intel by luoyu-intel · Pull Request #8106 · ggml-org/llama.cpp

luoyu-intel · 2024-06-25T09:09:54Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Changes:

sync the WARP_SIZE macro, and replace hard-coding 32 numbers.
add a macro condition to change WARP_SIZE to 16 for Intel's GPUs
remove hard-coding work_group_size=1024 constraint of *_norm_f32
remove unused shared local memory
output the correct tokens for Debug build, before the logits are all -nan. It's an issue from dpcpp: [SYCL]subgroup shuffle bug when sg_size=32 intel/llvm#14274
move *_norm_f32 to norm.cpp file
fix the bug of ignoring src1_ncols of mmvq

Debug can output the same tokens as release in this PR(master runs into exceptions):

 What's your name?ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

 Unterscheidung between the two words "name" and "namee"

Release output:

What's your name?
 Unterscheidung between the two words "name" and "namee"

Performance benefit

Intel Arc A770
37 tokens/s to 39 tokens/s (Windows + 9600K):

 Building a website can be done in 10 simple steps: Choosing a hosting platform before you begin anything, though, should come first.
llama_print_timings:        load time =    5109.68 ms
llama_print_timings:      sample time =       0.80 ms /    16 runs   (    0.05 ms per token, 20025.03 tokens per second)
llama_print_timings: prompt eval time =     271.43 ms /    14 tokens (   19.39 ms per token,    51.58 tokens per second)
llama_print_timings:        eval time =     383.75 ms /    15 runs   (   25.58 ms per token,    39.09 tokens per second)
llama_print_timings:       total time =     666.89 ms /    29 tokens
Log end

38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun in the way that little girls like you usually have. So she played outside with
llama_print_timings:        load time =   14153.51 ms
llama_print_timings:      sample time =       0.72 ms /    16 runs   (    0.04 ms per token, 22284.12 tokens per second)
llama_print_timings: prompt eval time =     391.66 ms /    33 tokens (   11.87 ms per token,    84.26 tokens per second)
llama_print_timings:        eval time =     359.29 ms /    15 runs   (   23.95 ms per token,    41.75 tokens per second)
llama_print_timings:       total time =     757.59 ms /    48 tokens

airMeng

need confirmation from our codeplay mates to verify on other HW

airMeng · 2024-06-27T01:53:00Z

@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon

luoyu-intel · 2024-06-27T07:06:01Z

The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this.

joeatodd

Hey, changes look good, nice to see some more work to split ggml-sycl.cpp into more small files! 👍

Tested on our side, all fine except for the 2 small things I've raised.

luoyu-intel · 2024-07-01T01:45:01Z

Hi @joeatodd please review the change.

luoyu-intel · 2024-07-01T01:53:23Z

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9:
Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside
prompt_length=8:
Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9:
Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear
prompt_length=8:
Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

joeatodd

Looks good, thanks for the changes 😃

airMeng · 2024-07-01T01:58:54Z

shall there be some check of block_size? We don't want all block size supported, right?

This function is for Q8_1, so therer is only one fixed block size.

yes, shall we add some check like

static_assert(QUANT_BLOCK_TILE ==32)

QUANT_BLOCK_TILE = QK8_1/WARP_SIZE. It's not necessary to add this check, the kernel works for all BlockSize%WARP_SIZE==0.

qnixsynapse · 2024-07-02T02:28:47Z

I somehow missed this. Using this patch, the gemma model is broken, atleast on Q4_K_S

luoyu-intel · 2024-07-02T02:32:23Z

@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago.

qnixsynapse · 2024-07-02T02:34:12Z

This is llama -3 8B. Not sure what went wrong but speed has been increased.

luoyu-intel · 2024-07-02T02:34:57Z

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

You can check this comment. How about a longer prompt?

qnixsynapse · 2024-07-02T02:38:04Z

llama-8B on a longer prompt. I have Arc A750 GPU if that matters.

luoyu-intel · 2024-07-02T02:41:59Z

I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values.

luoyu-intel · 2024-07-02T02:43:32Z

@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR.

qnixsynapse · 2024-07-02T02:45:53Z

It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models.

luoyu-intel · 2024-07-02T02:50:10Z

I didn't test Q4_K_S models. I will test it on A770.

qnixsynapse · 2024-07-02T02:51:00Z

Yup confirmed. Works great on CPU.
Tested iQ4_XS and Q4_K_S models.

Edit: Will test on Q4_0 model (although this is a legacy quant)

Edit 2: Broken on q4_0 model as well.

Edit 3: I will test with increasing the warp size manually later to see if that fixes the issue. (I know it shouldn't but still)

luoyu-intel · 2024-07-02T03:12:20Z

        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=nvptx64-nvidia-cuda")
+        add_compile_definitions(GGML_SYCL_WARP_SIZE=32)
+    else()
+        add_compile_definitions(GGML_SYCL_WARP_SIZE=16)


@qnixsynapse Could you change this size to 32 and test your model?

Works perfectly on warp size of 32!!!!

luoyu-intel · 2024-07-02T03:17:15Z

PR Q4_0, warp_size=32
 Once upon a time, there is a small village nestled in the rolling hills of the countryside. Unterscheidung between the two is not always clear-cut, and both terms are often used interchangeably. The village is home to a small population of people who live and work together in a close-knit community.

PR Q4_0, warp_size=16
 Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

PR Q4K_S, warp_size=32
 Once upon a time, there is a small village nestled in the mountains. The villagers lived simple lives, farming the land and raising their families. But one day, a great evil descended upon the village, in the form of a powerful sorcerer.
The sorcerer was angry and resentful towards the villagers, and

PR Q4K_S, warp_size=16
 Once upon a time, there is a smalloshtztztzrtrtrtrtttt tt tt tuleuleuleuleuleuleuleuleuleuleuleule Roman Roman Roman Roman Roman Romananeaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneane@{ane

I've confirmed this bug on Q4K_S.

qnixsynapse · 2024-07-02T03:27:54Z

PR Q4_0, warp_size=16
Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

hills of the countryside.rezzo

It's also broken on Q4_0

Working great on iQ4_XS quant as well.

airMeng · 2024-07-02T03:30:29Z

@qnixsynapse BTW which UI are you using, looking quite cool

qnixsynapse · 2024-07-02T03:32:12Z

@airMeng It's chainlit.

luoyu-intel · 2024-07-02T03:35:18Z

@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this?

qnixsynapse · 2024-07-02T03:37:10Z

@luoyu-intel Sure. :)

Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :)

* use warp_size macro for all sycl kernels * fix mask of permute_sub_group_by_xor * fix rms_norm with correct warp number * fix rms_norm_f32/group_norm_f32 * move norm to norm.cpp file * fix quantize bug * fix mmvq's batch size

github-actions Bot added build Compilation issues SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 25, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 25, 2024

airMeng requested review from AidanBeltonS and airMeng June 26, 2024 05:12

airMeng reviewed Jun 26, 2024

View reviewed changes

Comment thread ggml-sycl/presets.hpp Outdated

airMeng reviewed Jun 26, 2024

View reviewed changes

Comment thread ggml-sycl/common.hpp Outdated

airMeng approved these changes Jun 26, 2024

View reviewed changes

luoyu-intel force-pushed the intel_sgsize branch from fcbd140 to 7d8c960 Compare June 27, 2024 01:42

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 27, 2024

luoyu-intel marked this pull request as draft June 27, 2024 07:22

luoyu-intel force-pushed the intel_sgsize branch from ed267f2 to e2023a9 Compare June 28, 2024 01:41

airMeng added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jun 28, 2024

joeatodd suggested changes Jun 28, 2024

View reviewed changes

Comment thread ggml/src/ggml-sycl/mmvq.cpp Outdated

Comment thread ggml/src/ggml-sycl/presets.hpp Outdated

luoyu-intel marked this pull request as ready for review July 1, 2024 01:42

joeatodd approved these changes Jul 1, 2024

View reviewed changes

airMeng reviewed Jul 1, 2024

View reviewed changes

luoyu-intel added 8 commits July 2, 2024 09:27

use warp_size macro for all sycl kernels

e18bf85

fix mask of permute_sub_group_by_xor

a85ad06

fix rms_norm with correct warp number

9a48a45

fix rms_norm_f32

33df09d

fix group_norm_f32

90e0328

move norm to norm.cpp file

a2936f4

sync master

4cd48c7

fix quantize bug

61f0cd5

luoyu-intel added 3 commits July 2, 2024 09:28

fix mmvq's batch size

eb0d132

add warp_size macro

e4f1516

fix compile issue

8117e6d

luoyu-intel force-pushed the intel_sgsize branch from 3d5cdd6 to 8117e6d Compare July 2, 2024 01:28

airMeng merged commit d08c20e into ggml-org:master Jul 2, 2024

luoyu-intel commented Jul 2, 2024

View reviewed changes

luoyu-intel deleted the intel_sgsize branch July 3, 2024 06:06

Conversation

luoyu-intel commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Performance benefit

Uh oh!

Uh oh!

Uh oh!

airMeng left a comment

Choose a reason for hiding this comment

Uh oh!

airMeng commented Jun 27, 2024

Uh oh!

luoyu-intel commented Jun 27, 2024

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

luoyu-intel commented Jul 1, 2024

Uh oh!

luoyu-intel commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

airMeng Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

luoyu-intel Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

airMeng Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

luoyu-intel Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

qnixsynapse commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luoyu-intel Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

qnixsynapse Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

airMeng commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024

Uh oh!

luoyu-intel commented Jul 2, 2024

Uh oh!

qnixsynapse commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

luoyu-intel commented Jun 25, 2024 •

edited

Loading

luoyu-intel commented Jul 1, 2024 •

edited

Loading

qnixsynapse commented Jul 2, 2024 •

edited

Loading

qnixsynapse commented Jul 2, 2024 •

edited

Loading

qnixsynapse commented Jul 2, 2024 •

edited

Loading

qnixsynapse commented Jul 2, 2024 •

edited

Loading