[SYCL] Fix WARP_SIZE=16 bug of Intel GPU by luoyu-intel · Pull Request #8266 · ggml-org/llama.cpp

luoyu-intel · 2024-07-03T03:06:32Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR fixes some bugs of WARP_SIZE=16 for Intel GPU. All warp-related UTs are passed.
WARP_SIZE=16 has the same output as WARP_SIZE=32 on Intel GPUs.

NOTE: QX_K kernels are specialized for WARP_SIZE=32, so I use a fixed WARP_SIZE for them.

Performance change

llama-2-7b-chat-hf-q4_0.gguf, 32 in and 32 out, on ARC A770, from 40 tokens/s to 44 tokens/s

Master Branch

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, have  new experiences and learn new things. She was a curious child, always eager to explore and discover new things.

One day, she found a small door
llama_print_timings:        load time =   14573.13 ms
llama_print_timings:      sample time =       0.53 ms /    32 runs   (    0.02 ms per token, 60952.38 tokens per second)
llama_print_timings: prompt eval time =     217.69 ms /    32 tokens (    6.80 ms per token,   147.00 tokens per second)
llama_print_timings:        eval time =     771.81 ms /    31 runs   (   24.90 ms per token,    40.17 tokens per second)
llama_print_timings:       total time =     994.93 ms /    63 tokens

PR Branch

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, have  new experiences and learn new things. She was a curious child, always eager to explore and discover new things.

One day, she found a small door
llama_print_timings:        load time =   14716.53 ms
llama_print_timings:      sample time =       0.52 ms /    32 runs   (    0.02 ms per token, 61776.06 tokens per second)
llama_print_timings: prompt eval time =     216.68 ms /    32 tokens (    6.77 ms per token,   147.69 tokens per second)
llama_print_timings:        eval time =     698.20 ms /    31 runs   (   22.52 ms per token,    44.40 tokens per second)
llama_print_timings:       total time =     920.69 ms /    63 tokens

airMeng

tested Meta-Llama-3-8B-Instruct-Q4_K_S.gguf and llama-2-7b.Q4_0.gguf

airMeng · 2024-07-03T03:53:46Z

@joeatodd @OuadiElfarouki

NeoZhangJianyu

It's passed on MTL after I test.

qnixsynapse · 2024-07-03T06:24:11Z

Tested iq4_XS, Q4_K_S.

LGTM

luoyu-intel · 2024-07-03T06:29:12Z

@qnixsynapse Thanks for your test! Q4_K models still use WARP_SIZE=32, so they won't benefit from this PR.

qnixsynapse · 2024-07-03T06:32:29Z

@luoyu-intel Yes, I am aware. I am testing IQ4 models currently.

Alcpz · 2024-07-03T11:51:33Z

@joeatodd @OuadiElfarouki Performance of the SYCL branch using an NVIDIA A100 with Q4_K has no regressions.

model	size	params	backend	ngl	sm	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	78	none	pp512	2203.66 ± 15.26
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	SYCL	78	none	pp512	1720.49 ± 23.80
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	SYCL	78	none	pp512	606.95 ± 5.49

build: 4887fdce (3293)

model	size	params	backend	ngl	sm	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	81	none	tg128	5.36 ± 0.00
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	SYCL	81	none	tg128	4.27 ± 0.00
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	SYCL	81	none	tg128	2.08 ± 0.00

build: 4887fdce (3293)

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266) * fix group_norm ut * split softmax * fix softmax * add concat support condition * revert debug code * move QK_WARP_SIZE to presets.hpp Fix issue in above PR: fix norm() nullptr lead to crash on iGPU. use WARP_32_SIZE replace QK_WARP_SIZE optimize dmmv.cpp for iGPU. add sycl_hw.cpp to detect Hardware info.

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266) cherry-pick b549a1b

* fix group_norm ut * split softmax * fix softmax * add concat support condition * revert debug code * move QK_WARP_SIZE to presets.hpp

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 3, 2024

airMeng requested a review from AidanBeltonS July 3, 2024 03:38

luoyu-intel force-pushed the sycl-acc branch from c9045c1 to 3bf8c2c Compare July 3, 2024 03:45

airMeng approved these changes Jul 3, 2024

View reviewed changes

Comment thread ggml/src/ggml-sycl/dmmv.cpp Outdated

NeoZhangJianyu reviewed Jul 3, 2024

View reviewed changes

Comment thread ggml/src/CMakeLists.txt Outdated

NeoZhangJianyu approved these changes Jul 3, 2024

View reviewed changes

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 3, 2024

luoyu-intel added 7 commits July 5, 2024 10:47

fix group_norm ut

c675aaf

split softmax

e50517b

fix softmax

d70305b

revert qx_k

0012f2c

add concat support condition

870b607

revert debug code

d7cf5f5

move QK_WARP_SIZE to presets.hpp

ac8a4bd

luoyu-intel force-pushed the sycl-acc branch from 4887fdc to ac8a4bd Compare July 5, 2024 02:48

rebase work_space api

87098db

airMeng merged commit a9554e2 into ggml-org:master Jul 5, 2024

luoyu-intel deleted the sycl-acc branch July 8, 2024 02:44

arthw mentioned this pull request Jul 13, 2024

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266) cherry-pick b549a1bbefb2f1fbb8b558bac1f2ae7967e60964 arthw/llama.cpp#1

Merged

4 tasks

arthw added a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

Merge pull request #1 from arthw/update_warp

aeaed61

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266) cherry-pick b549a1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU#8266

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU#8266
airMeng merged 8 commits intoggml-org:masterfrom
luoyu-intel:sycl-acc

luoyu-intel commented Jul 3, 2024 •

edited

Loading

Uh oh!

airMeng left a comment •

edited

Loading

Uh oh!

Uh oh!

airMeng commented Jul 3, 2024

Uh oh!

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

qnixsynapse commented Jul 3, 2024

Uh oh!

luoyu-intel commented Jul 3, 2024

Uh oh!

qnixsynapse commented Jul 3, 2024

Uh oh!

Alcpz commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

luoyu-intel commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance change

Master Branch

PR Branch

Uh oh!

airMeng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

airMeng commented Jul 3, 2024

Uh oh!

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

qnixsynapse commented Jul 3, 2024

Uh oh!

luoyu-intel commented Jul 3, 2024

Uh oh!

qnixsynapse commented Jul 3, 2024

Uh oh!

Alcpz commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

luoyu-intel commented Jul 3, 2024 •

edited

Loading

airMeng left a comment •

edited

Loading