Skip to content

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190

Merged
NeoZhangJianyu merged 3 commits intoggml-org:masterfrom
arthw:supprt_flash_attention
Mar 8, 2026
Merged

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190
NeoZhangJianyu merged 3 commits intoggml-org:masterfrom
arthw:supprt_flash_attention

Conversation

@arthw
Copy link
Copy Markdown
Contributor

@arthw arthw commented Mar 7, 2026

Supprt Flash Attention for fp32/fp16/Q4/Q5/Q8.
All supported Flash Attention UT cases are passed.
Support to enable/disable Flash attention by environment variable: GGML_SYCL_ENABLE_FLASH_ATTN
Update the guide. Add "Design Rule" chapter.
Reduce the memoy usage in most cases.
Perfomance impact depends on the LLM: PP is reduced in most cases. TG is increrased in more cases.

Todo:
performance optimization.

Tested on Arc770 and iGPU on i7-13700K
Test result:

Model on Intel(R) Arc(TM) A770 Graphics PP t/s PP Delta(%) TG t/s TG Delta(%) Total Mem (M) Mem Delta (M)
deepseek-moe-16b-chat.Q4_K_M.gguf 16.08 -37.77% 16.88 2.68% 11363 0
DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf 24.42 -62.71% 25.84 21.83% 4940 -38
gpt-oss-20b-mxfp4.gguf 26.35 4.40% 10.04 8.19% 9917 -463
gpt-oss-20b-Q4_0.gguf 15.62 -41.17% 15.44 11.00% 11157 -150
gpt-oss-20b-Q8_0.gguf 15.64 -41.34% 15.11 10.70% 11461 -150
granite-3.0-3b-a800m-instruct-Q4_K_L.gguf 16.27 -39.27% 13.91 4.04% 2297 -114
granite-4.0-h-micro-Q4_K_M.gguf 22.48 -62.95% 12.89 -0.77% 2162 -97
llama-2-7b.Q4_0.gguf 28.57 -67.29% 28.77 17.57% 5718 -236
Llama3-TAIDE-LX-8B-Chat-Alpha1-Q4_0.gguf 21.72 -64.97% 26.09 21.75% 4926 -38
Lumimaid-v0.2-8B-Q6_K-imat.gguf 24.21 -61.95% 19.46 16.39% 6642 -38
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf 24.49 -62.33% 23.41 19.99% 5173 -38
Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf 24.20 -61.22% 10.17 -10.00% 6007 -38
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf 24.42 -62.79% 23.51 19.95% 5173 -38
Meta-Llama-3-8B.Q4_0.gguf 24.34 -62.88% 26.12 21.77% 4926 -38
Meta-Llama-3-8B.Q8_0.gguf 23.68 -60.34% 14.64 12.10% 8375 -38
Ministral-3-14B-Instruct-2512-UD-Q4_K_XL.gguf 20.97 -50.27% 12.66 7.38% 8517 -46
pythia-1.4b-Q4_0.gguf 26.73 -77.72% 51.52 19.34% 1601 -58
qwen2-1.5b-instruct-q4_0.gguf 29.04 -74.70% 38.55 10.17% 1300 3
Qwen2.5.1-Coder-7B-Instruct-Q8_0.gguf 22.94 -60.33% 13.69 0.59% 7693 0
Qwen2.5-Coder-3B-Instruct-abliterated-Q4_K_M.gguf 26.39 -69.70% 27.31 10.17% 2279 0
qwen2-7b-instruct-q4_k_m.gguf 23.62 -62.00% 21.99 0.87% 4696 0
Qwen3-14B-Q4_K_M.gguf 18.03 -52.96% 13.09 8.72% 9108 -62
Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf 81.39 9.66% 18.64 17.68% 5694 0
Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf 9.76 -27.11% 5.99 -13.44% 9587 0
Qwen3-8B-Q6_K.gguf 23.89 -60.92% 17.24 16.17% 6802 0
Qwen_Qwen3.5-27B-Q2_K.gguf 9.22 -24.98% 3.46 -28.22% 10145 0
stories15M_MOE-F16.gguf 219.72 -19.50% 147.07 -2.01% 41 0
unsloth-Qwen2.5-3B-Instruct_dtype-bfloat16_r-8_lr-0.0002.Q4_0.gguf 27.02 -68.70% 28.84 11.44% 2177 0
Model on Intel(R) UHD Graphics 770 PP t/s PP Delta(%) TG t/s TG Delta(%) Total Mem (M) Mem Delta (M)
deepseek-moe-16b-chat.Q4_K_M.gguf 7.77 -18.55% 4.54 6.32% 11363 0
DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf 6.05 -24.94% 5.33 7.24% 4940 -38
gpt-oss-20b-mxfp4.gguf 3.27 -0.30% 1.41 4.44% 9917 -463
gpt-oss-20b-Q4_0.gguf 2.95 -7.81% 1.98 2.59% 11157 -150
gpt-oss-20b-Q8_0.gguf 3.00 -8.26% 1.94 3.19% 11461 -150
granite-3.0-3b-a800m-instruct-Q4_K_L.gguf 12.62 -30.81% 6.11 19.57% 2297 -114
granite-4.0-h-micro-Q4_K_M.gguf 10.55 -26.79% 2.50 3.73% 2162 -97
llama-2-7b.Q4_0.gguf 7.21 -19.35% 6.65 17.08% 5718 -236
Llama3-TAIDE-LX-8B-Chat-Alpha1-Q4_0.gguf 6.03 -13.61% 5.60 7.90% 4926 -38
Lumimaid-v0.2-8B-Q6_K-imat.gguf 6.26 -19.64% 2.80 4.09% 6642 -38
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf 6.81 -21.00% 3.24 4.52% 5173 -38
Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf 6.63 -17.43% 1.09 0.93% 6007 -38
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf 6.62 -26.12% 3.25 4.84% 5173 -38
Meta-Llama-3-8B.Q4_0.gguf 6.37 -15.96% 5.60 7.69% 4926 -38
Meta-Llama-3-8B.Q8_0.gguf 6.46 -19.45% 2.96 4.23% 8375 -38
Ministral-3-14B-Instruct-2512-UD-Q4_K_XL.gguf 4.62 -14.29% 1.43 2.14% 8517 -46
pythia-1.4b-Q4_0.gguf 18.36 -51.36% 23.91 35.54% 1601 -58
qwen2-1.5b-instruct-q4_0.gguf 19.01 -50.31% 18.02 1.12% 1300 3
Qwen2.5.1-Coder-7B-Instruct-Q8_0.gguf 6.66 -23.89% 3.03 -2.26% 7693 0
Qwen2.5-Coder-3B-Instruct-abliterated-Q4_K_M.gguf 12.76 -38.95% 6.59 2.17% 2279 0
qwen2-7b-instruct-q4_k_m.gguf 7.43 -21.87% 3.31 -0.60% 4696 0
Qwen3-14B-Q4_K_M.gguf 3.92 -11.71% 1.70 -1.16% 9108 -62
Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf 12.99 -10.91% 5.08 7.17% 5694 0
Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf 4.78 1.92% 1.58 1.28% 9587 0
Qwen3-8B-Q6_K.gguf 6.29 -18.94% 2.76 5.75% 6802 0
Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf 5.73 -17.32% 2.77 2.21% 17363 4
Qwen3-Coder-Next-IQ4_XS.gguf 2.69 -2.18% 1.71 -1.16% 41180 0
Qwen3-Coder-Next-Q3_K_M.gguf 2.50 -7.75% 1.80 0.00% 36894 0
Qwen_Qwen3-30B-A3B-Q4_K_M.gguf 5.62 -17.11% 2.71 5.45% 18285 4
Qwen_Qwen3.5-27B-Q2_K.gguf 1.80 -4.26% 0.58 0.00% 10145 0
stories15M_MOE-F16.gguf 347.56 2.60% 31.12 -0.26% 41 0
unsloth-Qwen2.5-3B-Instruct_dtype-bfloat16_r-8_lr-0.0002.Q4_0.gguf 12.35 -28.86% 11.23 5.25% 2177 0

@github-actions github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 7, 2026
@savvadesogle
Copy link
Copy Markdown

Our HERO!!! 💪
Thank you Jianyu ❤️

@NeoZhangJianyu NeoZhangJianyu merged commit 213c4a0 into ggml-org:master Mar 8, 2026
148 of 150 checks passed
@JohannesGaessler
Copy link
Copy Markdown
Contributor

I didn't go through the code in detail but for documentation purposes it would probably make sense to mention somewhere that this code has been adapted from the ggml CUDA backend. It is very likely that I will make further changes to that code and contributors may be interested in taking them over.

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

I didn't go through the code in detail but for documentation purposes it would probably make sense to mention somewhere that this code has been adapted from the ggml CUDA backend. It is very likely that I will make further changes to that code and contributors may be interested in taking them over.

Yes.
The SYCL code is migrated from CUDA backend.
Using tool dpct, most of CUDA code can be migrated to SYCL code.
To get the correct result, there is lots of workload to debug the code.

dpct is included in the oneAPI base toolkit.
It's easy to use.

Thank you!

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
electimon pushed a commit to electimon/llama.cpp that referenced this pull request Mar 19, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
electimon pushed a commit to electimon/llama.cpp that referenced this pull request Mar 24, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants