[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 by arthw · Pull Request #20190 · ggml-org/llama.cpp

arthw · 2026-03-07T10:07:17Z

Supprt Flash Attention for fp32/fp16/Q4/Q5/Q8.
All supported Flash Attention UT cases are passed.
Support to enable/disable Flash attention by environment variable: GGML_SYCL_ENABLE_FLASH_ATTN
Update the guide. Add "Design Rule" chapter.
Reduce the memoy usage in most cases.
Perfomance impact depends on the LLM: PP is reduced in most cases. TG is increrased in more cases.

Todo:
performance optimization.

Tested on Arc770 and iGPU on i7-13700K
Test result:

Model on Intel(R) Arc(TM) A770 Graphics	PP t/s	PP Delta(%)	TG t/s	TG Delta(%)	Total Mem (M)	Mem Delta (M)
deepseek-moe-16b-chat.Q4_K_M.gguf	16.08	-37.77%	16.88	2.68%	11363	0
DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf	24.42	-62.71%	25.84	21.83%	4940	-38
gpt-oss-20b-mxfp4.gguf	26.35	4.40%	10.04	8.19%	9917	-463
gpt-oss-20b-Q4_0.gguf	15.62	-41.17%	15.44	11.00%	11157	-150
gpt-oss-20b-Q8_0.gguf	15.64	-41.34%	15.11	10.70%	11461	-150
granite-3.0-3b-a800m-instruct-Q4_K_L.gguf	16.27	-39.27%	13.91	4.04%	2297	-114
granite-4.0-h-micro-Q4_K_M.gguf	22.48	-62.95%	12.89	-0.77%	2162	-97
llama-2-7b.Q4_0.gguf	28.57	-67.29%	28.77	17.57%	5718	-236
Llama3-TAIDE-LX-8B-Chat-Alpha1-Q4_0.gguf	21.72	-64.97%	26.09	21.75%	4926	-38
Lumimaid-v0.2-8B-Q6_K-imat.gguf	24.21	-61.95%	19.46	16.39%	6642	-38
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf	24.49	-62.33%	23.41	19.99%	5173	-38
Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf	24.20	-61.22%	10.17	-10.00%	6007	-38
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf	24.42	-62.79%	23.51	19.95%	5173	-38
Meta-Llama-3-8B.Q4_0.gguf	24.34	-62.88%	26.12	21.77%	4926	-38
Meta-Llama-3-8B.Q8_0.gguf	23.68	-60.34%	14.64	12.10%	8375	-38
Ministral-3-14B-Instruct-2512-UD-Q4_K_XL.gguf	20.97	-50.27%	12.66	7.38%	8517	-46
pythia-1.4b-Q4_0.gguf	26.73	-77.72%	51.52	19.34%	1601	-58
qwen2-1.5b-instruct-q4_0.gguf	29.04	-74.70%	38.55	10.17%	1300	3
Qwen2.5.1-Coder-7B-Instruct-Q8_0.gguf	22.94	-60.33%	13.69	0.59%	7693	0
Qwen2.5-Coder-3B-Instruct-abliterated-Q4_K_M.gguf	26.39	-69.70%	27.31	10.17%	2279	0
qwen2-7b-instruct-q4_k_m.gguf	23.62	-62.00%	21.99	0.87%	4696	0
Qwen3-14B-Q4_K_M.gguf	18.03	-52.96%	13.09	8.72%	9108	-62
Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf	81.39	9.66%	18.64	17.68%	5694	0
Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf	9.76	-27.11%	5.99	-13.44%	9587	0
Qwen3-8B-Q6_K.gguf	23.89	-60.92%	17.24	16.17%	6802	0
Qwen_Qwen3.5-27B-Q2_K.gguf	9.22	-24.98%	3.46	-28.22%	10145	0
stories15M_MOE-F16.gguf	219.72	-19.50%	147.07	-2.01%	41	0
unsloth-Qwen2.5-3B-Instruct_dtype-bfloat16_r-8_lr-0.0002.Q4_0.gguf	27.02	-68.70%	28.84	11.44%	2177	0

Model on Intel(R) UHD Graphics 770	PP t/s	PP Delta(%)	TG t/s	TG Delta(%)	Total Mem (M)	Mem Delta (M)
deepseek-moe-16b-chat.Q4_K_M.gguf	7.77	-18.55%	4.54	6.32%	11363	0
DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf	6.05	-24.94%	5.33	7.24%	4940	-38
gpt-oss-20b-mxfp4.gguf	3.27	-0.30%	1.41	4.44%	9917	-463
gpt-oss-20b-Q4_0.gguf	2.95	-7.81%	1.98	2.59%	11157	-150
gpt-oss-20b-Q8_0.gguf	3.00	-8.26%	1.94	3.19%	11461	-150
granite-3.0-3b-a800m-instruct-Q4_K_L.gguf	12.62	-30.81%	6.11	19.57%	2297	-114
granite-4.0-h-micro-Q4_K_M.gguf	10.55	-26.79%	2.50	3.73%	2162	-97
llama-2-7b.Q4_0.gguf	7.21	-19.35%	6.65	17.08%	5718	-236
Llama3-TAIDE-LX-8B-Chat-Alpha1-Q4_0.gguf	6.03	-13.61%	5.60	7.90%	4926	-38
Lumimaid-v0.2-8B-Q6_K-imat.gguf	6.26	-19.64%	2.80	4.09%	6642	-38
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf	6.81	-21.00%	3.24	4.52%	5173	-38
Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf	6.63	-17.43%	1.09	0.93%	6007	-38
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf	6.62	-26.12%	3.25	4.84%	5173	-38
Meta-Llama-3-8B.Q4_0.gguf	6.37	-15.96%	5.60	7.69%	4926	-38
Meta-Llama-3-8B.Q8_0.gguf	6.46	-19.45%	2.96	4.23%	8375	-38
Ministral-3-14B-Instruct-2512-UD-Q4_K_XL.gguf	4.62	-14.29%	1.43	2.14%	8517	-46
pythia-1.4b-Q4_0.gguf	18.36	-51.36%	23.91	35.54%	1601	-58
qwen2-1.5b-instruct-q4_0.gguf	19.01	-50.31%	18.02	1.12%	1300	3
Qwen2.5.1-Coder-7B-Instruct-Q8_0.gguf	6.66	-23.89%	3.03	-2.26%	7693	0
Qwen2.5-Coder-3B-Instruct-abliterated-Q4_K_M.gguf	12.76	-38.95%	6.59	2.17%	2279	0
qwen2-7b-instruct-q4_k_m.gguf	7.43	-21.87%	3.31	-0.60%	4696	0
Qwen3-14B-Q4_K_M.gguf	3.92	-11.71%	1.70	-1.16%	9108	-62
Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf	12.99	-10.91%	5.08	7.17%	5694	0
Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf	4.78	1.92%	1.58	1.28%	9587	0
Qwen3-8B-Q6_K.gguf	6.29	-18.94%	2.76	5.75%	6802	0
Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf	5.73	-17.32%	2.77	2.21%	17363	4
Qwen3-Coder-Next-IQ4_XS.gguf	2.69	-2.18%	1.71	-1.16%	41180	0
Qwen3-Coder-Next-Q3_K_M.gguf	2.50	-7.75%	1.80	0.00%	36894	0
Qwen_Qwen3-30B-A3B-Q4_K_M.gguf	5.62	-17.11%	2.71	5.45%	18285	4
Qwen_Qwen3.5-27B-Q2_K.gguf	1.80	-4.26%	0.58	0.00%	10145	0
stories15M_MOE-F16.gguf	347.56	2.60%	31.12	-0.26%	41	0
unsloth-Qwen2.5-3B-Instruct_dtype-bfloat16_r-8_lr-0.0002.Q4_0.gguf	12.35	-28.86%	11.23	5.25%	2177	0

savvadesogle · 2026-03-07T10:47:12Z

Our HERO!!! 💪
Thank you Jianyu ❤️

JohannesGaessler · 2026-03-08T09:21:21Z

I didn't go through the code in detail but for documentation purposes it would probably make sense to mention somewhere that this code has been adapted from the ggml CUDA backend. It is very likely that I will make further changes to that code and contributors may be interested in taking them over.

NeoZhangJianyu · 2026-03-09T01:53:35Z

I didn't go through the code in detail but for documentation purposes it would probably make sense to mention somewhere that this code has been adapted from the ggml CUDA backend. It is very likely that I will make further changes to that code and contributors may be interested in taking them over.

Yes.
The SYCL code is migrated from CUDA backend.
Using tool dpct, most of CUDA code can be migrated to SYCL code.
To get the correct result, there is lots of workload to debug the code.

dpct is included in the oneAPI base toolkit.
It's easy to use.

Thank you!

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

arthw added 3 commits March 6, 2026 17:41

support flash-attention for fp32/fp16/Q4/Q5/Q8

42f8358

rm warining

2e16dda

update for JIT

76d9cb0

NeoZhangJianyu approved these changes Mar 7, 2026

View reviewed changes

github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 7, 2026

NeoZhangJianyu merged commit 213c4a0 into ggml-org:master Mar 8, 2026
148 of 150 checks passed

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

873bf0d

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

electimon pushed a commit to electimon/llama.cpp that referenced this pull request Mar 19, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

2b7f951

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

7a1ceda

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

electimon pushed a commit to electimon/llama.cpp that referenced this pull request Mar 24, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

4343389

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

6e1bde9

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

sanmai mentioned this pull request May 1, 2026

SYCL: flash-attention buffers are retained across long-context ubatches causing linear VRAM growth #22585

Open

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

39f6aac

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190
NeoZhangJianyu merged 3 commits intoggml-org:masterfrom
arthw:supprt_flash_attention

arthw commented Mar 7, 2026

Uh oh!

savvadesogle commented Mar 7, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Mar 8, 2026

Uh oh!

NeoZhangJianyu commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arthw commented Mar 7, 2026

Uh oh!

savvadesogle commented Mar 7, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Mar 8, 2026

Uh oh!

NeoZhangJianyu commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants