[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190
[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8#20190NeoZhangJianyu merged 3 commits intoggml-org:masterfrom
Conversation
|
Our HERO!!! 💪 |
|
I didn't go through the code in detail but for documentation purposes it would probably make sense to mention somewhere that this code has been adapted from the ggml CUDA backend. It is very likely that I will make further changes to that code and contributors may be interested in taking them over. |
Yes. dpct is included in the oneAPI base toolkit. Thank you! |
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT
Supprt Flash Attention for fp32/fp16/Q4/Q5/Q8.
All supported Flash Attention UT cases are passed.
Support to enable/disable Flash attention by environment variable: GGML_SYCL_ENABLE_FLASH_ATTN
Update the guide. Add "Design Rule" chapter.
Reduce the memoy usage in most cases.
Perfomance impact depends on the LLM: PP is reduced in most cases. TG is increrased in more cases.
Todo:
performance optimization.
Tested on Arc770 and iGPU on i7-13700K
Test result: