ggml-zendnn : add ZenDNN backend for AMD CPUs#17690
ggml-zendnn : add ZenDNN backend for AMD CPUs#17690taronaeo merged 3 commits intoggml-org:masterfrom
Conversation
|
I was thinking to create a backend with https://github.com/amd/blis (with FBGEMM) but good with zenDNN to. |
|
Can you also include the benchmark results from #17684 into this PR? |
|
@taronaeo Updated the PR description with benchmark results |
|
@Djip007 Thanks! AMD BLIS is actually what ZenDNN uses under the hood the |
taronaeo
left a comment
There was a problem hiding this comment.
General implementation looks good. Just needs fixing of the unnecessary enum declarations.
You should also look into supporting GGML_OP_MUL_MAT_ID for MoE as well but this can probably be in another PR in continuation of this.
For quantised model support, you can disable the following line
/* .buffer_from_host_ptr = */ true, // set to falseand weight tensors will go through .set_tensor() where you can manually upscale it to either BF16 or FP32 before it runs the same matmul calculations. I'm quite interested to see if you'll still get a performance boost though :)
|
Thanks @taronaeo for the review for MoE support, will add in a follow-up PR after this merges. Quantized models support with the upscaling approach may be not needed the ZenDNN team is also working on native quantized support. |
taronaeo
left a comment
There was a problem hiding this comment.
LGTM. Just minor changes to the docs and rebase your branch with upstream/master to fix the ops.md conflicts :)
|
Merge on green :) |
I don't think set But as I see, force call to But If you have an other way to do it I'll be happy to know, It will help me on other backend/extra. |
|
@taronaeo @ggerganov resolved the conflicts (2nd time) could we merge once CI is green to avoid a third round(haha)? |
Sorry a little busy today. Just started the CI. Will check in after approx an hour to push if green :) |
|
@taronaeo now CI is green. let's merge this! :) |
|
Failing CI tests do not seem related to this PR and the same failure(s) can be observed across other PRs as well. Merging PR. |
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
|
Does this also give speedups with quantized models such as Q8_0, K quants and IQ quants? |
No, the current implementation in this PR only defines support for F32 and BF16. See: llama.cpp/ggml/src/ggml-zendnn/ggml-zendnn.cpp Lines 374 to 379 in 2257758 |
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
* DL: DLIN does not support MMA F16 flash attention yet, skip silently * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.
Background
ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.
Changes
Backend implementation:
ggml/src/ggml-zendnn/GGML_OP_MUL_MATacceleration using ZenDNN primitivesBuild system:
-DGGML_ZENDNN=ON-DGGML_ZENDNN_PATH=/path/to/zendnnDocumentation:
docs/backend/ZenDNN.mddocs/build.mdHardware Support
Performance Notes
export ZENDNNL_MATMUL_ALGO=2(Blocked AOCL BLIS backend)Testing
Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).
Performance Results
Test Configuration
ZENDNNL_MATMUL_ALGO=2(Blocked AOCL BLIS)Benchmark Results
LLaMA 3.1 8B (BF16)
LLaMA 3.1 8B (F32)
Qwen2 7B (BF16)
Qwen2 7B (F32)
LLaMA 2 7B (BF16)
LLaMA 2 7B (F32)
LLaMA 2 13B (BF16)
LLaMA 2 13B (F32)
Mixtral 8x7B (BF16)
Key Observations:
Related
AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.