ggml-zendnn : add ZenDNN backend for AMD CPUs by z-vishal · Pull Request #17690 · ggml-org/llama.cpp

z-vishal · 2025-12-02T12:44:35Z

This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.

Background

ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.

Changes

Backend implementation:
- New ZenDNN backend in ggml/src/ggml-zendnn/
- Implements GGML_OP_MUL_MAT acceleration using ZenDNN primitives
- Supports FP32 and BF16 data types
- Auto-converts types
Build system:
- CMake integration with automatic download/build option: -DGGML_ZENDNN=ON
- Custom installation path support: -DGGML_ZENDNN_PATH=/path/to/zendnn
- Uses ZenDNN's CMake package for clean dependency management
Documentation:
- Comprehensive backend documentation in docs/backend/ZenDNN.md
- Build instructions added to docs/build.md
- Covers hardware support, setup, performance tuning, and profiling

Hardware Support

AMD EPYC 9005 Series (Turin/Zen 5)
AMD EPYC 9004 Series (Zen 4) - Recommended (best BF16 performance)
AMD EPYC 7003 Series (Milan/Zen 3)
AMD Ryzen AI MAX (Strix Halo)

Performance Notes

Best performance with export ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS backend)
Optimized for BF16 inference on Zen 4/5 processors
Automatic parallel dispatch using OpenMP

Testing

Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).

Performance Results

Test Configuration

Hardware: AMD EPYC 9004 Series (Zen 4)
Threads: 96
Batch Size: 4096
Tool: llama-bench
llama.cpp version: 7134
ZenDNN version: 1.0.0
Environment: ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS)

Benchmark Results

LLaMA 3.1 8B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	341.50	395.58	1.16x
pp256	382.52	561.94	1.47x
pp512	423.40	624.61	1.48x
pp1024	414.12	637.97	1.54x
pp2048	338.50	622.08	1.84x
pp4096	308.53	534.76	1.73x
tg128	7.28	10.53	1.45x

LLaMA 3.1 8B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	184.44	293.39	1.59x
pp256	189.69	384.71	2.03x
pp512	234.74	431.21	1.84x
pp1024	231.49	451.51	1.95x
pp2048	220.05	425.65	1.93x
pp4096	189.75	396.73	2.09x
tg128	2.69	7.34	2.73x

Qwen2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	339.58	381.26	1.12x
pp256	380.82	482.33	1.27x
pp512	434.41	639.02	1.47x
pp1024	432.35	703.14	1.63x
pp2048	382.49	694.71	1.82x
pp4096	316.63	640.01	2.02x
tg128	6.30	11.96	1.90x

Qwen2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.64	309.29	1.53x
pp256	217.81	408.51	1.88x
pp512	250.92	451.24	1.80x
pp1024	251.71	461.91	1.84x
pp2048	228.00	454.05	1.99x
pp4096	207.30	445.56	2.15x
tg128	2.75	8.11	2.95x

LLaMA 2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	325.94	387.72	1.19x
pp256	364.62	547.76	1.50x
pp512	417.88	613.29	1.47x
pp1024	418.46	603.59	1.44x
pp2048	382.10	623.88	1.63x
pp4096	316.20	559.45	1.77x
tg128	7.05	11.59	1.64x

LLaMA 2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.47	315.96	1.57x
pp256	217.71	397.12	1.82x
pp512	249.96	436.97	1.75x
pp1024	249.78	454.70	1.82x
pp2048	224.65	440.21	1.96x
pp4096	195.72	392.68	2.01x
tg128	3.70	8.15	2.20x

LLaMA 2 13B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	185.20	202.39	1.09x
pp256	200.55	300.21	1.50x
pp512	227.04	370.78	1.63x
pp1024	221.33	358.21	1.62x
pp2048	170.63	377.57	2.21x
pp4096	177.55	302.23	1.70x
tg128	3.72	6.76	1.82x

LLaMA 2 13B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	107.74	174.92	1.62x
pp256	114.34	215.51	1.88x
pp512	129.28	246.26	1.90x
pp1024	127.64	232.02	1.82x
pp2048	113.25	253.00	2.23x
pp4096	105.44	220.49	2.09x
tg128	1.92	4.73	2.46x

Mixtral 8x7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	92.74	94.24	1.02x
pp256	136.77	143.61	1.05x
pp512	164.38	167.70	1.02x
pp1024	169.80	175.44	1.03x
pp2048	166.19	176.64	1.06x
pp4096	151.95	174.29	1.15x
tg128	3.73	3.43	0.92x

Key Observations:

Best gains on F32 models: up to 2.95x speedup (Qwen2-7B token generation)
BF16: 1.5-2x faster with lower memory usage
Larger batches (pp2048, pp4096) show better performance
Smaller models (7B-13B) benefit more than large MoE models (Mixtral 8x7B)
Token generation: 1.45x-2.95x faster across models

AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.

Djip007 · 2025-12-02T23:40:19Z

I was thinking to create a backend with https://github.com/amd/blis (with FBGEMM) but good with zenDNN to.

taronaeo · 2025-12-03T00:27:25Z

Can you also include the benchmark results from #17684 into this PR?

z-vishal · 2025-12-03T02:27:53Z

@taronaeo Updated the PR description with benchmark results

z-vishal · 2025-12-03T02:29:44Z

@Djip007 Thanks! AMD BLIS is actually what ZenDNN uses under the hood the ZENDNNL_MATMUL_ALGO=2 setting activates the "Blocked AOCL BLIS" backend for optimal performance so you're getting BLIS optimizations through ZenDNN

taronaeo

General implementation looks good. Just needs fixing of the unnecessary enum declarations.

You should also look into supporting GGML_OP_MUL_MAT_ID for MoE as well but this can probably be in another PR in continuation of this.

For quantised model support, you can disable the following line

        /* .buffer_from_host_ptr = */ true, // set to false

and weight tensors will go through .set_tensor() where you can manually upscale it to either BF16 or FP32 before it runs the same matmul calculations. I'm quite interested to see if you'll still get a performance boost though :)

z-vishal · 2025-12-04T19:08:12Z

Thanks @taronaeo for the review

for MoE support, will add in a follow-up PR after this merges.

Quantized models support with the upscaling approach may be not needed the ZenDNN team is also working on native quantized support.

taronaeo

LGTM. Just minor changes to the docs and rebase your branch with upstream/master to fix the ops.md conflicts :)

taronaeo · 2025-12-05T03:11:40Z

Merge on green :)

Djip007 · 2025-12-05T11:38:30Z

you can disable the following line
        /* .buffer_from_host_ptr = */ true, // set to false
and weight tensors will go through .set_tensor() where you can manually upscale it to either BF16 or FP32 before it runs the same matmul calculations. I'm quite interested to see if you'll still get a performance boost though :)

I don't think set buffer_from_host_ptr = false imply the call of .set_tensor() It si more for mmap.
static upscale has no interest it will need more RAM and will have lower tg.
But it may be good to use it for repacking .

But as I see, force call to .set_tensor() need to have none CPU tensor, ... and a way to know that the tensor is use for MATMUL... For now as I can tell the only reliable way will be to add extra buffer for this backend, and change llama.cpp main code to use it, for now only CPU_backend can use it.

But If you have an other way to do it I'll be happy to know, It will help me on other backend/extra.

z-vishal · 2025-12-06T05:38:28Z

@taronaeo @ggerganov resolved the conflicts (2nd time) could we merge once CI is green to avoid a third round(haha)?
Thanks!

taronaeo · 2025-12-06T06:12:51Z

@taronaeo @ggerganov resolved the conflicts (2nd time) could we merge once CI is green to avoid a third round(haha)?

Thanks!

Sorry a little busy today. Just started the CI. Will check in after approx an hour to push if green :)

z-vishal · 2025-12-06T16:09:27Z

@taronaeo now CI is green. let's merge this! :)

taronaeo · 2025-12-06T16:13:13Z

Failing CI tests do not seem related to this PR and the same failure(s) can be observed across other PRs as well. Merging PR.

z-vishal · 2025-12-06T16:16:12Z

@taronaeo @danbev Thanks for the review! Shoutout to the community and AMD ZenDNN team.

* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>

Mushoz · 2025-12-07T08:41:29Z

Does this also give speedups with quantized models such as Q8_0, K quants and IQ quants?

taronaeo · 2025-12-07T11:52:01Z

Does this also give speedups with quantized models such as Q8_0, K quants and IQ quants?

No, the current implementation in this PR only defines support for F32 and BF16.

See:

llama.cpp/ggml/src/ggml-zendnn/ggml-zendnn.cpp

Lines 374 to 379 in 2257758

    
           switch (weights->type) { 
        
               case GGML_TYPE_F32: 
        
               case GGML_TYPE_BF16: 
        
                   return true; 
        
               default: 
        
                   return false;

* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>

* DL: DLIN does not support MMA F16 flash attention yet, skip silently * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>

* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>

z-vishal requested a review from ggerganov as a code owner December 2, 2025 12:44

loci-dev mentioned this pull request Dec 2, 2025

UPSTREAM PR #17690: ggml-zendnn : add ZenDNN backend for AMD CPUs auroralabs-loci/llama.cpp#402

Open

github-actions Bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025

taronaeo self-requested a review December 3, 2025 00:23

amukho mentioned this pull request Dec 3, 2025

[ZENDNN] Add ZenDNN as an optional third-party lib pytorch/pytorch#161155

Closed

danbev reviewed Dec 3, 2025

View reviewed changes

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

Comment thread docs/build.md

danbev reviewed Dec 3, 2025

View reviewed changes

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp

Djip007 mentioned this pull request Dec 4, 2025

add llama_matmul_demo2_bf16.c with other parallelize experiment mozilla-ai/llamafile#586

Closed

taronaeo reviewed Dec 4, 2025

View reviewed changes

Comment thread docs/build.md Outdated

taronaeo requested changes Dec 4, 2025

View reviewed changes

Comment thread docs/backend/ZenDNN.md

Comment thread ggml/src/ggml-zendnn/CMakeLists.txt Outdated

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

z-vishal requested a review from taronaeo December 4, 2025 19:08

taronaeo approved these changes Dec 5, 2025

View reviewed changes

Comment thread docs/backend/ZenDNN.md

Comment thread docs/backend/zDNN.md

z-vishal force-pushed the ggml-zendnn branch from 244979b to dfca987 Compare December 5, 2025 02:50

Manoj Kumar and others added 3 commits December 6, 2025 05:24

ggml-zennn: add ZenDNN backend support

dfd046b

ggml-zendnn : address ZenDNN backend review fixes and suggestions

d501790

docs : apply blockquote syntax to ZenDNN docs

71404a3

z-vishal force-pushed the ggml-zendnn branch from dfca987 to 71404a3 Compare December 6, 2025 05:25

taronaeo merged commit 017761d into ggml-org:master Dec 6, 2025
75 of 79 checks passed

ciprianveg mentioned this pull request Dec 7, 2025

Feature Request: Port the newly added Zendnn optimization from llama.cpl ikawrakow/ik_llama.cpp#1044

Closed

4 tasks

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

z-vishal deleted the ggml-zendnn branch January 6, 2026 05:15

Conversation

z-vishal commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Hardware Support

Performance Notes

Testing

Performance Results

Test Configuration

Benchmark Results

LLaMA 3.1 8B (BF16)

LLaMA 3.1 8B (F32)

Qwen2 7B (BF16)

Qwen2 7B (F32)

LLaMA 2 7B (BF16)

LLaMA 2 7B (F32)

LLaMA 2 13B (BF16)

LLaMA 2 13B (F32)

Mixtral 8x7B (BF16)

Related

Uh oh!

Djip007 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z-vishal commented Dec 4, 2025

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

taronaeo commented Dec 5, 2025

Uh oh!

Djip007 commented Dec 5, 2025

Uh oh!

z-vishal commented Dec 6, 2025

Uh oh!

taronaeo commented Dec 6, 2025

Uh oh!

z-vishal commented Dec 6, 2025

Uh oh!

taronaeo commented Dec 6, 2025

Uh oh!

Uh oh!

z-vishal commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mushoz commented Dec 7, 2025

Uh oh!

taronaeo commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

z-vishal commented Dec 2, 2025 •

edited

Loading

Djip007 commented Dec 2, 2025 •

edited

Loading

taronaeo commented Dec 3, 2025 •

edited

Loading

z-vishal commented Dec 6, 2025 •

edited

Loading