ggml-zdnn: rm user mapped buffers by taronaeo · Pull Request #15965 · ggml-org/llama.cpp

taronaeo · 2025-09-13T14:18:00Z

This PR removes the user mapped buffers to allow weight tensors to go through .set_tensor, allowing zDNN to create zTensor objects inside tensor->extra during setup instead of doing it on-the-fly, where we would lose performance.

Also, it fixes a missed buffer free for extra buffers used for things like bias.

Performance

model	size	params	backend	threads	test	t/s master	t/s PR	speedup
granite 3B all F32	9.44 GiB	2.53 B	zDNN,BLAS	8	pp512	215.82	214.32	0.99
granite 3B all F32	9.44 GiB	2.53 B	zDNN,BLAS	8	tg128	4.95	4.95	1.00
granite 3B F16	4.72 GiB	2.53 B	zDNN,BLAS	8	pp512	213.83	218.44	1.02
granite 3B F16	4.72 GiB	2.53 B	zDNN,BLAS	8	tg128	4.90	4.81	0.98
granite 3B BF16	4.72 GiB	2.53 B	zDNN,BLAS	8	pp512	213.85	216.12	1.01
granite 3B BF16	4.72 GiB	2.53 B	zDNN,BLAS	8	tg128	4.79	4.66	0.97

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

`test-backend-ops`

build/bin/test-backend-ops -b zDNN | grep -v "not supported"                                                                
ggml_zdnn_init: allocating                                                                                                                                        
ggml_zdnn_init: found 1 device                                                                                                                                    
ggml_zdnn_init: picking default device: zDNN                                                                                                                      
ggml_zdnn_init: NNPA name: zDNN                                                                                                                                   
ggml_zdnn_init: NNPA_PARMBLKFORMAT_0 = true                                                                                                                       
ggml_zdnn_init: NNPA_PARMBLKFORMAT_1 = true                                                                                                                       
Testing 3 devices                                                                                                                                                 
                                                                                                                                                                  
Backend 1/3: zDNN                                                                                                                                                 
  Device description: IBM Z Neural Network Processing Assist (NNPA)                                                                                               
  Device memory: 0 MB (0 MB free)                                                                                                                                 
                                                                                                                                                                  
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=1,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
ggml_zdnn_free: deallocating
  14471/14471 tests passed
  Backend zDNN: OK
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo · 2025-09-13T19:08:43Z

The xcode and macos CIs failing do not seem to be related to this PR. I'll merge in the morning if all is good

* ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo added 3 commits September 13, 2025 21:37

ggml-zdnn: rm user mapped buffers

7647063

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: rm dead code

d01c592

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: attempt to fix missing extra data buffer free

4d08a0b

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator labels Sep 13, 2025

slaren approved these changes Sep 13, 2025

View reviewed changes

taronaeo merged commit 6380d6a into ggml-org:master Sep 14, 2025
85 of 89 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-zdnn: rm user mapped buffers#15965

ggml-zdnn: rm user mapped buffers#15965
taronaeo merged 3 commits intoggml-org:masterfrom
taronaeo:refactor/rm-zdnn-user-mapped-buffers

taronaeo commented Sep 13, 2025

Uh oh!

taronaeo commented Sep 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taronaeo commented Sep 13, 2025

Performance

test-backend-ops

Uh oh!

taronaeo commented Sep 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`test-backend-ops`