ggml : refactor forward_dup for cpu backend by ngxson · Pull Request #16062 · ggml-org/llama.cpp

ngxson · 2025-09-18T02:09:36Z

Refactor this code by using cpp template function

I tested this code by running test-backend-ops against CPU <--> Metal/CUDA/Vulkan, but maybe there are some cases missed from the test. Would be nice if you can have a deeper look, thanks!

ggml : refactor forward_dup for cpu backend

ngxson · 2025-09-18T07:22:19Z

I did a pref test between master & PR. While the result fluctuate quite a lot from one run to another, I can see that the peak performance stays the same:

master:
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                10021 runs -   101.79 us/run -     9216 kB/run -   86.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 1290 runs -   878.65 us/run -    65536 kB/run -   71.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 3762 runs -   298.64 us/run -    24576 kB/run -   78.60 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 900 runs -  1464.49 us/run -    37376 kB/run -   24.43 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                2025 runs -   527.50 us/run -    37376 kB/run -   67.61 GB/s


PR:
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                12754 runs -    93.92 us/run -     9216 kB/run -   93.65 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 1806 runs -   570.91 us/run -    65536 kB/run -  109.90 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 3420 runs -   303.12 us/run -    24576 kB/run -   77.43 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 675 runs -  1552.52 us/run -    37376 kB/run -   23.05 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                1350 runs -   743.66 us/run -    37376 kB/run -   47.96 GB/s

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

ngxson and others added 3 commits September 14, 2025 09:00

ggml : refactor forward_dup for cpu backend

06dda63

clean up a bit

a31ba15

Merge pull request #30 from xuanson2025/xsn/refactor_cpu_dup_op

59ca217

ggml : refactor forward_dup for cpu backend

ngxson requested review from ggerganov and slaren September 18, 2025 02:09

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 18, 2025

add quant/dequant perf test

1d36311

github-actions Bot added the testing Everything test related label Sep 18, 2025

ggerganov approved these changes Sep 18, 2025

View reviewed changes

ngxson merged commit 0dd58b6 into ggml-org:master Sep 19, 2025
54 of 55 checks passed

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025

ggml : refactor forward_dup for cpu backend (ggml-org#16062)

76718de

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

ggml : refactor forward_dup for cpu backend (#16062)

348c94b

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

ggml : refactor forward_dup for cpu backend (ggml-org#16062)

af0dc80

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : refactor forward_dup for cpu backend#16062

ggml : refactor forward_dup for cpu backend#16062
ngxson merged 4 commits intoggml-org:masterfrom
ngxson:xsn/refactor_cpu_dup_op

ngxson commented Sep 18, 2025

Uh oh!

ngxson commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ngxson commented Sep 18, 2025

Uh oh!

ngxson commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants