Multi-backend profiler by pwilkin · Pull Request #21138 · ggml-org/llama.cpp

pwilkin · 2026-03-29T00:21:17Z

Overview

A picture says more than a thousand words, so here's a picture:

Additional information

This PR introduces a cross-backend profiler (currently supported: CPU, BLAS, CUDA) that allows low-overhead profiling of op executions over the course of the computation, including fused ops, by delegating to each backend the emission of fine-grained profiling events. For CUDA, this means CUDA graphs have to be disabled (which of course is a performance loss), but otherwise measures real-life executions without artificially modifying the graph.

After the profiling data is done, there is also a Python script that can process the data and generate a HTML file with an interactive profiler timeline / stats table (like the one above).

Parallel requests are currently not supported.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, after a few failed attempts I finally got the assistant to write the profiler properly

…initializer to all backends

am17an · 2026-03-29T03:35:37Z

I don't see myself using this over the CUDA tools like nsight systems etc or for CPU specific things - perf and friends. Since those are for "free" (i.e. outside the repo, maintained well, no instrumentation needed, no learning curve.) Is there a specific use-case you had in mind?

Green-Sky · 2026-03-29T10:10:01Z

After the profiling data is done, there is also a Python script that can process the data and generate a HTML file with an interactive profiler timeline / stats table (like the one above).

I think instead it should provide a json in chrome tracing format. That is a well established format with many tools.

pwilkin · 2026-03-29T11:19:54Z

@am17an honestly I had two main use-cases in mind:

Optimizing offload scenarios - currently it's quite hard to profile scenarios with partial CPU offload, yet for most users those are the most prevalent (MoE models).
User performance reports - this provides an easy way for non-technical users to provide data allows pinpointing the exact op / tensor parameters that exhibit a slowdown / performance regression.

am17an · 2026-03-29T11:33:34Z

@pwilkin - I'm quite interested in optimizing the offload use cases, in fact that's one of my main areas of interest. In offload it's not the compute that is the only factor, it's the data transfers, which nsight systems shows quite well.

Secondly, user performance reports are much preferred in llama-bench terms rather than a new tool. It's unlikely to provide any extra information if the report also contains which op it is, we're going to have to reproduce anyway.

pwilkin · 2026-03-29T14:53:33Z

All right:

added proper handling of copy events
integrated Vulkan profiler
fixed export to Chrome trace format
fixed compile bugs

JohannesGaessler

I'm very much against adding a profiler like this to the CUDA backend. Let me be frank: there is no situation where I would ever want to use a tool like this over NSight Systems. It will simply be a maintenance burden for no benefit.

IMbackK · 2026-03-29T16:26:58Z

I'm very much against adding a profiler like this to the CUDA backend. Let me be frank: there is no situation where I would ever want to use a tool like this over NSight Systems. It will simply be a maintenance burden for no benefit.

On the other hand the HIP backend would benefit. rocprofiler-compute and its predecessors rocprof1/2 are... less mature, in the past there have been long streches where amds profilers plain dident work on lamacpp due to various bugs. And even when they do work, hardware support is quite limited and the full feature set is only available to cdna.

am17an · 2026-03-29T16:42:07Z

On the other hand the HIP backend would benefit. rocprofiler-compute and its predecessors rocprof1/2 are... less mature

Still, I don't think the renaissance of HIP software should begin with llama.cpp.

JohannesGaessler · 2026-03-29T16:43:09Z

In terms of opportunity cost there are still a lot of things that that I would consider to be of higher priority for HIP performance than adding a profiler. And if something like this were to be added at all the way it should be done is as an external tool without any ggml backend changes that simply evaluates a ggml graph for the operation in question - which is kind of what test-backend-ops -perf already does. The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

IMbackK · 2026-03-29T16:48:26Z

The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

That would be very beneficial generally.

Still, I don't think the renaissance of HIP software should begin with llama.cpp.

Im am merely listing instances where this can be useful. Some other cases i can think off is multi-backend offload, you wont find a profiler that can do offload scenarios involving multiple ggml backends. Another case is vulkan profileing. There a tonne of vulkan implementations with little viable tooling.

0cc4m · 2026-03-29T17:06:29Z

There are more backends than CUDA and CPU, so I think this is a good idea. Vulkan has had a simpler profiler for a while because we have little other option. The CUDA side here doesn't look terribly complicated either.

0cc4m · 2026-03-29T17:11:31Z

The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

That would be very beneficial generally.

I did build something like that in #19896.

IMbackK · 2026-03-29T17:18:16Z

I did build something like that in #19896.

Nice, useful. Thank you for making me aware of this.

pwilkin · 2026-03-29T19:20:17Z

I'm very much willing to keep this as a separate branch / fork not to impose maintenance overhead on the CUDA maintainers (I understand your point that there's no use in maintaining an inferior tool which would require maintenance). Would be interested in hearing from other backend maintainers (right now I reckon Vulkan is a cautious maybe and HIP is a yes).

ggerganov · 2026-03-29T19:26:38Z

Yes, keep it on a branch. There are significant API changes that just don't warrant merging the changes in master atm.

You can move the branch to this repo (if you prefer) so that it's easier to sync - whoever is using it can just rebase it for everyone else.

jeffbolznv · 2026-03-29T19:31:10Z

I won't be able to try this for a few days, but IMO for this to be a replacement for GGML_VK_PERF_LOGGER we would need to make sure it works for all apps (regardless of how they parse their command line parameters), make it print the same meaningful names for fused ops, make it support printing plain text to stdout or stderr, make it print flops/bandwidth, and make it handle the concurrent mode properly.

pwilkin · 2026-03-29T19:53:49Z

Aight, created new virtual PR for discussion for the main repo branch in #21160, closing this one.

feat: cool profiler thingy

361eb97

pwilkin requested review from a team, danbev, ggerganov and ngxson as code owners March 29, 2026 00:21

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Mar 29, 2026

pwilkin marked this pull request as draft March 29, 2026 00:29

add second dimension to reported tensors, fix Mac build, add missing …

5ac8a3b

…initializer to all backends

github-actions Bot added Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language OpenCL Issues specific to the OpenCL backend labels Mar 29, 2026

Fix more missing backend stuff (and Python errors)

ad9d999

github-actions Bot added Ascend NPU issues specific to Ascend NPUs Hexagon OpenVINO labels Mar 29, 2026

fix builds, integrate vulkan profiler, fix copy events, fix export

1e415f3

github-actions Bot added Apple Metal https://en.wikipedia.org/wiki/Metal_(API) IBM zDNN issues specific to IBM zDNN Accelerator WebGPU labels Mar 29, 2026

JohannesGaessler requested changes Mar 29, 2026

View reviewed changes

pwilkin mentioned this pull request Mar 29, 2026

Cross-backend profiler #21160

Draft

pwilkin closed this Mar 29, 2026

Conversation

pwilkin commented Mar 29, 2026

Overview

Additional information

Requirements

Uh oh!

am17an commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Mar 29, 2026

Uh oh!

pwilkin commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

pwilkin commented Mar 29, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

JohannesGaessler commented Mar 29, 2026

Uh oh!

IMbackK commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 29, 2026

Uh oh!

0cc4m commented Mar 29, 2026

Uh oh!

IMbackK commented Mar 29, 2026

Uh oh!

pwilkin commented Mar 29, 2026

Uh oh!

ggerganov commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Mar 29, 2026

Uh oh!

pwilkin commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

am17an commented Mar 29, 2026 •

edited

Loading

IMbackK commented Mar 29, 2026 •

edited

Loading

ggerganov commented Mar 29, 2026 •

edited

Loading