Skip to content

Multi-backend profiler#21138

Closed
pwilkin wants to merge 4 commits intoggml-org:masterfrom
pwilkin:cool-profiler-thingy
Closed

Multi-backend profiler#21138
pwilkin wants to merge 4 commits intoggml-org:masterfrom
pwilkin:cool-profiler-thingy

Conversation

@pwilkin
Copy link
Copy Markdown
Member

@pwilkin pwilkin commented Mar 29, 2026

Overview

A picture says more than a thousand words, so here's a picture:
image

Additional information

This PR introduces a cross-backend profiler (currently supported: CPU, BLAS, CUDA) that allows low-overhead profiling of op executions over the course of the computation, including fused ops, by delegating to each backend the emission of fine-grained profiling events. For CUDA, this means CUDA graphs have to be disabled (which of course is a performance loss), but otherwise measures real-life executions without artificially modifying the graph.

After the profiling data is done, there is also a Python script that can process the data and generate a HTML file with an interactive profiler timeline / stats table (like the one above).

Parallel requests are currently not supported.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, after a few failed attempts I finally got the assistant to write the profiler properly

@pwilkin pwilkin requested review from a team, danbev, ggerganov and ngxson as code owners March 29, 2026 00:21
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Mar 29, 2026
@pwilkin pwilkin marked this pull request as draft March 29, 2026 00:29
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language OpenCL Issues specific to the OpenCL backend labels Mar 29, 2026
@github-actions github-actions Bot added Ascend NPU issues specific to Ascend NPUs Hexagon OpenVINO labels Mar 29, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

I don't see myself using this over the CUDA tools like nsight systems etc or for CPU specific things - perf and friends. Since those are for "free" (i.e. outside the repo, maintained well, no instrumentation needed, no learning curve.) Is there a specific use-case you had in mind?

@Green-Sky
Copy link
Copy Markdown
Collaborator

After the profiling data is done, there is also a Python script that can process the data and generate a HTML file with an interactive profiler timeline / stats table (like the one above).

I think instead it should provide a json in chrome tracing format. That is a well established format with many tools.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 29, 2026

@am17an honestly I had two main use-cases in mind:

  1. Optimizing offload scenarios - currently it's quite hard to profile scenarios with partial CPU offload, yet for most users those are the most prevalent (MoE models).
  2. User performance reports - this provides an easy way for non-technical users to provide data allows pinpointing the exact op / tensor parameters that exhibit a slowdown / performance regression.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

@pwilkin - I'm quite interested in optimizing the offload use cases, in fact that's one of my main areas of interest. In offload it's not the compute that is the only factor, it's the data transfers, which nsight systems shows quite well.

Secondly, user performance reports are much preferred in llama-bench terms rather than a new tool. It's unlikely to provide any extra information if the report also contains which op it is, we're going to have to reproduce anyway.

@github-actions github-actions Bot added Apple Metal https://en.wikipedia.org/wiki/Metal_(API) IBM zDNN issues specific to IBM zDNN Accelerator WebGPU labels Mar 29, 2026
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 29, 2026

All right:

  • added proper handling of copy events
  • integrated Vulkan profiler
  • fixed export to Chrome trace format
  • fixed compile bugs

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much against adding a profiler like this to the CUDA backend. Let me be frank: there is no situation where I would ever want to use a tool like this over NSight Systems. It will simply be a maintenance burden for no benefit.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 29, 2026

I'm very much against adding a profiler like this to the CUDA backend. Let me be frank: there is no situation where I would ever want to use a tool like this over NSight Systems. It will simply be a maintenance burden for no benefit.

On the other hand the HIP backend would benefit. rocprofiler-compute and its predecessors rocprof1/2 are... less mature, in the past there have been long streches where amds profilers plain dident work on lamacpp due to various bugs. And even when they do work, hardware support is quite limited and the full feature set is only available to cdna.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

On the other hand the HIP backend would benefit. rocprofiler-compute and its predecessors rocprof1/2 are... less mature

Still, I don't think the renaissance of HIP software should begin with llama.cpp.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

In terms of opportunity cost there are still a lot of things that that I would consider to be of higher priority for HIP performance than adding a profiler. And if something like this were to be added at all the way it should be done is as an external tool without any ggml backend changes that simply evaluates a ggml graph for the operation in question - which is kind of what test-backend-ops -perf already does. The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 29, 2026

The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

That would be very beneficial generally.

Still, I don't think the renaissance of HIP software should begin with llama.cpp.

Im am merely listing instances where this can be useful. Some other cases i can think off is multi-backend offload, you wont find a profiler that can do offload scenarios involving multiple ggml backends. Another case is vulkan profileing. There a tonne of vulkan implementations with little viable tooling.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 29, 2026

There are more backends than CUDA and CPU, so I think this is a good idea. Vulkan has had a simpler profiler for a while because we have little other option. The CUDA side here doesn't look terribly complicated either.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 29, 2026

The only thing that would be missing is some way to dump the ggml graphs for an existing model so that the individual operations can be profiled with the exact tensor shapes as would be found in an actual model.

That would be very beneficial generally.

I did build something like that in #19896.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 29, 2026

I did build something like that in #19896.

Nice, useful. Thank you for making me aware of this.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 29, 2026

I'm very much willing to keep this as a separate branch / fork not to impose maintenance overhead on the CUDA maintainers (I understand your point that there's no use in maintaining an inferior tool which would require maintenance). Would be interested in hearing from other backend maintainers (right now I reckon Vulkan is a cautious maybe and HIP is a yes).

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 29, 2026

Yes, keep it on a branch. There are significant API changes that just don't warrant merging the changes in master atm.

You can move the branch to this repo (if you prefer) so that it's easier to sync - whoever is using it can just rebase it for everyone else.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I won't be able to try this for a few days, but IMO for this to be a replacement for GGML_VK_PERF_LOGGER we would need to make sure it works for all apps (regardless of how they parse their command line parameters), make it print the same meaningful names for fused ops, make it support printing plain text to stdout or stderr, make it print flops/bandwidth, and make it handle the concurrent mode properly.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 29, 2026

Aight, created new virtual PR for discussion for the main repo branch in #21160, closing this one.

@pwilkin pwilkin closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs examples ggml changes relating to the ggml tensor library for machine learning Hexagon IBM zDNN issues specific to IBM zDNN Accelerator Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend OpenVINO python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants