Multi-backend profiler#21138
Conversation
…initializer to all backends
|
I don't see myself using this over the CUDA tools like nsight systems etc or for CPU specific things - |
I think instead it should provide a json in chrome tracing format. That is a well established format with many tools. |
|
@am17an honestly I had two main use-cases in mind:
|
|
@pwilkin - I'm quite interested in optimizing the offload use cases, in fact that's one of my main areas of interest. In offload it's not the compute that is the only factor, it's the data transfers, which nsight systems shows quite well. Secondly, user performance reports are much preferred in |
|
All right:
|
JohannesGaessler
left a comment
There was a problem hiding this comment.
I'm very much against adding a profiler like this to the CUDA backend. Let me be frank: there is no situation where I would ever want to use a tool like this over NSight Systems. It will simply be a maintenance burden for no benefit.
On the other hand the HIP backend would benefit. rocprofiler-compute and its predecessors rocprof1/2 are... less mature, in the past there have been long streches where amds profilers plain dident work on lamacpp due to various bugs. And even when they do work, hardware support is quite limited and the full feature set is only available to cdna. |
Still, I don't think the renaissance of HIP software should begin with llama.cpp. |
|
In terms of opportunity cost there are still a lot of things that that I would consider to be of higher priority for HIP performance than adding a profiler. And if something like this were to be added at all the way it should be done is as an external tool without any ggml backend changes that simply evaluates a ggml graph for the operation in question - which is kind of what |
That would be very beneficial generally.
Im am merely listing instances where this can be useful. Some other cases i can think off is multi-backend offload, you wont find a profiler that can do offload scenarios involving multiple ggml backends. Another case is vulkan profileing. There a tonne of vulkan implementations with little viable tooling. |
|
There are more backends than CUDA and CPU, so I think this is a good idea. Vulkan has had a simpler profiler for a while because we have little other option. The CUDA side here doesn't look terribly complicated either. |
I did build something like that in #19896. |
Nice, useful. Thank you for making me aware of this. |
|
I'm very much willing to keep this as a separate branch / fork not to impose maintenance overhead on the CUDA maintainers (I understand your point that there's no use in maintaining an inferior tool which would require maintenance). Would be interested in hearing from other backend maintainers (right now I reckon Vulkan is a cautious maybe and HIP is a yes). |
|
Yes, keep it on a branch. There are significant API changes that just don't warrant merging the changes in You can move the branch to this repo (if you prefer) so that it's easier to sync - whoever is using it can just rebase it for everyone else. |
|
I won't be able to try this for a few days, but IMO for this to be a replacement for GGML_VK_PERF_LOGGER we would need to make sure it works for all apps (regardless of how they parse their command line parameters), make it print the same meaningful names for fused ops, make it support printing plain text to stdout or stderr, make it print flops/bandwidth, and make it handle the concurrent mode properly. |
|
Aight, created new virtual PR for discussion for the main repo branch in #21160, closing this one. |
Overview
A picture says more than a thousand words, so here's a picture:

Additional information
This PR introduces a cross-backend profiler (currently supported: CPU, BLAS, CUDA) that allows low-overhead profiling of op executions over the course of the computation, including fused ops, by delegating to each backend the emission of fine-grained profiling events. For CUDA, this means CUDA graphs have to be disabled (which of course is a performance loss), but otherwise measures real-life executions without artificially modifying the graph.
After the profiling data is done, there is also a Python script that can process the data and generate a HTML file with an interactive profiler timeline / stats table (like the one above).
Parallel requests are currently not supported.
Requirements