Conversation
This makes it convenient to use an IdModel as a class member without having to pass it through many functions.
|
!build |
| public: | ||
| CancelSplitCat(Fusion* fusion) | ||
| : fusion_(fusion), | ||
| id_model_(fusion, /*build_graphs=*/true, /*allow_self_mapping=*/true) {} |
There was a problem hiding this comment.
We may want to consider not building all the graphs by default. If all we need is just the exact graph, we can skip generating the other graphs, which may be much more costly than just building the exact graph.
There was a problem hiding this comment.
Totally agree -- https://github.com/NVIDIA/Fuser/pull/1799/files#diff-4e65ceb031d001c04bbef22544b699ffd74f1e4ed0dcea24230b18b77860d4f0R79-R80. I don't see an API for that yet. Do you want to work on that?
There was a problem hiding this comment.
I think it's fine to leave that part in the follow-up PR.
naoyam
left a comment
There was a problem hiding this comment.
LGTM.
Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)
Good point. No, I haven't measured the compile time impact either. How do we measure that? cc @rdspring1 That being said, I believe there are lots of low hanging fruits to make IdModel run faster. For example, our disjoint set algorithm can be made to approximately O(1) per operation by using path compression and tracking elements in a set using a linked list. |
The easiest way would be to use the nvtx makers we embed. Unless it's disabled by
Yes, nothing has been done for its efficiency yet, and I won't be surprised to see some slow results. I don't think we would need to prioritize that at this moment, but it's been in my TODO list that at least we would make sure nothing gets intolerably slow. |
|
I use the nvtx markers to measure compile time latency. For visualization, I use either Nsight systems or built-in tracing infrastructure in chrome and the Reference: https://github.com/NVIDIA/Fuser/blob/main/csrc/instrumentation.h#L23-L38 |
This makes it convenient to use an IdModel as a class member without having to pass it through many functions.
I examined the NVFUSER_TRACE. FusionKernelRuntime::FusionKernelRuntime is bottlenecked by
"Finding valid fusion segment solutions" not pre-segmenter passes. I added the FUSER_PERF_SCOPE for pre-segmenter passes anyway.
For #1768