Refactor so CancelSplitCat becomes a class. by wujingyue · Pull Request #1789 · NVIDIA/Fuser

wujingyue · 2024-02-19T06:26:42Z

This makes it convenient to use an IdModel as a class member without having to pass it through many functions.

I examined the NVFUSER_TRACE. FusionKernelRuntime::FusionKernelRuntime is bottlenecked by
"Finding valid fusion segment solutions" not pre-segmenter passes. I added the FUSER_PERF_SCOPE for pre-segmenter passes anyway.

For #1768

This makes it convenient to use an IdModel as a class member without having to pass it through many functions.

wujingyue · 2024-02-27T06:36:47Z

!build

naoyam · 2024-02-27T21:31:14Z

csrc/preseg_passes/move_split_cat.cpp

+ public:
+  CancelSplitCat(Fusion* fusion)
+      : fusion_(fusion),
+        id_model_(fusion, /*build_graphs=*/true, /*allow_self_mapping=*/true) {}


We may want to consider not building all the graphs by default. If all we need is just the exact graph, we can skip generating the other graphs, which may be much more costly than just building the exact graph.

Totally agree -- https://github.com/NVIDIA/Fuser/pull/1799/files#diff-4e65ceb031d001c04bbef22544b699ffd74f1e4ed0dcea24230b18b77860d4f0R79-R80. I don't see an API for that yet. Do you want to work on that?

I think it's fine to leave that part in the follow-up PR.

csrc/preseg_passes/move_split_cat.cpp

naoyam

LGTM.

Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

wujingyue · 2024-02-27T22:48:20Z

LGTM.

Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

Good point. No, I haven't measured the compile time impact either. How do we measure that? cc @rdspring1

That being said, I believe there are lots of low hanging fruits to make IdModel run faster. For example, our disjoint set algorithm can be made to approximately O(1) per operation by using path compression and tracking elements in a set using a linked list.

naoyam · 2024-02-28T20:06:10Z

LGTM.
Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

Good point. No, I haven't measured the compile time impact either. How do we measure that? cc @rdspring1

The easiest way would be to use the nvtx makers we embed. Unless it's disabled by NVFUSER_DISABLE=nvtx, the runtime automatically inserts nvtx markers so that nsys and other tools can understand and visualize the timeline. I haven't tried it myself, but running a test with nsys should generate a trace file that has timestamps of annotated regions by FUSER_PERF_SCOPE. The generated trace file can be visualized, but there should also be a command line tool to interpret it.

That being said, I believe there are lots of low hanging fruits to make IdModel run faster. For example, our disjoint set algorithm can be made to approximately O(1) per operation by using path compression and tracking elements in a set using a linked list.

Yes, nothing has been done for its efficiency yet, and I won't be surprised to see some slow results. I don't think we would need to prioritize that at this moment, but it's been in my TODO list that at least we would make sure nothing gets intolerably slow.

rdspring1 · 2024-02-29T17:14:30Z

I use the nvtx markers to measure compile time latency. For visualization, I use either Nsight systems or built-in tracing infrastructure in chrome and the NVFUSER_TRACE flag.

Reference: https://github.com/NVIDIA/Fuser/blob/main/csrc/instrumentation.h#L23-L38

As a follow up to #1789.

wujingyue marked this pull request as draft February 19, 2024 07:04

wujingyue force-pushed the wjy/class branch from 4b36d91 to 91e0bf3 Compare February 20, 2024 05:43

wujingyue force-pushed the wjy/move branch from 4e53857 to 048e573 Compare February 20, 2024 23:26

Base automatically changed from wjy/move to main February 21, 2024 02:22

wujingyue added 2 commits February 27, 2024 06:33

Refactor so CancelSplitCat becomes a class.

fc14e0b

This makes it convenient to use an IdModel as a class member without having to pass it through many functions.

Create IdModel.

31a71ce

wujingyue force-pushed the wjy/class branch from d3c74b5 to 31a71ce Compare February 27, 2024 06:34

wujingyue marked this pull request as ready for review February 27, 2024 06:36

wujingyue requested a review from naoyam February 27, 2024 06:37

naoyam reviewed Feb 27, 2024

View reviewed changes

csrc/preseg_passes/move_split_cat.cpp Show resolved Hide resolved

wujingyue requested a review from naoyam February 27, 2024 21:34

naoyam approved these changes Feb 27, 2024

View reviewed changes

wujingyue merged commit bf7be62 into main Mar 1, 2024

wujingyue deleted the wjy/class branch March 1, 2024 19:03

wujingyue added a commit that referenced this pull request Mar 2, 2024

Add a perf scope for pre-segmenter passes.

eac3296

As a follow up to #1789.

wujingyue mentioned this pull request Mar 2, 2024

Add a perf scope for pre-segmenter passes. #1868

Merged

wujingyue added the enhancement label Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor so CancelSplitCat becomes a class.#1789

Refactor so CancelSplitCat becomes a class.#1789
wujingyue merged 2 commits intomainfrom
wjy/class

wujingyue commented Feb 19, 2024 •

edited

Loading

Uh oh!

wujingyue commented Feb 27, 2024

Uh oh!

naoyam Feb 27, 2024

Uh oh!

wujingyue Feb 27, 2024 •

edited

Loading

Uh oh!

naoyam Feb 27, 2024

Uh oh!

Uh oh!

naoyam left a comment

Uh oh!

wujingyue commented Feb 27, 2024

Uh oh!

naoyam commented Feb 28, 2024

Uh oh!

rdspring1 commented Feb 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wujingyue commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Feb 27, 2024

Uh oh!

naoyam Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Feb 27, 2024

Uh oh!

naoyam commented Feb 28, 2024

Uh oh!

rdspring1 commented Feb 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wujingyue commented Feb 19, 2024 •

edited

Loading

wujingyue Feb 27, 2024 •

edited

Loading