Skip to content

Refactor so CancelSplitCat becomes a class.#1789

Merged
wujingyue merged 2 commits intomainfrom
wjy/class
Mar 1, 2024
Merged

Refactor so CancelSplitCat becomes a class.#1789
wujingyue merged 2 commits intomainfrom
wjy/class

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Feb 19, 2024

This makes it convenient to use an IdModel as a class member without having to pass it through many functions.

I examined the NVFUSER_TRACE. FusionKernelRuntime::FusionKernelRuntime is bottlenecked by
"Finding valid fusion segment solutions" not pre-segmenter passes. I added the FUSER_PERF_SCOPE for pre-segmenter passes anyway.

For #1768

@wujingyue wujingyue marked this pull request as draft February 19, 2024 07:04
Base automatically changed from wjy/move to main February 21, 2024 02:22
This makes it convenient to use an IdModel as a class member without
having to pass it through many functions.
@wujingyue wujingyue marked this pull request as ready for review February 27, 2024 06:36
@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue requested a review from naoyam February 27, 2024 06:37
public:
CancelSplitCat(Fusion* fusion)
: fusion_(fusion),
id_model_(fusion, /*build_graphs=*/true, /*allow_self_mapping=*/true) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider not building all the graphs by default. If all we need is just the exact graph, we can skip generating the other graphs, which may be much more costly than just building the exact graph.

Copy link
Collaborator Author

@wujingyue wujingyue Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave that part in the follow-up PR.

@wujingyue wujingyue requested a review from naoyam February 27, 2024 21:34
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

@wujingyue
Copy link
Collaborator Author

LGTM.

Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

Good point. No, I haven't measured the compile time impact either. How do we measure that? cc @rdspring1

That being said, I believe there are lots of low hanging fruits to make IdModel run faster. For example, our disjoint set algorithm can be made to approximately O(1) per operation by using path compression and tracking elements in a set using a linked list.

@naoyam
Copy link
Collaborator

naoyam commented Feb 28, 2024

LGTM.
Out of curiosity, have you noticed any significant latency increase? I haven't done any performance profiling, but some parts of the IdModel analyses could by expensive. Just building the exact graph would be no worse than the current ComputeAtMap but the loop promotion analysis, which is automatically done if a loop graph is also built, might be very costly. Note that ToT doesn't have much yet, but I have pending RPs to expand that. (#1777, #1830)

Good point. No, I haven't measured the compile time impact either. How do we measure that? cc @rdspring1

The easiest way would be to use the nvtx makers we embed. Unless it's disabled by NVFUSER_DISABLE=nvtx, the runtime automatically inserts nvtx markers so that nsys and other tools can understand and visualize the timeline. I haven't tried it myself, but running a test with nsys should generate a trace file that has timestamps of annotated regions by FUSER_PERF_SCOPE. The generated trace file can be visualized, but there should also be a command line tool to interpret it.

That being said, I believe there are lots of low hanging fruits to make IdModel run faster. For example, our disjoint set algorithm can be made to approximately O(1) per operation by using path compression and tracking elements in a set using a linked list.

Yes, nothing has been done for its efficiency yet, and I won't be surprised to see some slow results. I don't think we would need to prioritize that at this moment, but it's been in my TODO list that at least we would make sure nothing gets intolerably slow.

@rdspring1
Copy link
Collaborator

I use the nvtx markers to measure compile time latency. For visualization, I use either Nsight systems or built-in tracing infrastructure in chrome and the NVFUSER_TRACE flag.

Reference: https://github.com/NVIDIA/Fuser/blob/main/csrc/instrumentation.h#L23-L38

@wujingyue wujingyue merged commit bf7be62 into main Mar 1, 2024
@wujingyue wujingyue deleted the wjy/class branch March 1, 2024 19:03
wujingyue added a commit that referenced this pull request Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants