Conversation
6fc6cb2 to
81e3fc1
Compare
- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,
81e3fc1 to
4d0aa43
Compare
- redo compile-time checks for matmul scheduler
- redo matmul scheduler heuristic structure
- redo compile-time checks for matmul scheduler
- redo matmul scheduler heuristic structure
37a8e3b to
2ea779a
Compare
- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,
- redo compile-time checks for matmul scheduler
- redo matmul scheduler heuristic structure
2ea779a to
4970db2
Compare
- add flag for rastrization order in matmul heuristic params
- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,
- redo compile-time checks for matmul scheduler
- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter, - fix documentation - add flag for rastrization order in matmul heuristic params, - add code for calculating index mode, - add code for calculating problem shape, - post review changes, - fix clang-tidy warnings in modified source files,
- add support for grid swizzle in matmul kernels, - add guards for MmaOps in schedulers other than matmul, - add minor updates,
f023b77 to
6d7115c
Compare
- add support for grid swizzle in matmul kernels, - add guards for MmaOps in schedulers other than matmul, - add minor updates,
6d7115c to
41c5217
Compare
|
The scope of latest changes in the PR:
The current status of verification: restarted all tests, I will update PR description when those are done. @zasdfgbnm and @naoyam for visibility of applied changes. |
- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,
41c5217 to
e4e0b2f
Compare
| return ss.str(); | ||
| } | ||
|
|
||
| size_t hash() const { |
There was a problem hiding this comment.
Very naive, but why not going with std::hash<> specializations instead ?
There was a problem hiding this comment.
I agree. We should change that (but not in this PR), not only for matmul, but for all schedulers.
There was a problem hiding this comment.
Such specialization will require extending namespace std that I'm a little bit hesitant to do (although this is the way we are supposed to do it, if I'm not wrong). Maybe it's not a bad idea and we could have all matmul related structs, that require hashing, have own specialization - this can be checked and done as a follow up of this PR. Does it sound ok?
- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,
e4e0b2f to
fc6be09
Compare
zasdfgbnm
left a comment
There was a problem hiding this comment.
Left a few more comments.
| TensorView* a = inputs[0]->as<TensorView>(); | ||
| TensorView* b = inputs[1]->as<TensorView>(); | ||
| TensorView* c = outputs[0]->as<TensorView>(); |
There was a problem hiding this comment.
It's fine to add in a followup PR. The more important thing is we will eventually have that test, and as long as we make sure that, it doesn't matter which PR it should go.
| return (tvs.size() == 1) ? tvs[0] : nullptr; | ||
| }; | ||
|
|
||
| const auto* tv_input_A = |
There was a problem hiding this comment.
A note for the future: I think we will have to refactor this to cache tv_input_A which should be used for scheduleMatmul https://github.com/NVIDIA/Fuser/pull/23/files#r1151864983
| // #2 | ||
| { | ||
| for (const auto* mma_expr : mma_exprs) { | ||
| const auto layout_data = getInputsLayout(mma_expr); |
There was a problem hiding this comment.
note to future: the result of this should be cached and reused in heuristics and scheduleMatmul
| if (fusion_outputs_tvs.size() != fusion_outputs.size()) { | ||
| return "Fusion has output which is not a TensorView object"; | ||
| } |
There was a problem hiding this comment.
Currently we only support tensors as fusion outputs. This is not just for matmul, but for all problems.
Lines 193 to 195 in 12f5c9d
There was a problem hiding this comment.
Good to know - in the follow up PR, with code cleaning, I will remove this check in matmul_utils.cpp.
Thanks for letting me know about executor's requirement.
| if (!tv->hasReduction()) { | ||
| return "Fusion output TV has no reduction domain"; | ||
| } |
There was a problem hiding this comment.
Why fusion output can not have reduction? Isn't fusion output also the output of mma op, which does have reduction?
There was a problem hiding this comment.
nvm, I misread this code. You are checking that "there must be a reduction", instead of "there can not be a reduction."
| // 'SchedulerRuntimeInfo' | ||
| #include <executor_utils.h> | ||
|
|
||
| #include <sstream> |
There was a problem hiding this comment.
It looks like an artifact from one of the older revisions of compile /run time checks (to generate reason message for check failure). Will remove it in the follow up PR.
Thanks!
csrc/scheduler/registry.cpp
Outdated
| // Check that inputs of all select/gather-like ops are fusion inputs | ||
| if (rejectScheduleForSelectLikeOps(fusion, ScheduleHeuristic::Matmul)) { | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Because we are not doing any fusion right now, so there will automatically be no select/index_select/gather.
| const auto isBroadcastIn = [](const Val* val) { | ||
| if (val->getValType().value() == ValType::TensorView) { | ||
| const auto* tv = val->as<TensorView>(); | ||
| return tv->hasBroadcast(); | ||
| } | ||
| return true; | ||
| }; | ||
|
|
||
| TORCH_INTERNAL_ASSERT(isBroadcastIn(in_a)); | ||
| TORCH_INTERNAL_ASSERT(isBroadcastIn(in_b)); |
There was a problem hiding this comment.
Note for the future:
I think we should not only check the input has a broadcast, but also to add many more consistency check. For this PR, I think it is fine, but let's create a separate followup PR for it:
I think we should check:
- Inputs has two concrete IDs and one broadcast ID
- The broadcast's axis in different inputs are different
- Output has two concrete IDs and one reduction ID
- The axis of the output reduction ID must correspond to a concrete ID in both inputs
And with this check here, we can remove corresponding checks in the scheduler.
|
|
||
| //! A helper for checking if layout of MMA op's inputs. It will return optional | ||
| //! message if check fails. | ||
| LayoutData getInputsLayout(const MmaOp* mma_expr) { |
There was a problem hiding this comment.
Note for future PR:
Does it make sense to make this a method of MmaOp?
There was a problem hiding this comment.
You are right, this can be moved as method of MmaOp. I will prepare PR with this change.
Thanks!
zasdfgbnm
left a comment
There was a problem hiding this comment.
I don't think there is any blockers, and this PR is good enough to get merged. Considering that this PR conflicts with lots of other matmul schedule changes such as NN support, split-k support, etc. I think the best strategy is to merge this PR early, and we can always do further improvements in followup PRs.
- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,
This is PR is a continuation of previous PR (link)
The main goals of this PR are:
The side goals of this PR:
resgitry.cpp)The compute-time check goals:
Verification:
Platform: Ampere (GA100, 80GB), built with public CTK 11.8,
Branch sources: e4e0b2f (03.31.2023, after squashing)
1/ c++ tests
2/ torch tests
mainand the results are the same):3/ benchmarks