Skip to content

Enable matmul scheduler in segmenter#23

Merged
drzejan2 merged 2 commits intomainfrom
ab/matmul_scheduler_in_segmenter
Apr 3, 2023
Merged

Enable matmul scheduler in segmenter#23
drzejan2 merged 2 commits intomainfrom
ab/matmul_scheduler_in_segmenter

Conversation

@drzejan2
Copy link
Contributor

@drzejan2 drzejan2 commented Mar 16, 2023

This is PR is a continuation of previous PR (link)

The main goals of this PR are:

  • update registry so segmenter can schedule matmul kernels,
  • add an initial implementation of simple heuristics for kernel schedule details, but with reasonable partition to make it easier to extend functionality in future,
  • add basic set of tests that will check new execution path,

The side goals of this PR:

  • organize matmul-related compile/runtime checks to avoid long build (avoiding extending resgitry.cpp)

The compute-time check goals:

  • start with simple checks to reject fusions that are obviously not supported
  • when minimal requirements for matmul are met by fusion (number of fusion's input/outputs, type of these inputs) we proceed to more complex check that check dependencies in graph:
    • MmaOp inputs and relationship with fusion inputs:
      • MmaOp input must be a product of broadcast expression, that takes fusion input object as its object
    • MmaOp output and relationship with fusion output:
      • MmaOp output TensorView is fusion output

Verification:

Platform: Ampere (GA100, 80GB), built with public CTK 11.8,
Branch sources: e4e0b2f (03.31.2023, after squashing)

1/ c++ tests

  • command:
./build/bin/nvfuser_tests
  • status, the failed tests on prepared branch:
// all tests passed

2/ torch tests

  • command:
python third_party/nvfuser/python_tests/test_torchscript.py
  • status, the failed tests on prepared branch (checked with current main and the results are the same):
----------------------------------------------------------------------
Ran 8221 tests in 1035.377s

FAILED (errors=153, skipped=8064)

3/ benchmarks

  • command:
./build/bin/nvfuser_bench --benchmark_filter=NvFuserScheduler
  • status, the failed tests on prepared branch:
// no issues found

@drzejan2 drzejan2 requested review from naoyam and zasdfgbnm and removed request for naoyam and zasdfgbnm March 16, 2023 13:12
@drzejan2 drzejan2 marked this pull request as draft March 16, 2023 13:18
@drzejan2
Copy link
Contributor Author

There are the following issues to be addressed before this PR can be leave draft state:
1/ update handling of mma op in scheduleMatmul (comment)
2/ update pattern matching in compile-time checks (comment)

@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 6fc6cb2 to 81e3fc1 Compare March 17, 2023 08:20
drzejan2 added a commit that referenced this pull request Mar 21, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add test for checking matmul schedule integration with segmenter,
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 81e3fc1 to 4d0aa43 Compare March 21, 2023 13:57
drzejan2 added a commit that referenced this pull request Mar 21, 2023
drzejan2 added a commit that referenced this pull request Mar 21, 2023
- redo compile-time checks for matmul scheduler
drzejan2 added a commit that referenced this pull request Mar 21, 2023
- redo matmul scheduler heuristic structure
drzejan2 added a commit that referenced this pull request Mar 21, 2023
- redo compile-time checks for matmul scheduler
drzejan2 added a commit that referenced this pull request Mar 21, 2023
- redo matmul scheduler heuristic structure
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 37a8e3b to 2ea779a Compare March 21, 2023 16:55
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add test for checking matmul schedule integration with segmenter,
drzejan2 added a commit that referenced this pull request Mar 22, 2023
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- redo compile-time checks for matmul scheduler
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- redo matmul scheduler heuristic structure
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 2ea779a to 4970db2 Compare March 22, 2023 10:18
drzejan2 added a commit that referenced this pull request Mar 22, 2023
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- add flag for rastrization order in matmul heuristic params
@drzejan2 drzejan2 marked this pull request as ready for review March 22, 2023 19:09
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add test for checking matmul schedule integration with segmenter,
drzejan2 added a commit that referenced this pull request Mar 22, 2023
drzejan2 added a commit that referenced this pull request Mar 22, 2023
- redo compile-time checks for matmul scheduler
@drzejan2 drzejan2 mentioned this pull request Mar 29, 2023
@drzejan2 drzejan2 marked this pull request as draft March 30, 2023 09:42
drzejan2 added a commit that referenced this pull request Mar 30, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add test for checking matmul schedule integration with segmenter,
- fix documentation
- add flag for rastrization order in matmul heuristic params,
- add code for calculating index mode,
- add code for calculating problem shape,
- post review changes,
- fix clang-tidy warnings in modified source files,
drzejan2 added a commit that referenced this pull request Mar 30, 2023
- add support for grid swizzle in matmul kernels,
- add guards for MmaOps in schedulers other than matmul,
- add minor updates,
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from f023b77 to 6d7115c Compare March 30, 2023 10:44
drzejan2 added a commit that referenced this pull request Mar 30, 2023
- add support for grid swizzle in matmul kernels,
- add guards for MmaOps in schedulers other than matmul,
- add minor updates,
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 6d7115c to 41c5217 Compare March 30, 2023 11:05
@mmigdal-nv mmigdal-nv self-requested a review March 30, 2023 13:01
@drzejan2
Copy link
Contributor Author

The scope of latest changes in the PR:

  • only matmul scheduler can process fusions that have MmaOp expr, others will reject processing fusions,
  • simplify hashing functions used in matmul heuristic params (flags are combined and then hashed)
  • MmaOp expression constructor will now perform check if provided inputs contain broadcast domains (previously defined as a rule in compile time checks in matmul scheduler)
  • remove check for memory location of fusion inputs/outputs and MmaOp inputs from rules in compile time checks for matmul scheduler (fusion inputs/outputs are, by definition, always in global memory; MmaOp inputs can be redefined based on scheduler decisions),
  • simplified dependency check for MmaOp inputs and outputs:
    • inputs: must be a product of Broadcast expression, which takes as input one of fusion inputs
    • output: must be a fusion output
  • simplified rules for fusion inputs/outputs in compile time checks for matmul scheduler
    • explicit expectation for the number of inputs (2) and outputs (1)
    • inputs/outputs can be only TensorViews

The current status of verification: restarted all tests, I will update PR description when those are done.

@zasdfgbnm and @naoyam for visibility of applied changes.

drzejan2 added a commit that referenced this pull request Mar 31, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling structures,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add tests for checking matmul schedule integration with segmenter,
- fix documentation
- add code for calculating index mode,
- add code for calculating problem shape,
- fix clang-tidy warnings in modified source files,
- add guards for MmaOps in schedulers other than matmul,
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 41c5217 to e4e0b2f Compare March 31, 2023 13:48
@drzejan2 drzejan2 marked this pull request as ready for review March 31, 2023 14:31
return ss.str();
}

size_t hash() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very naive, but why not going with std::hash<> specializations instead ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We should change that (but not in this PR), not only for matmul, but for all schedulers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such specialization will require extending namespace std that I'm a little bit hesitant to do (although this is the way we are supposed to do it, if I'm not wrong). Maybe it's not a bad idea and we could have all matmul related structs, that require hashing, have own specialization - this can be checked and done as a follow up of this PR. Does it sound ok?

drzejan2 added a commit that referenced this pull request Apr 3, 2023
- connect matmul scheduler in segmenter with implementation of
  matmul scheduling structures,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add tests for checking matmul schedule integration with segmenter,
- fix documentation
- add code for calculating index mode,
- add code for calculating problem shape,
- fix clang-tidy warnings in modified source files,
- add guards for MmaOps in schedulers other than matmul,
@drzejan2 drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from e4e0b2f to fc6be09 Compare April 3, 2023 15:54
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments.

Comment on lines +77 to +79
TensorView* a = inputs[0]->as<TensorView>();
TensorView* b = inputs[1]->as<TensorView>();
TensorView* c = outputs[0]->as<TensorView>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to add in a followup PR. The more important thing is we will eventually have that test, and as long as we make sure that, it doesn't matter which PR it should go.

return (tvs.size() == 1) ? tvs[0] : nullptr;
};

const auto* tv_input_A =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note for the future: I think we will have to refactor this to cache tv_input_A which should be used for scheduleMatmul https://github.com/NVIDIA/Fuser/pull/23/files#r1151864983

// #2
{
for (const auto* mma_expr : mma_exprs) {
const auto layout_data = getInputsLayout(mma_expr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to future: the result of this should be cached and reused in heuristics and scheduleMatmul

Comment on lines +448 to +450
if (fusion_outputs_tvs.size() != fusion_outputs.size()) {
return "Fusion has output which is not a TensorView object";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only support tensors as fusion outputs. This is not just for matmul, but for all problems.

Fuser/csrc/executor.cpp

Lines 193 to 195 in 12f5c9d

TORCH_INTERNAL_ASSERT(
out->getValType() == ValType::TensorView,
"Output types from fusions that are not tensors are not supported at this point.");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know - in the follow up PR, with code cleaning, I will remove this check in matmul_utils.cpp.

Thanks for letting me know about executor's requirement.

Comment on lines +479 to +481
if (!tv->hasReduction()) {
return "Fusion output TV has no reduction domain";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fusion output can not have reduction? Isn't fusion output also the output of mma op, which does have reduction?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I misread this code. You are checking that "there must be a reduction", instead of "there can not be a reduction."

// 'SchedulerRuntimeInfo'
#include <executor_utils.h>

#include <sstream>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like an artifact from one of the older revisions of compile /run time checks (to generate reason message for check failure). Will remove it in the follow up PR.

Thanks!

Comment on lines +2074 to +2078
// Check that inputs of all select/gather-like ops are fusion inputs
if (rejectScheduleForSelectLikeOps(fusion, ScheduleHeuristic::Matmul)) {
return false;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are not doing any fusion right now, so there will automatically be no select/index_select/gather.

Comment on lines +1354 to +1363
const auto isBroadcastIn = [](const Val* val) {
if (val->getValType().value() == ValType::TensorView) {
const auto* tv = val->as<TensorView>();
return tv->hasBroadcast();
}
return true;
};

TORCH_INTERNAL_ASSERT(isBroadcastIn(in_a));
TORCH_INTERNAL_ASSERT(isBroadcastIn(in_b));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future:

I think we should not only check the input has a broadcast, but also to add many more consistency check. For this PR, I think it is fine, but let's create a separate followup PR for it:

I think we should check:

  • Inputs has two concrete IDs and one broadcast ID
  • The broadcast's axis in different inputs are different
  • Output has two concrete IDs and one reduction ID
  • The axis of the output reduction ID must correspond to a concrete ID in both inputs

And with this check here, we can remove corresponding checks in the scheduler.


//! A helper for checking if layout of MMA op's inputs. It will return optional
//! message if check fails.
LayoutData getInputsLayout(const MmaOp* mma_expr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future PR:
Does it make sense to make this a method of MmaOp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this can be moved as method of MmaOp. I will prepare PR with this change.

Thanks!

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any blockers, and this PR is good enough to get merged. Considering that this PR conflicts with lots of other matmul schedule changes such as NN support, split-k support, etc. I think the best strategy is to merge this PR early, and we can always do further improvements in followup PRs.

- connect matmul scheduler in segmenter with implementation of
  matmul scheduling structures,
- update matmul params to store key items needed for matmul
  scheduling (based on prototype param structure),
- add matmul compile/runtime checks in separete source file,
- apply improvement in matmul instruction scheduling with loop
  rotation (changes from #2488),
- add initial heuristicis for matmul scheduler in segmenter,
- add implementation of helper functions for matmul heuristics,
- add dedicated debug logger,
- add tests for checking matmul schedule integration with segmenter,
- fix documentation
- add code for calculating index mode,
- add code for calculating problem shape,
- fix clang-tidy warnings in modified source files,
- add guards for MmaOps in schedulers other than matmul,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants