Enable matmul scheduler in segmenter by drzejan2 · Pull Request #23 · NVIDIA/Fuser

drzejan2 · 2023-03-16T12:13:44Z

This is PR is a continuation of previous PR (link)

The main goals of this PR are:

update registry so segmenter can schedule matmul kernels,
add an initial implementation of simple heuristics for kernel schedule details, but with reasonable partition to make it easier to extend functionality in future,
add basic set of tests that will check new execution path,

The side goals of this PR:

organize matmul-related compile/runtime checks to avoid long build (avoiding extending resgitry.cpp)

The compute-time check goals:

start with simple checks to reject fusions that are obviously not supported
when minimal requirements for matmul are met by fusion (number of fusion's input/outputs, type of these inputs) we proceed to more complex check that check dependencies in graph:
- MmaOp inputs and relationship with fusion inputs:
  - MmaOp input must be a product of broadcast expression, that takes fusion input object as its object
- MmaOp output and relationship with fusion output:
  - MmaOp output TensorView is fusion output

Verification:

Platform: Ampere (GA100, 80GB), built with public CTK 11.8,
Branch sources: e4e0b2f (03.31.2023, after squashing)

1/ c++ tests

command:

./build/bin/nvfuser_tests

status, the failed tests on prepared branch:

// all tests passed

2/ torch tests

command:

python third_party/nvfuser/python_tests/test_torchscript.py

status, the failed tests on prepared branch (checked with current main and the results are the same):

----------------------------------------------------------------------
Ran 8221 tests in 1035.377s

FAILED (errors=153, skipped=8064)

3/ benchmarks

command:

./build/bin/nvfuser_bench --benchmark_filter=NvFuserScheduler

status, the failed tests on prepared branch:

// no issues found

drzejan2 · 2023-03-16T13:20:32Z

There are the following issues to be addressed before this PR can be leave draft state:
1/ update handling of mma op in scheduleMatmul (comment)
2/ update pattern matching in compile-time checks (comment)

- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,

- redo compile-time checks for matmul scheduler

- redo matmul scheduler heuristic structure

- redo compile-time checks for matmul scheduler

- redo matmul scheduler heuristic structure

- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,

- redo compile-time checks for matmul scheduler

- redo matmul scheduler heuristic structure

- post review changes

- add flag for rastrization order in matmul heuristic params

- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter,

- redo compile-time checks for matmul scheduler

- connect matmul scheduler in segmenter with implementation of matmul scheduling, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add test for checking matmul schedule integration with segmenter, - fix documentation - add flag for rastrization order in matmul heuristic params, - add code for calculating index mode, - add code for calculating problem shape, - post review changes, - fix clang-tidy warnings in modified source files,

- add support for grid swizzle in matmul kernels, - add guards for MmaOps in schedulers other than matmul, - add minor updates,

drzejan2 · 2023-03-31T13:47:36Z

The scope of latest changes in the PR:

only matmul scheduler can process fusions that have MmaOp expr, others will reject processing fusions,
simplify hashing functions used in matmul heuristic params (flags are combined and then hashed)
MmaOp expression constructor will now perform check if provided inputs contain broadcast domains (previously defined as a rule in compile time checks in matmul scheduler)
remove check for memory location of fusion inputs/outputs and MmaOp inputs from rules in compile time checks for matmul scheduler (fusion inputs/outputs are, by definition, always in global memory; MmaOp inputs can be redefined based on scheduler decisions),
simplified dependency check for MmaOp inputs and outputs:
- inputs: must be a product of Broadcast expression, which takes as input one of fusion inputs
- output: must be a fusion output
simplified rules for fusion inputs/outputs in compile time checks for matmul scheduler
- explicit expectation for the number of inputs (2) and outputs (1)
- inputs/outputs can be only TensorViews

The current status of verification: restarted all tests, I will update PR description when those are done.

@zasdfgbnm and @naoyam for visibility of applied changes.

- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,

mmigdal-nv · 2023-03-31T15:06:09Z

csrc/scheduler/matmul_heuristic.h

+      return ss.str();
+    }
+
+    size_t hash() const {


Very naive, but why not going with std::hash<> specializations instead ?

I agree. We should change that (but not in this PR), not only for matmul, but for all schedulers.

Such specialization will require extending namespace std that I'm a little bit hesitant to do (although this is the way we are supposed to do it, if I'm not wrong). Maybe it's not a bad idea and we could have all matmul related structs, that require hashing, have own specialization - this can be checked and done as a follow up of this PR. Does it sound ok?

- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,

Fixes #103

zasdfgbnm

Left a few more comments.

zasdfgbnm · 2023-04-03T16:08:46Z

csrc/scheduler/matmul.cpp

+  TensorView* a = inputs[0]->as<TensorView>();
+  TensorView* b = inputs[1]->as<TensorView>();
+  TensorView* c = outputs[0]->as<TensorView>();


It's fine to add in a followup PR. The more important thing is we will eventually have that test, and as long as we make sure that, it doesn't matter which PR it should go.

zasdfgbnm · 2023-04-03T16:13:49Z

csrc/scheduler/matmul_utils.cpp

+    return (tvs.size() == 1) ? tvs[0] : nullptr;
+  };
+
+  const auto* tv_input_A =


A note for the future: I think we will have to refactor this to cache tv_input_A which should be used for scheduleMatmul https://github.com/NVIDIA/Fuser/pull/23/files#r1151864983

zasdfgbnm · 2023-04-03T16:16:23Z

csrc/scheduler/matmul_utils.cpp

+  // #2
+  {
+    for (const auto* mma_expr : mma_exprs) {
+      const auto layout_data = getInputsLayout(mma_expr);


note to future: the result of this should be cached and reused in heuristics and scheduleMatmul

zasdfgbnm · 2023-04-03T16:27:58Z

csrc/scheduler/matmul_utils.cpp

+    if (fusion_outputs_tvs.size() != fusion_outputs.size()) {
+      return "Fusion has output which is not a TensorView object";
+    }


Currently we only support tensors as fusion outputs. This is not just for matmul, but for all problems.

Fuser/csrc/executor.cpp

Lines 193 to 195 in 12f5c9d

TORCH_INTERNAL_ASSERT(

out->getValType() == ValType::TensorView,

"Output types from fusions that are not tensors are not supported at this point.");

Good to know - in the follow up PR, with code cleaning, I will remove this check in matmul_utils.cpp.

Thanks for letting me know about executor's requirement.

zasdfgbnm · 2023-04-03T16:32:14Z

csrc/scheduler/matmul_utils.cpp

+      if (!tv->hasReduction()) {
+        return "Fusion output TV has no reduction domain";
+      }


Why fusion output can not have reduction? Isn't fusion output also the output of mma op, which does have reduction?

nvm, I misread this code. You are checking that "there must be a reduction", instead of "there can not be a reduction."

zasdfgbnm · 2023-04-03T17:02:46Z

csrc/scheduler/matmul.cpp

+// 'SchedulerRuntimeInfo'
+#include <executor_utils.h>
+
+#include <sstream>


Is this used?

It looks like an artifact from one of the older revisions of compile /run time checks (to generate reason message for check failure). Will remove it in the follow up PR.

Thanks!

zasdfgbnm · 2023-04-03T19:01:22Z

csrc/scheduler/registry.cpp

+    // Check that inputs of all select/gather-like ops are fusion inputs
+    if (rejectScheduleForSelectLikeOps(fusion, ScheduleHeuristic::Matmul)) {
+      return false;
+    }
+


Because we are not doing any fusion right now, so there will automatically be no select/index_select/gather.

zasdfgbnm · 2023-04-03T19:20:48Z

csrc/ir_nodes.cpp

+  const auto isBroadcastIn = [](const Val* val) {
+    if (val->getValType().value() == ValType::TensorView) {
+      const auto* tv = val->as<TensorView>();
+      return tv->hasBroadcast();
+    }
+    return true;
+  };
+
+  TORCH_INTERNAL_ASSERT(isBroadcastIn(in_a));
+  TORCH_INTERNAL_ASSERT(isBroadcastIn(in_b));


Note for the future:

I think we should not only check the input has a broadcast, but also to add many more consistency check. For this PR, I think it is fine, but let's create a separate followup PR for it:

I think we should check:

Inputs has two concrete IDs and one broadcast ID

The broadcast's axis in different inputs are different

Output has two concrete IDs and one reduction ID

The axis of the output reduction ID must correspond to a concrete ID in both inputs

And with this check here, we can remove corresponding checks in the scheduler.

zasdfgbnm · 2023-04-03T19:25:19Z

csrc/scheduler/matmul_utils.cpp

+
+//! A helper for checking if layout of MMA op's inputs. It will return optional
+//! message if check fails.
+LayoutData getInputsLayout(const MmaOp* mma_expr) {


Note for future PR:
Does it make sense to make this a method of MmaOp?

You are right, this can be moved as method of MmaOp. I will prepare PR with this change.

Thanks!

zasdfgbnm

I don't think there is any blockers, and this PR is good enough to get merged. Considering that this PR conflicts with lots of other matmul schedule changes such as NN support, split-k support, etc. I think the best strategy is to merge this PR early, and we can always do further improvements in followup PRs.

- connect matmul scheduler in segmenter with implementation of matmul scheduling structures, - update matmul params to store key items needed for matmul scheduling (based on prototype param structure), - add matmul compile/runtime checks in separete source file, - apply improvement in matmul instruction scheduling with loop rotation (changes from #2488), - add initial heuristicis for matmul scheduler in segmenter, - add implementation of helper functions for matmul heuristics, - add dedicated debug logger, - add tests for checking matmul schedule integration with segmenter, - fix documentation - add code for calculating index mode, - add code for calculating problem shape, - fix clang-tidy warnings in modified source files, - add guards for MmaOps in schedulers other than matmul,

drzejan2 mentioned this pull request Mar 16, 2023

Enable matmul scheduler in segmenter csarofeen/pytorch#2461

Closed

drzejan2 requested review from naoyam and zasdfgbnm and removed request for naoyam and zasdfgbnm March 16, 2023 13:12

drzejan2 marked this pull request as draft March 16, 2023 13:18

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 6fc6cb2 to 81e3fc1 Compare March 17, 2023 08:20

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 81e3fc1 to 4d0aa43 Compare March 21, 2023 13:57

drzejan2 added a commit that referenced this pull request Mar 21, 2023

Fixes in documentation (#23)

b6db5c7

drzejan2 added a commit that referenced this pull request Mar 21, 2023

Segmenter - support for matmul scheduler (#23)

4e8c66a

- redo compile-time checks for matmul scheduler

drzejan2 added a commit that referenced this pull request Mar 21, 2023

Segmenter - support for matmul scheduler (#23)

37a8e3b

- redo matmul scheduler heuristic structure

drzejan2 added a commit that referenced this pull request Mar 21, 2023

Segmenter - support for matmul scheduler (#23)

ee56049

- redo compile-time checks for matmul scheduler

drzejan2 added a commit that referenced this pull request Mar 21, 2023

Segmenter - support for matmul scheduler (#23)

2ea779a

- redo matmul scheduler heuristic structure

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 37a8e3b to 2ea779a Compare March 21, 2023 16:55

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Fixes in documentation (#23)

d60712b

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Segmenter - support for matmul scheduler (#23)

8d1f499

- redo compile-time checks for matmul scheduler

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Segmenter - support for matmul scheduler (#23)

e1596ad

- redo matmul scheduler heuristic structure

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 2ea779a to 4970db2 Compare March 22, 2023 10:18

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Segmenter - support for matmul scheduler (#23)

6a04dd4

- post review changes

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Segmenter - support for matmul scheduler (#23)

4970db2

- add flag for rastrization order in matmul heuristic params

drzejan2 requested review from csarofeen, naoyam and zasdfgbnm March 22, 2023 19:09

drzejan2 marked this pull request as ready for review March 22, 2023 19:09

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Fixes in documentation (#23)

65c3431

drzejan2 added a commit that referenced this pull request Mar 22, 2023

Segmenter - support for matmul scheduler (#23)

ca2a61a

- redo compile-time checks for matmul scheduler

drzejan2 mentioned this pull request Mar 29, 2023

CTA Swizzles #90

Merged

drzejan2 marked this pull request as draft March 30, 2023 09:42

drzejan2 added a commit that referenced this pull request Mar 30, 2023

Segmenter - support for matmul scheduler (#23)

6d7115c

- add support for grid swizzle in matmul kernels, - add guards for MmaOps in schedulers other than matmul, - add minor updates,

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from f023b77 to 6d7115c Compare March 30, 2023 10:44

drzejan2 added a commit that referenced this pull request Mar 30, 2023

Segmenter - support for matmul scheduler (#23)

41c5217

- add support for grid swizzle in matmul kernels, - add guards for MmaOps in schedulers other than matmul, - add minor updates,

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 6d7115c to 41c5217 Compare March 30, 2023 11:05

mmigdal-nv self-requested a review March 30, 2023 13:01

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from 41c5217 to e4e0b2f Compare March 31, 2023 13:48

drzejan2 marked this pull request as ready for review March 31, 2023 14:31

mmigdal-nv reviewed Mar 31, 2023

View reviewed changes

drzejan2 force-pushed the ab/matmul_scheduler_in_segmenter branch from e4e0b2f to fc6be09 Compare April 3, 2023 15:54

patching ABI compatibility (#112)

6d5dfdb

Fixes #103

zasdfgbnm reviewed Apr 3, 2023

View reviewed changes

zasdfgbnm approved these changes Apr 3, 2023

View reviewed changes

zasdfgbnm reviewed Apr 3, 2023

View reviewed changes

zasdfgbnm approved these changes Apr 3, 2023

View reviewed changes

drzejan2 merged commit ce575fd into main Apr 3, 2023

drzejan2 deleted the ab/matmul_scheduler_in_segmenter branch April 3, 2023 21:59

drzejan2 mentioned this pull request Apr 5, 2023

MmaOp - consistency checks for basic matmul #131

Merged

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

t-vi mentioned this pull request Oct 9, 2025

INTERNAL_ASSERT running Qwen-3-4B from litgpt through thunder #5358

Open

	TORCH_INTERNAL_ASSERT(
	out->getValType() == ValType::TensorView,
	"Output types from fusions that are not tensors are not supported at this point.");

Conversation

drzejan2 commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The main goals of this PR are:

The side goals of this PR:

The compute-time check goals:

Verification:

1/ c++ tests

2/ torch tests

3/ benchmarks

Uh oh!

drzejan2 commented Mar 16, 2023

Uh oh!

drzejan2 commented Mar 31, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

drzejan2 commented Mar 16, 2023 •

edited

Loading