Translate MatmulOp and LinearOp by jacobhinkle · Pull Request #2236 · NVIDIA/Fuser

jacobhinkle · 2024-05-13T20:36:09Z

The purpose of this PR is to enable the NVFuser matmul scheduler to operate on the new MatmulOp an LinearOp nodes. That means the matmul scheduler can optionally schedule segments if they are not supported by ATen's matmul and linear functions.

Specifically, this PR:

Adds MatmulOp and LinearOp to the set of MatmulPatterns detected. Previously only MmaOp and combined mul-sum patterns were detected.
Enable translation of MatmulOp and LinearOp to fixed broadcast+MmaOp patterns.
Introduces EnableOption::FuseMatmul and associated NVFUSER_ENABLE=fuse_matmul option to enable the automatic scheduler to accept matmul patterns consisting of a MatmulOp or LinearOp. By default, the matmul scheduler will not accept segments containing MatmulOp or LinearOp patterns, meaning all those nodes will be computed using the ExprEval scheduler (ATen).

Not all cases that are supported by these IR nodes can be translated to an MmaOp. In particular:

gemv cases where one operand is 1D are not supported
Cases with multiple batch dimensions.
Cases where "batch" dimensions must be added by unsqueezing one dimension. Those new dimensions are indistinguishable from e.g. multiple M dimensions which do not yet support.

These correspond to test cases in the two new tests. See the TODO comments for descriptions of cases we plan to support but cannot yet translate.

There is also a commented out test case for LinearOp: M=N=1 which should not be translated to a LinearOp at all.

jacobhinkle · 2024-05-13T20:56:33Z

~~This test is failing:~~ This was fixed in #2272 .

Fuser/tests/cpp/test_combine_mul_sum.cpp

Lines 134 to 146 in 5b5ec8f

    
           // We are broadcating to a tensor that will have too many dims 
        
           // to be valid for a mma op. 
        
           std::vector<bool> bcast_dims(tv0->nDims() + 2, false); 
        
           bcast_dims.at(bcast_dims.size() - 2) = true; 
        
           bcast_dims.at(bcast_dims.size() - 3) = true; 
        
           auto tv0b = broadcast(tv0t, bcast_dims); 
        
           bcast_dims.at(bcast_dims.size() - 2) = false; 
        
           bcast_dims.at(bcast_dims.size() - 3) = true; 
        
           bcast_dims.at(bcast_dims.size() - 4) = true; 
        
           auto tv1b = broadcast(tv1t, bcast_dims); 
        
           auto tv2 = mul(tv0b, tv1b); 
        
           auto tv3 = sum(tv2, {-1}); 
        
           fusion.addOutput(tv3);

I haven't implemented this condition because I don't think we really want it. Rather, we would prefer to accept patterns with multiple M, N, K, or Batch dims and simply canonicalize them via reorder/merge at the beginning of scheduling. Still, if we do support that then we should update the test to actually check that they are properly scheduled; until then we should probably keep the check strict.

This also adds testing of exact mapping to the node tests (WIP).

Use IdModel to define ID roles (and hence TV roles). Also use allocation domains to determine whether each operand has K as its inner dimension.

Tests pass!

…late_matmul_op

This replaces the `CombineMulSum` class with `MatmulPattern` in the Matmul scheduler. Additionally, we use these matmul patterns to determine the problem layout, IterDomain roles, and TensorView roles. The allocation domain is used to determine the problem layout. The matmul scheduler is updated to reject segments whose input allocation domains are non-trivial (until that is supported eg. by #2226). Note that this does not add handling of `MatmulOp` and `LinearOp` in the matmul scheduler. That will be done next in #2236 or similar. --------- Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>

jacobhinkle · 2024-05-30T00:01:09Z

!build

csrc/scheduler/expr_eval_sched.cpp

csrc/scheduler/matmul_utils.cpp

Priya2698 · 2024-05-30T00:12:42Z

csrc/scheduler/mma_utils.cpp

+    dim_roles[exact_graph.toGroup(A->axis(-1))] = MatmulDomain::K;
+    NVF_ERROR(A->nDims() > 0 && B->nDims() > 0);
+    size_t m_and_k_dims = 0;
+    if (A->nDims() == 1 && B->nDims() == 1) {
+      NVF_ERROR(
+          false, "MatmulOp node should not be created when both inputs are 1D");
+    } else if (A->nDims() == 1) {
+      // Missing M dimension
+      dim_roles[exact_graph.toGroup(B->axis(-1))] = MatmulDomain::N;
+      m_and_k_dims = 1;
+    } else if (B->nDims() == 1) {
+      // Missing N dimension
+      dim_roles[exact_graph.toGroup(A->axis(-2))] = MatmulDomain::M;
+      m_and_k_dims = 1;


I see. My suggestion was that the role information is embedded in the position of the iterdomain in the mapping output. For eg: out_size-3, out_size-2, out_size-1 are M, N, K respectively.

Could you clarify which approach is erroneous -- mappingMatmulOpIterDomain or this PR? Why should iS5 be M instead of batch?

tests/cpp/test_combine_mul_sum.cpp

tests/cpp/test_matmul_scheduler.cpp

Priya2698 · 2024-05-30T00:39:32Z

tests/cpp/test_matmul_scheduler.cpp

+TEST_F(MatmulSchedulerTest, Llama2FFN) {
+  NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(7, 5, 9, 0);
+
+  for (bool enable_fusion : {false, true}) {


I would recommend parametrizing this, the parametrization pattern can be used with the SegmentMatmulOpPrologue and SegmentLinearOpPrologue tests as well to test with all three combinations of schedulers as listed in this comment --

// TODO: Once we can control the ExprEval and Matmul schedulers via options, run // this test with all three combinations (with and without each scheduler, but // at least one enabled).

Oh that's a good idea

I added a parametrization for the Llamma2FFN test, but the prologue tests are actually currently failing when fusion is enabled, so I will leave that for a separate PR (that might be fixed by #2309).

tests/cpp/test_matmul_scheduler.cpp

And enable 1d/1d linear tests

jacobhinkle · 2024-05-30T13:36:29Z

!build

jacobhinkle · 2024-05-30T13:47:39Z

!build

Priya2698

LGTM. Thanks for addressing the comments.

jacobhinkle added 4 commits May 13, 2024 18:02

Introduce MatmulPattern and enable it in scheduler

85504a0

Fixes

d174a8c

Strip casts from input of mul-sum patterns

6c7f71a

Remove CombineMulSum

bb39995

jacobhinkle and others added 22 commits May 14, 2024 12:24

Merge remote-tracking branch 'origin/main' into translate_matmul_op

cb99bf2

Add a test

03e9bdc

Re-enable NVFUSER_DISABLE=matmul_expr_eval

c0d33f5

Add MatmulOp to ir_utils::isTvOp

a697b55

This also adds testing of exact mapping to the node tests (WIP).

Big refactor to use IdModel and allocation domain

d50e19b

Use IdModel to define ID roles (and hence TV roles). Also use allocation domains to determine whether each operand has K as its inner dimension.

Add IterType::Reduction domain for K dim in output

04522b8

Translate bcast K as simple product

7b04286

Finish testing all combinations of mappings

dcaf349

Tests pass!

Merge remote-tracking branch 'origin/main' into translate_matmul_op

dd7f09c

Merge remote-tracking branch 'origin/matmul_op_id_mapping' into trans…

5ece10e

…late_matmul_op

Remove prints and fix up isMatmulFusionDefinitionSupported

a7a9b56

Fix getProblemLayout

d6487b1

Add EnableOption::FuseMatmul and check sm arch

fd0ceb9

Add TODO about skipping downcast roundtrip

6e3f052

Remove some unused code

812f9d2

clang-tidy

6d48f35

Merge remote-tracking branch 'origin/main' into translate_matmul_op

68bf90e

Undo change to matmul

ffc62dd

Merge branch 'main' into translate_matmul_op

7ef6dda

Clean up test

4839125

Test that alloc domain causes ExprEval use

254ba50

Merge remote-tracking branch 'origin/main' into translate_matmul_op

6d8313d

jacobhinkle mentioned this pull request May 20, 2024

Generalize CombineMulSum as MatmulPatterns #2272

Merged

Merge remote-tracking branch 'origin/main' into translate_matmul_op

d432785

jacobhinkle and others added 8 commits May 29, 2024 11:17

Update csrc/scheduler/mma_utils.cpp

06f675e

Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>

Use mapMatmulOpIterDomains

8fa30ec

Use map*OpIterDomains to simplify getDimRoles

e19c8b9

Add Llama2FFN test

520b709

clang-format

9ae0769

Remove 2D bias check in linear test

0fb44e4

Parametrize LinearOp node translation test

fe8898e

Parametrize matmul test

2d8d45d

jacobhinkle requested a review from Priya2698 May 29, 2024 23:32

Priya2698 reviewed May 30, 2024

View reviewed changes

tests/cpp/test_combine_mul_sum.cpp Show resolved Hide resolved

Priya2698 reviewed May 30, 2024

View reviewed changes

tests/cpp/test_combine_mul_sum.cpp Show resolved Hide resolved

tests/cpp/test_combine_mul_sum.cpp Show resolved Hide resolved

tests/cpp/test_combine_mul_sum.cpp Outdated Show resolved Hide resolved

Priya2698 reviewed May 30, 2024

View reviewed changes

jacobhinkle added 3 commits May 30, 2024 12:19

Fix multidevice examples by filtering out device dims

59d77cf

Clean up canScheduleCompileTime

764823f

Add link to #2241 in failing test case

5cf7d72

jacobhinkle mentioned this pull request May 30, 2024

NoOp scheduler claims zero-dimensional non-trivial fusions #2241

Closed

jacobhinkle added 3 commits May 30, 2024 12:50

Make NoOp scheduler avoid matmul ops

5ffe590

And enable 1d/1d linear tests

NVFuserTest -> MatmulSchedulerTest

6108522

Parametrize Llama2FFN test

f34f4ad

Guard translation tests for cc < 7.5

7342ecd

Priya2698 approved these changes May 30, 2024

View reviewed changes

jacobhinkle merged commit 2310aec into main May 30, 2024

jacobhinkle deleted the translate_matmul_op branch May 30, 2024 20:27

jacobhinkle mentioned this pull request Jun 3, 2024

Task formalism in our IR [Inspired by Online Softmax] #2329

Open

Priya2698 mentioned this pull request Jun 12, 2024

Use matmul in distributed tests #2386

Merged

jacobhinkle mentioned this pull request Jun 13, 2024

Factor out validation from Expr subclass constructors #2407

Open

Conversation

jacobhinkle commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented May 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 May 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 May 30, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle May 30, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle May 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobhinkle commented May 30, 2024

Uh oh!

jacobhinkle commented May 30, 2024

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented May 13, 2024 •

edited

Loading

jacobhinkle commented May 13, 2024 •

edited

Loading