Generalize CombineMulSum as MatmulPatterns by jacobhinkle · Pull Request #2272 · NVIDIA/Fuser

jacobhinkle · 2024-05-20T17:19:59Z

This replaces the CombineMulSum class with MatmulPattern in the Matmul scheduler. Additionally, we use these matmul patterns to determine the problem layout, IterDomain roles, and TensorView roles. The allocation domain is used to determine the problem layout. The matmul scheduler is updated to reject segments whose input allocation domains are non-trivial (until that is supported eg. by #2226).

Note that this does not add handling of MatmulOp and LinearOp in the matmul scheduler. That will be done next in #2236 or similar.

This also uses IdModel to find IterDomain and Tensor roles, and checks allocation domain to find problem layout. We guard the matmul tensor to reject problems that have non-trivial input allocation domains.

This will go in another PR

This fixes #2273

jacobhinkle · 2024-05-20T19:35:21Z

csrc/ir/utils.cpp

  return 1;
 }

+bool hasTrivialAllocationDomain(const TensorView* tv) {


The intention of this utility is to generalize !tv->hasAllocation() to cases where an allocation domain is provided, but it actually corresponds to the no-reductions rfactor domain (ignoring broadcasts).

Thanks! Would it be possible to add this as a comment in the header file.

jacobhinkle · 2024-05-20T19:36:33Z

csrc/ir/utils.cpp

                                        const std::string& desc) {
    // TODO: revise rules when add support for batch gemms
-    NVF_ERROR(details.bcasts.empty(), desc, ": has broadcast domains.");
+    // NVF_ERROR(details.bcasts.empty(), desc, ": has broadcast domains.");


Fixes #2273. See test AmpereMulSumToMatmul_MultipleBroadcasts

jacobhinkle · 2024-05-20T19:39:47Z

!build

jacobhinkle · 2024-05-21T00:22:41Z

!build

Priya2698 · 2024-05-21T04:12:34Z

csrc/scheduler/mma_utils.h

+//! and LinearOp, the output is the same dtype as the inputs; so output does not
+//! necessarily correspond to the output of a translated MmaOp and it might not
+//! be a fusion output.
+struct MatmulPattern {


Can we rename this API to be more intuitive. MatmulPattern makes me think of one of MmaOp/MatmulOp/LinearOp but this represents a description instead.

In the next PR it will include MatmulOp and LinearOp and perform that translation, so it really is meant to generalize those.

csrc/scheduler/mma_utils.h

csrc/scheduler/mma_utils.cpp

Priya2698 · 2024-05-21T04:54:43Z

csrc/scheduler/mma_utils.cpp

+                               MatmulRole role) -> InnerDomResult {
+    const auto role_it = roles_map.find(role);
+    if (role_it == roles_map.end()) {
+      return {MatmulDomain::M, "Could not find role in roles_map"};


The error message here is confusing. Why is this MatmulDomain::M? Inner dimension can be M/N. Similarly for the other errors in this lambda.

You're right and this InnerDomResult business is a hack to get around the design of DataWrapperOpt. I want to just return the error message in this case, but using DataWrapperOpt doesn't work properly in clang (haven't tried gcc) when the wrapped type is trivially copyable, since it balks at using std::move on such types.

What about using using InnerDomResult = std::pair<std::optional<MatmulDomain>, std::string>; and returning std::nullopt instead of MatmulDomain::A

I just pushed a change that uses a bare variant<std::string, UnitDim>.

csrc/scheduler/mma_utils.cpp

We can still refuse to schedule, but these are valid patterns

Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>

csrc/scheduler/matmul_utils.cpp

zasdfgbnm · 2024-05-21T19:19:12Z

Could you rebase this PR? I see obsolete code removed by #2268 in this PR.

csrc/scheduler/mma_utils.h

csrc/scheduler/mma_utils.cpp

csrc/scheduler/matmul.cpp

csrc/scheduler/matmul_utils.cpp

csrc/scheduler/mma_utils.cpp

csrc/scheduler/matmul_utils.cpp

Priya2698 · 2024-05-21T20:39:34Z

csrc/scheduler/mma_utils.cpp

+                               MatmulRole role) -> InnerDomResult {
+    const auto role_it = roles_map.find(role);
+    if (role_it == roles_map.end()) {
+      return {MatmulDomain::M, "Could not find role in roles_map"};


What about using using InnerDomResult = std::pair<std::optional<MatmulDomain>, std::string>; and returning std::nullopt instead of MatmulDomain::A

csrc/scheduler/mma_utils.cpp

Priya2698 · 2024-05-21T20:54:39Z

csrc/scheduler/mma_utils.cpp

+  // (bit 2)
+  using ValGroupPresence = std::bitset<3>;
+
+  std::unordered_map<ValGroup, ValGroupPresence> present_flags;


nit: is membership_flags a better name since your comment uses that terminology?

Actually I just renamed ValGroupPresence to DimPresence and updated the comment to no longer mention "membership", since I think that's a little more opaque term than "presence". What we really care about is whether a dimension is present in each tensor so I think that term is clearer.

Priya2698 · 2024-05-21T21:02:02Z

csrc/scheduler/mma_utils.cpp

-      if (has_m && has_n) {
-        storage.push_back(entry.first);
-      }
+  // NOTE: sort output roles in descending order by uses() size, and


Why is sorting the tvs important -- is there a place where we rely on this ordering to be same everywhere?

IIRC, the order is important because we want deterministic behavior. Otherwise there will be a slight change in the variable names in the generated code from run to run.

Exactly. We sometimes iterate over roles TVs. If we did not sort, then the ->name()s of introduced Vals would be ordered arbitrarily, for example. Maintaining deterministic compiled code is helpful for the codediff tool, and also for keeping our sanity when debugging problem fusions.

csrc/scheduler/mma_utils.cpp

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

@zasdfgbnm

Thanks to the suggestion by @zasdfgbnm while reviewing #2272, I found some additional cases where we took a short-cut to updating bools using the bitwise assignment ops. This is not ideal since its behavior is undefined (there's no guarantee that the underlying representation of `true` is `b1` and not `b10` or any other non-zero value). More importantly, writing `a |= b` as `a = a || b` allows us to short-circuit if `a == true`. Using bitwise `a |= b`, `b` will always be evaluated.

This also combines more code.

csrc/ir/utils.cpp

zasdfgbnm · 2024-05-22T19:54:53Z

csrc/scheduler/matmul_utils.cpp

+  mma_utils::MatmulPattern& pattern = patterns.front();
+
+  // IdModel is used to analyze problem shape & layout
+  IdModel id_model(fusion);


Is getMatmulHeuristics a hot path? We may want to cache the IdModel (in a separate PR), like:

Fuser/csrc/scheduler/pointwise.cpp

Lines 177 to 181 in 0892405

auto domain_map_entry =

HeuristicSummaryEntry<HeuristicCompileTime::DomainMap>(

data_cache,

[fusion]() { return std::make_unique<DomainMap>(fusion); });

const auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());

That's a good idea. Currently I build an IdModel in getMatmulHeuristics then again in scheduleMatmul, and actually I also need to rebuild it after translating MatmulPatterns to MmaOps. I think this use case is very similar to the DomainMap in the pointwise scheduler that you linked to; we just use IdModel instead of ComputeAtMap.

This was a change I made to handle casts that wound up breaking some tests and benchmarks in #2272, leading to dynamic cast errors or segfaults. The solution is to test the type of the left and right hand sides before processing the pattern matching.

This fixes a bug introduced by #2272 in `test_multidevice` where we reject a matmul segment shaped like `[iDIDxMo, iMi, bN, iK]` for having too many M dimensions. Locally this still has a single M dimension so it is valid. This PR ignores device dims for the purposes of computing tensor roles and problem shape. Further issues we should look into: 1. As mentioned in #2272 we should proceed to handle multiple M, N, K, and Batch dimensions, although in this case the restriction was useful for surfacing this bug. 2. Even if the matmul scheduler is completely broken or disabled, the _reduction_ scheduler should have been able to schedule this fusion. However, it identified the reduction tensor as `isResharding` and removed it from the `reduction_tvs` list, causing a failure in `scheduleReduction`. We should clean up that check to be able to schedule this type of fusion as a reduction. 3. The rfactor domain is often used for scheduling utilities to inspect the logical size of tensors. However, because multidevice scheduling modifies the leaf domain before segmentation, we should probably audit our schedulers to ensure they use the leaf domain and ignore device dims where necessary. 4. I should also not forget to rerun `!build` before merging PRs :-).

This fixes a bug introduced by #2272 in `test_multidevice` where we reject a matmul segment shaped like `[iDIDxMo, iMi, bN, iK]` for having too many M dimensions. Locally this still has a single M dimension so it is valid. This PR ignores device dims for the purposes of computing tensor roles and problem shape. Further issues we should look into: 1. As mentioned in #2272 we should proceed to handle multiple M, N, K, and Batch dimensions, although in this case the restriction was useful for surfacing this bug. 2. Even if the matmul scheduler is completely broken or disabled, the _reduction_ scheduler should have been able to schedule this fusion. However, it identified the reduction tensor as `isResharding` and removed it from the `reduction_tvs` list, causing a failure in `scheduleReduction`. We should clean up that check to be able to schedule this type of fusion as a reduction. 3. Inside the matmul scheduler we call `canonicalizeMmaTvOrdering` which I believe still uses rfactor domain to determine domain ordering. Instead this should be updated to use dim roles that are already computed from the `MatmulPattern`. 4. The rfactor domain is often used for scheduling utilities to inspect the logical size of tensors. However, because multidevice scheduling modifies the leaf domain before segmentation, we should probably audit our schedulers to ensure they use the leaf domain and ignore device dims where necessary. 5. I should also not forget to rerun `!build` before merging PRs :sweat_smile:

jacobhinkle added 2 commits May 20, 2024 17:01

Generalize CombineMulSum as MatmulPatterns

58687ad

This also uses IdModel to find IterDomain and Tensor roles, and checks allocation domain to find problem layout. We guard the matmul tensor to reject problems that have non-trivial input allocation domains.

Remove MatmulOp stuff.

a583d40

This will go in another PR

jacobhinkle mentioned this pull request May 20, 2024

Cannot define an MmaOp where batch dimension is Broadcast #2273

Closed

jacobhinkle and others added 4 commits May 20, 2024 19:24

Fix multiple-broadcasts test.

f5ec534

This fixes #2273

Merge branch 'main' into matmul_patterns

d74c7c0

Remove bcast output test

1a0fdb9

Remove canScheduleCompileTime check for ExprEval

9e6447d

jacobhinkle commented May 20, 2024

View reviewed changes

jacobhinkle marked this pull request as ready for review May 20, 2024 19:39

jacobhinkle requested review from Priya2698, protonu and zasdfgbnm May 20, 2024 19:39

jacobhinkle and others added 3 commits May 21, 2024 00:20

Fix gcc build failure due to unused variable

1cfac85

Fix signed/unsigned compare

7aad387

Merge branch 'main' into matmul_patterns

e4ef03e

Priya2698 reviewed May 21, 2024

View reviewed changes

jacobhinkle and others added 4 commits May 21, 2024 14:01

Allow multiple M, N, or K dims in pattern match

36a72dc

We can still refuse to schedule, but these are valid patterns

Update csrc/scheduler/mma_utils.cpp

1f713d3

Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>

Merge branch 'main' into matmul_patterns

361389a

Add comment about why casts are often present

6937ff8

Priya2698 reviewed May 21, 2024

View reviewed changes

csrc/scheduler/matmul_utils.cpp Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into matmul_patterns

f54c51b

zasdfgbnm reviewed May 21, 2024

View reviewed changes

csrc/scheduler/mma_utils.h Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

jacobhinkle added 2 commits May 21, 2024 19:35

Remove getProblemIterDomains

345b025

Rename group_to_domain -> dim_roles

142f366

zasdfgbnm reviewed May 21, 2024

View reviewed changes

csrc/scheduler/matmul.cpp Outdated Show resolved Hide resolved

csrc/scheduler/matmul.cpp Outdated Show resolved Hide resolved

csrc/scheduler/matmul.cpp Outdated Show resolved Hide resolved

Priya2698 reviewed May 21, 2024

View reviewed changes

jacobhinkle and others added 5 commits May 21, 2024 19:05

Update csrc/scheduler/matmul.cpp

57974a3

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

Rename most occurences of roles_map -> tensor_roles

e0619cf

Remove getProblemLayout(Fusion*, const MatmulPattern&)

b0fa936

Uncomment getProblemLayout

76eb037

Replace bitwise assignment ops

e1044d9

jacobhinkle mentioned this pull request May 21, 2024

Replace bitwise assignment ops on bools #2279

Merged

Rename dim_to_domain -> dim_roles

990f2be

jacobhinkle added 9 commits May 22, 2024 11:59

Merge remote-tracking branch 'origin/main' into matmul_patterns

4df6b11

Add comment describing isMatmulFusionDefinitionSupported

c3753cb

Use lambda to simplify getTensorsRoles

4fac4fb

Rename getTensorsRoles -> getTensorRoles

9d31a90

Assume alloc=rfactor to determine M, N and A, B

a99c9ae

Test that dim role mapping survives swap

bd6517c

Rename ValGroupPresence to DimPresence

8dc6948

Use std::variant<std::string, UnitDim> for error

71128e4

This also combines more code.

clang-tidy fix

71ba8a2

protonu reviewed May 22, 2024

View reviewed changes

csrc/ir/utils.cpp Show resolved Hide resolved

zasdfgbnm approved these changes May 22, 2024

View reviewed changes

Use noReductions/noBroadcasts to simplify hasTrivialAllocationDomain

0deb0bf

jacobhinkle merged commit 9153612 into main May 23, 2024

jacobhinkle deleted the matmul_patterns branch May 23, 2024 00:14

jacobhinkle mentioned this pull request May 23, 2024

Translate MatmulOp and LinearOp #2236

Merged

jacobhinkle mentioned this pull request May 23, 2024

Do not assume mul-sum pattern inputs are TensorView #2293

Merged

jacobhinkle mentioned this pull request May 24, 2024

Compute matmul dim roles with no-devices leaf domain #2300

Merged

Priya2698 mentioned this pull request Jun 12, 2024

Use matmul in distributed tests #2386

Merged

	auto domain_map_entry =
	HeuristicSummaryEntry<HeuristicCompileTime::DomainMap>(
	data_cache,
	[fusion]() { return std::make_unique<DomainMap>(fusion); });
	const auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());

Conversation

jacobhinkle commented May 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented May 20, 2024

Uh oh!

jacobhinkle commented May 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zasdfgbnm commented May 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants