Enable shared memory reuse in matmul epilogue by jacobhinkle · Pull Request #770 · NVIDIA/Fuser

jacobhinkle · 2023-08-23T14:54:37Z

This uses promoteReuse from #739 and inserts a syncthreads just before the epilogue loop when smem is used for the epilogue, when possible. The matmul heuristic attempts to predict when this will be possible in order to more accurately estimate shared memory usage, and hence occupancy. If we cannot guarantee re-use, we must assume in the heuristic that memory will not be reclaimed, even though it might be when the fusion is lowered.

Shared memory reclamation can only occur if the smem buffers have non-overlapping lifetimes. This is difficult to guarantee before scheduling and lowering. We use cacheAfter to create the a and b smem tiles, but we use cacheBefore for the epilogue smem tile. This means that smem will be used for any downstream uses of a and b but the epilogue smem will have its lifetime restricted to the epilogue itself, regardless of downstream uses of the matrix product.

The uses of a and b can complicate lifetime analysis. Consider a case where both matrices are square and we wish to compute a @ b + a where @ denotes matmul. Since we used a->cacheAfter() to create the smem tile, that smem may be used not only in the matmul but also in the addition in the epilogue. In that case we cannot re-use a for the epilogue smem. A conservative check that there are no other uses of a or b is currently implemented in order to guarantee re-use. A less conservative sufficient (but still not necessary) condition is that any other use of a or b is a producer of b or a respectively; this is not implemented yet.

Previously, we stacked every ForLoop regardless of parallelization. This meant that when the first few dimensions were left of compute at in the whole fusion, even if they were parallelized all tensors would have the same outer live interval. I noticed this for the AmpereMatmulSmemEpilogue_CUDA tests. In that case if you look at the generated CUDA it's clearly not true; the outer for loops do not appear since they are parallelized. This commit fixes this; note that it can affect all reuse analysis including aliasing even of local memory.

… into smem_epilogue_request_reuse

csarofeen · 2023-08-26T13:55:54Z

The check reminds me a lot of how we check persistent buffer usage.

naoyam · 2023-08-28T18:58:04Z

@jacobhinkle Do we have tests?

jacobhinkle · 2023-08-28T19:18:46Z

@jacobhinkle Do we have tests?

Not yet. I was thinking of modifying the Epilogue* tests to check that memory is reused in the event that params.use_smem_epilogue == true.

zasdfgbnm · 2023-08-28T23:44:40Z

csrc/scheduler/mma_utils.cpp

  const auto blocks_per_sm_with_smem_epilogue = std::min(
-      shared_memory_available / (smem_a + smem_b + smem_c),
+      shared_memory_available / total_with_smem_epilogue,
      (size_t)blocks_per_sm_by_register);


Should we add a case that, if reuse and no-reuse provides the same occupancy, then we don't promote reuse because this will save a sync?

That's a good idea. In that case I'll also need to guard promoteReuse with params.use_smem_epilogue.

Actually, now I understand your comment. Yes we'll need a separate parameter indicating whether to reuse memory or not, in addition to use_smem_epilogue. I'll push something in a moment.

zasdfgbnm

Generally LGTM, but will leave it to @drzejan2 for approval. Also, could you run /build/nvfuser_bench --benchmark_filter=Matmul for reporting the perf on these benchmarks, and also work with @mmigdal-nv for a more thorough perf evaluation?

jacobhinkle · 2023-08-29T00:32:53Z

Generally LGTM, but will leave it to @drzejan2 for approval. Also, could you run /build/nvfuser_bench --benchmark_filter=Matmul for reporting the perf on these benchmarks, and also work with @mmigdal-nv for a more thorough perf evaluation?

Will do!

drzejan2

Changes look good to me.

I will approve this MR when use case mentioned by @zasdfgbnm is supported (link).

csrc/scheduler/mma_utils.cpp

csrc/scheduler/matmul_utils.cpp

tests failing at the moment

…_reuse

…ristics

jacobhinkle · 2023-08-29T15:15:41Z

I will approve this MR when use case mentioned by @zasdfgbnm is supported (link).

Supported now. The heuristic now holds bool promote_prologue_smem_reuse in addition to use_smem_epilogue. promote_prologue_smem_reuse is only true if re-using smem would increase occupancy, since otherwise we should avoid adding a __syncthreads().

csrc/scheduler/matmul_heuristic.h

jacobhinkle · 2023-08-29T20:00:54Z

Generally LGTM, but will leave it to @drzejan2 for approval. Also, could you run /build/nvfuser_bench --benchmark_filter=Matmul for reporting the perf on these benchmarks, and also work with @mmigdal-nv for a more thorough perf evaluation?

I ran the benchmarks in the background while working on some other stuff. Then I recently realized that the benchmarks do not use smem for the epilogue since the heuristic is not run so I see no difference in perf compared to TOT. Instead, the matmul benchmarks manually set the params. Altering that should probably not go into here, but I will hack it to see what effect it will have on some of the benchmarks.

zasdfgbnm · 2023-08-29T20:18:16Z

test/test_gpu_tensorcore.cpp

+    if (params.promote_prologue_smem_reuse) {
+      // Check prologue shared memory re-use
+      TORCH_CHECK(smem_allocs.at(1)->address()->isZero());
+      TORCH_CHECK(smem_allocs.at(2)->address()->isZero());


Is smem_allocs.at(0) A and smem_allocs.at(1) B, and smem_allocs.at(1) C? So B is reusing A's memory, and C is reusing B's memory?

Yes that's right. I improved the comment here a bit.

I see. So we allocate B first, then A. This makes sense. Thanks for the explanation!

Wait, is it guaranteed that which of A vs B is allocated first? If A is allocated first, then will smem_allocs.at(1) no longer be zero here? I think we should remove the check for smem_allocs.at(1).

Very good question since their lifetimes end at the same time point. We break ties like that by ordering by name(), so B will be ordered after A, leading to this order consistently. However, that's not exactly clear so maybe we could just check that C is placed at 0 and that either A or B is at 0.

drzejan2

Functionally everything is sound, will approve when a bug in hashing function is fixed.

csrc/scheduler/matmul_heuristic.h

csrc/scheduler/mma_utils.h

csrc/scheduler/mma_utils.cpp

csrc/scheduler/matmul_utils.cpp

Co-authored-by: Andrzej Bekas <118676880+drzejan2@users.noreply.github.com>

See note `[Struct Support in PolymorphicValue]` for description, and the new test `PolymorphicValueTest.Struct` for examples.

per title

) This updates the root to rfactor propagation in IterType concretization of dynamic fusions. Previously, although we only overwrote Symbolic IterDomains in this step, we still asserted that we could infer an IterType for each I moved that check so that it is only applied when we need to make a change. Additionally, we previously propagated Broadcast-only IterDomains as Symbolic, since we combine with our previous estimate using promoteIterType. As mentioned in a comment, this means Broadcast gets propagated as Symbolic. Instead we now only fall back to promoteIterType when there are multiple input IterTypes to the IterDomain expression. Fixes #798

…_reuse

test/test_gpu_tensorcore.cpp

Test is skipped in this case anyway

…_reuse

jacobhinkle and others added 15 commits August 22, 2023 19:30

Ignore trivial loops in memory aliasing pass

9c06a18

Merge branch 'main' into alias_pass_ignore_trivial_loops

e5cc2b7

Add TensorView::requestReuse and failing test

8a5dafe

Fix up reuse tests

4d9cc35

Enable smem reuse in matmul epilogue

6a82eb4

Change to promoteReuse interface

fd5a7db

Remove prints and fix tv clone

df5476b

Clean up comment

2a1b51a

Switch to using promoteReuse

263f428

Almost guarantee reuse before assuming it

998e194

Remove old RequestReuse test

8ea47cc

Move reuse guarantee to getMatmulHeuristics

57a3bab

Merge remote-tracking branch 'origin/alias_pass_ignore_trivial_loops'…

2d0219a

… into smem_epilogue_request_reuse

Remove blank space

d894c4b

mmigdal-nv requested a review from drzejan2 August 23, 2023 15:33

Merge branch 'main' into smem_epilogue_request_reuse

b5c953a

jacobhinkle and others added 2 commits August 27, 2023 20:14

Guarantee a and b reuse separately

d1e986d

Merge branch 'main' into smem_epilogue_request_reuse

7ed6447

jacobhinkle requested review from liqiangxl, mmigdal-nv, naoyam and zasdfgbnm August 28, 2023 00:20

jacobhinkle marked this pull request as ready for review August 28, 2023 00:20

zasdfgbnm reviewed Aug 28, 2023

View reviewed changes

drzejan2 reviewed Aug 29, 2023

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/matmul_utils.cpp Outdated Show resolved Hide resolved

jacobhinkle added 5 commits August 29, 2023 08:22

Separate smem epilogue from promote_reuse param

b9db3ac

Move roles_map analysis to generateSharedMemoryEpilogueHeuristics

8160135

tests failing at the moment

Check for reuse in AmpereMatmulSmemEpilogue_CUDA

1d546e3

Merge remote-tracking branch 'origin/main' into smem_epilogue_request…

fb222c7

…_reuse

Restore old signature as overload for generateSharedMemoryEpilogueHeu…

0367777

…ristics

Fix MatmulSASSTest.AmpereModifiersSharedMemoryEpilogue_CUDA

366a5e7

zasdfgbnm reviewed Aug 29, 2023

View reviewed changes

csrc/scheduler/matmul_heuristic.h Show resolved Hide resolved

Update sameAs and hash

6b54a5c

zasdfgbnm reviewed Aug 29, 2023

View reviewed changes

Better comment in reuse check in epilogue test

5cc7883

drzejan2 reviewed Aug 30, 2023

View reviewed changes

csrc/scheduler/matmul_heuristic.h Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.h Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/matmul_utils.cpp Outdated Show resolved Hide resolved

jacobhinkle and others added 8 commits August 30, 2023 07:43

State which scheduler is considered in mma_utils.cpp

7cf708f

Co-authored-by: Andrzej Bekas <118676880+drzejan2@users.noreply.github.com>

Add a new way to hold structs in PolymorphicValue (#791)

cae9683

See note `[Struct Support in PolymorphicValue]` for description, and the new test `PolymorphicValueTest.Struct` for examples.

Fixed a minor error in tools/compare_codegen.sh (#810)

2649b45

per title

Fix no-benchmark build (#809)

50919f1

Fix hash

2948ca7

Reformat comment

0a29cc6

Merge remote-tracking branch 'origin/main' into smem_epilogue_request…

3492690

…_reuse

liqiangxl reviewed Aug 30, 2023

View reviewed changes

test/test_gpu_tensorcore.cpp Show resolved Hide resolved

drzejan2 approved these changes Aug 30, 2023

View reviewed changes

jacobhinkle added 3 commits August 30, 2023 12:36

Handle all three cases for smem epilogue in test

cdf7c24

Remove !use_smem_epilogue case.

9d0319a

Test is skipped in this case anyway

Merge remote-tracking branch 'origin/main' into smem_epilogue_request…

ea236a8

…_reuse

jacobhinkle merged commit 761eea4 into main Aug 31, 2023

jacobhinkle deleted the smem_epilogue_request_reuse branch August 31, 2023 15:44

Conversation

jacobhinkle commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen commented Aug 26, 2023

Uh oh!

naoyam commented Aug 28, 2023

Uh oh!

jacobhinkle commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Aug 29, 2023

Uh oh!

drzejan2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jacobhinkle commented Aug 29, 2023

Uh oh!

Uh oh!

jacobhinkle commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drzejan2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jacobhinkle commented Aug 23, 2023 •

edited

Loading

jacobhinkle commented Aug 28, 2023 •

edited

Loading

jacobhinkle commented Aug 29, 2023 •

edited

Loading