Add `TensorView::promoteReuse` by jacobhinkle · Pull Request #739 · NVIDIA/Fuser

jacobhinkle · 2023-08-17T18:10:19Z

As of #703, nvfuser is able to reuse shared memory even when the indexes don't match. However, this requires block synchronization to occur in the appropriate places. That PR did not provide a mechanism for inserting synchronization. This PR addresses this by adding the method TensorView::promoteReuse(), which sets the promote_reuse_ flag on the tensor. The reuseMemoryAllocations lowering pass recognizes that flag and ensures synchronizations are available after the tensor's last use but before the next smem allocation is written to, so that memory reuse can occur. That pass now looks like:

Find shared or local memory tensors that can be "aliased". That is, their lifetimes don't overlap and they have the equivalent index expressions, so the later one can be replaced with a reference to the first.
Find shared memory tensors which we should promote for re-use, i.e. those with the promote_reuse_ flag set. These determine intervals between their last read and the next first write of another smem tensor; we check for syncing expressions within those intervals. Currently we insert a sync at the end of the interval if we don't find a pre-existing one, but we could change these to arrive/wait barriers in future work.
Do shared memory allocation as introduced in Stack-based shared memory allocator #703. The new syncs introduced in step 2 are now recognized and memory is reclaimed as requested.

Note that currently promoteReuse can be called on any TensorView, but it only has an effect on shared memory tensors.

Previously, we stacked every ForLoop regardless of parallelization. This meant that when the first few dimensions were left of compute at in the whole fusion, even if they were parallelized all tensors would have the same outer live interval. I noticed this for the AmpereMatmulSmemEpilogue_CUDA tests. In that case if you look at the generated CUDA it's clearly not true; the outer for loops do not appear since they are parallelized. This commit fixes this; note that it can affect all reuse analysis including aliasing even of local memory.

jacobhinkle · 2023-08-18T12:38:46Z

csrc/device_lower/pass/alias_memory.cpp

+      // Parallelized loops do not result in for loops in the CUDA kernel, so
+      // they should not affect liveness analysis. This means that
+      // current_stack_ will differ from kir::IrVisitor::for_loops_, which will
+      // actually hold all ForLoops regardless of parallelization.
+      current_stack_.push_back(loop_info);


I noticed this when testing with matmuls; see FusionAmpereMatmulSmemEpilogue_CUDA for example. In that case the kernel IR starts with

T9_s[ ... ] ca_pos( 2 ) = ALLOCATE(buffer=T9_s[ ... ] ca_pos( 2 ), mem_type=shared, size=8192, zero_init=false) T8_s[ ... ] ca_pos( 2 ) = ALLOCATE(buffer=T8_s[ ... ] ca_pos( 2 ), mem_type=shared, size=4096, zero_init=false) T7_s[ ... ] ca_pos( 2 ) produce_pos( 2 ) = ALLOCATE(buffer=T7_s[ ... ] ca_pos( 2 ) produce_pos( 2 ), mem_type=shared, size=8192, zero_init=false) FOR blockIdx.x in iblockIdx.x71{( ceilDiv(T0.logical_size[0], 64) )}: FOR blockIdx.y in iblockIdx.y73{( ceilDiv(T1.logical_size[1], 128) )}: ... FOR threadIdx.z in ithreadIdx.z84{( ceilDiv(( ( ceilDiv(64, 4) ) * 4 ), 32) )}: FOR threadIdx.y in ithreadIdx.y86{( ceilDiv(( ( ( ceilDiv(( ceilDiv(128, 8) ), 4) ) * 4 ) * 8 ), 32) )}: // All writes and reads of T7, T8, T9 are in this loop FOR i806 in iS88{( ceilDiv(32, 16) )}: ...

Here we see that only the i806 loop will be present in the actual CUDA code. The parallelized loops cover the entire kernel, so without this change, the outer live interval of every tensor is just the range of the outermost BIDx loop. With this change, we properly compute the live intervals relative to loops that should occur in the kernel.

This change is related to this PR but not strictly necessary for it. We could always remove it for now and reintroduce it when we make the change to the matmul scheduler.

Regarding the example at

Fuser/csrc/device_lower/pass/alias_memory.cpp

Lines 926 to 938 in b921503

//! Find the loop level of expr that apears in the same scope as

//! the reference allocate. Eg.

//!

//! For ...

//! For ...

//! Allocate <---- reference arg

//! For ..

//! For ...

//! For ... <---- this function returns `ScopeInfo` for this loop

//! For ...

//! expr <---- current expr (implied in current_stack_ and

//! current_pos_ )

//! Assumes that expr either writes to or reads from the reference allocate.

What happens if I have:

//! For ... //! For ... //! Allocate <---- reference arg //! For .. //! For ... //! For blockIdx.x in blockDim.x <---- Will this function returns `ScopeInfo` for this loop? //! For ... //! expr <---- current expr (implied in current_stack_ and current_pos_ )

Since the For blockIdx.x in blockDim.x loop is parallelized, it will not appear in the kernel. So in that case this should look like

//! For ... //! For ... //! Allocate <---- reference arg //! For .. //! For ... //! // For blockIdx.x in blockDim.x <---- This loop does not appear in the CUDA code, so it is ignored //! For ... <---- This function returns `ScopeInfo` for this loop //! expr <---- current expr (implied in current_stack_ and current_pos_ )

@jacobhinkle Could you remind me what the problem is with:

so without this change, the outer live interval of every tensor is just the range of the outermost BIDx loop.

The problem is that in that case we actually could do "outer aliasing", since in the cuda kernel the live intervals do not overlap. However, because the trivial BIDx loop surrounds the whole kernel, without this change the outer live interval of any allocation is simply the span of that BIDx loop.

It seems this could affect many more cases (positively). Can you split out this change from this PR and also see if how the aliasing would change with the benchmarks?

Also, there's a couple of places we do current_stack_.back() and I wonder they are all safe. If all loops are parallelized, wouldn't the stack just be empty?

Yes I will go ahead and split this off into another PR.

Moved to #766.

jacobhinkle · 2023-08-18T12:55:55Z

!build

jacobhinkle · 2023-08-18T13:18:21Z

Adding the following after the line https://github.com/NVIDIA/Fuser/blob/main/csrc/scheduler/matmul.cpp#L970 results in shared memory reuse in matmuls with smem epilogues:

smem_epilogue->requestReuse(acw_smem);
// This next line is unnecessary since both tensors have same outer live interval. Both tensors
// will be reclaimed as long as either is requested for reuse.
//smem_epilogue->requestReuse(bcw_smem);

I'm leaving that for a follow-on PR.

NeedsReorderedPush actually had the lifetimes not quite overlapping. New version is simpler I think.

It was actually not aliasing. It's pretty tough to force aliasing _and_ re-use

zasdfgbnm · 2023-08-21T22:59:37Z

csrc/device_lower/pass/alias_memory.cpp

    return all_allocations_;
  }

+  std::optional<AllocationInfo*> getMaybeAllocInfoFromTV(TensorView* tv) const {


nit: return nullptr when not found, so we don't have to use std::optional.

zasdfgbnm · 2023-08-21T23:11:34Z

csrc/device_lower/pass/alias_memory.cpp

  void setAlias(AllocationInfo* from, AllocationInfo* to) {
    alias_map_[from] = to;
    from->alias_to = to->alloc_expr;
+    to->outer_aliased_by.push_back(from);


What if B alias A, and C alias B, will A.outer_aliased_by has both B and C?

Good point. Currently two hop aliases are not assigned but they could be in the future possibly.

I added an assertion here that to->alias_to is null.

zasdfgbnm · 2023-08-21T23:19:27Z

csrc/device_lower/pass/alias_memory.cpp

+      // Parallelized loops do not result in for loops in the CUDA kernel, so
+      // they should not affect liveness analysis. This means that
+      // current_stack_ will differ from kir::IrVisitor::for_loops_, which will
+      // actually hold all ForLoops regardless of parallelization.
+      current_stack_.push_back(loop_info);


Regarding the example at

Fuser/csrc/device_lower/pass/alias_memory.cpp

Lines 926 to 938 in b921503

//! Find the loop level of expr that apears in the same scope as

//! the reference allocate. Eg.

//!

//! For ...

//! For ...

//! Allocate <---- reference arg

//! For ..

//! For ...

//! For ... <---- this function returns `ScopeInfo` for this loop

//! For ...

//! expr <---- current expr (implied in current_stack_ and

//! current_pos_ )

//! Assumes that expr either writes to or reads from the reference allocate.

What happens if I have:

//! For ... //! For ... //! Allocate <---- reference arg //! For .. //! For ... //! For blockIdx.x in blockDim.x <---- Will this function returns `ScopeInfo` for this loop? //! For ... //! expr <---- current expr (implied in current_stack_ and current_pos_ )

zasdfgbnm · 2023-08-22T17:28:47Z

csrc/device_lower/pass/alias_memory.cpp

  void handle(kir::ForLoop* for_loop) final {
    auto loop_info = scope_map_.getLoopScopeInfo(for_loop);
-    current_stack_.push_back(loop_info);
+    if (!for_loop->iter_domain()->isParallelized()) {


Should we instead use for_loop->isTrivial()? There are other trivial loops not generated in codegen, for example, vectorization loop, and we should handle all of them equivalently.

Much better. Thanks.

naoyam · 2023-08-22T17:53:50Z

csrc/ir/interface_nodes.h

+  //! is present in the kernel to reuse memory and inserts new block
+  //! synchronizations if necessary.
+  void promoteReuse(bool b = true) {
+    promote_reuse_ = b;


Assert here that this is a shared memory tensor?

naoyam · 2023-08-22T18:11:22Z

csrc/device_lower/pass/alias_memory.cpp

+      // Parallelized loops do not result in for loops in the CUDA kernel, so
+      // they should not affect liveness analysis. This means that
+      // current_stack_ will differ from kir::IrVisitor::for_loops_, which will
+      // actually hold all ForLoops regardless of parallelization.
+      current_stack_.push_back(loop_info);


@jacobhinkle Could you remind me what the problem is with:

so without this change, the outer live interval of every tensor is just the range of the outermost BIDx loop.

zasdfgbnm · 2023-08-22T18:01:06Z

csrc/ir/interface_nodes.h

+
+  //! Returns whether we should insert syncs if needed in order to reuse the
+  //! memory of this tensor.
+  bool getPromoteReuse() const {


nit: Is this a better name? shouldPromoteReuse

zasdfgbnm · 2023-08-22T18:05:22Z

csrc/device_lower/pass/alias_memory.cpp

+  std::unordered_multimap<int, int> sync_intervals_;
+
+  // Position within the traversal
+  int position_ = -1;


Does this need to be a member?

It does not. This was leftover from a refactor. Fixing..

zasdfgbnm · 2023-08-22T18:16:56Z

csrc/device_lower/pass/alias_memory.cpp

+    if (inserted_syncs_.find(expr) != inserted_syncs_.end()) {
+      if (isDebugDumpEnabled(DebugDumpOption::BufferReuseInfo)) {
+        debug() << "Skipping new sync expression " << expr->toString();
+      }
+      kir::ExprMutator::dispatch(expr);
+    }


IIUC, traverseAndInsert will only insert after the entire traverse is done, which means, inside here, we will never see an inserted sync? (If it did insert on the fly, then should we recompute AllocationInfoMap every time when we register an insertion?)

You're right it should not insert until afterward; this check is not needed. I had gotten a segfault at one point and placed this guard there to diagnose, but I think that was a logic error that wound up calling kir::ExprMutator::dispatch. I'll verify that none of the tests are hitting it and remove if not.

test/test_smem_reuse.cpp

csrc/device_lower/pass/alias_memory.cpp

naoyam · 2023-08-22T20:45:14Z

test/test_smem_reuse.cpp

+  auto tv7 = neg(tv5); // pos = f
+  fusion->addOutput(tv7);
+
+  { // This should not re-use memory


It seems there's something missing here. This fusion is not parallelized at all, so why does it need a syncthreads to reuse the memory?

Even when parallelized, no syncthreads should be necessary for fusions like:

__shared__ float X[N]; ... auto t0 = X[threadIdx.x]; // last read of X ... X[threadIdx.x] = t1; // reuse X without syncthreads ...

That case would be an inner alias and is supported without syncing. Outer alias, when there are separate loops for the two allocations, is not supported and requires sync. We could potentially handle that case without syncing too but it might require more machinery for proving indices are equivalent.

In this test all three smem allocations have different size, so they cannot be aliased.

They don't need to be aliased. I think this is more about the stack-based reuse logic. Since no tensor is parallelized, we should be able to freely pop allocated tensors without a sync.

Oh sorry I thought you meant this tensor is not parallelized. If no tensors are parallelized you are right that we wouldn't need any syncs. Do we need to handle that case?

Completely serial cases would be fine to ignore as they are just a synthetic example I came up with.

But how about cases like this?

__shared__ float X[blockDim.x * 4]; ... for (int i = 0; i < 4; ++i) { auto t0 = X[threadIdx.x + blockDim.x * i]; // last read of X } ... for (int i = 0; i < 4; ++i) { X[threadIdx.x + blockDim.x * i] = t1; // reuse X without syncthreads }

I actually don't remember all the details and differences of the inner and outer sharing, but isn't this case be outer sharing? And if so, no reuse is allowed without a sync, right?

I think what's missing here is probably something like what we do for the RAW sync insertion. We use the CA maps to analyze if a read after a write requires a sync. See for example: https://github.com/NVIDIA/Fuser/blob/main/csrc/device_lower/analysis/sync_information.cpp#L234

It's quite complicated and also it's one of those that we would be able to simplify a lot with the new ID graph. So, I think it's fine to leave this as a limitation for now.

Ah I see. Yes in this case there are non-overlapping sets of elements written/read across threads, so it's safe. It's definitely very similar analysis to what's done in SyncMap; I had originally thought we might just augment SyncMap to insert these re-use syncs even. Sounds good on leaving it for now and revisiting it after ID graphs are complete.

Can you file it as a TODO issue?

Drafted an issue here: #769

naoyam · 2023-08-22T20:53:45Z

test/test_smem_reuse.cpp

+    for (auto alloc : gpulw.kernel()->summary().dynamic_smem_allocations) {
+      EXPECT_NE(alloc->address(), nullptr);
+      auto addr = ee.evaluate(alloc->address()).as<int64_t>();
+      auto size = ee.evaluate(alloc->size()).as<int64_t>() *
+          dataTypeSize(alloc->buffer()->dtype());
+      smem_usage = std::max(smem_usage, addr + size);
+    }
+    EXPECT_EQ(smem_usage, alignInt((H + 1) * 4) + (H + 1) * 4);


Can we just directly validate the alias relationship of each allocation? I understand checking the total size is also fine, but asserting the alias relationships would make the intention of the test and expected behavior much more clear.

Since they are not aliased (i.e. the kir::Allocates have null alias()), we can't compare those directly. We can probably compare the addresses instead though. I will give it a shot.

Oh, that's true. Then maybe not worth spending much time.

naoyam

LGTM

This uses `promoteReuse` from #739 and inserts a syncthreads just before the epilogue loop when smem is used for the epilogue, when possible. The matmul heuristic attempts to predict when this will be possible in order to more accurately estimate shared memory usage, and hence occupancy. If we cannot guarantee re-use, we must assume in the heuristic that memory will not be reclaimed, even though it might be when the fusion is lowered. Shared memory reclamation can only occur if the smem buffers have non-overlapping lifetimes. This is difficult to guarantee before scheduling and lowering. We use `cacheAfter` to create the `a` and `b` smem tiles, but we use `cacheBefore` for the epilogue smem tile. This means that smem will be used for any downstream uses of `a` and `b` but the epilogue smem will have its lifetime restricted to the epilogue itself, regardless of downstream uses of the matrix product. The uses of `a` and `b` can complicate lifetime analysis. Consider a case where both matrices are square and we wish to compute `a @ b + a` where `@` denotes matmul. Since we used `a->cacheAfter()` to create the smem tile, that smem may be used not only in the matmul but also in the addition in the epilogue. In that case we cannot re-use a for the epilogue smem. A conservative check that there are no other uses of `a` or `b` is currently implemented in order to guarantee re-use. A less conservative sufficient (but still not necessary) condition is that any other use of `a` or `b` is a producer of `b` or `a` respectively; this is not implemented yet. --------- Co-authored-by: Andrzej Bekas <118676880+drzejan2@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com>

jacobhinkle added 5 commits August 17, 2023 11:30

Add TensorView::requestReuse and failing test

b7de2b4

Clone requested reuses when cloning TV

b25313b

First draft of block sync inserter

94bdabe

More verbose printing of buffer reuse info

e2ef3a5

jacobhinkle commented Aug 18, 2023

View reviewed changes

Merge branch 'main' into request_smem_reuse

92e1e7f

jacobhinkle changed the title ~~[WIP] Add TensorView::requestReuse~~ Add TensorView::requestReuse Aug 18, 2023

jacobhinkle and others added 4 commits August 18, 2023 11:47

Clean up old tests.

c7043dc

NeedsReorderedPush actually had the lifetimes not quite overlapping. New version is simpler I think.

Fix up reuse tests

7d08a41

Rename and add verification to allocate step

bb6e1f4

Merge branch 'main' into request_smem_reuse

2b56df0

jacobhinkle marked this pull request as ready for review August 18, 2023 17:21

jacobhinkle requested review from liqiangxl, naoyam and zasdfgbnm August 18, 2023 17:21

jacobhinkle marked this pull request as draft August 18, 2023 18:54

jacobhinkle added 2 commits August 21, 2023 07:32

Change to promoteReuse interface

ffdb9cd

Remove prints and fix tv clone

4da8e92

jacobhinkle changed the title ~~Add TensorView::requestReuse~~ Add TensorView::promoteReuse Aug 21, 2023

jacobhinkle and others added 3 commits August 21, 2023 10:23

Ignore non-smem tensors

9c6ec7c

Merge branch 'main' into request_smem_reuse

2d98d2d

Clean up comment

41799ba

jacobhinkle marked this pull request as ready for review August 21, 2023 16:50

jacobhinkle and others added 4 commits August 21, 2023 14:05

Add a couple new tests

f092673

Add test with alias shadowing a promoteReuse

5ae92e6

Remove alias test.

2c7074b

It was actually not aliasing. It's pretty tough to force aliasing _and_ re-use

Merge branch 'main' into request_smem_reuse

b921503

zasdfgbnm reviewed Aug 21, 2023

View reviewed changes

jacobhinkle added 5 commits August 22, 2023 07:24

Use raw pointers instead of std::optional pointers

5054271

Merge remote-tracking branch 'origin/main' into request_smem_reuse

3084d09

Assert against multi-hop alias

c43efb8

Fix comment about outer smem aliasing

7d60ccf

Style change to assertion

a0fce50

zasdfgbnm reviewed Aug 22, 2023

View reviewed changes

jacobhinkle added 2 commits August 22, 2023 13:41

Use ForLoop::isTrivial

00f9342

Merge remote-tracking branch 'origin/main' into request_smem_reuse

33b15a9

naoyam reviewed Aug 22, 2023

View reviewed changes

zasdfgbnm reviewed Aug 22, 2023

View reviewed changes

jacobhinkle added 3 commits August 22, 2023 14:26

Remove position_ member

6c6459f

Rename getPromoteReuse to shouldPromoteReuse

e0ab062

Assert on memory type in promoteReuse()

91fb414

zasdfgbnm reviewed Aug 22, 2023

View reviewed changes

test/test_smem_reuse.cpp Show resolved Hide resolved

Remove TODO comment

084fe54

naoyam reviewed Aug 22, 2023

View reviewed changes

csrc/device_lower/pass/alias_memory.cpp Show resolved Hide resolved

jacobhinkle added 3 commits August 22, 2023 15:17

Merge remote-tracking branch 'origin/main' into request_smem_reuse

ec6bc07

Skip future allocs except unaliased smem allocs

cc7b03e

Skip checking inserted syncs in dispatch(Expr*)

7e74560

naoyam reviewed Aug 22, 2023

View reviewed changes

jacobhinkle added 2 commits August 22, 2023 19:24

Refactor test definition into function

d6504a2

Undo change that skips trivial loops

9a5260f

naoyam approved these changes Aug 23, 2023

View reviewed changes

jacobhinkle mentioned this pull request Aug 23, 2023

Enable sync-free smem outer aliasing #769

Closed

jacobhinkle merged commit 268a63f into main Aug 23, 2023

jacobhinkle deleted the request_smem_reuse branch August 23, 2023 02:07

jacobhinkle mentioned this pull request Aug 23, 2023

Enable shared memory reuse in matmul epilogue #770

Merged

	//! Find the loop level of expr that apears in the same scope as
	//! the reference allocate. Eg.
	//!
	//! For ...
	//! For ...
	//! Allocate <---- reference arg
	//! For ..
	//! For ...
	//! For ... <---- this function returns `ScopeInfo` for this loop
	//! For ...
	//! expr <---- current expr (implied in current_stack_ and
	//! current_pos_ )
	//! Assumes that expr either writes to or reads from the reference allocate.

Conversation

jacobhinkle commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Aug 18, 2023

Uh oh!

jacobhinkle commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jacobhinkle commented Aug 17, 2023 •

edited

Loading

jacobhinkle Aug 18, 2023 •

edited

Loading

jacobhinkle Aug 22, 2023 •

edited

Loading

jacobhinkle commented Aug 18, 2023 •

edited

Loading

jacobhinkle Aug 22, 2023 •

edited

Loading

jacobhinkle Aug 22, 2023 •

edited

Loading