Stack-based shared memory allocator by jacobhinkle · Pull Request #703 · NVIDIA/Fuser

jacobhinkle · 2023-08-09T20:00:35Z

This PR replaces NoReuseSharedMemAllocator with a new shared mem allocator called StackBasedSharedMemAllocator. In this new approach, we re-use memory that is no longer used where there are existing synchronizations in the kernel. In particular, we use a stack of allocations to lay out memory, and we delay pushing onto the stack until it is time to reclaim memory. This lets us order the allocations not by when they were first written, but by when they are last read, giving us more opportunities for re-use since the first allocations to be freed will lie at the top of the stack.

In a future PR, we can explore extensions of this approach which introduce new syncs. Since the current implementation does not introduce syncs and only reuses memory when it is safe to do so, it is activated as the only method for shared memory allocation, replacing the previous class that never reused space.

jacobhinkle · 2023-08-11T18:59:27Z

!build

csrc/device_lower/pass/alias_memory.cpp

As of #703, nvfuser is able to reuse shared memory even when the indexes don't match. However, this requires block synchronization to occur in the appropriate places. That PR did not provide a mechanism for inserting synchronization. This PR addresses this by adding the method `TensorView::promoteReuse()`, which sets the `promote_reuse_` flag on the tensor. The `reuseMemoryAllocations` lowering pass recognizes that flag and ensures synchronizations are available after the tensor's last use but before the next smem allocation is written to, so that memory reuse can occur. That pass now looks like: 1. Find shared or local memory tensors that can be "aliased". That is, their lifetimes don't overlap and they have the equivalent index expressions, so the later one can be replaced with a reference to the first. 2. Find shared memory tensors which we should promote for re-use, i.e. those with the `promote_reuse_` flag set. These determine intervals between their last read and the next first write of another smem tensor; we check for syncing expressions within those intervals. Currently we insert a sync at the end of the interval if we don't find a pre-existing one, but we could change these to arrive/wait barriers in future work. 3. Do shared memory allocation as introduced in #703. The new syncs introduced in step 2 are now recognized and memory is reclaimed as requested. Note that currently `promoteReuse` can be called on any `TensorView`, but it only has an effect on shared memory tensors.

jacobhinkle added 19 commits August 9, 2023 15:59

First draft without sync insertion

f4b7ef2

Merge remote-tracking branch 'origin/main' into smem_reuse_stack

f5c61c4

Convert to class, more doc, still WIP

03185cc

Fix a couple bugs

24ca118

Use size in bytes, align.

4bc2eb3

Add notes on reordering stack pushes

124da93

Typo

f41cfe1

Comment update

1097ff6

Add warn_only mode and enable/disable options

5b540ad

Add draft of a couple tests

8eb41cd

Add passing tests

35e5e23

Update check in failing needreorder test

e0edb6e

Fix tests. Now just missing syncs

06267a6

Merge remote-tracking branch 'origin/main' into smem_reuse_stack

8d0d242

Refactor to use IrVisitor and only reclaim mem on syncs

efa3e60

Fix SimpleCase test and remove prints

7f54fe9

Clean up tests. Remove disable/enable options

07cfdfc

Remove NoReuseSharedMemAllocator

c19d3f1

Merge remote-tracking branch 'origin/main' into smem_reuse_stack

9eee05a

jacobhinkle changed the title ~~[WIP] Stack-based shared memory allocator~~ Stack-based shared memory allocator Aug 11, 2023

jacobhinkle and others added 2 commits August 11, 2023 19:00

Remove warn_only option and clean up test comments

37aa86e

Merge branch 'main' into smem_reuse_stack

2e8fc78

jacobhinkle marked this pull request as ready for review August 11, 2023 23:01

jacobhinkle requested a review from zasdfgbnm August 11, 2023 23:01