Skip to content

Stack-based shared memory allocator#703

Merged
jacobhinkle merged 33 commits intomainfrom
smem_reuse_stack
Aug 15, 2023
Merged

Stack-based shared memory allocator#703
jacobhinkle merged 33 commits intomainfrom
smem_reuse_stack

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Aug 9, 2023

This PR replaces NoReuseSharedMemAllocator with a new shared mem allocator called StackBasedSharedMemAllocator. In this new approach, we re-use memory that is no longer used where there are existing synchronizations in the kernel. In particular, we use a stack of allocations to lay out memory, and we delay pushing onto the stack until it is time to reclaim memory. This lets us order the allocations not by when they were first written, but by when they are last read, giving us more opportunities for re-use since the first allocations to be freed will lie at the top of the stack.

In a future PR, we can explore extensions of this approach which introduce new syncs. Since the current implementation does not introduce syncs and only reuses memory when it is safe to do so, it is activated as the only method for shared memory allocation, replacing the previous class that never reused space.

@jacobhinkle jacobhinkle changed the title [WIP] Stack-based shared memory allocator Stack-based shared memory allocator Aug 11, 2023
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle marked this pull request as ready for review August 11, 2023 23:01
@jacobhinkle jacobhinkle requested a review from zasdfgbnm August 11, 2023 23:01
@jacobhinkle jacobhinkle merged commit 2e85798 into main Aug 15, 2023
@jacobhinkle jacobhinkle deleted the smem_reuse_stack branch August 15, 2023 13:17
jacobhinkle added a commit that referenced this pull request Aug 23, 2023
As of #703, nvfuser is able to reuse shared memory even when the indexes
don't match. However, this requires block synchronization to occur in
the appropriate places. That PR did not provide a mechanism for
inserting synchronization. This PR addresses this by adding the method
`TensorView::promoteReuse()`, which sets the `promote_reuse_` flag on
the tensor. The `reuseMemoryAllocations` lowering pass recognizes that
flag and ensures synchronizations are available after the tensor's last
use but before the next smem allocation is written to, so that memory
reuse can occur. That pass now looks like:

1. Find shared or local memory tensors that can be "aliased". That is,
their lifetimes don't overlap and they have the equivalent index
expressions, so the later one can be replaced with a reference to the
first.
2. Find shared memory tensors which we should promote for re-use, i.e.
those with the `promote_reuse_` flag set. These determine intervals
between their last read and the next first write of another smem tensor;
we check for syncing expressions within those intervals. Currently we
insert a sync at the end of the interval if we don't find a pre-existing
one, but we could change these to arrive/wait barriers in future work.
3. Do shared memory allocation as introduced in #703. The new syncs
introduced in step 2 are now recognized and memory is reclaimed as
requested.

Note that currently `promoteReuse` can be called on any `TensorView`,
but it only has an effect on shared memory tensors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants