Stack-based shared memory allocator#703
Merged
jacobhinkle merged 33 commits intomainfrom Aug 15, 2023
Merged
Conversation
Collaborator
Author
|
!build |
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
reviewed
Aug 14, 2023
zasdfgbnm
approved these changes
Aug 14, 2023
jacobhinkle
added a commit
that referenced
this pull request
Aug 23, 2023
As of #703, nvfuser is able to reuse shared memory even when the indexes don't match. However, this requires block synchronization to occur in the appropriate places. That PR did not provide a mechanism for inserting synchronization. This PR addresses this by adding the method `TensorView::promoteReuse()`, which sets the `promote_reuse_` flag on the tensor. The `reuseMemoryAllocations` lowering pass recognizes that flag and ensures synchronizations are available after the tensor's last use but before the next smem allocation is written to, so that memory reuse can occur. That pass now looks like: 1. Find shared or local memory tensors that can be "aliased". That is, their lifetimes don't overlap and they have the equivalent index expressions, so the later one can be replaced with a reference to the first. 2. Find shared memory tensors which we should promote for re-use, i.e. those with the `promote_reuse_` flag set. These determine intervals between their last read and the next first write of another smem tensor; we check for syncing expressions within those intervals. Currently we insert a sync at the end of the interval if we don't find a pre-existing one, but we could change these to arrive/wait barriers in future work. 3. Do shared memory allocation as introduced in #703. The new syncs introduced in step 2 are now recognized and memory is reclaimed as requested. Note that currently `promoteReuse` can be called on any `TensorView`, but it only has an effect on shared memory tensors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR replaces
NoReuseSharedMemAllocatorwith a new shared mem allocator calledStackBasedSharedMemAllocator. In this new approach, we re-use memory that is no longer used where there are existing synchronizations in the kernel. In particular, we use a stack of allocations to lay out memory, and we delay pushing onto the stack until it is time to reclaim memory. This lets us order the allocations not by when they were first written, but by when they are last read, giving us more opportunities for re-use since the first allocations to be freed will lie at the top of the stack.In a future PR, we can explore extensions of this approach which introduce new syncs. Since the current implementation does not introduce syncs and only reuses memory when it is safe to do so, it is activated as the only method for shared memory allocation, replacing the previous class that never reused space.