Enable shared memory reuse in matmul epilogue#770
Conversation
Previously, we stacked every ForLoop regardless of parallelization. This meant that when the first few dimensions were left of compute at in the whole fusion, even if they were parallelized all tensors would have the same outer live interval. I noticed this for the AmpereMatmulSmemEpilogue_CUDA tests. In that case if you look at the generated CUDA it's clearly not true; the outer for loops do not appear since they are parallelized. This commit fixes this; note that it can affect all reuse analysis including aliasing even of local memory.
… into smem_epilogue_request_reuse
|
The check reminds me a lot of how we check persistent buffer usage. |
|
@jacobhinkle Do we have tests? |
Not yet. I was thinking of modifying the |
| const auto blocks_per_sm_with_smem_epilogue = std::min( | ||
| shared_memory_available / (smem_a + smem_b + smem_c), | ||
| shared_memory_available / total_with_smem_epilogue, | ||
| (size_t)blocks_per_sm_by_register); |
There was a problem hiding this comment.
Should we add a case that, if reuse and no-reuse provides the same occupancy, then we don't promote reuse because this will save a sync?
There was a problem hiding this comment.
That's a good idea. In that case I'll also need to guard promoteReuse with params.use_smem_epilogue.
There was a problem hiding this comment.
Actually, now I understand your comment. Yes we'll need a separate parameter indicating whether to reuse memory or not, in addition to use_smem_epilogue. I'll push something in a moment.
zasdfgbnm
left a comment
There was a problem hiding this comment.
Generally LGTM, but will leave it to @drzejan2 for approval. Also, could you run /build/nvfuser_bench --benchmark_filter=Matmul for reporting the perf on these benchmarks, and also work with @mmigdal-nv for a more thorough perf evaluation?
Will do! |
drzejan2
left a comment
There was a problem hiding this comment.
Changes look good to me.
I will approve this MR when use case mentioned by @zasdfgbnm is supported (link).
tests failing at the moment
Supported now. The heuristic now holds |
I ran the benchmarks in the background while working on some other stuff. Then I recently realized that the benchmarks do not use smem for the epilogue since the heuristic is not run so I see no difference in perf compared to TOT. Instead, the matmul benchmarks manually set the params. Altering that should probably not go into here, but I will hack it to see what effect it will have on some of the benchmarks. |
test/test_gpu_tensorcore.cpp
Outdated
| if (params.promote_prologue_smem_reuse) { | ||
| // Check prologue shared memory re-use | ||
| TORCH_CHECK(smem_allocs.at(1)->address()->isZero()); | ||
| TORCH_CHECK(smem_allocs.at(2)->address()->isZero()); |
There was a problem hiding this comment.
Is smem_allocs.at(0) A and smem_allocs.at(1) B, and smem_allocs.at(1) C? So B is reusing A's memory, and C is reusing B's memory?
There was a problem hiding this comment.
Yes that's right. I improved the comment here a bit.
There was a problem hiding this comment.
I see. So we allocate B first, then A. This makes sense. Thanks for the explanation!
There was a problem hiding this comment.
Wait, is it guaranteed that which of A vs B is allocated first? If A is allocated first, then will smem_allocs.at(1) no longer be zero here? I think we should remove the check for smem_allocs.at(1).
There was a problem hiding this comment.
Very good question since their lifetimes end at the same time point. We break ties like that by ordering by name(), so B will be ordered after A, leading to this order consistently. However, that's not exactly clear so maybe we could just check that C is placed at 0 and that either A or B is at 0.
drzejan2
left a comment
There was a problem hiding this comment.
Functionally everything is sound, will approve when a bug in hashing function is fixed.
Co-authored-by: Andrzej Bekas <118676880+drzejan2@users.noreply.github.com>
See note `[Struct Support in PolymorphicValue]` for description, and the new test `PolymorphicValueTest.Struct` for examples.
) This updates the root to rfactor propagation in IterType concretization of dynamic fusions. Previously, although we only overwrote Symbolic IterDomains in this step, we still asserted that we could infer an IterType for each I moved that check so that it is only applied when we need to make a change. Additionally, we previously propagated Broadcast-only IterDomains as Symbolic, since we combine with our previous estimate using promoteIterType. As mentioned in a comment, this means Broadcast gets propagated as Symbolic. Instead we now only fall back to promoteIterType when there are multiple input IterTypes to the IterDomain expression. Fixes #798
Test is skipped in this case anyway
This uses
promoteReusefrom #739 and inserts a syncthreads just before the epilogue loop when smem is used for the epilogue, when possible. The matmul heuristic attempts to predict when this will be possible in order to more accurately estimate shared memory usage, and hence occupancy. If we cannot guarantee re-use, we must assume in the heuristic that memory will not be reclaimed, even though it might be when the fusion is lowered.Shared memory reclamation can only occur if the smem buffers have non-overlapping lifetimes. This is difficult to guarantee before scheduling and lowering. We use
cacheAfterto create theaandbsmem tiles, but we usecacheBeforefor the epilogue smem tile. This means that smem will be used for any downstream uses ofaandbbut the epilogue smem will have its lifetime restricted to the epilogue itself, regardless of downstream uses of the matrix product.The uses of
aandbcan complicate lifetime analysis. Consider a case where both matrices are square and we wish to computea @ b + awhere@denotes matmul. Since we useda->cacheAfter()to create the smem tile, that smem may be used not only in the matmul but also in the addition in the epilogue. In that case we cannot re-use a for the epilogue smem. A conservative check that there are no other uses ofaorbis currently implemented in order to guarantee re-use. A less conservative sufficient (but still not necessary) condition is that any other use ofaorbis a producer ofborarespectively; this is not implemented yet.