Conversation
Note that CudaCodeGenerator currently always starts a loop with 0 even if ForLoop::start_ is non-zero. I think this change is safe since only use case of non-zero start should be the epilogue loop of circular buffering. The concern of degenerate loop should not be applicable.
This avoids extra memory accesses without adding extra predicates, which was prototyped in PR #2660.
|
!build |
|
Is this related to #2008? cc: @jacobhinkle |
Hmm, does that mean we don't want to have epilogue loops in some cases? |
|
I dont think it's necessarily bad to have an epilogue in cp.async cases, maybe other than code size. With an epilogue we would not need to drain the leftover jobs as was added in #2008. |
|
The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation? |
Ah, no, actually they are failing. Will look into them. |
I remember the issue with using an epilogue loop for these |
So, for |
Yes I think so, only because I could not figure out a way around that inline ptx limitation so that we could have a different call in each epilogue iteration. |
|
@jacobhinkle What about this for selecting Take a look at the PTX in https://ce.nvidia.com/z/oKxhxc. inline __device__ void cp_async_wait_group_read(int n)
{
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-wait-group
if (n == 0) { asm volatile("cp.async.wait_group.read 0; \n" ::: "memory"); }
if (n == 1) { asm volatile("cp.async.wait_group.read 1; \n" ::: "memory"); }
if (n == 2) { asm volatile("cp.async.wait_group.read 2; \n" ::: "memory"); }
if (n == 3) { asm volatile("cp.async.wait_group.read 3; \n" ::: "memory"); }
if (n == 4) { asm volatile("cp.async.wait_group.read 4; \n" ::: "memory"); }
if (n == 5) { asm volatile("cp.async.wait_group.read 5; \n" ::: "memory"); }
assert(n >= 0 && n <= 5);
} |
I think this is essentially the same as having a switch statement. That is indeed the alternative if we need to have an epilogue I believe.
That's good to know. I didn't know it would be able to prune dead branches like that. That means we could actually have a pretty high number of hard-coded cases and there would be no runtime penalty. |
Adding support of predicate indexing with circular buffering. Circular buffering itself doesn't need many changes, but circular buffering and unswitch/unroll is a bit more complicated. There's an existing [bug](#2159) as well, which is fixed here. #2663 could simplify this PR but we probably don't want to enforce epilogue generation. This PR doesn't rely on it. Fixes #2159
|
I'm closing this for now as it's still unclear if we could workaround the performance concerns. I wanted to do this to simplify predicate indexing but not strictly necessary. |
Stacked on top of #2661.
We currently only generate circular buffer epilogue loops when the producer is in global memory. This PR changes we always generate epilogue.
This is not strictly necessary for correctness but avoids extra memory accesses that won't be used.
Alternatively, we could add an extra predicate in the main loop. See #2660 as well.
Overall, I think this approach is simpler and I don't see any performance concern compared to #2660.