Skip to content

Always generate epilogue#2663

Closed
naoyam wants to merge 5 commits intomainfrom
always_generate_epilogue
Closed

Always generate epilogue#2663
naoyam wants to merge 5 commits intomainfrom
always_generate_epilogue

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Jul 22, 2024

Stacked on top of #2661.

We currently only generate circular buffer epilogue loops when the producer is in global memory. This PR changes we always generate epilogue.

This is not strictly necessary for correctness but avoids extra memory accesses that won't be used.

Alternatively, we could add an extra predicate in the main loop. See #2660 as well.

Overall, I think this approach is simpler and I don't see any performance concern compared to #2660.

naoyam added 2 commits July 22, 2024 14:14
Note that CudaCodeGenerator currently always starts a loop with 0 even
if ForLoop::start_ is non-zero. I think this change is safe since only
use case of non-zero start should be the epilogue loop of circular
buffering. The concern of degenerate loop should not be applicable.
This avoids extra memory accesses without adding extra predicates, which
was prototyped in PR #2660.
@naoyam naoyam changed the base branch from main to circular_buffer_fix_epilogue July 22, 2024 22:04
@naoyam
Copy link
Collaborator Author

naoyam commented Jul 22, 2024

!build

@zasdfgbnm
Copy link
Collaborator

Is this related to #2008? cc: @jacobhinkle

@naoyam
Copy link
Collaborator Author

naoyam commented Jul 22, 2024

Is this related to #2008? cc: @jacobhinkle

Hmm, does that mean we don't want to have epilogue loops in some cases?

@jacobhinkle
Copy link
Collaborator

I dont think it's necessarily bad to have an epilogue in cp.async cases, maybe other than code size. With an epilogue we would not need to drain the leftover jobs as was added in #2008.

@naoyam
Copy link
Collaborator Author

naoyam commented Jul 22, 2024

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

@naoyam
Copy link
Collaborator Author

naoyam commented Jul 22, 2024

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

@jacobhinkle
Copy link
Collaborator

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

I remember the issue with using an epilogue loop for these cp.async waits: see #2005 (comment). The problem is that you need to specify the number of groups left in the wait as a constant, and even an unrolled loop variable cannot be used there. Because of that I'm not sure there's any good way to do this without manually replicating (unrolling) the epilogue loop, which would probably introduce more bugs.

@naoyam
Copy link
Collaborator Author

naoyam commented Jul 23, 2024

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

I remember the issue with using an epilogue loop for these cp.async waits: see #2005 (comment). The problem is that you need to specify the number of groups left in the wait as a constant, and even an unrolled loop variable cannot be used there. Because of that I'm not sure there's any good way to do this without manually replicating (unrolling) the epilogue loop, which would probably introduce more bugs.

So, for cp.async, it seems the best practice is to not generate epilogue and keep the current codegen as is. Is that what you think?

@jacobhinkle
Copy link
Collaborator

So, for cp.async, it seems the best practice is to not generate epilogue and keep the current codegen as is. Is that what you think?

Yes I think so, only because I could not figure out a way around that inline ptx limitation so that we could have a different call in each epilogue iteration.

Base automatically changed from circular_buffer_fix_epilogue to main July 23, 2024 14:57
@rdspring1
Copy link
Collaborator

rdspring1 commented Jul 24, 2024

@jacobhinkle What about this for selecting cp.async at runtime?

Take a look at the PTX in https://ce.nvidia.com/z/oKxhxc.
If you apply a constraint to the index for cp_async_wait_group_read, it only runs through a subset of instructions.
e.g., (i % 3)+2 => only [2, 5] are in the foo function.

inline __device__ void cp_async_wait_group_read(int n)
{
  // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-wait-group
  if (n == 0) { asm volatile("cp.async.wait_group.read 0; \n" ::: "memory"); }
  if (n == 1) { asm volatile("cp.async.wait_group.read 1; \n" ::: "memory"); }
  if (n == 2) { asm volatile("cp.async.wait_group.read 2; \n" ::: "memory"); }
  if (n == 3) { asm volatile("cp.async.wait_group.read 3; \n" ::: "memory"); }
  if (n == 4) { asm volatile("cp.async.wait_group.read 4; \n" ::: "memory"); }
  if (n == 5) { asm volatile("cp.async.wait_group.read 5; \n" ::: "memory"); }
  assert(n >= 0 && n <= 5);
}

@jacobhinkle
Copy link
Collaborator

@jacobhinkle What about this for selecting cp.async at runtime?

I think this is essentially the same as having a switch statement. That is indeed the alternative if we need to have an epilogue I believe.

Take a loop at the PTX in https://ce.nvidia.com/z/oKxhxc. If you apply a constraint to the index for cp_async_wait_group_read, it only runs through a subset of instructions. e.g., (i % 3)+2 => only [2, 5] are in the foo function.

That's good to know. I didn't know it would be able to prune dead branches like that. That means we could actually have a pretty high number of hard-coded cases and there would be no runtime penalty.

naoyam added a commit that referenced this pull request Jul 27, 2024
Adding support of predicate indexing with circular buffering.

Circular buffering itself doesn't need many changes, but circular
buffering and unswitch/unroll is a bit more complicated. There's an
existing [bug](#2159) as well, which is fixed here.

#2663 could simplify this PR but we probably don't want to enforce
epilogue generation. This PR doesn't rely on it.

Fixes #2159
@naoyam
Copy link
Collaborator Author

naoyam commented Jul 30, 2024

I'm closing this for now as it's still unclear if we could workaround the performance concerns. I wanted to do this to simplify predicate indexing but not strictly necessary.

@naoyam naoyam closed this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants