Always generate epilogue by naoyam · Pull Request #2663 · NVIDIA/Fuser

naoyam · 2024-07-22T22:04:03Z

Stacked on top of #2661.

We currently only generate circular buffer epilogue loops when the producer is in global memory. This PR changes we always generate epilogue.

This is not strictly necessary for correctness but avoids extra memory accesses that won't be used.

Alternatively, we could add an extra predicate in the main loop. See #2660 as well.

Overall, I think this approach is simpler and I don't see any performance concern compared to #2660.

Note that CudaCodeGenerator currently always starts a loop with 0 even if ForLoop::start_ is non-zero. I think this change is safe since only use case of non-zero start should be the epilogue loop of circular buffering. The concern of degenerate loop should not be applicable.

This avoids extra memory accesses without adding extra predicates, which was prototyped in PR #2660.

naoyam · 2024-07-22T22:05:10Z

!build

zasdfgbnm · 2024-07-22T22:12:29Z

Is this related to #2008? cc: @jacobhinkle

naoyam · 2024-07-22T22:19:53Z

Is this related to #2008? cc: @jacobhinkle

Hmm, does that mean we don't want to have epilogue loops in some cases?

jacobhinkle · 2024-07-22T22:21:29Z

I dont think it's necessarily bad to have an epilogue in cp.async cases, maybe other than code size. With an epilogue we would not need to drain the leftover jobs as was added in #2008.

naoyam · 2024-07-22T22:32:33Z

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

naoyam · 2024-07-22T22:34:26Z

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

jacobhinkle · 2024-07-22T23:30:11Z

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

I remember the issue with using an epilogue loop for these cp.async waits: see #2005 (comment). The problem is that you need to specify the number of groups left in the wait as a constant, and even an unrolled loop variable cannot be used there. Because of that I'm not sure there's any good way to do this without manually replicating (unrolling) the epilogue loop, which would probably introduce more bugs.

naoyam · 2024-07-23T01:02:09Z

The test modified in #2008 doesn't seem to fail with this PR change. I suppose it's because the epilogue loop is actually empty. Am I understanding correctly? Do we have some simpler tests, preferably without loop rotation?

Ah, no, actually they are failing. Will look into them.

I remember the issue with using an epilogue loop for these cp.async waits: see #2005 (comment). The problem is that you need to specify the number of groups left in the wait as a constant, and even an unrolled loop variable cannot be used there. Because of that I'm not sure there's any good way to do this without manually replicating (unrolling) the epilogue loop, which would probably introduce more bugs.

So, for cp.async, it seems the best practice is to not generate epilogue and keep the current codegen as is. Is that what you think?

jacobhinkle · 2024-07-23T02:04:08Z

So, for cp.async, it seems the best practice is to not generate epilogue and keep the current codegen as is. Is that what you think?

Yes I think so, only because I could not figure out a way around that inline ptx limitation so that we could have a different call in each epilogue iteration.

…ogue

rdspring1 · 2024-07-24T18:17:09Z

@jacobhinkle What about this for selecting cp.async at runtime?

Take a look at the PTX in https://ce.nvidia.com/z/oKxhxc.
If you apply a constraint to the index for cp_async_wait_group_read, it only runs through a subset of instructions.
e.g., (i % 3)+2 => only [2, 5] are in the foo function.

inline __device__ void cp_async_wait_group_read(int n)
{
  // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-wait-group
  if (n == 0) { asm volatile("cp.async.wait_group.read 0; \n" ::: "memory"); }
  if (n == 1) { asm volatile("cp.async.wait_group.read 1; \n" ::: "memory"); }
  if (n == 2) { asm volatile("cp.async.wait_group.read 2; \n" ::: "memory"); }
  if (n == 3) { asm volatile("cp.async.wait_group.read 3; \n" ::: "memory"); }
  if (n == 4) { asm volatile("cp.async.wait_group.read 4; \n" ::: "memory"); }
  if (n == 5) { asm volatile("cp.async.wait_group.read 5; \n" ::: "memory"); }
  assert(n >= 0 && n <= 5);
}

jacobhinkle · 2024-07-24T20:19:48Z

@jacobhinkle What about this for selecting cp.async at runtime?

I think this is essentially the same as having a switch statement. That is indeed the alternative if we need to have an epilogue I believe.

Take a loop at the PTX in https://ce.nvidia.com/z/oKxhxc. If you apply a constraint to the index for cp_async_wait_group_read, it only runs through a subset of instructions. e.g., (i % 3)+2 => only [2, 5] are in the foo function.

That's good to know. I didn't know it would be able to prune dead branches like that. That means we could actually have a pretty high number of hard-coded cases and there would be no runtime penalty.

Adding support of predicate indexing with circular buffering. Circular buffering itself doesn't need many changes, but circular buffering and unswitch/unroll is a bit more complicated. There's an existing [bug](#2159) as well, which is fixed here. #2663 could simplify this PR but we probably don't want to enforce epilogue generation. This PR doesn't rely on it. Fixes #2159

naoyam · 2024-07-30T15:43:43Z

I'm closing this for now as it's still unclear if we could workaround the performance concerns. I wanted to do this to simplify predicate indexing but not strictly necessary.

naoyam added 2 commits July 22, 2024 14:14

Always generate epilogue in circular buffering

e4179a8

This avoids extra memory accesses without adding extra predicates, which was prototyped in PR #2660.

naoyam changed the base branch from main to circular_buffer_fix_epilogue July 22, 2024 22:04

naoyam requested a review from rdspring1 July 22, 2024 22:05

naoyam mentioned this pull request Jul 22, 2024

Add a predicate to main loop of circular buffering #2660

Closed

Use int64_t instead of unsigned

c793a2e

string match fix

63d1f53

Merge branch 'circular_buffer_fix_epilogue' into always_generate_epil…

186da00

…ogue

Base automatically changed from circular_buffer_fix_epilogue to main July 23, 2024 14:57

naoyam mentioned this pull request Jul 24, 2024

Predicate indexing for circular buffering #2677

Merged

naoyam closed this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always generate epilogue#2663

Always generate epilogue#2663
naoyam wants to merge 5 commits intomainfrom
always_generate_epilogue

naoyam commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

zasdfgbnm commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

jacobhinkle commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

jacobhinkle commented Jul 22, 2024

Uh oh!

naoyam commented Jul 23, 2024

Uh oh!

jacobhinkle commented Jul 23, 2024

Uh oh!

rdspring1 commented Jul 24, 2024 •

edited

Loading

Uh oh!

jacobhinkle commented Jul 24, 2024

Uh oh!

naoyam commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

naoyam commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

zasdfgbnm commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

jacobhinkle commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

naoyam commented Jul 22, 2024

Uh oh!

jacobhinkle commented Jul 22, 2024

Uh oh!

naoyam commented Jul 23, 2024

Uh oh!

jacobhinkle commented Jul 23, 2024

Uh oh!

rdspring1 commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Jul 24, 2024

Uh oh!

naoyam commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdspring1 commented Jul 24, 2024 •

edited

Loading