Ignore trivial loops in memory aliasing pass by jacobhinkle · Pull Request #766 · NVIDIA/Fuser

jacobhinkle · 2023-08-22T23:34:00Z

Trivial kir::ForLoops are ones that appear in the kernel IR, but do not appear in the generated CUDA kernel. This can happen for a number of reasons: for example if that dimension is vectorized, or if it's parallelized with a stop value equal to the extent of a dimension. We can test this with kir::ForLoop::isTrivial(). Consider an example:

T9_s[ ... ] ca_pos( 2 ) = ALLOCATE(buffer=T9_s[ ... ] ca_pos( 2 ), mem_type=shared, size=8192, zero_init=false)
T8_s[ ... ] ca_pos( 2 ) = ALLOCATE(buffer=T8_s[ ... ] ca_pos( 2 ), mem_type=shared, size=4096, zero_init=false)
T7_s[ ... ] ca_pos( 2 ) produce_pos( 2 ) = ALLOCATE(buffer=T7_s[ ... ] ca_pos( 2 ) produce_pos( 2 ), mem_type=shared, size=8192, zero_init=false)
FOR blockIdx.x in iblockIdx.x71{( ceilDiv(T0.logical_size[0], 64) )}:
  FOR blockIdx.y in iblockIdx.y73{( ceilDiv(T1.logical_size[1], 128) )}:
    ...
    FOR threadIdx.z in ithreadIdx.z84{( ceilDiv(( ( ceilDiv(64, 4) ) * 4 ), 32) )}:
      FOR threadIdx.y in ithreadIdx.y86{( ceilDiv(( ( ( ceilDiv(( ceilDiv(128, 8) ), 4) ) * 4 ) * 8 ), 32) )}:
        // All writes and reads of T7, T8, T9 are in this loop
        FOR i806 in iS88{( ceilDiv(32, 16) )}:
          T7_s = ...;
        ENDFOR i806
        T8_s = T7_s;
        T9_s = T8_s;
      ENDFOR threadIdx.y
    ENDFOR threadIdx.z
  ENDFOR blockIdx.y
ENDFOR blockIdx.x

In this case, all of the parallelized for loops are trivial, and only the FOR i806 loop appears in the generated code. That means the actual lifetimes of T7 and T8 overlap and those of T8 and T9 overlap, but not those of T7 and T9.

In the aliasing pass, we define outer live intervals as those at the scope of the allocation. In the above case, it will set the outer live interval of all three allocations equal to the start and end of the blockIdx.x loop.

This PR ignores trivial loops in this analysis, so that outer live intervals are defined at the scope that will be realized in the CUDA kernel at the level of the Allocate expression. In the above example, this means the outer live intervals for T7 and T9 will no longer overlap, so they are eligible for memory re-use.

naoyam · 2023-08-23T00:16:03Z

As I mentioned in the original PR, can you please make sure there's no invalid back() happening?

jacobhinkle · 2023-08-25T14:18:16Z

As I mentioned in the original PR, can you please make sure there's no invalid back() happening?

We should never get an invalid access due to .back(), since we always push the top level scope to current_stack_. I went through the rest of the code and I don't see any issues. I am currently running a codegen comparison to look for unintended consequences in the wild.

naoyam

Feel free to merge once the check with the generated code is done

…vial_loops

jacobhinkle · 2023-08-28T00:11:21Z

Manually checked diffs in codegen. There is still quite a bit of non-determinism, which I believe is coming from allocateIndexVariables, so there is a lot of noise but from what I can tell, the only smem-related changes are expected, and limited to a few examples:

FusionMatmulSoftmaxMatmulAmpere re-uses one buffer
FusionHdiff re-orders a couple allocations matching their actual last outer reads now.
FusionPredicateParallelizedDomains: buffer T5 now re-uses both T1 and T2 instead of only T2. Also T1 and T2 are swapped on the stack.

I think this is safe to merge.

naoyam · 2023-08-28T05:54:23Z

Manually checked diffs in codegen. There is still quite a bit of non-determinism, which I believe is coming from allocateIndexVariables, so there is a lot of noise but from what I can tell, the only smem-related changes are expected, and limited to a few examples:

FusionMatmulSoftmaxMatmulAmpere re-uses one buffer

FusionHdiff re-orders a couple allocations matching their actual last outer reads now.

FusionPredicateParallelizedDomains: buffer T5 now re-uses both T1 and T2 instead of only T2. Also T1 and T2 are swapped on the stack.

I think this is safe to merge.

Did you see diffs with the benchmarks or the tests, or both? Last time I checked I didn't see any diff with the benchmarks. I did see some minor diffs with a few benchmarks.

jacobhinkle · 2023-08-28T10:52:33Z

Manually checked diffs in codegen. There is still quite a bit of non-determinism, which I believe is coming from allocateIndexVariables, so there is a lot of noise but from what I can tell, the only smem-related changes are expected, and limited to a few examples:

FusionMatmulSoftmaxMatmulAmpere re-uses one buffer

FusionHdiff re-orders a couple allocations matching their actual last outer reads now.

FusionPredicateParallelizedDomains: buffer T5 now re-uses both T1 and T2 instead of only T2. Also T1 and T2 are swapped on the stack.

I think this is safe to merge.

Did you see diffs with the benchmarks or the tests, or both? Last time I checked I didn't see any diff with the benchmarks. I did see some minor diffs with a few benchmarks.

I see lots of diffs, but they are all unrelated. I didn't see any involving smem allocations on the benchmarks. For example of what I'm calling unrelated:

I think there may be non-determinism in either the allocation of index variables or in index hoisting.

naoyam · 2023-08-28T15:18:09Z

Hmm, could you please create an issue with a repro?

Ignore trivial loops in memory aliasing pass

9c06a18

jacobhinkle mentioned this pull request Aug 22, 2023

Add TensorView::promoteReuse #739

Merged

Merge branch 'main' into alias_pass_ignore_trivial_loops

e5cc2b7

jacobhinkle mentioned this pull request Aug 23, 2023

Enable shared memory reuse in matmul epilogue #770

Merged

Merge branch 'main' into alias_pass_ignore_trivial_loops

a1ea6be

jacobhinkle marked this pull request as ready for review August 25, 2023 14:19

jacobhinkle requested a review from naoyam August 25, 2023 14:20

naoyam approved these changes Aug 25, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into alias_pass_ignore_tri…

f031d8c

…vial_loops

jacobhinkle merged commit e0a22af into main Aug 28, 2023

jacobhinkle deleted the alias_pass_ignore_trivial_loops branch August 28, 2023 00:12

jacobhinkle mentioned this pull request Aug 28, 2023

Repeatability in matmul scheduler #799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore trivial loops in memory aliasing pass#766

Ignore trivial loops in memory aliasing pass#766
jacobhinkle merged 4 commits intomainfrom
alias_pass_ignore_trivial_loops

jacobhinkle commented Aug 22, 2023 •

edited

Loading

Uh oh!

naoyam commented Aug 23, 2023

Uh oh!

jacobhinkle commented Aug 25, 2023

Uh oh!

naoyam left a comment

Uh oh!

jacobhinkle commented Aug 28, 2023 •

edited

Loading

Uh oh!

naoyam commented Aug 28, 2023

Uh oh!

jacobhinkle commented Aug 28, 2023

Uh oh!

naoyam commented Aug 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacobhinkle commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Aug 23, 2023

Uh oh!

jacobhinkle commented Aug 25, 2023

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Aug 28, 2023

Uh oh!

jacobhinkle commented Aug 28, 2023

Uh oh!

naoyam commented Aug 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented Aug 22, 2023 •

edited

Loading

jacobhinkle commented Aug 28, 2023 •

edited

Loading