Strengthen index simplification for cast epilogue matmul by jacobhinkle · Pull Request #1827 · NVIDIA/Fuser

jacobhinkle · 2024-02-23T23:26:51Z

This came up when working on #1770. In a private conversation, @zasdfgbnm noticed wisely that the problematic indexing is really a failure of expression simplification; if we could fully simplify the swizzling expression it could be entirely hoisted and we would be left with a nice clean linear index for the smem buffer in the epilogue loop.

This is NVFuserTest.FusionAmpereMatmulSmemEpilogueCast_CUDA on main:

    // main loop
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i123 = 0; i123 < 4; ++i123) {
    nvfuser_index_t i124;
    i124 = 32 * i123;
    nvfuser_index_t i125;
    i125 = i56 + (2048LL * i123);
    #pragma unroll
    for(nvfuser_index_t i126 = 0; i126 < 8; ++i126) {
      nvfuser_index_t i127;
      i127 = i124 + (4 * i126);
      nvfuser_index_t i128;
      i128 = i11 + i126;
      nvfuser_index_t i129;
      i129 = (i125 + (32LL * (i128 / 4))) + (8LL * (i57 ^ (i128 % 4)));
      #pragma unroll
      for(nvfuser_index_t i130 = 0; i130 < 2; ++i130) {
        loadGeneric<float, 2>( &T8[(i129 + (1024LL * i130))],  &T3[(i127 + (2LL * i130))]);
      }
    }
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i131 = 0; i131 < 16; ++i131) {
    nvfuser_index_t i132;
    i132 = i58 + (1024 * i131);
    Array<__half, 8, 8> T7;
    #pragma unroll
    for(nvfuser_index_t i133 = 0; i133 < 8; ++i133) {
      nvfuser_index_t i134;
      i134 = i59 + i133;
      nvfuser_index_t i135;
      i135 = i134 % 128;
      nvfuser_index_t i136;
      i136 = i135 / 8;
      nvfuser_index_t i137;
      i137 = i134 / 128;
      T7[i133]
         = __float2half(T8[((((i132 + (128LL * i137)) + (32LL * (i136 / 4))) + (i135 % 8)) + (8LL * ((i136 % 4) ^ ((i31 + i137) % 4))))]);
    }
    if ((b72 && (i73 < (-(8 * i131))))) {
      loadLocalToGlobal<__half, /*vec_size=*/8, /*is_volatile=*/false>( &T4[(i62 + (i63 * i131))], &T7[0]);
    }
  }
}

This PR:

    // main loop
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i114 = 0; i114 < 4; ++i114) {
    nvfuser_index_t i115;
    i115 = 32 * i114;
    nvfuser_index_t i116;
    i116 = i50 + (2048LL * i114);
    #pragma unroll
    for(nvfuser_index_t i117 = 0; i117 < 8; ++i117) {
      nvfuser_index_t i118;
      i118 = i115 + (4 * i117);
      nvfuser_index_t i119;
      i119 = i12 + i117;
      nvfuser_index_t i120;
      i120 = (i116 + (32LL * (i119 / 4))) + (8LL * (i51 ^ (i119 % 4)));
      #pragma unroll
      for(nvfuser_index_t i121 = 0; i121 < 2; ++i121) {
        loadGeneric<float, 2>( &T7[(i120 + (1024LL * i121))],  &T2[(i118 + (2LL * i121))]);
      }
    }
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i122 = 0; i122 < 16; ++i122) {
    nvfuser_index_t i123;
    i123 = i53 + (1024 * i122);
    Array<__half, 8, 8> T6;
    #pragma unroll
    for(nvfuser_index_t i124 = 0; i124 < 8; ++i124) {
      T6[i124]
         = __float2half(T7[(i123 + i124)]);
    }
    if ((b67 && (i68 < (-(8 * i122))))) {
      loadLocalToGlobal<__half, /*vec_size=*/8, /*is_volatile=*/false>( &T3[(i56 + (i57 * i122))], &T6[0]);
    }
  }
}

~~If we can also get i134 % 8 simplified to i134 and i134 / 8 simplified to 0 then this should give a nice and efficient last loop.~~ This is done

~~Currently this PR is super slow (e.g. 101 s vs 8 s on main in debug mode) due to the added recursion. Memoizing past results would be beneficial, but that's a topic for another PR.~~ This PR is no longer slow, thanks to limited recursion depth and #1972.

Fixes #1828

Still need to add a test for the problem that's fixed by the slow recursive approach.

jacobhinkle · 2024-02-24T00:44:48Z

x < y => x % y = x is not currently exploited in any of the simplification passes, as far as I can tell. I added a failing test to track that.

jacobhinkle · 2024-02-24T00:45:55Z

!build --diff

jacobhinkle · 2024-02-27T01:15:23Z

Changing the size in NVFuserTest.FusionAmpereMatmulSmemEpilogueCast_CUDA to 16384, 16384, 256 (i.e. a large memory-bound problem size), we go from 398 GB/s to 554 GB/s on A100 80GB PCIe. That's a 39% speedup! That roughly matches our internal measurements that output bandwidth for fp16 outputs is about half that of fp32 outputs. Seems like it's worth figuring out the proof slowdown here since this optimization can have a large impact when output writes are a significant portion of runtime.

I also measured after manually performing the last optimization (i134 % 8 -> i134, i134 / 8 -> 0, hoisting) and saw now effect. It's possible these are already performed by the cuda compiler.

zasdfgbnm · 2024-02-27T04:43:03Z

That's a 39% speedup!

That's a lot! Thanks for measuring this!

I also measured after manually performing the last optimization (i134 % 8 -> i134, i134 / 8 -> 0, hoisting) and saw now effect. It's possible these are already performed by the cuda compiler.

Even if there is no visible perf improvement, I still suggest go ahead and implementing it because:

It should be easy to implement
It improves code readability

jacobhinkle · 2024-03-01T15:57:28Z

Summary: As of now, this PR causes a 27% slowdown in total runtime for the test mentioned in the PR description compared to main. That is an improvement over my original method that showed 2.2x runtime due to a recently pushed change. In debug mode we currently see 3.8x runtime, up from 14x. These are compile time differences; recent testing suggests kernel performance is improved by around 3% due to the simplification.

The latest pushed change switches from unordered_set method to just limiting the recursion depth directly to 2.

Release mode test timing:

recursion depth  time (sec)  Simplified?
      0              1.5         no
      1              1.5         no
      2              1.9         yes
      3              9.3         yes
      4            133           yes
Compare to
 main                1.5         no
 unordered_set       3.3         yes

The Release compilation mode seems to make a big difference.

Debug build timings:

recursion depth  time (sec)  Simplified?
      0              7.4         no
      1              7.5         no
      2             28           yes
      3            422           yes
      4              ?           yes
 main                7.4         no
 unordered_set     104           yes

jacobhinkle · 2024-03-01T18:00:20Z

I'm now trying to write a good test for this PR, then I will clean up and mark it ready.

This reverts commit 713ead0.

jacobhinkle · 2024-03-11T18:24:09Z

!build --diff

These are apparently not needed for the index simplification in this PR.

jacobhinkle · 2024-03-12T19:54:46Z

Recently-pushed change strengthens eliminateTrivialComputation to simplify a % b and a / b when -|b| < a < |b|. This leads to a very nice kernel:

    // main loop
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i117 = 0; i117 < 4; ++i117) {
    nvfuser_index_t i118;
    i118 = 32 * i117;
    nvfuser_index_t i119;
    i119 = i53 + (2048LL * i117);
    #pragma unroll
    for(nvfuser_index_t i120 = 0; i120 < 8; ++i120) {
      nvfuser_index_t i121;
      i121 = i118 + (4 * i120);
      nvfuser_index_t i122;
      i122 = i30 + i120;
      nvfuser_index_t i123;
      i123 = (i119 + (32LL * (i122 / 4))) + (8LL * (i54 ^ (i122 % 4)));
      #pragma unroll
      for(nvfuser_index_t i124 = 0; i124 < 2; ++i124) {
        loadGeneric<float, 2>( &T8[(i123 + (1024LL * i124))],  &T3[(i121 + (2LL * i124))]);
      }
    }
  }
  __syncthreads();
  #pragma unroll
  for(nvfuser_index_t i125 = 0; i125 < 16; ++i125) {
    nvfuser_index_t i126;
    i126 = i55 + (1024 * i125);
    Array<__half, 8, 8> T7;
    #pragma unroll
    for(nvfuser_index_t i127 = 0; i127 < 8; ++i127) {
      T7[i127]
         = __float2half(T8[(i126 + i127)]);
    }
    if ((b58 && (i67 < (-(8 * i125))))) {
      loadLocalToGlobal<__half, /*vec_size=*/8, /*is_volatile=*/false>( &T4[(i57 + (i10 * i125))], &T7[0]);
    }
  }
}

However, it feels a bit slower. I'm going to evaluate the compile time with and without this change.

jacobhinkle · 2024-03-20T15:43:59Z

it feels a bit slower.

On my machine this takes overall test time from 1.9 to 3.9 seconds in a release build and from 7.4 to 15 seconds in debug build. I think it's probably best to leave this new optimization mentioned in the last comment for another PR so that I can play more with speeding it up. For now, I will revert it and ensure the timing makes sense then hopefully we can merge.

This reverts commit ad498c9.

jacobhinkle · 2024-03-20T16:02:21Z

With reverted commit, compile times are back down. If tests pass, I think this PR is ready.

jacobhinkle · 2024-03-20T16:02:25Z

!build --diff

jacobhinkle · 2024-03-20T19:02:31Z

!build --diff

csrc/expr_simplifier.cpp

I experimented and found that moving this to the beginning of the function actually affects correctness due to the checks preceding this loop. I can increase depth to work around that but then runtime increases due to increased recursion. This change keeps functionality we had previously instead.

jacobhinkle · 2024-03-21T08:43:54Z

!build --diff

This reverts commit 2ad3f53.

jacobhinkle · 2024-03-21T15:30:34Z

After #1972 the compile time is down to 1.3s. Even better, since it re-used lots of proofs, re-enabling the reverted simplification actually doesn't appreciably change compile time!

csrc/expr_simplifier.cpp

jacobhinkle · 2024-03-21T16:17:04Z

!build --diff-bench

A trivial modulus operation is simplified away now.

jacobhinkle · 2024-03-21T23:45:19Z

tests/cpp/test_loop_rotation.cpp

      float T2[1LL];
      T2[0LL]
-         = T1[(i15 % 2LL)];
+         = T1[i15];


jacobhinkle · 2024-03-21T23:49:15Z

tests/cpp/test_expr_simplifier.cpp

+  // This doesn't simplify at all
+  // EXPECT_VALUE_TRUE(simplifyExpr("neg( 8 ) < neg( i0 )"_, {}, {"i0 < 8"_}));
+
+  // This doesn't simplify at all
+  // EXPECT_VALUE_TRUE(simplifyExpr("neg( i0 ) < 0"_, {}, {"0 < i0"_}));


I think these commented out tests could be addressed by implementing a < b implies -b < -a. I don't think we need it urgently, but I plan to experiment with it.

tests/cpp/test_expr_simplifier.cpp

jacobhinkle added 3 commits February 23, 2024 23:13

Add recursion to lessThan

20b5efd

Remove debug prints

9218c3e

Add failing test for remaining simplification issue.

9f4d1d1

Still need to add a test for the problem that's fixed by the slow recursive approach.

Switch to unordered_map for checked_rels

a20bb7d

jacobhinkle changed the title ~~[WIP] More powerful prove::lessThan~~ [WIP] Strengthen prove::lessThan Feb 24, 2024

jacobhinkle mentioned this pull request Feb 24, 2024

Missing opportunities to remove trivial mod and div #1828

Closed

jacobhinkle added 2 commits March 1, 2024 14:30

Limit recursion depth to 2

6eb1ee6

Remove stale comment

55aed8b

jacobhinkle added 7 commits March 4, 2024 16:56

Merge remote-tracking branch 'origin/main' into expr_simplify_lessthan

280e61d

Add test that still fails

81f8305

Fix bug in test

c790006

Add negation of each assumption. THIS IS SLOW

713ead0

Revert "Add negation of each assumption. THIS IS SLOW"

b5a566d

This reverts commit 713ead0.

Merge remote-tracking branch 'origin/main' into expr_simplify_lessthan

3880bf9

Delint

43167b8

jacobhinkle mentioned this pull request Mar 12, 2024

Unswizzle before grid reduction in split-K #1534

Merged

jacobhinkle added 3 commits March 12, 2024 18:04

Add tests that are specific to the matmul index exprs we need

ef4b393

Add rule to simplify x % y and x / y for -|y| x < |y|

ad498c9

Comment out failing ordering tests

fdd7e9a

These are apparently not needed for the index simplification in this PR.

jacobhinkle changed the title ~~[WIP] Strengthen prove::lessThan~~ [WIP] Strengthen index simplification for cast epilogue matmul Mar 12, 2024

jacobhinkle changed the title ~~[WIP] Strengthen index simplification for cast epilogue matmul~~ Strengthen index simplification for cast epilogue matmul Mar 12, 2024

Merge remote-tracking branch 'origin/main' into expr_simplify_lessthan

27483b6

jacobhinkle added 2 commits March 20, 2024 15:44

Revert "Add rule to simplify x % y and x / y for -|y| x < |y|"

2ad3f53

This reverts commit ad498c9.

Relax tests given reverted commit

c84a8aa

Fix OOB vector access in distributeGcdRemainderDivMod

b91030c

zasdfgbnm reviewed Mar 20, 2024

View reviewed changes

csrc/expr_simplifier.cpp Outdated Show resolved Hide resolved

jacobhinkle added 2 commits March 21, 2024 15:01

Merge remote-tracking branch 'origin/main' into expr_simplify_lessthan

b375a32

Revert "Revert "Add rule to simplify x % y and x / y for -|y| x < |y|""

ac870cc

This reverts commit 2ad3f53.

Update test to reflect re-enabled simplification

9cf4286

jacobhinkle commented Mar 21, 2024

View reviewed changes

csrc/expr_simplifier.cpp Show resolved Hide resolved

jacobhinkle added 2 commits March 21, 2024 16:07

Fix 0/i behavior

a2c3607

Better fix

c7eb740

jacobhinkle marked this pull request as ready for review March 21, 2024 16:16

jacobhinkle requested a review from zasdfgbnm March 21, 2024 16:53

Update LoopRotationTest.MultipleDoubleBuffer

1a3b970

A trivial modulus operation is simplified away now.

jacobhinkle commented Mar 21, 2024

View reviewed changes

zasdfgbnm approved these changes Mar 22, 2024

View reviewed changes

tests/cpp/test_expr_simplifier.cpp Show resolved Hide resolved

tests/cpp/test_expr_simplifier.cpp Show resolved Hide resolved

tests/cpp/test_expr_simplifier.cpp Outdated Show resolved Hide resolved

Address reviewer comments

714c8e0

jacobhinkle merged commit 5817da9 into main Mar 22, 2024

jacobhinkle deleted the expr_simplify_lessthan branch March 22, 2024 15:41

This was referenced Mar 22, 2024

Place epilogue in unswizzling loop #1770

Closed

Some matmul nvfuser_splitk benchmark fails #1996

Closed

Conversation

jacobhinkle commented Feb 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Feb 24, 2024

Uh oh!

jacobhinkle commented Feb 24, 2024

Uh oh!

jacobhinkle commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm commented Feb 27, 2024

Uh oh!

jacobhinkle commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Mar 1, 2024

Uh oh!

jacobhinkle commented Mar 11, 2024

Uh oh!

jacobhinkle commented Mar 12, 2024

Uh oh!

jacobhinkle commented Mar 20, 2024

Uh oh!

jacobhinkle commented Mar 20, 2024

Uh oh!

jacobhinkle commented Mar 20, 2024

Uh oh!

jacobhinkle commented Mar 20, 2024

Uh oh!

Uh oh!

jacobhinkle commented Mar 21, 2024

Uh oh!

jacobhinkle commented Mar 21, 2024

Uh oh!

Uh oh!

jacobhinkle commented Mar 21, 2024

Uh oh!

jacobhinkle Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented Feb 23, 2024 •

edited

Loading

jacobhinkle commented Feb 27, 2024 •

edited

Loading

jacobhinkle commented Mar 1, 2024 •

edited

Loading

jacobhinkle Mar 21, 2024 •

edited

Loading