Add naive cache for provers by zasdfgbnm · Pull Request #1972 · NVIDIA/Fuser

zasdfgbnm · 2024-03-20T21:27:09Z

Before:

[       OK ] GPUTTensorCoreTest.FusionAmpereMatmul_CUDA (4571 ms)
[       OK ] NVFuserTest.FusionMagicSchedulerBatchNormalization_CUDA (2089 ms)
[       OK ] GpuViewTest.FusionReshapeReductionShmoo (17323 ms)

After:

[       OK ] GPUTTensorCoreTest.FusionAmpereMatmul_CUDA (3151 ms)
[       OK ] NVFuserTest.FusionMagicSchedulerBatchNormalization_CUDA (2044 ms)
[       OK ] GpuViewTest.FusionReshapeReductionShmoo (16693 ms)

jacobhinkle

This is nice! It's a much simpler alternative to #1974, which aims to reuse proofs across Contexts.

csrc/expr_simplifier.cpp

zasdfgbnm · 2024-03-21T00:16:12Z

!build

zasdfgbnm · 2024-03-21T00:21:44Z

There is something wrong with Duo and I can not start the CI. Let me wait for a few hours and retry.

zasdfgbnm · 2024-03-21T00:26:08Z

!build

zasdfgbnm · 2024-03-21T00:33:51Z

!build

zasdfgbnm · 2024-03-21T00:42:09Z

!build

zasdfgbnm · 2024-03-21T01:02:01Z

!build

zasdfgbnm · 2024-03-21T01:31:24Z

!build

zasdfgbnm · 2024-03-21T01:36:20Z

!build

zasdfgbnm · 2024-03-21T05:13:05Z

!build

@zasdfgbnm

This came up when working on #1770. In a private conversation, @zasdfgbnm noticed wisely that the problematic indexing is really a failure of expression simplification; if we could fully simplify the swizzling expression it could be entirely hoisted and we would be left with a nice clean linear index for the smem buffer in the epilogue loop. This is `NVFuserTest.FusionAmpereMatmulSmemEpilogueCast_CUDA` on `main`: ```c++ // main loop } __syncthreads(); #pragma unroll for(nvfuser_index_t i123 = 0; i123 < 4; ++i123) { nvfuser_index_t i124; i124 = 32 * i123; nvfuser_index_t i125; i125 = i56 + (2048LL * i123); #pragma unroll for(nvfuser_index_t i126 = 0; i126 < 8; ++i126) { nvfuser_index_t i127; i127 = i124 + (4 * i126); nvfuser_index_t i128; i128 = i11 + i126; nvfuser_index_t i129; i129 = (i125 + (32LL * (i128 / 4))) + (8LL * (i57 ^ (i128 % 4))); #pragma unroll for(nvfuser_index_t i130 = 0; i130 < 2; ++i130) { loadGeneric<float, 2>( &T8[(i129 + (1024LL * i130))], &T3[(i127 + (2LL * i130))]); } } } __syncthreads(); #pragma unroll for(nvfuser_index_t i131 = 0; i131 < 16; ++i131) { nvfuser_index_t i132; i132 = i58 + (1024 * i131); Array<__half, 8, 8> T7; #pragma unroll for(nvfuser_index_t i133 = 0; i133 < 8; ++i133) { nvfuser_index_t i134; i134 = i59 + i133; nvfuser_index_t i135; i135 = i134 % 128; nvfuser_index_t i136; i136 = i135 / 8; nvfuser_index_t i137; i137 = i134 / 128; T7[i133] = __float2half(T8[((((i132 + (128LL * i137)) + (32LL * (i136 / 4))) + (i135 % 8)) + (8LL * ((i136 % 4) ^ ((i31 + i137) % 4))))]); } if ((b72 && (i73 < (-(8 * i131))))) { loadLocalToGlobal<__half, /*vec_size=*/8, /*is_volatile=*/false>( &T4[(i62 + (i63 * i131))], &T7[0]); } } } ``` This PR: ```c++ // main loop } __syncthreads(); #pragma unroll for(nvfuser_index_t i114 = 0; i114 < 4; ++i114) { nvfuser_index_t i115; i115 = 32 * i114; nvfuser_index_t i116; i116 = i50 + (2048LL * i114); #pragma unroll for(nvfuser_index_t i117 = 0; i117 < 8; ++i117) { nvfuser_index_t i118; i118 = i115 + (4 * i117); nvfuser_index_t i119; i119 = i12 + i117; nvfuser_index_t i120; i120 = (i116 + (32LL * (i119 / 4))) + (8LL * (i51 ^ (i119 % 4))); #pragma unroll for(nvfuser_index_t i121 = 0; i121 < 2; ++i121) { loadGeneric<float, 2>( &T7[(i120 + (1024LL * i121))], &T2[(i118 + (2LL * i121))]); } } } __syncthreads(); #pragma unroll for(nvfuser_index_t i122 = 0; i122 < 16; ++i122) { nvfuser_index_t i123; i123 = i53 + (1024 * i122); Array<__half, 8, 8> T6; #pragma unroll for(nvfuser_index_t i124 = 0; i124 < 8; ++i124) { T6[i124] = __float2half(T7[(i123 + i124)]); } if ((b67 && (i68 < (-(8 * i122))))) { loadLocalToGlobal<__half, /*vec_size=*/8, /*is_volatile=*/false>( &T3[(i56 + (i57 * i122))], &T6[0]); } } } ``` ~~If we can also get `i134 % 8` simplified to `i134` and `i134 / 8` simplified to 0 then this should give a nice and efficient last loop.~~ This is done ~~Currently this PR is super slow (e.g. 101 s vs 8 s on main in debug mode) due to the added recursion. Memoizing past results would be beneficial, but that's a topic for another PR.~~ This PR is no longer slow, thanks to limited recursion depth and #1972. Fixes #1828

Add naive cache for provers

2e85fc5

zasdfgbnm requested a review from jacobhinkle March 20, 2024 21:27

jacobhinkle approved these changes Mar 20, 2024

View reviewed changes

csrc/expr_simplifier.cpp Outdated Show resolved Hide resolved

zasdfgbnm added 2 commits March 20, 2024 17:03

Merge branch 'main' into naive-cache-provers

1d3159c

rename macro

922475a

zasdfgbnm merged commit 85c68d8 into main Mar 21, 2024

zasdfgbnm deleted the naive-cache-provers branch March 21, 2024 06:59

jacobhinkle mentioned this pull request Mar 21, 2024

Strengthen index simplification for cast epilogue matmul #1827

Merged

jacobhinkle mentioned this pull request Apr 2, 2024

Perform cancellation in SimplifyingIrBuilder::addExpr #2020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add naive cache for provers#1972

Add naive cache for provers#1972
zasdfgbnm merged 3 commits intomainfrom
naive-cache-provers

zasdfgbnm commented Mar 20, 2024

Uh oh!

jacobhinkle left a comment

Uh oh!

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zasdfgbnm commented Mar 20, 2024

Uh oh!

jacobhinkle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

zasdfgbnm commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants