-
Notifications
You must be signed in to change notification settings - Fork 79
Closed
Description
For the example in FusionAmpereSwizzle_CUDA, the generated code contains trivial predicates:
#pragma unroll
for(nvfuser_index_t i653 = 0; i653 < 4; ++i653) {
int i10749;
i10749 = 32 * i653;
#pragma unroll
for(nvfuser_index_t i654 = 0; i654 < 8; ++i654) {
if (((nvfuser_index_t)blockIdx.x) < ((ceilDiv(T1.size[1], 128)) * 4)) {
Ampere::M16N8K16TN<16>(
reinterpret_cast<Array<float,4,4>*>(&T5[(i10749 + (2 * i654))]),
&(reinterpret_cast<Array<__half,8,8>*>(&T2)[i653]),
&(reinterpret_cast<Array<__half,4,4>*>(&T3)[i654]));
}
}
}where ((nvfuser_index_t)blockIdx.x) < ((ceilDiv(T1.size[1], 128)) * 4) is trivial because the rhs of < is identical to gridDim.x. We should simplify this trivial predicate.
On RTX 3090, the perf with and without that trivial predicate is 20.8374 ms vs 16.1956 ms
- [ ] https://github.com/NVIDIA/Fuser/pull/86
- [ ] https://github.com/NVIDIA/Fuser/pull/94
- [ ] https://github.com/NVIDIA/Fuser/pull/105
- [ ] https://github.com/NVIDIA/Fuser/pull/98
- [ ] https://github.com/NVIDIA/Fuser/pull/106
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels