Trivial predicate is causing a 30% slowdown for matmul with grid swizzle

For the example in `FusionAmpereSwizzle_CUDA`, the generated code contains trivial predicates:
```C++
    #pragma unroll
    for(nvfuser_index_t i653 = 0; i653 < 4; ++i653) {
      int i10749;
      i10749 = 32 * i653;
      #pragma unroll
      for(nvfuser_index_t i654 = 0; i654 < 8; ++i654) {
        if (((nvfuser_index_t)blockIdx.x) < ((ceilDiv(T1.size[1], 128)) * 4)) {
          Ampere::M16N8K16TN<16>(
            reinterpret_cast<Array<float,4,4>*>(&T5[(i10749 + (2 * i654))]),
            &(reinterpret_cast<Array<__half,8,8>*>(&T2)[i653]),
            &(reinterpret_cast<Array<__half,4,4>*>(&T3)[i654]));
        }
      }
    }
```
where `((nvfuser_index_t)blockIdx.x) < ((ceilDiv(T1.size[1], 128)) * 4)` is trivial because the rhs of `<` is identical to `gridDim.x`. We should simplify this trivial predicate.

On RTX 3090, the perf with and without that trivial predicate is `20.8374 ms` vs `16.1956 ms`
```[tasklist]
- [ ] https://github.com/NVIDIA/Fuser/pull/86
- [ ] https://github.com/NVIDIA/Fuser/pull/94
- [ ] https://github.com/NVIDIA/Fuser/pull/105
- [ ] https://github.com/NVIDIA/Fuser/pull/98
- [ ] https://github.com/NVIDIA/Fuser/pull/106
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trivial predicate is causing a 30% slowdown for matmul with grid swizzle #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trivial predicate is causing a 30% slowdown for matmul with grid swizzle #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions