Conversation
Will revisit once sync pass is done, when we have a TensorIndex
Still missing allocation/indexing of work buffer
I need to replay leaf transforms, then get index.
Codegen is now like
```c++
// Allocate global tensor T5
reduction::serialReductionStep(
T3[0LL],
T2[(i14 + i18)],
0.000000000e+00f,
T5[((((((((((((nvfuser_index_t)blockIdx.x) * 8LL) + ((nvfuser_index_t)blockIdx.y)) * 4LL) + i13) * 8LL) + (i18 + nvfuser_zero)) * 4LL) + ((nvfuser_index_t)threadIdx.y)) * 32LL) + ((nvfuser_index_t)threadIdx.x))],
[](float &a, float b) { a = a + b; },
index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
true,
true);
```
This looks OK, although it will get a little better with hoisting. This
compiles, but I get an error in `runFusion`:
```
C++ exception with description "Expected T5_g[ iblockIdx.x59{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(262144, 32) ), 4) ), 8) ), 4) ), 8) )}, iblockIdx.y60{8}, ithreadIdx.y54{4}, ithreadIdx.x52{32}, iS58{4}, iS56{8}, rblockIdx.z49{5} ] to be bound to a tensor of rank 1, but got a tensor of rank 6
Exception raised from validateValWithConcreteValue at /opt/pytorch/nvfuser/csrc/expr_evaluator.cpp:38 (most recent call first):
```
This is happening when binding inputs I believe.
Fixes execution error. Test passes!
Generated kernel now looks like
```c++
// Allocate global tensor T4
grid_sync::blockSerializeWait<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
#pragma unroll
for(nvfuser_index_t i13 = 0; i13 < 4LL; ++i13) {
nvfuser_index_t i14;
i14 = 8LL * i13;
nvfuser_index_t i15;
i15 = 2048LL * i13;
nvfuser_index_t i16;
i16 = i4 + i15;
nvfuser_index_t i17;
i17 = -i15;
#pragma unroll
for(nvfuser_index_t i18 = 0; i18 < 8LL; ++i18) {
nvfuser_index_t i19;
i19 = 256LL * (i18 + nvfuser_zero);
nvfuser_index_t i20;
i20 = i16 + i19;
float T3[1LL];
T3[0LL] = 0.000000000e+00f;
// Allocate global tensor T5
reduction::serialReductionStep(
T3[0LL],
T2[(i14 + i18)],
0.000000000e+00f,
T5[i20],
[](float &a, float b) { a = a + b; },
index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
true,
true);
if ((b6 && (i5 < (i17 - i19)))) {
T1[i20]
= T3[0LL];
}
}
}
NVFUSER_UPDATE_MAGIC_ZERO;
grid_sync::blockSerializeRelease<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
```
Note that the index `i20` matches the output `T1`. This is what we need
to reclaim `T1` in a later PR; it will still be a challenge in that work
to exact map between `T5` and `T3` in order to get `T1` and `T5` exact
mapped...
Also sort expected output by line to give clearer error messages.
These were disabled in #1545 because of slow compilation with gridReduce
Now that we have the scheduling test we don't need this. And the matmul scheduler exercises vectorization.
csrc/device_lower/pass/index.cpp
Outdated
| auto work_buffer_domain = IrBuilder::create<TensorDomain>(work_buffer_root); | ||
| auto work_buffer_tv = IrBuilder::create<TensorView>( | ||
| work_buffer_domain, out_tv->dtype(), MemoryType::Global); | ||
| work_buffer_domain, DataType::Half, MemoryType::Global); |
There was a problem hiding this comment.
Placeholder. This will be removed once we update the interface for requesting serial grid reduction to also specify the precision.
runtime/grid_reduction.cu
Outdated
| float* out, | ||
| float* in, | ||
| float init, | ||
| volatile Twork* work, |
There was a problem hiding this comment.
This template's redundant since Twork = float might match this if we're not careful. Instead it might be best to just check is_same for the types of out and work and dispatch from there to separately-named helper functions.
runtime/grid_reduction.cu
Outdated
| } else if constexpr (std::is_same<Twork, __bfloat>::value) { | ||
| work_float = __bfloat2float(work_reg[i]); | ||
| } else { | ||
| // static_assert(false); |
There was a problem hiding this comment.
Shouldn't be needed since I assert at the start, but I don't know why this caused compile to fail unless, as mentioned above, this template is also matching the Twork=float case...
Update test to exercise both paths, with varying tolerance
This adds templated castFloating() helper function with specializations.
Undo tolerance change
| template <typename TO, typename FROM> | ||
| __device__ __inline__ TO castFloating(FROM x) { |
There was a problem hiding this comment.
TODO: we could support vectorized casts here by adding a vec_size template arg then specializing to the usual set of vectorization widths and using __half22float2 and friends.
|
A note about vectorization with half reduction: We currently schedule vectorized reduction by vectorizing the |
This disables reduction in fp16 or bf16, which is enabled by default in PyTorch. There are two reasons to disable this for our benchmarks: 1. nvFuser does not support split-K in reduced precision (see #1719). Since half precision reduction is much faster than single precision, this means eager mode will be faster but less precise than nvFuser by default. For fair comparison, we can both use single precision. 2. The accuracy of matmuls is degraded for split-K problems (small M&N, large K) by default in PyTorch. This can lead to validation errors where nvFuser actually performs an accurate computation but our baseline is inaccurate.
) This changes the python matmul benchmark to run four times as many tests: - We parametrize by reduction in float or in fp16/bf16, which is enabled by default in PyTorch. - We parametrize by `eager`. If this is true we directly compute `torch.matmul` without involving nvFuser. Otherwise we use nvFuser. This lets us compute baselines in the same run as we compute the nvFuser result instead of needing to re-run the benchmark with different environment variables as we previously had to. nvFuser does not support split-K in reduced precision (see #1719), so we skip these cases for now.
This change implements the possibility to use a reduced-precision work buffer for the split-K grid reduction. Note that this does not mean the accumulator precision is reduced: register buffers are still
Float. However, for split-K we might have say 5 segments reduced in single precision that need to be grid-reduced. That grid reduction requires global writes and reads, and this change lets us reduce the precision just for that IO.Note that reduced precision split-K reduction is the default behavior of cuBLAS and PyTorch/ATen.