Skip to content

Half-precision reduction for split-K#1719

Draft
jacobhinkle wants to merge 50 commits intomainfrom
splitk_half_reduction
Draft

Half-precision reduction for split-K#1719
jacobhinkle wants to merge 50 commits intomainfrom
splitk_half_reduction

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Feb 3, 2024

This change implements the possibility to use a reduced-precision work buffer for the split-K grid reduction. Note that this does not mean the accumulator precision is reduced: register buffers are still Float. However, for split-K we might have say 5 segments reduced in single precision that need to be grid-reduced. That grid reduction requires global writes and reads, and this change lets us reduce the precision just for that IO.

Note that reduced precision split-K reduction is the default behavior of cuBLAS and PyTorch/ATen.

Will revisit once sync pass is done, when we have a TensorIndex
Still missing allocation/indexing of work buffer
I need to replay leaf transforms, then get index.
Codegen is now like
```c++
  // Allocate global tensor T5
  reduction::serialReductionStep(
    T3[0LL],
    T2[(i14 + i18)],
    0.000000000e+00f,
    T5[((((((((((((nvfuser_index_t)blockIdx.x) * 8LL) + ((nvfuser_index_t)blockIdx.y)) * 4LL) + i13) * 8LL) + (i18 + nvfuser_zero)) * 4LL) + ((nvfuser_index_t)threadIdx.y)) * 32LL) + ((nvfuser_index_t)threadIdx.x))],
    [](float &a, float b) { a = a + b; },
    index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
    index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
    true,
    true);
```
This looks OK, although it will get a little better with hoisting. This
compiles, but I get an error in `runFusion`:
```
C++ exception with description "Expected T5_g[ iblockIdx.x59{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(262144, 32) ), 4) ), 8) ), 4) ), 8) )}, iblockIdx.y60{8}, ithreadIdx.y54{4}, ithreadIdx.x52{32}, iS58{4}, iS56{8}, rblockIdx.z49{5} ] to be bound to a tensor of rank 1, but got a tensor of rank 6
Exception raised from validateValWithConcreteValue at /opt/pytorch/nvfuser/csrc/expr_evaluator.cpp:38 (most recent call first):
```
This is happening when binding inputs I believe.
Fixes execution error. Test passes!
Generated kernel now looks like
```c++
  // Allocate global tensor T4
  grid_sync::blockSerializeWait<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
  #pragma unroll
  for(nvfuser_index_t i13 = 0; i13 < 4LL; ++i13) {
    nvfuser_index_t i14;
    i14 = 8LL * i13;
    nvfuser_index_t i15;
    i15 = 2048LL * i13;
    nvfuser_index_t i16;
    i16 = i4 + i15;
    nvfuser_index_t i17;
    i17 = -i15;
    #pragma unroll
    for(nvfuser_index_t i18 = 0; i18 < 8LL; ++i18) {
      nvfuser_index_t i19;
      i19 = 256LL * (i18 + nvfuser_zero);
      nvfuser_index_t i20;
      i20 = i16 + i19;
      float T3[1LL];
      T3[0LL] = 0.000000000e+00f;
      // Allocate global tensor T5
      reduction::serialReductionStep(
        T3[0LL],
        T2[(i14 + i18)],
        0.000000000e+00f,
        T5[i20],
        [](float &a, float b) { a = a + b; },
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
        true,
        true);
      if ((b6 && (i5 < (i17 - i19)))) {
        T1[i20]
           = T3[0LL];
      }
    }
  }
  NVFUSER_UPDATE_MAGIC_ZERO;
  grid_sync::blockSerializeRelease<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
```
Note that the index `i20` matches the output `T1`. This is what we need
to reclaim `T1` in a later PR; it will still be a challenge in that work
to exact map between `T5` and `T3` in order to get `T1` and `T5` exact
mapped...
Also sort expected output by line to give clearer error messages.
These were disabled in #1545 because of slow compilation with gridReduce
@jacobhinkle jacobhinkle changed the base branch from main to vectorized_serial_reduction February 3, 2024 16:50
auto work_buffer_domain = IrBuilder::create<TensorDomain>(work_buffer_root);
auto work_buffer_tv = IrBuilder::create<TensorView>(
work_buffer_domain, out_tv->dtype(), MemoryType::Global);
work_buffer_domain, DataType::Half, MemoryType::Global);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder. This will be removed once we update the interface for requesting serial grid reduction to also specify the precision.

float* out,
float* in,
float init,
volatile Twork* work,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template's redundant since Twork = float might match this if we're not careful. Instead it might be best to just check is_same for the types of out and work and dispatch from there to separately-named helper functions.

} else if constexpr (std::is_same<Twork, __bfloat>::value) {
work_float = __bfloat2float(work_reg[i]);
} else {
// static_assert(false);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be needed since I assert at the start, but I don't know why this caused compile to fail unless, as mentioned above, this template is also matching the Twork=float case...

Update test to exercise both paths, with varying tolerance
Base automatically changed from vectorized_serial_reduction to main February 6, 2024 13:13
Comment on lines +12 to +13
template <typename TO, typename FROM>
__device__ __inline__ TO castFloating(FROM x) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we could support vectorized casts here by adding a vec_size template arg then specializing to the usual set of vectorization widths and using __half22float2 and friends.

@jacobhinkle
Copy link
Collaborator Author

jacobhinkle commented Feb 12, 2024

A note about vectorization with half reduction:

We currently schedule vectorized reduction by vectorizing the ReductionOp output (see #1528). However, with this PR, we might have single precision output and half precision reduction buffer. This means that while we would ideally use a vectorization width of 8 for the half-precision reduction, we will be bound by the single-precision output type. If we try and vectorize at width 8, we will hit an error in lowering as vectorized TVs are validated in VectorizeValidator. We could either special-case for this error, or introduce some other way to indicate vectorization of the temporary buffer. Currently we don't schedule that temporary buffer: we allocate it according to the leaf domain of the (single-precision) output at lowering (index.cpp). A more flexible way might be to actually create a global TensorView at scheduling and attach it as an attribute to the ReductionOp. That tensor's leaf domain would equal its root/allocation domain and could hold vectorization, grouping, dtype, etc.

@jacobhinkle jacobhinkle changed the title [WIP] Half-precision reduction for split-K Half-precision reduction for split-K Feb 12, 2024
jacobhinkle added a commit that referenced this pull request Oct 17, 2024
This disables reduction in fp16 or bf16, which is enabled by default in
PyTorch. There are two reasons to disable this for our benchmarks:
1. nvFuser does not support split-K in reduced precision (see #1719).
   Since half precision reduction is much faster than single precision,
   this means eager mode will be faster but less precise than
   nvFuser by default. For fair comparison, we can both use single
   precision.
2. The accuracy of matmuls is degraded for split-K problems (small M&N,
   large K) by default in PyTorch. This can lead to validation errors
   where nvFuser actually performs an accurate computation but our
   baseline is inaccurate.
jacobhinkle added a commit that referenced this pull request Oct 25, 2024
)

This changes the python matmul benchmark to run four times as many
tests:
- We parametrize by reduction in float or in fp16/bf16, which is enabled
by default in PyTorch.
- We parametrize by `eager`. If this is true we directly compute
`torch.matmul` without involving nvFuser. Otherwise we use nvFuser. This
lets us compute baselines in the same run as we compute the nvFuser
result instead of needing to re-run the benchmark with different
environment variables as we previously had to.

nvFuser does not support split-K in reduced precision (see #1719), so we
skip these cases for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant