Half-precision reduction for split-K by jacobhinkle · Pull Request #1719 · NVIDIA/Fuser

jacobhinkle · 2024-02-03T16:50:17Z

This change implements the possibility to use a reduced-precision work buffer for the split-K grid reduction. Note that this does not mean the accumulator precision is reduced: register buffers are still Float. However, for split-K we might have say 5 segments reduced in single precision that need to be grid-reduced. That grid reduction requires global writes and reads, and this change lets us reduce the precision just for that IO.

Note that reduced precision split-K reduction is the default behavior of cuBLAS and PyTorch/ATen.

Will revisit once sync pass is done, when we have a TensorIndex

Still missing allocation/indexing of work buffer

I need to replay leaf transforms, then get index.

Codegen is now like ```c++ // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[((((((((((((nvfuser_index_t)blockIdx.x) * 8LL) + ((nvfuser_index_t)blockIdx.y)) * 4LL) + i13) * 8LL) + (i18 + nvfuser_zero)) * 4LL) + ((nvfuser_index_t)threadIdx.y)) * 32LL) + ((nvfuser_index_t)threadIdx.x))], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); ``` This looks OK, although it will get a little better with hoisting. This compiles, but I get an error in `runFusion`: ``` C++ exception with description "Expected T5_g[ iblockIdx.x59{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(262144, 32) ), 4) ), 8) ), 4) ), 8) )}, iblockIdx.y60{8}, ithreadIdx.y54{4}, ithreadIdx.x52{32}, iS58{4}, iS56{8}, rblockIdx.z49{5} ] to be bound to a tensor of rank 1, but got a tensor of rank 6 Exception raised from validateValWithConcreteValue at /opt/pytorch/nvfuser/csrc/expr_evaluator.cpp:38 (most recent call first): ``` This is happening when binding inputs I believe.

Fixes execution error. Test passes!

Generated kernel now looks like ```c++ // Allocate global tensor T4 grid_sync::blockSerializeWait<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i13 = 0; i13 < 4LL; ++i13) { nvfuser_index_t i14; i14 = 8LL * i13; nvfuser_index_t i15; i15 = 2048LL * i13; nvfuser_index_t i16; i16 = i4 + i15; nvfuser_index_t i17; i17 = -i15; #pragma unroll for(nvfuser_index_t i18 = 0; i18 < 8LL; ++i18) { nvfuser_index_t i19; i19 = 256LL * (i18 + nvfuser_zero); nvfuser_index_t i20; i20 = i16 + i19; float T3[1LL]; T3[0LL] = 0.000000000e+00f; // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[i20], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b6 && (i5 < (i17 - i19)))) { T1[i20] = T3[0LL]; } } } NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeRelease<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); ``` Note that the index `i20` matches the output `T1`. This is what we need to reclaim `T1` in a later PR; it will still be a challenge in that work to exact map between `T5` and `T3` in order to get `T1` and `T5` exact mapped...

Also sort expected output by line to give clearer error messages.

These were disabled in #1545 because of slow compilation with gridReduce

…uction

Now that we have the scheduling test we don't need this. And the matmul scheduler exercises vectorization.

jacobhinkle · 2024-02-03T16:51:31Z

csrc/device_lower/pass/index.cpp

  auto work_buffer_domain = IrBuilder::create<TensorDomain>(work_buffer_root);
  auto work_buffer_tv = IrBuilder::create<TensorView>(
-      work_buffer_domain, out_tv->dtype(), MemoryType::Global);
+      work_buffer_domain, DataType::Half, MemoryType::Global);


Placeholder. This will be removed once we update the interface for requesting serial grid reduction to also specify the precision.

jacobhinkle · 2024-02-03T16:53:14Z

runtime/grid_reduction.cu

+    float* out,
+    float* in,
+    float init,
+    volatile Twork* work,


This template's redundant since Twork = float might match this if we're not careful. Instead it might be best to just check is_same for the types of out and work and dispatch from there to separately-named helper functions.

jacobhinkle · 2024-02-03T16:54:00Z

runtime/grid_reduction.cu

+      } else if constexpr (std::is_same<Twork, __bfloat>::value) {
+        work_float = __bfloat2float(work_reg[i]);
+      } else {
+        // static_assert(false);


Shouldn't be needed since I assert at the start, but I don't know why this caused compile to fail unless, as mentioned above, this template is also matching the Twork=float case...

Update test to exercise both paths, with varying tolerance

This adds templated castFloating() helper function with specializations.

Undo tolerance change

jacobhinkle · 2024-02-06T19:24:26Z

runtime/type_conversion.cu

+template <typename TO, typename FROM>
+__device__ __inline__ TO castFloating(FROM x) {


TODO: we could support vectorized casts here by adding a vec_size template arg then specializing to the usual set of vectorization widths and using __half22float2 and friends.

jacobhinkle · 2024-02-12T20:41:54Z

A note about vectorization with half reduction:

We currently schedule vectorized reduction by vectorizing the ReductionOp output (see #1528). However, with this PR, we might have single precision output and half precision reduction buffer. This means that while we would ideally use a vectorization width of 8 for the half-precision reduction, we will be bound by the single-precision output type. If we try and vectorize at width 8, we will hit an error in lowering as vectorized TVs are validated in VectorizeValidator. We could either special-case for this error, or introduce some other way to indicate vectorization of the temporary buffer. Currently we don't schedule that temporary buffer: we allocate it according to the leaf domain of the (single-precision) output at lowering (index.cpp). A more flexible way might be to actually create a global TensorView at scheduling and attach it as an attribute to the ReductionOp. That tensor's leaf domain would equal its root/allocation domain and could hold vectorization, grouping, dtype, etc.

This disables reduction in fp16 or bf16, which is enabled by default in PyTorch. There are two reasons to disable this for our benchmarks: 1. nvFuser does not support split-K in reduced precision (see #1719). Since half precision reduction is much faster than single precision, this means eager mode will be faster but less precise than nvFuser by default. For fair comparison, we can both use single precision. 2. The accuracy of matmuls is degraded for split-K problems (small M&N, large K) by default in PyTorch. This can lead to validation errors where nvFuser actually performs an accurate computation but our baseline is inaccurate.

) This changes the python matmul benchmark to run four times as many tests: - We parametrize by reduction in float or in fp16/bf16, which is enabled by default in PyTorch. - We parametrize by `eager`. If this is true we directly compute `torch.matmul` without involving nvFuser. Otherwise we use nvFuser. This lets us compute baselines in the same run as we compute the nvFuser result instead of needing to re-run the benchmark with different environment variables as we previously had to. nvFuser does not support split-K in reduced precision (see #1719), so we skip these cases for now.

jacobhinkle added 30 commits January 19, 2024 13:18

Start lowering serial grid reduction

aa6598d

Add grid_serialization.{cpp,h}

68fe6e8

Add requestSerialGridReduction

e27440c

Disable previous changes to indexing pass.

e04cea5

Will revisit once sync pass is done, when we have a TensorIndex

Call insertGridSerializationSyncs pass

ebec0e5

Remove file added by mistake

2530e51

Fix formatting lintrunner messed up

f395b57

Bump num_reduction_op_attr

a6f6611

Add test

109e878

Fix sync insertion in lowering pass.

e96a57f

Still missing allocation/indexing of work buffer

Fix missing allocation of sync flag buffer

43234b9

Allocate global work buffer. Index is zero for now

c9b126e

I need to replay leaf transforms, then get index.

Taking a stab at replay/indexing of intermediate

a2aec20

Infer shape using allocation domain instead of root

76f55b9

Fixes execution error. Test passes!

Update comments

a92c0bd

Clean up comments.

10872c1

Update NVFuserTest.Pipeline_CUDA

2ca845a

Also sort expected output by line to give clearer error messages.

Clean up sum val computation

d60ba14

Clean up comments and reset sync pattern properly

6e3e55f

Fix compile error

7afca1b

Re-use TensorDomain instead of replaying

50b59f9

Copy domains to create new TensorDomain instead of reusing

83c60a5

Allocate work buffer like leaf of output

5f9c1cf

Use serial grid reduction in split-K

c6d3f10

Set proper dtype for init in MmaOp

d90fbb0

Restore split-k benchmarks

9286006

These were disabled in #1545 because of slow compilation with gridReduce

Fix after rebase

491280d

Vectorized serial grid reduction

a9cd7b1

jacobhinkle added 7 commits February 2, 2024 16:54

Delint test

e40addb

Restore check for SerialGridReduction in validateAndCollectVectorizeInfo

b705d06

Merge remote-tracking branch 'origin/main' into vectorized_serial_red…

68a268b

…uction

Fix typo

dd9405b

Vectorize in scheduler

f93ddc7

Remove obsolete CodegenNodes test

7dcacf3

Now that we have the scheduling test we don't need this. And the matmul scheduler exercises vectorization.

[WIP] Half-precision reduction for split-K

b7279c4

jacobhinkle changed the base branch from main to vectorized_serial_reduction February 3, 2024 16:50

jacobhinkle commented Feb 3, 2024

View reviewed changes

Add interface for specifying reduction type

190b121

Update test to exercise both paths, with varying tolerance

jacobhinkle mentioned this pull request Feb 6, 2024

Vectorized serial grid reduction #1528

Merged

Base automatically changed from vectorized_serial_reduction to main February 6, 2024 13:13

jacobhinkle added 5 commits February 6, 2024 14:29

Add type_conversion.cu

64c986a

This adds templated castFloating() helper function with specializations.

Refactor runtime function for clarity

9897e9d

Undo tolerance change

Merge remote-tracking branch 'origin/main' into splitk_half_reduction

2a26437

Print reduction dtype

4bc8e27

Respect requested dtype

8c8e402

jacobhinkle commented Feb 6, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into splitk_half_reduction

a479a93

jacobhinkle added 2 commits February 12, 2024 20:44

Delint runtime

7757a3a

Fix merge

bab2e57

jacobhinkle changed the title ~~[WIP] Half-precision reduction for split-K~~ Half-precision reduction for split-K Feb 12, 2024

jacobhinkle mentioned this pull request Mar 22, 2024

Reusable zeroed memory #1829

Open

jacobhinkle mentioned this pull request Oct 17, 2024

Run with and without half-precision matmul reduction in benchmark #3203

Merged

jacobhinkle mentioned this pull request Nov 8, 2024

Use TMA with reduction for Hopper Split-K #3383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Half-precision reduction for split-K#1719

Half-precision reduction for split-K#1719
jacobhinkle wants to merge 50 commits intomainfrom
splitk_half_reduction

jacobhinkle commented Feb 3, 2024 •

edited

Loading

Uh oh!

jacobhinkle Feb 3, 2024

Uh oh!

jacobhinkle Feb 3, 2024

Uh oh!

jacobhinkle Feb 3, 2024

Uh oh!

jacobhinkle Feb 6, 2024

Uh oh!

jacobhinkle commented Feb 12, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		template <typename TO, typename FROM>
		__device__ __inline__ TO castFloating(FROM x) {

Conversation

jacobhinkle commented Feb 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle Feb 3, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Feb 3, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Feb 3, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jacobhinkle commented Feb 3, 2024 •

edited

Loading

jacobhinkle commented Feb 12, 2024 •

edited

Loading