Skip to content

Concretize resize#2835

Open
jacobhinkle wants to merge 19 commits intomainfrom
concretize_resize
Open

Concretize resize#2835
jacobhinkle wants to merge 19 commits intomainfrom
concretize_resize

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Aug 22, 2024

Currently we concretize Resize operations by determining their output IterType. This lets us properly handle cases like slices of size 1 that are later used in broadcast operations. In #2795 there is an example where static shapes are many times faster than dynamic shapes in a Fusion containing reshapes and resizes (slice). This PR is an attempt to recover static shape performance in that setting without requiring fully static shapes (i.e. recompilation for every new set of inputs).

In this PR, instead of only determining the output IterType of each Resize expression, we also evaluate the input extent and the left and right expansion values. This determines the output extent and hence its IterType since the output extent is just the sum of the input extent and the two expansion values. We use these three ints in place of the IterType in concretization info, which is used as a FusionKernelRuntime cache key. This means that we will trigger compilation whenever a pad, cat, or slice is performed in a dimension whose size changes.

This behavior can be turned off using NVFUSER_DISABLE=concretize_resize_extents in which case we do not compile in static input extents and left/right expansion values, and we only concretize the output IterType. When that option is provided, the concretization info holds a standardized triple of ints: 2 1 0 for Iteration and 2 -1 0 for Broadcast. This is so that we will still get cache hits even when the resize ops change, unless the output IterType also changes.

Note that #511 is a separate attempt to simplify slice expressions at concretization, without making them fully static. That PR in its current form is not enough to recover full performance on #2795. If we merge this PR, then #511 will be moot.

Example: NVFuserTest.ResizePadToBroadcastDynamic_CUDA

With NVFUSER_DISABLE=concretize_resize_extents:

Concretized Fusion:

%kernel {
i20 = (nvfuser_index_t)(i17);
i22 = (nvfuser_index_t)(i18);
i24 = (nvfuser_index_t)(i15);
i26 = (nvfuser_index_t)(i16);
i28 = (nvfuser_index_t)(i13);
i30 = (nvfuser_index_t)(i14);
i32 = (nvfuser_index_t)(i11);
i34 = (nvfuser_index_t)(i12);
T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ]
   = pad( T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ], {0, 0, i20, i22, i24, i26, i28, i30, i32, i34} )
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
   = T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
   * T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ];

TransformPrinter : 
T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
 logical domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
 contiguity: f f f f f
 loop domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
 logical domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5})
 contiguity: f f f f f
 loop domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5})
T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ]
 root domain : (iS10{i0}, iS11{i2}rf, iS13{i3}rf, iS15{i4}rf, iS17{i5}rf)
  Resize: iS11{i2}rf by ( (nvfuser_index_t)(i17) ) and ( (nvfuser_index_t)(i18) ) -> bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf
  Resize: iS13{i3}rf by ( (nvfuser_index_t)(i15) ) and ( (nvfuser_index_t)(i16) ) -> iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf
  Resize: iS15{i4}rf by ( (nvfuser_index_t)(i13) ) and ( (nvfuser_index_t)(i14) ) -> bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf
  Resize: iS17{i5}rf by ( (nvfuser_index_t)(i11) ) and ( (nvfuser_index_t)(i12) ) -> iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf
 logical domain : (iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf)
 contiguity: t n t n t
 loop domain : (iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf)
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
 logical domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
 contiguity: t t t t t
 loop domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
} // %kernel

Without that flag:

Concretized Fusion:

%kernel {
T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ]
   = pad( T0_g[ iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6} ], {0, 0, -1, -1, 1, 1, 0, -4, 0, -1} )
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
   = T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
   * T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ];

TransformPrinter : 
T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
 logical domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
 contiguity: f f f f f
 loop domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
T0_g[ iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6} ]
 logical domain : (iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6})
 contiguity: f f f f f
 loop domain : (iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6})
T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ]
 root domain : (iS10{i0}, iS24{3}rf, iS25{2}rf, iS26{5}rf, iS27{6}rf)
  Resize: iS24{3}rf by -1 and -1 -> bS28{1}rf
  Resize: iS25{2}rf by 1 and 1 -> iS29{4}rf
  Resize: iS26{5}rf by 0 and -4 -> bS30{1}rf
  Resize: iS27{6}rf by 0 and -1 -> iS31{5}rf
 logical domain : (iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf)
 contiguity: t n t n t
 loop domain : (iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf)
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
 logical domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
 contiguity: t t t t t
 loop domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})

Notice that the last four dimensions are static since they are involved in non-trivial pads, but the first dimension (i0 and i6) remains dynamic.

Performance on rope example

The example from #2795 originally took 35 us (356 GB/s) on my device (3090Ti). With this PR it runs in 8 us (1581 GB/s). In that example, all dimensions are actually made static during concretization since they are all involved in non-trivial slices or reshape ops.

Fixes #2795

@jacobhinkle
Copy link
Collaborator Author

!build --diff-bench

@jjsjann123 jjsjann123 self-requested a review August 23, 2024 16:16
@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

jacobhinkle commented Aug 27, 2024

I'm investigating the test failure in test_host_ir --gtest_filter=HostIrTest.ForLoops/useFusionExecutor. There is indeed an accuracy bug and the generated kernel differs in a single index:

--- main.cu     2024-08-27 19:36:33.086411645 -0400
+++ pr2835.cu   2024-08-27 19:36:44.867256256 -0400
@@ -10700,37 +10700,37 @@
 __global__ void nvfuser_none_f0_c0_r0_g0(nvfuser_index_t i0, Tensor<int64_t, 1, 1> T2, Tensor<int64_t, 1, 1> T3) {
   NVFUSER_DEFINE_MAGIC_ZERO;
   nvfuser_index_t i1;
   i1 = i0 + 7LL;
   bool b2;
   b2 = i0 < 0LL;
   nvfuser_index_t i3;
   i3 = b2 ? i1 : i0;
   nvfuser_index_t i4;
   i4 = max(0LL, i3);
   nvfuser_index_t i5;
   i5 = i0 + 1LL;
   bool b6;
   b6 = i5 < 0LL;
   nvfuser_index_t i7;
   i7 = i5 + 7LL;
   nvfuser_index_t i8;
   i8 = b6 ? i7 : i5;
   nvfuser_index_t i9;
   i9 = min(7LL, i8);
   nvfuser_index_t i10;
   i10 = max(i4, i9);
   int64_t T0[7LL];
   #pragma unroll
   for(nvfuser_index_t i11 = 0LL; i11 < 7LL; ++i11) {
     T0[i11] = (i11 + nvfuser_zero);
   }
   NVFUSER_UPDATE_MAGIC_ZERO;
   int64_t T1[1LL];
   T1[0LL]
-     = T0[i4];
+     = T0[1LL];
   T3[0LL]
     = T2[0LL]
     + T1[0LL];
 }

This difference is reflective of the following changes to the fusion IR:

--- maintransforms.txt  2024-08-27 19:43:20.162728311 -0400
+++ pr2835transforms.txt        2024-08-27 19:43:44.897347764 -0400
@@ -1,39 +1,39 @@
 %kernel {
 T0_l[ iS0{7} ]
    = iota(7, 0, 1, int64_t);
 b13 = i7 < 0;
 i16 = i7 + 7;
 i18 = where(b13, i16, i7);
 i20 = fmax(0, i18);
 i9 = i7 + 1;
 b22 = i9 < 0;
 i25 = i9 + 7;
 i27 = where(b22, i25, i9);
 i29 = fmin(7, i27);
 i31 = fmax(i20, i29);
 T1_l[ bS5{1}rf ]
    = slice( T0_l[ iS0{7} ], { {i20, i31, 1} } )
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
+T3_g[ bS7{1} ]
    = T2_g[ bS3{1} ]
    + T1_l[ bS5{1}rf ];
 
 TransformPrinter :
 T2_g[ bS3{1} ]
  logical domain : (bS3{1})
  contiguity: n
  loop domain : (bS3{1})
 T0_l[ iS0{7} ]
  logical domain : (iS0{7})
  contiguity: t
  loop domain : (iS0{7})
 T1_l[ bS5{1}rf ]
  root domain : (iS1{7}rf)
-  Resize: iS1{7}rf by ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) and ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) -> bS5{1}rf
+  Resize: iS1{7}rf by -1 and -5 -> bS5{1}rf
  logical domain : (bS5{1}rf)
  contiguity: n
  loop domain : (bS5{1}rf)
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
- logical domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+T3_g[ bS7{1} ]
+ logical domain : (bS7{1})
  contiguity: n
- loop domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+ loop domain : (bS7{1})
 } // %kernel

EDIT: I think what is happening is the HostIrExecutor is concretizing the fusion with the first set of params then not updating it for each loop iteration. I think this is a bug in the host IR executor; we should probably do something like detect if the loop index is a "dynamic val" to determine whether we need to reconcretize for each evaluation... For now, I will just disable concretization of resize extents in that test and file an issue once this PR is merged.

@jacobhinkle
Copy link
Collaborator Author

The latest pushed change fixes the broken test but note that this exposes a bug in HostIrExecutor's concretization method when params.use_fusion_executor_cache = false. That bug is not specific to Resize ops: I think if we did a reshape based on the ForLoop index then we would see a similar error for example. I will post that as a separate issue soon.

@jacobhinkle jacobhinkle marked this pull request as ready for review August 28, 2024 00:24
@jacobhinkle
Copy link
Collaborator Author

Note that at some point in the past week we have gone from 35 us on main to about 13 us for the repro in #2795. This PR still brings this down to 8 us. I haven't bisected yet to see where that speedup came from; I just noticed it this evening.

@naoyam
Copy link
Collaborator

naoyam commented Aug 28, 2024

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still looking at the actual transform.

resize_extents_.emplace_back(
id_index,
ConcreteResize{
2, input_extent + left_expand + right_expand == 1 ? -1 : 1, 0});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question, why do we want to have a leading 2 and a trailing 0 here? This reads a bit strange...

and 2, 1, 0 or 2, -1, 0 could potentially have conflicts with a static resize? Assuming we can switch the option on/off at runtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think instead an input extent of -1 should be used, which will never conflict with an actual case. I can then use -1,1,1 to represent Broadcast and -1, 2, 2 to represent Iteration.

void OptOutMutator::mutate(NamedScalar* ns) {}

void OptOutMutator::mutate(IterDomain* id) {
void OptOutMutator::mutate(IterDomain* orig_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm not following the code change here.

Reads like this is an orthogonal issue in mutator that's patched in this PR? My usual ask is, can we have an accompanying test / or a small segment in the PR description to help me understand it? 🙇

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add that.

@jacobhinkle
Copy link
Collaborator Author

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

Yes this is correct, resized IterDomains are treated as if they were static with static expands.

@jjsjann123
Copy link
Collaborator

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

Yes this is correct, resized IterDomains are treated as if they were static with static expands.

nitpick:
Here we are using sameAs to check if a resize is needed in slice. Should/could we try to evaluate a equal operator between range.stop and inp_root_size?

Fuser/csrc/ops/alias.cpp

Lines 786 to 787 in 13fba3b

if (range.start->isZeroInt() && range.stop->sameAs(inp_root_size) &&
range.step->isOneInt()) {

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat!

Please ping me again when mutator piece is good for another round of review.

// Concretize each resize op.
for (const auto& [id_index, iter_type] : info_->getResizeIterTypes()) {
auto id = info_->initialInfo()->getDynamicResizedIterDomains().at(id_index);
if (isOptionDisabled(DisableOption::ConcretizeResizeExtents)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, earlier there's a comment at call site that's out-dated at this point. We can move that one here into this short-cut.

  // Set output IterTypes for dynamic resize ops
  concretizeResize();

IterDomainBuilder builder(in_id);

if (!has_const_extent) {
Val* new_extent = IrBuilder::create<Val>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: is it beneficial to keep the old extent untouched, when it can be evaluated as a constant?

My naive impression is that this update can be applied unconditionally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be applied unconditionally probably, but there's no advantage to replacing it in this case.

// Concretize the output shape which is constant
if (!orig_id->extent()->sameAs(new_id->extent())) {
registerConcretization(orig_id->extent(), new_id->extent());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not handling expanded extent here?

i.e. doesn't slice an expanded dimension preserve the expanded field here, which means we need to register that in concretization as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that we should handle expanded extents on the inputs. Ideally we would need to handle output expanded extent so I will update the PR. However, note that currently I believe if you resize an expanded ID, the result is Iteration. This is because we do not check that the pad values are negative, so we might be doing an actual padding of that dimension. In case the pad values are constant we could do that check and sometimes preserve expanded broadcasts in slice.

@jacobhinkle
Copy link
Collaborator Author

@jjsjann123 re our conversation today, maybe we could also (in another PR) add some info in the default case to indicate that the resize is trivial. Currently, if you don't pass this EnableOption, we save the resize concretization info as (-1, 1, 1) or (-1, 1, 2) to indicate that this is a Broadcast or Iteration domain. We could additionally save (-1, -1, -1) to indicate that this is a trivial Resize (i.e. both the expand values are zero, and during concretization we would remove the Resize op altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

performance issue on dynamic shaped tensor

3 participants