Conversation
|
!build --diff-bench |
|
!build --diff |
|
I'm investigating the test failure in --- main.cu 2024-08-27 19:36:33.086411645 -0400
+++ pr2835.cu 2024-08-27 19:36:44.867256256 -0400
@@ -10700,37 +10700,37 @@
__global__ void nvfuser_none_f0_c0_r0_g0(nvfuser_index_t i0, Tensor<int64_t, 1, 1> T2, Tensor<int64_t, 1, 1> T3) {
NVFUSER_DEFINE_MAGIC_ZERO;
nvfuser_index_t i1;
i1 = i0 + 7LL;
bool b2;
b2 = i0 < 0LL;
nvfuser_index_t i3;
i3 = b2 ? i1 : i0;
nvfuser_index_t i4;
i4 = max(0LL, i3);
nvfuser_index_t i5;
i5 = i0 + 1LL;
bool b6;
b6 = i5 < 0LL;
nvfuser_index_t i7;
i7 = i5 + 7LL;
nvfuser_index_t i8;
i8 = b6 ? i7 : i5;
nvfuser_index_t i9;
i9 = min(7LL, i8);
nvfuser_index_t i10;
i10 = max(i4, i9);
int64_t T0[7LL];
#pragma unroll
for(nvfuser_index_t i11 = 0LL; i11 < 7LL; ++i11) {
T0[i11] = (i11 + nvfuser_zero);
}
NVFUSER_UPDATE_MAGIC_ZERO;
int64_t T1[1LL];
T1[0LL]
- = T0[i4];
+ = T0[1LL];
T3[0LL]
= T2[0LL]
+ T1[0LL];
}This difference is reflective of the following changes to the fusion IR: --- maintransforms.txt 2024-08-27 19:43:20.162728311 -0400
+++ pr2835transforms.txt 2024-08-27 19:43:44.897347764 -0400
@@ -1,39 +1,39 @@
%kernel {
T0_l[ iS0{7} ]
= iota(7, 0, 1, int64_t);
b13 = i7 < 0;
i16 = i7 + 7;
i18 = where(b13, i16, i7);
i20 = fmax(0, i18);
i9 = i7 + 1;
b22 = i9 < 0;
i25 = i9 + 7;
i27 = where(b22, i25, i9);
i29 = fmin(7, i27);
i31 = fmax(i20, i29);
T1_l[ bS5{1}rf ]
= slice( T0_l[ iS0{7} ], { {i20, i31, 1} } )
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
+T3_g[ bS7{1} ]
= T2_g[ bS3{1} ]
+ T1_l[ bS5{1}rf ];
TransformPrinter :
T2_g[ bS3{1} ]
logical domain : (bS3{1})
contiguity: n
loop domain : (bS3{1})
T0_l[ iS0{7} ]
logical domain : (iS0{7})
contiguity: t
loop domain : (iS0{7})
T1_l[ bS5{1}rf ]
root domain : (iS1{7}rf)
- Resize: iS1{7}rf by ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) and ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) -> bS5{1}rf
+ Resize: iS1{7}rf by -1 and -5 -> bS5{1}rf
logical domain : (bS5{1}rf)
contiguity: n
loop domain : (bS5{1}rf)
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
- logical domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+T3_g[ bS7{1} ]
+ logical domain : (bS7{1})
contiguity: n
- loop domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+ loop domain : (bS7{1})
} // %kernelEDIT: I think what is happening is the |
|
The latest pushed change fixes the broken test but note that this exposes a bug in |
|
Note that at some point in the past week we have gone from 35 us on main to about 13 us for the repro in #2795. This PR still brings this down to 8 us. I haven't bisected yet to see where that speedup came from; I just noticed it this evening. |
|
IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct? |
jjsjann123
left a comment
There was a problem hiding this comment.
still looking at the actual transform.
| resize_extents_.emplace_back( | ||
| id_index, | ||
| ConcreteResize{ | ||
| 2, input_extent + left_expand + right_expand == 1 ? -1 : 1, 0}); |
There was a problem hiding this comment.
naive question, why do we want to have a leading 2 and a trailing 0 here? This reads a bit strange...
and 2, 1, 0 or 2, -1, 0 could potentially have conflicts with a static resize? Assuming we can switch the option on/off at runtime?
There was a problem hiding this comment.
Good point. I think instead an input extent of -1 should be used, which will never conflict with an actual case. I can then use -1,1,1 to represent Broadcast and -1, 2, 2 to represent Iteration.
| void OptOutMutator::mutate(NamedScalar* ns) {} | ||
|
|
||
| void OptOutMutator::mutate(IterDomain* id) { | ||
| void OptOutMutator::mutate(IterDomain* orig_id) { |
There was a problem hiding this comment.
sorry I'm not following the code change here.
Reads like this is an orthogonal issue in mutator that's patched in this PR? My usual ask is, can we have an accompanying test / or a small segment in the PR description to help me understand it? 🙇
There was a problem hiding this comment.
Good point. I will add that.
Yes this is correct, resized IterDomains are treated as if they were static with static expands. |
nitpick: Lines 786 to 787 in 13fba3b |
jjsjann123
left a comment
There was a problem hiding this comment.
Looks neat!
Please ping me again when mutator piece is good for another round of review.
csrc/dynamic_transform.cpp
Outdated
| // Concretize each resize op. | ||
| for (const auto& [id_index, iter_type] : info_->getResizeIterTypes()) { | ||
| auto id = info_->initialInfo()->getDynamicResizedIterDomains().at(id_index); | ||
| if (isOptionDisabled(DisableOption::ConcretizeResizeExtents)) { |
There was a problem hiding this comment.
nitpick, earlier there's a comment at call site that's out-dated at this point. We can move that one here into this short-cut.
// Set output IterTypes for dynamic resize ops
concretizeResize();
| IterDomainBuilder builder(in_id); | ||
|
|
||
| if (!has_const_extent) { | ||
| Val* new_extent = IrBuilder::create<Val>( |
There was a problem hiding this comment.
QQ: is it beneficial to keep the old extent untouched, when it can be evaluated as a constant?
My naive impression is that this update can be applied unconditionally.
There was a problem hiding this comment.
It could be applied unconditionally probably, but there's no advantage to replacing it in this case.
| // Concretize the output shape which is constant | ||
| if (!orig_id->extent()->sameAs(new_id->extent())) { | ||
| registerConcretization(orig_id->extent(), new_id->extent()); | ||
| } |
There was a problem hiding this comment.
Why are we not handling expanded extent here?
i.e. doesn't slice an expanded dimension preserve the expanded field here, which means we need to register that in concretization as well?
There was a problem hiding this comment.
It's a good point that we should handle expanded extents on the inputs. Ideally we would need to handle output expanded extent so I will update the PR. However, note that currently I believe if you resize an expanded ID, the result is Iteration. This is because we do not check that the pad values are negative, so we might be doing an actual padding of that dimension. In case the pad values are constant we could do that check and sometimes preserve expanded broadcasts in slice.
Co-authored-by: jjsjann123 <jiej@nvidia.com>
|
@jjsjann123 re our conversation today, maybe we could also (in another PR) add some info in the default case to indicate that the resize is trivial. Currently, if you don't pass this EnableOption, we save the resize concretization info as (-1, 1, 1) or (-1, 1, 2) to indicate that this is a Broadcast or Iteration domain. We could additionally save (-1, -1, -1) to indicate that this is a trivial Resize (i.e. both the expand values are zero, and during concretization we would remove the Resize op altogether. |
Currently we concretize
Resizeoperations by determining their outputIterType. This lets us properly handle cases like slices of size 1 that are later used in broadcast operations. In #2795 there is an example where static shapes are many times faster than dynamic shapes in a Fusion containing reshapes and resizes (slice). This PR is an attempt to recover static shape performance in that setting without requiring fully static shapes (i.e. recompilation for every new set of inputs).In this PR, instead of only determining the output
IterTypeof eachResizeexpression, we also evaluate the input extent and the left and right expansion values. This determines the output extent and hence itsIterTypesince the output extent is just the sum of the input extent and the two expansion values. We use these three ints in place of theIterTypein concretization info, which is used as aFusionKernelRuntimecache key. This means that we will trigger compilation whenever a pad, cat, or slice is performed in a dimension whose size changes.This behavior can be turned off using
NVFUSER_DISABLE=concretize_resize_extentsin which case we do not compile in static input extents and left/right expansion values, and we only concretize the outputIterType. When that option is provided, the concretization info holds a standardized triple of ints:2 1 0forIterationand2 -1 0forBroadcast. This is so that we will still get cache hits even when the resize ops change, unless the outputIterTypealso changes.Note that #511 is a separate attempt to simplify slice expressions at concretization, without making them fully static. That PR in its current form is not enough to recover full performance on #2795. If we merge this PR, then #511 will be moot.
Example:
NVFuserTest.ResizePadToBroadcastDynamic_CUDAWith
NVFUSER_DISABLE=concretize_resize_extents:Without that flag:
Notice that the last four dimensions are static since they are involved in non-trivial pads, but the first dimension (
i0andi6) remains dynamic.Performance on rope example
The example from #2795 originally took 35 us (356 GB/s) on my device (3090Ti). With this PR it runs in 8 us (1581 GB/s). In that example, all dimensions are actually made static during concretization since they are all involved in non-trivial slices or reshape ops.
Fixes #2795