Concretize resize by jacobhinkle · Pull Request #2835 · NVIDIA/Fuser

jacobhinkle · 2024-08-22T19:33:35Z

Currently we concretize Resize operations by determining their output IterType. This lets us properly handle cases like slices of size 1 that are later used in broadcast operations. In #2795 there is an example where static shapes are many times faster than dynamic shapes in a Fusion containing reshapes and resizes (slice). This PR is an attempt to recover static shape performance in that setting without requiring fully static shapes (i.e. recompilation for every new set of inputs).

In this PR, instead of only determining the output IterType of each Resize expression, we also evaluate the input extent and the left and right expansion values. This determines the output extent and hence its IterType since the output extent is just the sum of the input extent and the two expansion values. We use these three ints in place of the IterType in concretization info, which is used as a FusionKernelRuntime cache key. This means that we will trigger compilation whenever a pad, cat, or slice is performed in a dimension whose size changes.

This behavior can be turned off using NVFUSER_DISABLE=concretize_resize_extents in which case we do not compile in static input extents and left/right expansion values, and we only concretize the output IterType. When that option is provided, the concretization info holds a standardized triple of ints: 2 1 0 for Iteration and 2 -1 0 for Broadcast. This is so that we will still get cache hits even when the resize ops change, unless the output IterType also changes.

Note that #511 is a separate attempt to simplify slice expressions at concretization, without making them fully static. That PR in its current form is not enough to recover full performance on #2795. If we merge this PR, then #511 will be moot.

Example: `NVFuserTest.ResizePadToBroadcastDynamic_CUDA`

With NVFUSER_DISABLE=concretize_resize_extents:

Concretized Fusion:

%kernel {
i20 = (nvfuser_index_t)(i17);
i22 = (nvfuser_index_t)(i18);
i24 = (nvfuser_index_t)(i15);
i26 = (nvfuser_index_t)(i16);
i28 = (nvfuser_index_t)(i13);
i30 = (nvfuser_index_t)(i14);
i32 = (nvfuser_index_t)(i11);
i34 = (nvfuser_index_t)(i12);
T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ]
   = pad( T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ], {0, 0, i20, i22, i24, i26, i28, i30, i32, i34} )
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
   = T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
   * T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ];

TransformPrinter : 
T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
 logical domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
 contiguity: f f f f f
 loop domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
 logical domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5})
 contiguity: f f f f f
 loop domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5})
T2_l[ iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf ]
 root domain : (iS10{i0}, iS11{i2}rf, iS13{i3}rf, iS15{i4}rf, iS17{i5}rf)
  Resize: iS11{i2}rf by ( (nvfuser_index_t)(i17) ) and ( (nvfuser_index_t)(i18) ) -> bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf
  Resize: iS13{i3}rf by ( (nvfuser_index_t)(i15) ) and ( (nvfuser_index_t)(i16) ) -> iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf
  Resize: iS15{i4}rf by ( (nvfuser_index_t)(i13) ) and ( (nvfuser_index_t)(i14) ) -> bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf
  Resize: iS17{i5}rf by ( (nvfuser_index_t)(i11) ) and ( (nvfuser_index_t)(i12) ) -> iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf
 logical domain : (iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf)
 contiguity: t n t n t
 loop domain : (iS10{i0}, bS24{( ( i2 + ( (nvfuser_index_t)(i17) ) ) + ( (nvfuser_index_t)(i18) ) )}rf, iS25{( ( i3 + ( (nvfuser_index_t)(i15) ) ) + ( (nvfuser_index_t)(i16) ) )}rf, bS26{( ( i4 + ( (nvfuser_index_t)(i13) ) ) + ( (nvfuser_index_t)(i14) ) )}rf, iS27{( ( i5 + ( (nvfuser_index_t)(i11) ) ) + ( (nvfuser_index_t)(i12) ) )}rf)
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
 logical domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
 contiguity: t t t t t
 loop domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
} // %kernel

Without that flag:

Concretized Fusion:

%kernel {
T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ]
   = pad( T0_g[ iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6} ], {0, 0, -1, -1, 1, 1, 0, -4, 0, -1} )
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
   = T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
   * T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ];

TransformPrinter : 
T1_g[ iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10} ]
 logical domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
 contiguity: f f f f f
 loop domain : (iS5{i6}, iS6{i7}, iS7{i8}, iS8{i9}, iS9{i10})
T0_g[ iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6} ]
 logical domain : (iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6})
 contiguity: f f f f f
 loop domain : (iS0{i0}, iS32{3}, iS33{2}, iS34{5}, iS35{6})
T2_l[ iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf ]
 root domain : (iS10{i0}, iS24{3}rf, iS25{2}rf, iS26{5}rf, iS27{6}rf)
  Resize: iS24{3}rf by -1 and -1 -> bS28{1}rf
  Resize: iS25{2}rf by 1 and 1 -> iS29{4}rf
  Resize: iS26{5}rf by 0 and -4 -> bS30{1}rf
  Resize: iS27{6}rf by 0 and -1 -> iS31{5}rf
 logical domain : (iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf)
 contiguity: t n t n t
 loop domain : (iS10{i0}, bS28{1}rf, iS29{4}rf, bS30{1}rf, iS31{5}rf)
T3_g[ iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10} ]
 logical domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})
 contiguity: t t t t t
 loop domain : (iS19{i6}, iS20{i7}, iS21{i8}, iS22{i9}, iS23{i10})

Notice that the last four dimensions are static since they are involved in non-trivial pads, but the first dimension (i0 and i6) remains dynamic.

Performance on rope example

The example from #2795 originally took 35 us (356 GB/s) on my device (3090Ti). With this PR it runs in 8 us (1581 GB/s). In that example, all dimensions are actually made static during concretization since they are all involved in non-trivial slices or reshape ops.

Fixes #2795

jacobhinkle · 2024-08-22T19:33:43Z

!build --diff-bench

…terDomain*)

jacobhinkle · 2024-08-27T19:11:16Z

!build --diff

jacobhinkle · 2024-08-27T23:24:01Z

I'm investigating the test failure in test_host_ir --gtest_filter=HostIrTest.ForLoops/useFusionExecutor. There is indeed an accuracy bug and the generated kernel differs in a single index:

--- main.cu     2024-08-27 19:36:33.086411645 -0400
+++ pr2835.cu   2024-08-27 19:36:44.867256256 -0400
@@ -10700,37 +10700,37 @@
 __global__ void nvfuser_none_f0_c0_r0_g0(nvfuser_index_t i0, Tensor<int64_t, 1, 1> T2, Tensor<int64_t, 1, 1> T3) {
   NVFUSER_DEFINE_MAGIC_ZERO;
   nvfuser_index_t i1;
   i1 = i0 + 7LL;
   bool b2;
   b2 = i0 < 0LL;
   nvfuser_index_t i3;
   i3 = b2 ? i1 : i0;
   nvfuser_index_t i4;
   i4 = max(0LL, i3);
   nvfuser_index_t i5;
   i5 = i0 + 1LL;
   bool b6;
   b6 = i5 < 0LL;
   nvfuser_index_t i7;
   i7 = i5 + 7LL;
   nvfuser_index_t i8;
   i8 = b6 ? i7 : i5;
   nvfuser_index_t i9;
   i9 = min(7LL, i8);
   nvfuser_index_t i10;
   i10 = max(i4, i9);
   int64_t T0[7LL];
   #pragma unroll
   for(nvfuser_index_t i11 = 0LL; i11 < 7LL; ++i11) {
     T0[i11] = (i11 + nvfuser_zero);
   }
   NVFUSER_UPDATE_MAGIC_ZERO;
   int64_t T1[1LL];
   T1[0LL]
-     = T0[i4];
+     = T0[1LL];
   T3[0LL]
     = T2[0LL]
     + T1[0LL];
 }

This difference is reflective of the following changes to the fusion IR:

--- maintransforms.txt  2024-08-27 19:43:20.162728311 -0400
+++ pr2835transforms.txt        2024-08-27 19:43:44.897347764 -0400
@@ -1,39 +1,39 @@
 %kernel {
 T0_l[ iS0{7} ]
    = iota(7, 0, 1, int64_t);
 b13 = i7 < 0;
 i16 = i7 + 7;
 i18 = where(b13, i16, i7);
 i20 = fmax(0, i18);
 i9 = i7 + 1;
 b22 = i9 < 0;
 i25 = i9 + 7;
 i27 = where(b22, i25, i9);
 i29 = fmin(7, i27);
 i31 = fmax(i20, i29);
 T1_l[ bS5{1}rf ]
    = slice( T0_l[ iS0{7} ], { {i20, i31, 1} } )
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
+T3_g[ bS7{1} ]
    = T2_g[ bS3{1} ]
    + T1_l[ bS5{1}rf ];
 
 TransformPrinter :
 T2_g[ bS3{1} ]
  logical domain : (bS3{1})
  contiguity: n
  loop domain : (bS3{1})
 T0_l[ iS0{7} ]
  logical domain : (iS0{7})
  contiguity: t
  loop domain : (iS0{7})
 T1_l[ bS5{1}rf ]
  root domain : (iS1{7}rf)
-  Resize: iS1{7}rf by ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) and ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) -> bS5{1}rf
+  Resize: iS1{7}rf by -1 and -5 -> bS5{1}rf
  logical domain : (bS5{1}rf)
  contiguity: n
  loop domain : (bS5{1}rf)
-T3_g[ bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )} ]
- logical domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+T3_g[ bS7{1} ]
+ logical domain : (bS7{1})
  contiguity: n
- loop domain : (bS6{( ( ( -( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ) ) + 7 ) + ( ( fmax(( fmax(0, ( where(( i7 < 0 ), ( i7 + 7 ), i7) )) ), ( fmin(7, ( where(( ( i7 + 1 ) < 0 ), ( ( i7 + 1 ) + 7 ), ( i7 + 1 )) )) )) ) - 7 ) )})
+ loop domain : (bS7{1})
 } // %kernel

EDIT: I think what is happening is the HostIrExecutor is concretizing the fusion with the first set of params then not updating it for each loop iteration. I think this is a bug in the host IR executor; we should probably do something like detect if the loop index is a "dynamic val" to determine whether we need to reconcretize for each evaluation... For now, I will just disable concretization of resize extents in that test and file an issue once this PR is merged.

jacobhinkle · 2024-08-28T00:24:24Z

The latest pushed change fixes the broken test but note that this exposes a bug in HostIrExecutor's concretization method when params.use_fusion_executor_cache = false. That bug is not specific to Resize ops: I think if we did a reshape based on the ForLoop index then we would see a similar error for example. I will post that as a separate issue soon.

jacobhinkle · 2024-08-28T00:26:14Z

Note that at some point in the past week we have gone from 35 us on main to about 13 us for the repro in #2795. This PR still brings this down to 8 us. I haven't bisected yet to see where that speedup came from; I just noticed it this evening.

naoyam · 2024-08-28T18:53:33Z

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

jjsjann123

still looking at the actual transform.

tests/cpp/test_dynamic_transform.cpp

jjsjann123 · 2024-08-28T21:48:18Z

csrc/dynamic_transform.cpp

+      resize_extents_.emplace_back(
+          id_index,
+          ConcreteResize{
+              2, input_extent + left_expand + right_expand == 1 ? -1 : 1, 0});


naive question, why do we want to have a leading 2 and a trailing 0 here? This reads a bit strange...

and 2, 1, 0 or 2, -1, 0 could potentially have conflicts with a static resize? Assuming we can switch the option on/off at runtime?

Good point. I think instead an input extent of -1 should be used, which will never conflict with an actual case. I can then use -1,1,1 to represent Broadcast and -1, 2, 2 to represent Iteration.

csrc/dynamic_transform.cpp

csrc/mutator.cpp

jjsjann123 · 2024-08-28T22:10:47Z

csrc/mutator.cpp

 void OptOutMutator::mutate(NamedScalar* ns) {}

-void OptOutMutator::mutate(IterDomain* id) {
+void OptOutMutator::mutate(IterDomain* orig_id) {


sorry I'm not following the code change here.

Reads like this is an orthogonal issue in mutator that's patched in this PR? My usual ask is, can we have an accompanying test / or a small segment in the PR description to help me understand it? 🙇

Good point. I will add that.

jacobhinkle · 2024-08-28T22:14:33Z

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

Yes this is correct, resized IterDomains are treated as if they were static with static expands.

jjsjann123 · 2024-08-28T22:28:54Z

IIUC, this means that resized domains would be effectively treated as static? Most likely, the expansion factors of resize would depend on the actual extent of the resized domain, like in the RoPE case, so if the expansion factors were treated as static, the domain itself would be effectively static. Is my understanding correct?

Yes this is correct, resized IterDomains are treated as if they were static with static expands.

nitpick:
Here we are using sameAs to check if a resize is needed in slice. Should/could we try to evaluate a equal operator between range.stop and inp_root_size?

Fuser/csrc/ops/alias.cpp

Lines 786 to 787 in 13fba3b

    
           if (range.start->isZeroInt() && range.stop->sameAs(inp_root_size) && 
        
               range.step->isOneInt()) {

jjsjann123

Looks neat!

Please ping me again when mutator piece is good for another round of review.

jjsjann123 · 2024-08-28T22:51:42Z

csrc/dynamic_transform.cpp

-  // Concretize each resize op.
-  for (const auto& [id_index, iter_type] : info_->getResizeIterTypes()) {
-    auto id = info_->initialInfo()->getDynamicResizedIterDomains().at(id_index);
+  if (isOptionDisabled(DisableOption::ConcretizeResizeExtents)) {


nitpick, earlier there's a comment at call site that's out-dated at this point. We can move that one here into this short-cut.

// Set output IterTypes for dynamic resize ops concretizeResize();

jjsjann123 · 2024-08-28T22:57:09Z

csrc/dynamic_transform.cpp

+    IterDomainBuilder builder(in_id);
+
+    if (!has_const_extent) {
+      Val* new_extent = IrBuilder::create<Val>(


QQ: is it beneficial to keep the old extent untouched, when it can be evaluated as a constant?

My naive impression is that this update can be applied unconditionally.

It could be applied unconditionally probably, but there's no advantage to replacing it in this case.

jjsjann123 · 2024-08-28T23:01:55Z

csrc/dynamic_transform.cpp

+    // Concretize the output shape which is constant
+    if (!orig_id->extent()->sameAs(new_id->extent())) {
+      registerConcretization(orig_id->extent(), new_id->extent());
+    }


Why are we not handling expanded extent here?

i.e. doesn't slice an expanded dimension preserve the expanded field here, which means we need to register that in concretization as well?

It's a good point that we should handle expanded extents on the inputs. Ideally we would need to handle output expanded extent so I will update the PR. However, note that currently I believe if you resize an expanded ID, the result is Iteration. This is because we do not check that the pad values are negative, so we might be doing an actual padding of that dimension. In case the pad values are constant we could do that check and sometimes preserve expanded broadcasts in slice.

Co-authored-by: jjsjann123 <jiej@nvidia.com>

jacobhinkle · 2024-09-10T16:53:56Z

@jjsjann123 re our conversation today, maybe we could also (in another PR) add some info in the default case to indicate that the resize is trivial. Currently, if you don't pass this EnableOption, we save the resize concretization info as (-1, 1, 1) or (-1, 1, 2) to indicate that this is a Broadcast or Iteration domain. We could additionally save (-1, -1, -1) to indicate that this is a trivial Resize (i.e. both the expand values are zero, and during concretization we would remove the Resize op altogether.

jacobhinkle added 3 commits August 21, 2024 12:21

Add input/expand values from Resize to conc info

b5dafa4

Fix up concretizeResize and mutate(TensorView)

1cf4e11

Avoid creating immediate cycles in OptOutMutator::mutateExpr

825bca1

jacobhinkle added 2 commits August 23, 2024 05:50

Add DisableOption::ConcretizeResizeExtents

bd80c71

Use mutated root ID in propagateFromProducerToConsumer

221ed74

jjsjann123 self-requested a review August 23, 2024 16:16

jacobhinkle added 6 commits August 23, 2024 14:53

Standardize concretized info when disable option is given

79bd846

Guard against registering trivial mutation in OptOutMutator::mutate(I…

73d1804

…terDomain*)

Clean up code that prevents introducing cycles in mutateExpr

3f45928

Don't use resize op to create replacement if disable option given

c551e93

Merge remote-tracking branch 'origin/main' into concretize_resize

68f6562

Remove preserve_def hack. Fix tests

ac080bf

jacobhinkle added 2 commits August 27, 2024 20:00

Use correct dynamic vals

93bba71

Disable ConcretizeResizeExtents in ForLoops test

a84669f

jacobhinkle marked this pull request as ready for review August 28, 2024 00:24

Remove debug print

809a7f9

jjsjann123 reviewed Aug 28, 2024

View reviewed changes

jacobhinkle and others added 4 commits September 9, 2024 11:11

Merge remote-tracking branch 'origin/main' into concretize_resize

aacc240

Switch from opt-out to opt-in (EnableOption)

d15d1ef

Apply suggestions from code review

e68e37b

Co-authored-by: jjsjann123 <jiej@nvidia.com>

Clean up reversion of tests

571afd6

Clean up DynamicTransformConcretizationInfo::operator==

7ba300b

Conversation

jacobhinkle commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example: NVFuserTest.ResizePadToBroadcastDynamic_CUDA

Performance on rope example

Uh oh!

jacobhinkle commented Aug 22, 2024

Uh oh!

jacobhinkle commented Aug 27, 2024

Uh oh!

jacobhinkle commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Aug 28, 2024

Uh oh!

jacobhinkle commented Aug 28, 2024

Uh oh!

naoyam commented Aug 28, 2024

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Aug 28, 2024

Uh oh!

jjsjann123 commented Aug 28, 2024

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Sep 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobhinkle commented Aug 22, 2024 •

edited

Loading

Example: `NVFuserTest.ResizePadToBroadcastDynamic_CUDA`

jacobhinkle commented Aug 27, 2024 •

edited

Loading