Skip to content

Fix concretization and execution with padded expanded empty inputs#876

Merged
jacobhinkle merged 20 commits intomainfrom
fix_870
Oct 4, 2023
Merged

Fix concretization and execution with padded expanded empty inputs#876
jacobhinkle merged 20 commits intomainfrom
fix_870

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Sep 13, 2023

Stacked on #610; see #876 (comment).

This PR:

  • Changes the analyzeResizes pass in concretization to inspect expanded extents
  • Changes root->rfactor propagation in concretization to return early when IterType is already concretized. This avoids propagating Broadcast in this test when we have already marked the resized ID as Iteration.
  • Changes PrecomputedValues::bindInputs to bind not only metadata but also the actual TensorView arguments.

I noticed that the ExpressionEvaluator used during compilation contained more bound scalars than the one used at execution where we fail to evaluate the extent. We had i0 and i2 bound at execution, but we did not have T0 bound, so we could not compute getMetaData(T0). At compilation, T0 was bound so there was no problem until execution.

Note that at compilation, we use auto expr_eval = executor_utils::bindInputs(args, kernel); whereas at compilation we use evaluatorPrecomputedValues()->bindInputs(args);. The difference is that PrecomputedValues::bindInputs will call bindTensorMetaData instead of binding the actual tensor. This PR also binds the actual tensor in addition to its metadata in that method.

Fixes #870

@jacobhinkle jacobhinkle changed the title Fix 870 [WIP] Fix 870 Sep 13, 2023
@jacobhinkle jacobhinkle changed the title [WIP] Fix 870 Fix concretization and evaluation with padded expanded empty inputs Sep 13, 2023
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle marked this pull request as ready for review September 13, 2023 16:51
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle changed the title Fix concretization and evaluation with padded expanded empty inputs Fix concretization and execution with padded expanded empty inputs Sep 15, 2023
@zasdfgbnm
Copy link
Collaborator

Could you check the benchmark result for the cpp and python benchmarks in #749

@jjsjann123
Copy link
Collaborator

Could you check the benchmark result for the cpp and python benchmarks in #749

Out of curiosity, are we expecting trivial changes in this PR to have big impact on overhead?!

@zasdfgbnm
Copy link
Collaborator

Could you check the benchmark result for the cpp and python benchmarks in #749

Out of curiosity, are we expecting trivial changes in this PR to have big impact on overhead?!

No, I don't expect so. Just want to confirm.

@jacobhinkle
Copy link
Collaborator Author

Could you check the benchmark result for the cpp and python benchmarks in #749

The shape inference benchmarks look like this on main:

-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
LayerNormBackward_ShapeInference                       7898 us         7898 us           84
LayerNormForward_ShapeInference                        1994 us         1994 us          353
LayerNormBackward_NoShapeInferenceCachedBaseline       1339 us         1339 us          521
LayerNormForward_NoShapeInferenceCachedBaseline         467 us          467 us         1499 

and like this on this branch:

-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
LayerNormBackward_ShapeInference                       8194 us         8194 us           88
LayerNormForward_ShapeInference                        1948 us         1948 us          343
LayerNormBackward_NoShapeInferenceCachedBaseline       1263 us         1263 us          547
LayerNormForward_NoShapeInferenceCachedBaseline         461 us          461 us         1475

Where can I find the python benchmarks?

@zasdfgbnm
Copy link
Collaborator

Where can I find the python benchmarks?

The python one is in #749 (comment)

} else {
bindValue(input->evaluatorIndex(), *args[i]);
}
bindValue(input->evaluatorIndex(), *args[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think I remembered why I felt worried about this. At some point, I tried to do the same thing, but eventually gave up. Because doing so will make PrecomputedValues hold ownership of tensors, and as a result, memory usage will go up a lot.

Copy link
Collaborator Author

@jacobhinkle jacobhinkle Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. So we don't want PrecomputedValues or ExpressionEvaluator to own any tensor buffers, but it might be nice to still hold a reference them so that we can evaluate GetMetaData.

We could bind a Pointer instead but I think we might still wind up owning the tensor in the PolymorphicValue that points to. We could also introduce a TensorMeta class that doesn't own its data but holds sizes, strides, dtype, device, and data pointer. I am not sure whether that would slow down evaluation of CPU aten ops though, since we'd need to construct an at::Tensor when evaluating those ops then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'm trying to understand the existing TensorMetaData class now. I think maybe we could bind these values as the the outputs of IrBuilder::metadataExpr(tv) instead of binding to tv directly?

Copy link
Collaborator Author

@jacobhinkle jacobhinkle Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 26a1f44 I changed this from binding TVs to binding extents and TensorMetaData. That way we can evaluate the extents directly and also the metadata expressions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK for ExpressionEvaluator to bind tensors, actually, we are evaluating tensor expressions. Because we do not carry instances of ExpressionEvaluator for long time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we bind tens to tv in ExpressionEvaluator ee then later we bind a PrecomputedValues pv to ee, then we evaluate an expression with tv as an input, won't that populate pv with tens?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. PrecomputedValues -> ExpressionEvaluator is one-directional.

if (precomputed_values_ && precomputed_values_->ready()) {
if (precomputed_values_->getMaybeValueFor(value).hasValue()) {
return precomputed_values_->getMaybeValueFor(value);
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll revert the last PR then and add a note that we bind to ExpressionEvaluator but should avoid binding to PrecomputedValues.

@jacobhinkle
Copy link
Collaborator Author

The python one is in #749 (comment)

Sorry I missed that.

[---------------------  ---------------------]
           |  Before #876 PR  |  After #876 PR
1 threads: -----------------------------------
      2    |       17.7       |       18.7    
      4    |       22.6       |       23.8    
      8    |       33.0       |       34.5    
      16   |       52.8       |       53.4    
      32   |       98.6       |       98.4    
      64   |      192.4       |      194.2    
      128  |      411.1       |      406.6    

Times are in microseconds (us).

@jacobhinkle
Copy link
Collaborator Author

!build

Comment on lines +184 to +189
//! This is a shallow comparison operator that just checks whether we point to
//! the same exact Struct
bool operator==(const StructHandle& other) const {
return struct_ptr_ == other.struct_ptr_;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After binding a TensorMetaData handle in PolymorphicValues, without this I was hitting the following error:

what(): ret.has_value() INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/lib/dynamic_type/src/dynamic_type/dynamic_type.h":741, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Cannot compute N7nvfuser12StructHandleE == N7nvfuser12StructHandleE : incompatible type
Exception raised from operator== at /opt/pytorch/nvfuser/lib/dynamic_type/src/dynamic_type/dynamic_type.h:741 (most recent call first):

Adding this shallow pointer comparison fixes it, but @zasdfgbnm please let me know if this is safe to add.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I could add another special case in PolymorphicValue_functions::isSame?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense

};

TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getInnerPersistentHeuristics(
std::shared_ptr<ReductionParams> getInnerPersistentHeuristics(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lintrunner was failing...

@jacobhinkle
Copy link
Collaborator Author

!build

Comment on lines +357 to +360
metadata->logical_size = tensor.sizes();
metadata->logical_stride = tensor.strides();
metadata->alloc_size = tensor.sizes();
metadata->alloc_stride = tensor.strides();
Copy link
Collaborator Author

@jacobhinkle jacobhinkle Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle these lines? Currently this is failing in the test AllocationDomainTest.NHWC4d_To_NHWC4d_CUDA. In that test we have a memcpy fusion with channels-last inputs and outputs and we schedule vectorization of size 4. I get the error

terminate called after throwing an instance of 'nvfuser::nvfError'
what(): is_contiguous || size == 1 || is_expanded_broadcasting || (still_rightmost && stride == 1) || (!still_rightmost && stride % word_size == 0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/executor_utils.cpp":545, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Vectorization of T0_g[ iS15{( (( (( getMetaData(T0) )).logical_size ))[0] )}, iS16{( (( (( getMetaData(T0) )).logical_size ))[1] )}, iS17{( (( (( getMetaData(T0) )).logical_size ))[2] )}, iS18{(
(( (( getMetaData(T0) )).logical_size ))[3] )} ] with word size 4 not possible due to invalid stride. Domain: iS18{( (( (( getMetaData(T0) )).logical_size ))[3] )}, stride: 21, cur_contig_stride: 1, non contig due to slice: 0

Inspecting this a bit I see that logical_stride and alloc_stride are the same because of the line above, but they should differ for this channels-last example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this message. Is this still the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this comment is outdated now since I realize we should just use ExpressionEvaluator to get the metadata object here instead of trying (and failing) to replicate it as I did previously.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

I think this fixes the original problem. However, in doing so it exposes #596 in the pytest_ops.py slice correctness tests. Those tests include a fusion like this:

Fusion before concretization:
Inputs:  T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], float
Outputs:
  T1_g[ ?S4{1}rf, ?S6{6}rf, ?S8{5}rf ], float
%kernel_math {
T1_g[ ?S4{1}rf, ?S6{6}rf, ?S8{5}rf ]
   = slice( T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], { {1, 2, 1} {0, 6, 1} {3, 8, 1} } )
}

The output IterDomains are currently symbolic, even though the extents are constant. These extents should actually depend on the inputs since if they are zero, even ?S4 might have zero extent. The problem at hand is that before this PR (e.g. on main commit b96d569) this is concretized as

Concretized Fusion:
Inputs:
  T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], float
Outputs:
  T1_g[ iS12{1}rf, iS13{6}rf, iS14{5}rf ], float

%kernel_math {T1_g[ iS12{1}rf, iS13{6}rf, iS14{5}rf ]
   = slice( T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], { {1, 2, 1} {0, 6, 1} {3, 8, 1} } )}

That is, the output iterdomain is concretized as Iteration since the input is marked as Iteration with size 1. This is somewhat related to #900 since in that proposal we would concretize ?S0 as broadcast as well. Anyway, this is the reverse problem from what is fixed in this PR, but the erroneous concretization was helping avoid triggering #596 previously. I would like to wait until a solution is found for that issue before merging this. If instead the current fix is needed urgently, I can disable the slice tests temporarily; i think this is the only failure in our test suite.

// we do not want them to own large objects.
// To do this we create a temporary ExpressionEvaluator so that we can compute
// the metadata once, then save it
ExpressionEvaluator ee;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some sharp edges here. For example, if between rfactor domain and root domain, there's a split whose factor is a fusion input, then this will fail, because we do not have it binded. But for now, I think it's fine because we do not have such case. Let's leave this as is. In the future we might need to refactor if this becomes an issue.

Copy link
Collaborator Author

@jacobhinkle jacobhinkle Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. We should at minimum bind this here so that any precomputed values that are already bound will be available to ee. I'll also add a comment about this pitfall.

@jjsjann123
Copy link
Collaborator

...
That is, the output iterdomain is concretized as Iteration since the input is marked as Iteration with size 1. This is somewhat related to #900 since in that proposal we would concretize ?S0 as broadcast as well. Anyway, this is the reverse problem from what is fixed in this PR, but the erroneous concretization was helping avoid triggering #596 previously. I would like to wait until a solution is found for that issue before merging this. If instead the current fix is needed urgently, I can disable the slice tests temporarily; i think this is the only failure in our test suite.

>Fusion before concretization:
Inputs:  T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], float
Outputs:
  T1_g[ ?S4{1}rf, ?S6{6}rf, ?S8{5}rf ], float
%kernel_math {
T1_g[ ?S4{1}rf, ?S6{6}rf, ?S8{5}rf ]
   = slice( T0_g[ iS0{i0}, iS1{i1}, iS2{i2} ], { {1, 2, 1} {0, 6, 1} {3, 8, 1} } )
}

These extents should actually depend on the inputs since if they are zero, even ?S4 might have zero extent.

IIUC, all extends from a slice is dependent on the actual input sizes. i.e. if we slice out of range, the behavior is that it'll just crop at the boundary so output range is indeed dynamic.

So this is backing up the idea in #900 😄

@jacobhinkle
Copy link
Collaborator Author

all extends from a slice is dependent on the actual input sizes

This is true. It's incorrect now as we assume 0 <= start <= stop <= extent in slice, but #460 would replace these with the proper expressions.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle merged commit d70e19e into main Oct 4, 2023
@jacobhinkle jacobhinkle deleted the fix_870 branch October 4, 2023 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Execution error with pad fusion on expanded input

3 participants