Fix concretization and execution with padded expanded empty inputs#876
Fix concretization and execution with padded expanded empty inputs#876jacobhinkle merged 20 commits intomainfrom
Conversation
|
!build |
|
!build |
|
Could you check the benchmark result for the cpp and python benchmarks in #749 |
Out of curiosity, are we expecting trivial changes in this PR to have big impact on overhead?! |
No, I don't expect so. Just want to confirm. |
The shape inference benchmarks look like this on and like this on this branch: Where can I find the python benchmarks? |
The python one is in #749 (comment) |
csrc/evaluator_common.cpp
Outdated
| } else { | ||
| bindValue(input->evaluatorIndex(), *args[i]); | ||
| } | ||
| bindValue(input->evaluatorIndex(), *args[i]); |
There was a problem hiding this comment.
Oops, I think I remembered why I felt worried about this. At some point, I tried to do the same thing, but eventually gave up. Because doing so will make PrecomputedValues hold ownership of tensors, and as a result, memory usage will go up a lot.
There was a problem hiding this comment.
Ah that makes sense. So we don't want PrecomputedValues or ExpressionEvaluator to own any tensor buffers, but it might be nice to still hold a reference them so that we can evaluate GetMetaData.
We could bind a Pointer instead but I think we might still wind up owning the tensor in the PolymorphicValue that points to. We could also introduce a TensorMeta class that doesn't own its data but holds sizes, strides, dtype, device, and data pointer. I am not sure whether that would slow down evaluation of CPU aten ops though, since we'd need to construct an at::Tensor when evaluating those ops then.
There was a problem hiding this comment.
OK I'm trying to understand the existing TensorMetaData class now. I think maybe we could bind these values as the the outputs of IrBuilder::metadataExpr(tv) instead of binding to tv directly?
There was a problem hiding this comment.
In 26a1f44 I changed this from binding TVs to binding extents and TensorMetaData. That way we can evaluate the extents directly and also the metadata expressions.
There was a problem hiding this comment.
I think it is OK for ExpressionEvaluator to bind tensors, actually, we are evaluating tensor expressions. Because we do not carry instances of ExpressionEvaluator for long time.
There was a problem hiding this comment.
But if we bind tens to tv in ExpressionEvaluator ee then later we bind a PrecomputedValues pv to ee, then we evaluate an expression with tv as an input, won't that populate pv with tens?
There was a problem hiding this comment.
No, I don't think so. PrecomputedValues -> ExpressionEvaluator is one-directional.
Lines 162 to 166 in 5ff26b9
There was a problem hiding this comment.
Got it. I'll revert the last PR then and add a note that we bind to ExpressionEvaluator but should avoid binding to PrecomputedValues.
Sorry I missed that. |
|
!build |
| //! This is a shallow comparison operator that just checks whether we point to | ||
| //! the same exact Struct | ||
| bool operator==(const StructHandle& other) const { | ||
| return struct_ptr_ == other.struct_ptr_; | ||
| } | ||
|
|
There was a problem hiding this comment.
After binding a TensorMetaData handle in PolymorphicValues, without this I was hitting the following error:
what(): ret.has_value() INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/lib/dynamic_type/src/dynamic_type/dynamic_type.h":741, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Cannot compute N7nvfuser12StructHandleE == N7nvfuser12StructHandleE : incompatible type
Exception raised from operator== at /opt/pytorch/nvfuser/lib/dynamic_type/src/dynamic_type/dynamic_type.h:741 (most recent call first):
Adding this shallow pointer comparison fixes it, but @zasdfgbnm please let me know if this is safe to add.
There was a problem hiding this comment.
Alternatively, I could add another special case in PolymorphicValue_functions::isSame?
There was a problem hiding this comment.
I think this makes sense
| }; | ||
|
|
||
| TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getInnerPersistentHeuristics( | ||
| std::shared_ptr<ReductionParams> getInnerPersistentHeuristics( |
There was a problem hiding this comment.
Lintrunner was failing...
|
!build |
csrc/evaluator_common.cpp
Outdated
| metadata->logical_size = tensor.sizes(); | ||
| metadata->logical_stride = tensor.strides(); | ||
| metadata->alloc_size = tensor.sizes(); | ||
| metadata->alloc_stride = tensor.strides(); |
There was a problem hiding this comment.
How should we handle these lines? Currently this is failing in the test AllocationDomainTest.NHWC4d_To_NHWC4d_CUDA. In that test we have a memcpy fusion with channels-last inputs and outputs and we schedule vectorization of size 4. I get the error
terminate called after throwing an instance of 'nvfuser::nvfError'
what(): is_contiguous || size == 1 || is_expanded_broadcasting || (still_rightmost && stride == 1) || (!still_rightmost && stride % word_size == 0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/executor_utils.cpp":545, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Vectorization of T0_g[ iS15{( (( (( getMetaData(T0) )).logical_size ))[0] )}, iS16{( (( (( getMetaData(T0) )).logical_size ))[1] )}, iS17{( (( (( getMetaData(T0) )).logical_size ))[2] )}, iS18{(
(( (( getMetaData(T0) )).logical_size ))[3] )} ] with word size 4 not possible due to invalid stride. Domain: iS18{( (( (( getMetaData(T0) )).logical_size ))[3] )}, stride: 21, cur_contig_stride: 1, non contig due to slice: 0
Inspecting this a bit I see that logical_stride and alloc_stride are the same because of the line above, but they should differ for this channels-last example.
There was a problem hiding this comment.
Sorry, missed this message. Is this still the case?
There was a problem hiding this comment.
Oh this comment is outdated now since I realize we should just use ExpressionEvaluator to get the metadata object here instead of trying (and failing) to replicate it as I did previously.
|
!build |
|
I think this fixes the original problem. However, in doing so it exposes #596 in the The output That is, the output iterdomain is concretized as |
| // we do not want them to own large objects. | ||
| // To do this we create a temporary ExpressionEvaluator so that we can compute | ||
| // the metadata once, then save it | ||
| ExpressionEvaluator ee; |
There was a problem hiding this comment.
There's some sharp edges here. For example, if between rfactor domain and root domain, there's a split whose factor is a fusion input, then this will fail, because we do not have it binded. But for now, I think it's fine because we do not have such case. Let's leave this as is. In the future we might need to refactor if this becomes an issue.
There was a problem hiding this comment.
Great point. We should at minimum bind this here so that any precomputed values that are already bound will be available to ee. I'll also add a comment about this pitfall.
IIUC, all extends from a slice is dependent on the actual input sizes. i.e. if we slice out of range, the behavior is that it'll just crop at the boundary so output range is indeed dynamic. So this is backing up the idea in #900 😄 |
This is true. It's incorrect now as we assume |
|
!build |
|
!build |
Stacked on #610; see #876 (comment).
This PR:
analyzeResizespass in concretization to inspect expanded extentsBroadcastin this test when we have already marked the resized ID asIteration.PrecomputedValues::bindInputsto bind not only metadata but also the actualTensorViewarguments.I noticed that the
ExpressionEvaluatorused during compilation contained more bound scalars than the one used at execution where we fail to evaluate the extent. We hadi0andi2bound at execution, but we did not haveT0bound, so we could not computegetMetaData(T0). At compilation,T0was bound so there was no problem until execution.Note that at compilation, we use
auto expr_eval = executor_utils::bindInputs(args, kernel);whereas at compilation we useevaluatorPrecomputedValues()->bindInputs(args);. The difference is thatPrecomputedValues::bindInputswill callbindTensorMetaDatainstead of binding the actual tensor. This PR also binds the actual tensor in addition to its metadata in that method.Fixes #870