[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538)#5571
[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538)#5571xadupre wants to merge 9 commits intomicrosoft:masterfrom
Conversation
|
|
||
| // Pass in allocator as that will be used as an allocator deleter by the framework | ||
| // and it will de-allocate the memory for this intermediate tensor when it goes out of scope | ||
| std::unique_ptr<Tensor> output = onnxruntime::make_unique<Tensor>(input.DataType(), output_dims, (void*)input.DataRaw(), input.Location()); |
There was a problem hiding this comment.
A new tensor is created but it reuses the buffer of another one with a new shape or equal size. API of class Tensor does not seem to have a proper way of doing that.
There was a problem hiding this comment.
The Tensor class may need some enhancements to truly avoid this allocation (it may need to hang onto a shared_ptr) - we are creating a Tensor that doesn't own the memory - but the Tensor that actually owns the memory may be an intermediate Tensor in this op and if it gets destructed, it will cause issues, won't it ?
There was a problem hiding this comment.
Yes, it would cause an issue. That's why I highlighted this part of the PR because I know it could cause an issue. However, in that case, everything happens in EinsteinOp, so the Tensor which owns the memory stays alive as long as it is needed. That's why I did it.
There was a problem hiding this comment.
What guarantees the input tensor stays alive? As this get called in a loop by EinsumTypedComputeProcessor::Run can't the input go out of scope on the next iteration if it was the result from the previous iteration ?
e.g. first iteration in Run that calls PairwiseOperandProcess creates a new tensor which is saved as 'result'. next iteration passes this in as input. if we do the reshape on this iteration, the returned tensor replaces 'result'. at that point doesn't the buffer get freed as part of the assignment of the new result?
By chance the memory may still be accessible, but I'm not sure that the ownership semantics are correct.
|
Was it bought down to 3700 ms (from 6200 ms) by just the Transpose improvements and this Einsum enhancement (allocation save) further reduces it by 300 ms ? |
| } // namespace DeviceHelpers | ||
|
|
||
| // This helps decide if we need to reshape a tensor. | ||
| bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation); |
There was a problem hiding this comment.
The CUDA implementation will most likely require changes as well - if you notice every auxiliary op here takes in a device_func that is meant to be for device specific operations...
There was a problem hiding this comment.
I notice but since this function creates a Tensor but no allocation to store new data. I took the function Transpose just below and took out the line doing the allocation. The line auto status = device_transpose_func(permutation, input, *output, &overriden_shape, einsum_cuda_assets); is not needed and the function does not need parameter device_transpose_func and einsum_cuda_assets. That's why I removed them.
| bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) { | ||
| ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape."); | ||
|
|
||
| // No transpose required for scalars |
There was a problem hiding this comment.
Nit: comment should say Transpose/Reshape ?
| return false; | ||
| } | ||
|
|
||
| // A transposition only moving single dimension is equivalent to a reshape. |
There was a problem hiding this comment.
Just a thought - Can this stub actually be part of Transpose itself (i.e.) once we recognize that the transpose request is just a reshape request within the Transpose op, can we just do a memcpy of the input buffer into the output buffer (without invoking other fancier components of Transpose) and stamp the copied output buffer with the new shape and return that Tensor back from Transpose? Does this special-casing live in Transpose currently ? If it does not, can we check how much gains that single enhancement can give ?
My concern is with re-sharing a piece of memory across multiple Tensors and synchronizing the life-cycle of the memory so that it is valid till the last Tensor using it is valid. This is going to need some work in the Tensor class...
There was a problem hiding this comment.
I already did that. In the first PR, operator Transpose detects the case where a Transpose is a just Reshape. Then the data is copied in one block. In this PR, I wanted to even avoid the allocation: the gain I measured 300ms out of 3700ms is the time spent in allocation + copy of Reshaped/Transposed Tensor. I cannot remove the allocation in operator Transpose because the reshaped Tensor cannot exist without the Tensor owning the buffer and there is no way to make sure of that. In operator Einstein, the tranposed are intermediate tensors, their scope is known. The reshaped tensor disappears before or at the same time as the tensor owning the buffer.
There was a problem hiding this comment.
A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.
There was a problem hiding this comment.
A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.
If it's able to be determined upfront that the buffer can be re-used an optimizer could replace the Transpose with a Reshape. That requires the perms and the input shape to be known.
I don't think it's possible to re-use otherwise if there's an allocation plan, as an input buffer which was expected to become unused after running the Transpose would still be in use.
Reshape re-uses the buffer if possible as the input/output is marked as Alias in the kernel registration.
From 3700 to 3400. The gain from 6000ms to 3700ms was brought by the first PR improving Transpose operator. But still, avoiding an allocation is always a good thing. |
| const uint8_t* source, uint8_t* target, size_t element_size) { | ||
| size_t blocksize = num_elts_in_block * element_size; | ||
| ORT_ENFORCE(num_axes > 0, "Transpose not implemented for empty tensors."); | ||
| MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex)); |
There was a problem hiding this comment.
Is this ORT_ENFORCE necessary? Tranpose::Compute handles empty tensors with
if (output_shape.Size() == 0)
return Status::OK();
ORT_ENFORCE involves an 'if' and a throw so we should avoid using it unless required.
| bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) { | ||
| // A transposition only moving single dimension is equivalent to a reshape. | ||
| // Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1). | ||
| size_t last_permuted_axis = 0; |
There was a problem hiding this comment.
This comment is a bit confusing as a 'single dimension' sounds like only one dim should move, but I believe the logic here is that as long as the dims with values > 1 stay in the same order, it's a reshape.
There was a problem hiding this comment.
Yes, that's what I intented to say. I'll rewrite.
| } | ||
|
|
||
| bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) { | ||
| // A transposition only moving single dimension is equivalent to a reshape. |
There was a problem hiding this comment.
static bool IsReshape(...)?
| } | ||
|
|
||
| TEST(TransposeOpTest, TransposeReshape) { | ||
| std::vector<int64_t> input_shape({1, 4, 2, 1, 3}); |
There was a problem hiding this comment.
Add comment explaining that the test is to hit the path in the Transpose implementation where we can reshape the input. May not be obvious to someone new to this code what 'transpose reshape' means.
|
|
||
| // This helps decide if we need to reshape instead of transposing. | ||
| bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) { | ||
| ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape."); |
There was a problem hiding this comment.
Is anything different between the implementation of IsReshapeRequired here vs. IsReshape in tranpose.cc? Can we just have a single version of the logic checking the dims vs perms?
There was a problem hiding this comment.
Just to precise, this PR includes another one: #5550. This one is an addition to save one allocation in operator Einstein. I was hoping the first one to be reviewed and then discuss about the best way to save this allocation and to measure the gain obtained with this change. I'll propagate the changes into #5550.
Description:
Operator Einstein transposes tensor with single dimension (1, 1, 1024, 4096) and does transpositions equivalent to a reshape (permutation=(2, 0, 3, 1) for example). PR #5550 replaces the transposition by a simple copy in that case, this change avoids copying the data. This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700).
This PR merges #5550, it should be merged first.
Motivation and Context
Performance.