Skip to content

[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538)#5571

Closed
xadupre wants to merge 9 commits intomicrosoft:masterfrom
xadupre:transein
Closed

[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538)#5571
xadupre wants to merge 9 commits intomicrosoft:masterfrom
xadupre:transein

Conversation

@xadupre
Copy link
Member

@xadupre xadupre commented Oct 22, 2020

Description:
Operator Einstein transposes tensor with single dimension (1, 1, 1024, 4096) and does transpositions equivalent to a reshape (permutation=(2, 0, 3, 1) for example). PR #5550 replaces the transposition by a simple copy in that case, this change avoids copying the data. This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700).

This PR merges #5550, it should be merged first.

Motivation and Context
Performance.

@xadupre xadupre requested a review from skottmckay October 22, 2020 17:32
@xadupre xadupre requested a review from a team as a code owner October 22, 2020 17:32

// Pass in allocator as that will be used as an allocator deleter by the framework
// and it will de-allocate the memory for this intermediate tensor when it goes out of scope
std::unique_ptr<Tensor> output = onnxruntime::make_unique<Tensor>(input.DataType(), output_dims, (void*)input.DataRaw(), input.Location());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new tensor is created but it reuses the buffer of another one with a new shape or equal size. API of class Tensor does not seem to have a proper way of doing that.

Copy link
Member

@hariharans29 hariharans29 Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tensor class may need some enhancements to truly avoid this allocation (it may need to hang onto a shared_ptr) - we are creating a Tensor that doesn't own the memory - but the Tensor that actually owns the memory may be an intermediate Tensor in this op and if it gets destructed, it will cause issues, won't it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would cause an issue. That's why I highlighted this part of the PR because I know it could cause an issue. However, in that case, everything happens in EinsteinOp, so the Tensor which owns the memory stays alive as long as it is needed. That's why I did it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees the input tensor stays alive? As this get called in a loop by EinsumTypedComputeProcessor::Run can't the input go out of scope on the next iteration if it was the result from the previous iteration ?

e.g. first iteration in Run that calls PairwiseOperandProcess creates a new tensor which is saved as 'result'. next iteration passes this in as input. if we do the reshape on this iteration, the returned tensor replaces 'result'. at that point doesn't the buffer get freed as part of the assignment of the new result?

By chance the memory may still be accessible, but I'm not sure that the ownership semantics are correct.

@hariharans29
Copy link
Member

hariharans29 commented Oct 22, 2020

This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700). -

Was it bought down to 3700 ms (from 6200 ms) by just the Transpose improvements and this Einsum enhancement (allocation save) further reduces it by 300 ms ?

} // namespace DeviceHelpers

// This helps decide if we need to reshape a tensor.
bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation);
Copy link
Member

@hariharans29 hariharans29 Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA implementation will most likely require changes as well - if you notice every auxiliary op here takes in a device_func that is meant to be for device specific operations...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice but since this function creates a Tensor but no allocation to store new data. I took the function Transpose just below and took out the line doing the allocation. The line auto status = device_transpose_func(permutation, input, *output, &overriden_shape, einsum_cuda_assets); is not needed and the function does not need parameter device_transpose_func and einsum_cuda_assets. That's why I removed them.

bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) {
ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape.");

// No transpose required for scalars
Copy link
Member

@hariharans29 hariharans29 Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment should say Transpose/Reshape ?

return false;
}

// A transposition only moving single dimension is equivalent to a reshape.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought - Can this stub actually be part of Transpose itself (i.e.) once we recognize that the transpose request is just a reshape request within the Transpose op, can we just do a memcpy of the input buffer into the output buffer (without invoking other fancier components of Transpose) and stamp the copied output buffer with the new shape and return that Tensor back from Transpose? Does this special-casing live in Transpose currently ? If it does not, can we check how much gains that single enhancement can give ?

My concern is with re-sharing a piece of memory across multiple Tensors and synchronizing the life-cycle of the memory so that it is valid till the last Tensor using it is valid. This is going to need some work in the Tensor class...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did that. In the first PR, operator Transpose detects the case where a Transpose is a just Reshape. Then the data is copied in one block. In this PR, I wanted to even avoid the allocation: the gain I measured 300ms out of 3700ms is the time spent in allocation + copy of Reshaped/Transposed Tensor. I cannot remove the allocation in operator Transpose because the reshaped Tensor cannot exist without the Tensor owning the buffer and there is no way to make sure of that. In operator Einstein, the tranposed are intermediate tensors, their scope is known. The reshaped tensor disappears before or at the same time as the tensor owning the buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.

If it's able to be determined upfront that the buffer can be re-used an optimizer could replace the Transpose with a Reshape. That requires the perms and the input shape to be known.

I don't think it's possible to re-use otherwise if there's an allocation plan, as an input buffer which was expected to become unused after running the Transpose would still be in use.

Reshape re-uses the buffer if possible as the input/output is marked as Alias in the kernel registration.

@xadupre
Copy link
Member Author

xadupre commented Oct 23, 2020

This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700). -

Was it bought down to 3700 ms (from 6200 ms) by just the Transpose improvements and this Einsum enhancement (allocation save) further reduces it by 300 ms ?

From 3700 to 3400. The gain from 6000ms to 3700ms was brought by the first PR improving Transpose operator. But still, avoiding an allocation is always a good thing.

const uint8_t* source, uint8_t* target, size_t element_size) {
size_t blocksize = num_elts_in_block * element_size;
ORT_ENFORCE(num_axes > 0, "Transpose not implemented for empty tensors.");
MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ORT_ENFORCE necessary? Tranpose::Compute handles empty tensors with

  if (output_shape.Size() == 0)
    return Status::OK();

ORT_ENFORCE involves an 'if' and a throw so we should avoid using it unless required.

bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) {
// A transposition only moving single dimension is equivalent to a reshape.
// Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1).
size_t last_permuted_axis = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing as a 'single dimension' sounds like only one dim should move, but I believe the logic here is that as long as the dims with values > 1 stay in the same order, it's a reshape.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I intented to say. I'll rewrite.

}

bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) {
// A transposition only moving single dimension is equivalent to a reshape.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static bool IsReshape(...)?

}

TEST(TransposeOpTest, TransposeReshape) {
std::vector<int64_t> input_shape({1, 4, 2, 1, 3});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment explaining that the test is to hit the path in the Transpose implementation where we can reshape the input. May not be obvious to someone new to this code what 'transpose reshape' means.


// This helps decide if we need to reshape instead of transposing.
bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) {
ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything different between the implementation of IsReshapeRequired here vs. IsReshape in tranpose.cc? Can we just have a single version of the logic checking the dims vs perms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to precise, this PR includes another one: #5550. This one is an addition to save one allocation in operator Einstein. I was hoping the first one to be reviewed and then discuss about the best way to save this allocation and to measure the gain obtained with this change. I'll propagate the changes into #5550.

@xadupre xadupre changed the title Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538) [WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538) Oct 28, 2020
@xadupre xadupre closed this Dec 14, 2020
@xadupre xadupre deleted the transein branch December 14, 2020 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants