[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538) by xadupre · Pull Request #5571 · microsoft/onnxruntime

xadupre · 2020-10-22T17:32:00Z

Description:
Operator Einstein transposes tensor with single dimension (1, 1, 1024, 4096) and does transpositions equivalent to a reshape (permutation=(2, 0, 3, 1) for example). PR #5550 replaces the transposition by a simple copy in that case, this change avoids copying the data. This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700).

This PR merges #5550, it should be merged first.

Motivation and Context
Performance.

…o transpose

xadupre · 2020-10-22T17:40:20Z

onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc

+
+  // Pass in allocator as that will be used as an allocator deleter by the framework
+  // and it will de-allocate the memory for this intermediate tensor when it goes out of scope
+  std::unique_ptr<Tensor> output = onnxruntime::make_unique<Tensor>(input.DataType(), output_dims, (void*)input.DataRaw(), input.Location());


A new tensor is created but it reuses the buffer of another one with a new shape or equal size. API of class Tensor does not seem to have a proper way of doing that.

The Tensor class may need some enhancements to truly avoid this allocation (it may need to hang onto a shared_ptr) - we are creating a Tensor that doesn't own the memory - but the Tensor that actually owns the memory may be an intermediate Tensor in this op and if it gets destructed, it will cause issues, won't it ?

Yes, it would cause an issue. That's why I highlighted this part of the PR because I know it could cause an issue. However, in that case, everything happens in EinsteinOp, so the Tensor which owns the memory stays alive as long as it is needed. That's why I did it.

What guarantees the input tensor stays alive? As this get called in a loop by EinsumTypedComputeProcessor::Run can't the input go out of scope on the next iteration if it was the result from the previous iteration ?

e.g. first iteration in Run that calls PairwiseOperandProcess creates a new tensor which is saved as 'result'. next iteration passes this in as input. if we do the reshape on this iteration, the returned tensor replaces 'result'. at that point doesn't the buffer get freed as part of the assignment of the new result?

By chance the memory may still be accessible, but I'm not sure that the ownership semantics are correct.

hariharans29 · 2020-10-22T18:07:19Z

This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700). -

Was it bought down to 3700 ms (from 6200 ms) by just the Transpose improvements and this Einsum enhancement (allocation save) further reduces it by 300 ms ?

hariharans29 · 2020-10-22T18:08:39Z

onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.h

 }  // namespace DeviceHelpers

+// This helps decide if we need to reshape a tensor.
+bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation);


The CUDA implementation will most likely require changes as well - if you notice every auxiliary op here takes in a device_func that is meant to be for device specific operations...

I notice but since this function creates a Tensor but no allocation to store new data. I took the function Transpose just below and took out the line doing the allocation. The line auto status = device_transpose_func(permutation, input, *output, &overriden_shape, einsum_cuda_assets); is not needed and the function does not need parameter device_transpose_func and einsum_cuda_assets. That's why I removed them.

hariharans29 · 2020-10-22T18:14:42Z

onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc

+bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) {
+  ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape.");
+
+  // No transpose required for scalars


Nit: comment should say Transpose/Reshape ?

hariharans29 · 2020-10-22T18:37:16Z

onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc

+    return false;
+  }
+
+  // A transposition only moving single dimension is equivalent to a reshape.


Just a thought - Can this stub actually be part of Transpose itself (i.e.) once we recognize that the transpose request is just a reshape request within the Transpose op, can we just do a memcpy of the input buffer into the output buffer (without invoking other fancier components of Transpose) and stamp the copied output buffer with the new shape and return that Tensor back from Transpose? Does this special-casing live in Transpose currently ? If it does not, can we check how much gains that single enhancement can give ?

My concern is with re-sharing a piece of memory across multiple Tensors and synchronizing the life-cycle of the memory so that it is valid till the last Tensor using it is valid. This is going to need some work in the Tensor class...

I already did that. In the first PR, operator Transpose detects the case where a Transpose is a just Reshape. Then the data is copied in one block. In this PR, I wanted to even avoid the allocation: the gain I measured 300ms out of 3700ms is the time spent in allocation + copy of Reshaped/Transposed Tensor. I cannot remove the allocation in operator Transpose because the reshaped Tensor cannot exist without the Tensor owning the buffer and there is no way to make sure of that. In operator Einstein, the tranposed are intermediate tensors, their scope is known. The reshaped tensor disappears before or at the same time as the tensor owning the buffer.

A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.

A more general comment, operator Transpose or Reshape could probably reuse the buffer of the input Tensor if this operator is the only consumer of this tensor.

If it's able to be determined upfront that the buffer can be re-used an optimizer could replace the Transpose with a Reshape. That requires the perms and the input shape to be known.

I don't think it's possible to re-use otherwise if there's an allocation plan, as an input buffer which was expected to become unused after running the Transpose would still be in use.

Reshape re-uses the buffer if possible as the input/output is marked as Alias in the kernel registration.

xadupre · 2020-10-23T08:36:07Z

This change makes model linked in #5538 faster by 10% (3400 ms instead of 3700). -

Was it bought down to 3700 ms (from 6200 ms) by just the Transpose improvements and this Einsum enhancement (allocation save) further reduces it by 300 ms ?

From 3700 to 3400. The gain from 6000ms to 3700ms was brought by the first PR improving Transpose operator. But still, avoiding an allocation is always a good thing.

skottmckay · 2020-10-25T23:13:03Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

                            const uint8_t* source, uint8_t* target, size_t element_size) {
  size_t blocksize = num_elts_in_block * element_size;
+  ORT_ENFORCE(num_axes > 0, "Transpose not implemented for empty tensors.");
+  MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex));


Is this ORT_ENFORCE necessary? Tranpose::Compute handles empty tensors with

if (output_shape.Size() == 0) return Status::OK();

ORT_ENFORCE involves an 'if' and a throw so we should avoid using it unless required.

skottmckay · 2020-10-25T23:16:43Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

+bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) {
+  // A transposition only moving single dimension is equivalent to a reshape.
+  // Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1).
+  size_t last_permuted_axis = 0;


This comment is a bit confusing as a 'single dimension' sounds like only one dim should move, but I believe the logic here is that as long as the dims with values > 1 stay in the same order, it's a reshape.

Yes, that's what I intented to say. I'll rewrite.

skottmckay · 2020-10-25T23:17:01Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

 }

+bool IsReshape(const std::vector<size_t>& perm, const std::vector<int64_t>& input_dims) {
+  // A transposition only moving single dimension is equivalent to a reshape.


static bool IsReshape(...)?

skottmckay · 2020-10-25T23:19:00Z

onnxruntime/test/providers/cpu/tensor/transpose_test.cc

 }

+TEST(TransposeOpTest, TransposeReshape) {
+  std::vector<int64_t> input_shape({1, 4, 2, 1, 3});


Add comment explaining that the test is to hit the path in the Transpose implementation where we can reshape the input. May not be obvious to someone new to this code what 'transpose reshape' means.

skottmckay · 2020-10-25T23:31:40Z

onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc


+// This helps decide if we need to reshape instead of transposing.
+bool IsReshapeRequired(const std::vector<int64_t>& input_dims, const std::vector<size_t>& permutation) {
+  ORT_ENFORCE(input_dims.size() == permutation.size(), "The rank of the input must match permutation size for Transpose/Reshape.");


Is anything different between the implementation of IsReshapeRequired here vs. IsReshape in tranpose.cc? Can we just have a single version of the logic checking the dims vs perms?

Just to precise, this PR includes another one: #5550. This one is an addition to save one allocation in operator Einstein. I was hoping the first one to be reviewed and then discuss about the best way to save this allocation and to measure the gain obtained with this change. I'll propagate the changes into #5550.

…o transein

sdpython added 6 commits October 20, 2020 17:31

Improves implementation of transpose operator

d8fd2ea

use alloca instead of std::vector

59b8014

Improves all transpose function, uses stack to allocate

265beac

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

a963144

…o transpose

fix comment

621a5a2

Simplifies transposition when it is not really needed.

ed176f1

xadupre requested a review from skottmckay October 22, 2020 17:32

xadupre requested a review from a team as a code owner October 22, 2020 17:32

Update operator Einstein to avoid copying data when not necessary

ecd9a00

xadupre commented Oct 22, 2020

View reviewed changes

hariharans29 reviewed Oct 22, 2020

View reviewed changes

skottmckay reviewed Oct 25, 2020

View reviewed changes

xadupre changed the title ~~Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538)~~ [WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538) Oct 28, 2020

sdpython added 2 commits November 13, 2020 11:18

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

d572536

…o transein

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

5fa08c9

…o transein

xadupre closed this Dec 14, 2020

xadupre deleted the transein branch December 14, 2020 11:13

Conversation

xadupre commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xadupre commented Oct 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xadupre commented Oct 22, 2020 •

edited

Loading

hariharans29 Oct 22, 2020 •

edited

Loading

hariharans29 commented Oct 22, 2020 •

edited

Loading

hariharans29 Oct 22, 2020 •

edited

Loading

hariharans29 Oct 22, 2020 •

edited

Loading