Skip to content

Improvement in average inference latency for models running on OVEP NPU #441

Merged
saurabhkale17 merged 24 commits intoovep-develop-lnl-1.2from
npu_allocator_
Sep 5, 2024
Merged

Improvement in average inference latency for models running on OVEP NPU #441
saurabhkale17 merged 24 commits intoovep-develop-lnl-1.2from
npu_allocator_

Conversation

@saurabhkale17
Copy link

@saurabhkale17 saurabhkale17 commented Sep 3, 2024

Description:

This PR addresses and resolves the implicit memory copying issue, leading to a significant improvement in average latency for models running on NPU devices.

Motivation and Context:

Issue Discovery: During performance testing with the onnxruntime_perf_test application, the model GT exhibited much lower performance compared to OpenVINO's benchmark application.

Root Cause Analysis: The performance drop was traced back to two instances of unnecessary memcpy operations in OVEP: one before inference (from the ORT buffer to the OV buffer) and one after inference (from the OV buffer back to the ORT buffer).

Impact: The avg latency of the model is less as compared to the dimensionality of the input and output specifically for model GT. These memcpy operations were contributing significantly to the increased inference latency.

Solution: In OpenVINO 2024.4, the implementation of remote tensors for NPUs introduces an interface for working directly with device-specific memory. This feature eliminates the need for memcpy by using the remote NPU buffer directly, significantly reducing latency. OVEP leverages this capability to allocate buffers in the NPU's accessible memory region, optimizing the allocation of input and output tensors.

Fixed Issue:

EISW-135604

To use the above feature we have introduced a new flag "use_device_mem". This flag is set to false by default.
You can use the feature by setting it to true from the cmd line while using perf_test application.
Ex: onnxruntime_perf_test.exe -e openvino -i "device_type|NPU enable_qdq_optimizer|true use_device_mem|true" -m times -r 100 -I "path_to_the_model"

return Status::OK();
}

std::vector<AllocatorPtr> OpenVINOExecutionProvider::CreatePreferredAllocators() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this function name is not NPU Specific ?

return nullptr;
}

std::vector<AllocatorPtr> CreatePreferredAllocators() override;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only for NPU, the function name should suggest that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is for NPU.
This CreatePreferredAllocators is the function of OpenVINOExecutionProvider class.
OpenVINOExecutionProvider class is inherited from IExecutionProvider class which has this function.
And this is a function which is used by all the execution providers.

// Copyright (C) Intel Corporation
// Licensed under the MIT License

#include "core/providers/openvino/ov_allocator.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add npu plugin include files work acroos openvino versions ?

}
} else {
OVTensorPtr graph_input_blob;
auto tensor = context.GetInput(subgraph_context_.input_names.at(input_name));
Copy link

@sfatimar sfatimar Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to explain the code here. Is this compatible across all OV Versions.


ov_tensor_data_t ov_tensor_data;
ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name};
if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try not using auto to avoid coverity issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to avoid auto wherever possible in the new changes.
But for the situations where the data type is very complex it will be better with auto.

For example this auto (line 354) has a data type of
class std::_Tree_iterator<class std::_Tree_val<struct std::_Tree_simple_types<struct std::pair<struct std::pair<void const * __ptr64,class std::basic_string<char,struct std::char_traits,class std::allocator > const > const ,struct onnxruntime::openvino_ep::ov_tensor_data_t> > > >

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to avoid auto, when the return type is smaller for example cases like below -

std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();

We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above

constexpr const char* HIP_PINNED = "HipPinned";
constexpr const char* OpenVINO_CPU = "OpenVINO_CPU";
constexpr const char* OpenVINO_GPU = "OpenVINO_GPU";
constexpr const char* OpenVINO_RT = "OpenVINO_RT";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is OpenVINO_RT different from OpenVINO_CPU ?

@sfatimar
Copy link

sfatimar commented Sep 4, 2024

@ankitm3k @preetha-intel can you please review and close this task on priority. ?

@@ -0,0 +1,55 @@
// Copyright (C) Intel Corporation

Check warning

Code scanning / lintrunner

CLANGFORMAT/format

See https://clang.llvm.org/docs/ClangFormat.html. Run `lintrunner -a` to apply this patch.
@@ -0,0 +1,25 @@
// Copyright (C) Intel Corporation

Check warning

Code scanning / lintrunner

CLANGFORMAT/format

See https://clang.llvm.org/docs/ClangFormat.html. Run `lintrunner -a` to apply this patch.
@saurabhkale17 saurabhkale17 merged commit 236df14 into ovep-develop-lnl-1.2 Sep 5, 2024

ov_tensor_data_t ov_tensor_data;
ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name};
if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to avoid auto, when the return type is smaller for example cases like below -

std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();

We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above

namespace onnxruntime {
namespace openvino_ep {

struct ov_tensor_data_t {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class & struct naming case convention is not matching the overall ORT convention, you can use name struct OVTensorData, above looks like a variable name

@@ -0,0 +1,25 @@
// Copyright (C) Intel Corporation
// Licensed under the MIT License
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This translation unit is force to be OV version specific, can we have below -

Suggested change
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

We can remove/ deprecate it later when we move to higher versions like OV 2025.2, etc

public:
OVRTAllocator(ov::Core &core, OrtDevice::DeviceType device_type, OrtDevice::DeviceId device_id, const char* name);
void* Alloc(size_t size) override;
void Free(void* p) override;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be called inside destructor like ~OVRTAllocator() { Free(ptr); } This will manage the destruction once you step out of its local scope while using it automatically

Ort::TypeInfo type_info = session_.GetOutputTypeInfo(i);
auto tensor_info = type_info.GetTensorTypeAndShapeInfo();

std::vector<int64_t> output_shape = tensor_info.GetShape();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reserve the vector shape to avoid reallocations using below, if the tensor_info size is known -
output_shape.reserve(tensor_info.GetShape().size());

name1, type, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, static_cast<OrtDevice::DeviceId>(id1)), id1,
mem_type1);
} else if (strcmp(name1, onnxruntime::OpenVINO_RT_NPU) == 0) {
*out = new OrtMemoryInfo(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify if the out variable memory is deleted later on to avoid memory leaks, once can use smart pointer to avoid explicit delete call

OVRemoteContextPtr remote_context_;
#endif

using ort_tensor_key_t = std::pair<const void *, const std::string>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming a typedef or a namespace can be conflicting with variable name as suggested above,

Suggested change
using ort_tensor_key_t = std::pair<const void *, const std::string>;
using ORTTensorKey = std::pair<const void *, const std::string>;


return Status::OK();
}
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

@@ -0,0 +1,55 @@
// Copyright (C) Intel Corporation
// Licensed under the MIT License
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

}
}

outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use move semantics to have inplace allocations for the tensors if possible

Suggested change
outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(),
outputs_.emplace_back(Ort::Value::CreateTensor(*custom_allocator_, std::move(const int64_t*)output_shape.data()), output_shape.size(), tensor_info.GetElementType()));

sfatimar pushed a commit that referenced this pull request Sep 6, 2024
…PU (#441)

* Prototype shared memory allocator on Windows using OV-EP

* Partially working allocator.

Crashing on tensor destruction. Might have UMD exceptions. Needs further
debug. Unknown if values are correct.

* Hard code onnx perf to use RT NPU allocator for inputs

* Fix allocation lookups coming from different level zero contexts

* Page align OV allocation

* Allocate input as WC

* Only set tensors when they have changed.

* Revert "Allocate input as WC"

This reverts commit d43219f.

* Hard code onnx perf to use RT NPU for outputs

* Revert "Hard code onnx perf to use RT NPU for outputs"

This reverts commit c1f3b3e.

* Hard code onnx perf to use RT NPU for outputs fixed

* Fix onnx_perf_test app crash on tensor destroy

* refactor: remove redundant ort_shape_to_ovshape lambda function

* alocate buffer in NPU visible region from perf test application

* remove redundant code

* add command line parameter in perf test for using remote tensors

* remove redundant code

* remove redundant statements

* fix crash during inference

* remove redundant code

* enable backward compatibility of remote tensor feature

* Revert "enable backward compatibility of remote tensor feature"

This reverts commit 1791b90.

* enable backward compatibility of remote tensor feature in OVEP

---------

Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
sfatimar pushed a commit that referenced this pull request Sep 6, 2024
…PU (#441)

* Prototype shared memory allocator on Windows using OV-EP

* Partially working allocator.

Crashing on tensor destruction. Might have UMD exceptions. Needs further
debug. Unknown if values are correct.

* Hard code onnx perf to use RT NPU allocator for inputs

* Fix allocation lookups coming from different level zero contexts

* Page align OV allocation

* Allocate input as WC

* Only set tensors when they have changed.

* Revert "Allocate input as WC"

This reverts commit d43219f.

* Hard code onnx perf to use RT NPU for outputs

* Revert "Hard code onnx perf to use RT NPU for outputs"

This reverts commit c1f3b3e.

* Hard code onnx perf to use RT NPU for outputs fixed

* Fix onnx_perf_test app crash on tensor destroy

* refactor: remove redundant ort_shape_to_ovshape lambda function

* alocate buffer in NPU visible region from perf test application

* remove redundant code

* add command line parameter in perf test for using remote tensors

* remove redundant code

* remove redundant statements

* fix crash during inference

* remove redundant code

* enable backward compatibility of remote tensor feature

* Revert "enable backward compatibility of remote tensor feature"

This reverts commit 1791b90.

* enable backward compatibility of remote tensor feature in OVEP

---------

Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
saurabhkale17 added a commit that referenced this pull request Sep 25, 2024
…PU (#441)

* Prototype shared memory allocator on Windows using OV-EP

* Partially working allocator.

Crashing on tensor destruction. Might have UMD exceptions. Needs further
debug. Unknown if values are correct.

* Hard code onnx perf to use RT NPU allocator for inputs

* Fix allocation lookups coming from different level zero contexts

* Page align OV allocation

* Allocate input as WC

* Only set tensors when they have changed.

* Revert "Allocate input as WC"

This reverts commit d43219f.

* Hard code onnx perf to use RT NPU for outputs

* Revert "Hard code onnx perf to use RT NPU for outputs"

This reverts commit c1f3b3e.

* Hard code onnx perf to use RT NPU for outputs fixed

* Fix onnx_perf_test app crash on tensor destroy

* refactor: remove redundant ort_shape_to_ovshape lambda function

* alocate buffer in NPU visible region from perf test application

* remove redundant code

* add command line parameter in perf test for using remote tensors

* remove redundant code

* remove redundant statements

* fix crash during inference

* remove redundant code

* enable backward compatibility of remote tensor feature

* Revert "enable backward compatibility of remote tensor feature"

This reverts commit 1791b90.

* enable backward compatibility of remote tensor feature in OVEP

---------

Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants