Improvement in average inference latency for models running on OVEP NPU by saurabhkale17 · Pull Request #441 · intel/onnxruntime

saurabhkale17 · 2024-09-03T06:13:21Z

Description:

This PR addresses and resolves the implicit memory copying issue, leading to a significant improvement in average latency for models running on NPU devices.

Motivation and Context:

Issue Discovery: During performance testing with the onnxruntime_perf_test application, the model GT exhibited much lower performance compared to OpenVINO's benchmark application.

Root Cause Analysis: The performance drop was traced back to two instances of unnecessary memcpy operations in OVEP: one before inference (from the ORT buffer to the OV buffer) and one after inference (from the OV buffer back to the ORT buffer).

Impact: The avg latency of the model is less as compared to the dimensionality of the input and output specifically for model GT. These memcpy operations were contributing significantly to the increased inference latency.

Solution: In OpenVINO 2024.4, the implementation of remote tensors for NPUs introduces an interface for working directly with device-specific memory. This feature eliminates the need for memcpy by using the remote NPU buffer directly, significantly reducing latency. OVEP leverages this capability to allocate buffers in the NPU's accessible memory region, optimizing the allocation of input and output tensors.

Fixed Issue:

EISW-135604

To use the above feature we have introduced a new flag "use_device_mem". This flag is set to false by default.
You can use the feature by setting it to true from the cmd line while using perf_test application.
Ex: onnxruntime_perf_test.exe -e openvino -i "device_type|NPU enable_qdq_optimizer|true use_device_mem|true" -m times -r 100 -I "path_to_the_model"

Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct.

This reverts commit d43219f.

This reverts commit c1f3b3e.

onnxruntime/core/providers/openvino/ov_allocator.cc

onnxruntime/core/providers/openvino/ov_allocator.h

sfatimar · 2024-09-04T02:53:14Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

  return Status::OK();
 }

+std::vector<AllocatorPtr> OpenVINOExecutionProvider::CreatePreferredAllocators() {


Why this function name is not NPU Specific ?

sfatimar · 2024-09-04T02:54:01Z

onnxruntime/core/providers/openvino/openvino_execution_provider.h

    return nullptr;
  }

+  std::vector<AllocatorPtr> CreatePreferredAllocators() override;


If this is only for NPU, the function name should suggest that

yes, this is for NPU.
This CreatePreferredAllocators is the function of OpenVINOExecutionProvider class.
OpenVINOExecutionProvider class is inherited from IExecutionProvider class which has this function.
And this is a function which is used by all the execution providers.

sfatimar · 2024-09-04T02:56:47Z

onnxruntime/core/providers/openvino/ov_allocator.cc

+// Copyright (C) Intel Corporation
+// Licensed under the MIT License
+
+#include "core/providers/openvino/ov_allocator.h"


Will add npu plugin include files work acroos openvino versions ?

sfatimar · 2024-09-04T03:41:55Z

onnxruntime/core/providers/openvino/backends/basic_backend.cc

        }
      } else {
-        OVTensorPtr graph_input_blob;
+        auto tensor = context.GetInput(subgraph_context_.input_names.at(input_name));


Please add comments to explain the code here. Is this compatible across all OV Versions.

sfatimar · 2024-09-04T03:44:26Z

onnxruntime/core/providers/openvino/backends/basic_backend.cc

+
+      ov_tensor_data_t ov_tensor_data;
+      ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name};
+      if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) {


Please try not using auto to avoid coverity issues.

Tried to avoid auto wherever possible in the new changes.
But for the situations where the data type is very complex it will be better with auto.

For example this auto (line 354) has a data type of
class std::_Tree_iterator<class std::_Tree_val<struct std::_Tree_simple_types<struct std::pair<struct std::pair<void const * __ptr64,class std::basic_string<char,struct std::char_traits,class std::allocator > const > const ,struct onnxruntime::openvino_ep::ov_tensor_data_t> > > >

I would recommend to avoid auto, when the return type is smaller for example cases like below -

std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();

We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above

sfatimar · 2024-09-04T03:46:46Z

include/onnxruntime/core/framework/allocator.h

 constexpr const char* HIP_PINNED = "HipPinned";
 constexpr const char* OpenVINO_CPU = "OpenVINO_CPU";
 constexpr const char* OpenVINO_GPU = "OpenVINO_GPU";
+constexpr const char* OpenVINO_RT = "OpenVINO_RT";


How is OpenVINO_RT different from OpenVINO_CPU ?

sfatimar · 2024-09-04T04:44:34Z

@ankitm3k @preetha-intel can you please review and close this task on priority. ?

onnxruntime/test/perftest/ort_test_session.h

onnxruntime/core/providers/openvino/ov_allocator.cc

onnxruntime/core/providers/openvino/ov_allocator.h

This reverts commit 1791b90.

onnxruntime/core/providers/openvino/ov_allocator.cc

@@ -0,0 +1,55 @@
+// Copyright (C) Intel Corporation


onnxruntime/core/providers/openvino/ov_allocator.h

@@ -0,0 +1,25 @@
+// Copyright (C) Intel Corporation


ankitm3k · 2024-09-04T14:02:36Z

onnxruntime/core/providers/openvino/backends/basic_backend.cc

+
+      ov_tensor_data_t ov_tensor_data;
+      ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name};
+      if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) {


I would recommend to avoid auto, when the return type is smaller for example cases like below -

std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();

We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above

ankitm3k · 2024-09-04T14:05:22Z

onnxruntime/core/providers/openvino/backends/basic_backend.h

 namespace onnxruntime {
 namespace openvino_ep {

+struct ov_tensor_data_t {


Class & struct naming case convention is not matching the overall ORT convention, you can use name struct OVTensorData, above looks like a variable name

ankitm3k · 2024-09-04T14:15:54Z

onnxruntime/core/providers/openvino/ov_allocator.h

@@ -0,0 +1,25 @@
+// Copyright (C) Intel Corporation
+// Licensed under the MIT License
+#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4


This translation unit is force to be OV version specific, can we have below -

Suggested change

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

We can remove/ deprecate it later when we move to higher versions like OV 2025.2, etc

ankitm3k · 2024-09-04T14:22:48Z

onnxruntime/core/providers/openvino/ov_allocator.h

+ public:
+  OVRTAllocator(ov::Core &core, OrtDevice::DeviceType device_type, OrtDevice::DeviceId device_id, const char* name);
+  void* Alloc(size_t size) override;
+  void Free(void* p) override;


Can be called inside destructor like ~OVRTAllocator() { Free(ptr); } This will manage the destruction once you step out of its local scope while using it automatically

ankitm3k · 2024-09-04T14:28:30Z

onnxruntime/test/perftest/ort_test_session.cc

+      Ort::TypeInfo type_info = session_.GetOutputTypeInfo(i);
+      auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
+
+      std::vector<int64_t> output_shape = tensor_info.GetShape();


We can reserve the vector shape to avoid reallocations using below, if the tensor_info size is known -
output_shape.reserve(tensor_info.GetShape().size());

ankitm3k · 2024-09-04T14:34:08Z

onnxruntime/core/framework/allocator.cc

        name1, type, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, static_cast<OrtDevice::DeviceId>(id1)), id1,
        mem_type1);
+  } else if (strcmp(name1, onnxruntime::OpenVINO_RT_NPU) == 0) {
+    *out = new OrtMemoryInfo(


Verify if the out variable memory is deleted later on to avoid memory leaks, once can use smart pointer to avoid explicit delete call

ankitm3k · 2024-09-04T14:37:58Z

onnxruntime/core/providers/openvino/backends/basic_backend.h

  OVRemoteContextPtr remote_context_;
 #endif
+
+  using ort_tensor_key_t = std::pair<const void *, const std::string>;


Naming a typedef or a namespace can be conflicting with variable name as suggested above,

Suggested change

using ort_tensor_key_t = std::pair<const void *, const std::string>;

using ORTTensorKey = std::pair<const void *, const std::string>;

ankitm3k · 2024-09-04T14:38:27Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc


  return Status::OK();
 }
+#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4


Suggested change

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

ankitm3k · 2024-09-04T14:39:16Z

onnxruntime/core/providers/openvino/ov_allocator.cc

@@ -0,0 +1,55 @@
+// Copyright (C) Intel Corporation
+// Licensed under the MIT License
+#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4


Suggested change

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4

#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

ankitm3k · 2024-09-04T14:49:40Z

onnxruntime/test/perftest/ort_test_session.cc

+        }
+      }
+
+        outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(),


Use move semantics to have inplace allocations for the tensors if possible

Suggested change

outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(),

outputs_.emplace_back(Ort::Value::CreateTensor(*custom_allocator_, std::move(const int64_t*)output_shape.data()), output_shape.size(), tensor_info.GetElementType()));

…PU (#441) * Prototype shared memory allocator on Windows using OV-EP * Partially working allocator. Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct. * Hard code onnx perf to use RT NPU allocator for inputs * Fix allocation lookups coming from different level zero contexts * Page align OV allocation * Allocate input as WC * Only set tensors when they have changed. * Revert "Allocate input as WC" This reverts commit d43219f. * Hard code onnx perf to use RT NPU for outputs * Revert "Hard code onnx perf to use RT NPU for outputs" This reverts commit c1f3b3e. * Hard code onnx perf to use RT NPU for outputs fixed * Fix onnx_perf_test app crash on tensor destroy * refactor: remove redundant ort_shape_to_ovshape lambda function * alocate buffer in NPU visible region from perf test application * remove redundant code * add command line parameter in perf test for using remote tensors * remove redundant code * remove redundant statements * fix crash during inference * remove redundant code * enable backward compatibility of remote tensor feature * Revert "enable backward compatibility of remote tensor feature" This reverts commit 1791b90. * enable backward compatibility of remote tensor feature in OVEP --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>

javier-intel and others added 21 commits August 21, 2024 15:16

Prototype shared memory allocator on Windows using OV-EP

89094a8

Partially working allocator.

2e4b205

Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct.

Hard code onnx perf to use RT NPU allocator for inputs

63e8aee

Fix allocation lookups coming from different level zero contexts

cd88b0c

Page align OV allocation

89127f0

Allocate input as WC

d43219f

Only set tensors when they have changed.

274e6af

Revert "Allocate input as WC"

6feae84

This reverts commit d43219f.

Hard code onnx perf to use RT NPU for outputs

c1f3b3e

Revert "Hard code onnx perf to use RT NPU for outputs"

1e3dadd

This reverts commit c1f3b3e.

Hard code onnx perf to use RT NPU for outputs fixed

61a2d4a

Fix onnx_perf_test app crash on tensor destroy

5800966

Merge branch 'ort_allocator_hacking_temp' into npu_allocator_

1af175e

refactor: remove redundant ort_shape_to_ovshape lambda function

0faaf9f

alocate buffer in NPU visible region from perf test application

a1f92cd

remove redundant code

6ee25da

add command line parameter in perf test for using remote tensors

a9f2357

remove redundant code

8844556

remove redundant statements

8b32612

fix crash during inference

4830c89

remove redundant code

48c5569

saurabhkale17 requested review from ankitm3k, preetha-intel, sfatimar and vthaniel September 3, 2024 06:13

github-advanced-security bot found potential problems Sep 3, 2024

View reviewed changes

onnxruntime/core/providers/openvino/ov_allocator.cc Fixed Show fixed Hide fixed

onnxruntime/core/providers/openvino/ov_allocator.h Fixed Show fixed Hide fixed

sfatimar reviewed Sep 4, 2024

View reviewed changes

preetha-intel reviewed Sep 4, 2024

View reviewed changes

onnxruntime/test/perftest/ort_test_session.h Show resolved Hide resolved

enable backward compatibility of remote tensor feature

1791b90

github-advanced-security bot found potential problems Sep 4, 2024

View reviewed changes

onnxruntime/core/providers/openvino/ov_allocator.cc Fixed Show fixed Hide fixed

onnxruntime/core/providers/openvino/ov_allocator.h Fixed Show fixed Hide fixed

saurabhkale17 added 2 commits September 4, 2024 22:45

Revert "enable backward compatibility of remote tensor feature"

f6f439b

This reverts commit 1791b90.

enable backward compatibility of remote tensor feature in OVEP

39c0cba

github-advanced-security bot found potential problems Sep 5, 2024

View reviewed changes

saurabhkale17 merged commit 236df14 into ovep-develop-lnl-1.2 Sep 5, 2024

ankitm3k reviewed Sep 5, 2024

View reviewed changes

	#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4
	#if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4

	using ort_tensor_key_t = std::pair<const void *, const std::string>;
	using ORTTensorKey = std::pair<const void *, const std::string>;

	outputs_.push_back(Ort::Value::CreateTensor(custom_allocator_, (const int64_t)output_shape.data(),
	outputs_.emplace_back(Ort::Value::CreateTensor(custom_allocator_, std::move(const int64_t)output_shape.data()), output_shape.size(), tensor_info.GetElementType()));

Conversation

saurabhkale17 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Motivation and Context:

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfatimar Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfatimar commented Sep 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Check warning

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

saurabhkale17 commented Sep 3, 2024 •

edited

Loading

sfatimar Sep 4, 2024 •

edited

Loading