Improvement in average inference latency for models running on OVEP NPU #441
Improvement in average inference latency for models running on OVEP NPU #441saurabhkale17 merged 24 commits intoovep-develop-lnl-1.2from
Conversation
| return Status::OK(); | ||
| } | ||
|
|
||
| std::vector<AllocatorPtr> OpenVINOExecutionProvider::CreatePreferredAllocators() { |
There was a problem hiding this comment.
Why this function name is not NPU Specific ?
| return nullptr; | ||
| } | ||
|
|
||
| std::vector<AllocatorPtr> CreatePreferredAllocators() override; |
There was a problem hiding this comment.
If this is only for NPU, the function name should suggest that
There was a problem hiding this comment.
yes, this is for NPU.
This CreatePreferredAllocators is the function of OpenVINOExecutionProvider class.
OpenVINOExecutionProvider class is inherited from IExecutionProvider class which has this function.
And this is a function which is used by all the execution providers.
| // Copyright (C) Intel Corporation | ||
| // Licensed under the MIT License | ||
|
|
||
| #include "core/providers/openvino/ov_allocator.h" |
There was a problem hiding this comment.
Will add npu plugin include files work acroos openvino versions ?
| } | ||
| } else { | ||
| OVTensorPtr graph_input_blob; | ||
| auto tensor = context.GetInput(subgraph_context_.input_names.at(input_name)); |
There was a problem hiding this comment.
Please add comments to explain the code here. Is this compatible across all OV Versions.
|
|
||
| ov_tensor_data_t ov_tensor_data; | ||
| ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name}; | ||
| if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) { |
There was a problem hiding this comment.
Please try not using auto to avoid coverity issues.
There was a problem hiding this comment.
Tried to avoid auto wherever possible in the new changes.
But for the situations where the data type is very complex it will be better with auto.
For example this auto (line 354) has a data type of
class std::_Tree_iterator<class std::_Tree_val<struct std::_Tree_simple_types<struct std::pair<struct std::pair<void const * __ptr64,class std::basic_string<char,struct std::char_traits,class std::allocator > const > const ,struct onnxruntime::openvino_ep::ov_tensor_data_t> > > >
There was a problem hiding this comment.
I would recommend to avoid auto, when the return type is smaller for example cases like below -
std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();
We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above
| constexpr const char* HIP_PINNED = "HipPinned"; | ||
| constexpr const char* OpenVINO_CPU = "OpenVINO_CPU"; | ||
| constexpr const char* OpenVINO_GPU = "OpenVINO_GPU"; | ||
| constexpr const char* OpenVINO_RT = "OpenVINO_RT"; |
There was a problem hiding this comment.
How is OpenVINO_RT different from OpenVINO_CPU ?
|
@ankitm3k @preetha-intel can you please review and close this task on priority. ? |
|
|
||
| ov_tensor_data_t ov_tensor_data; | ||
| ort_tensor_key_t ort_tensor_key{tensor.GetTensorRawData(), allocator_name}; | ||
| if (const auto& it = ort_ov_tensor_map.find(ort_tensor_key); it != ort_ov_tensor_map.end()) { |
There was a problem hiding this comment.
I would recommend to avoid auto, when the return type is smaller for example cases like below -
std::unique_ptr<ONNX_NAMESPACE::AttributeProto> sdk_version_attr = ONNX_NAMESPACE::AttributeProto::Create();
We can use auto only if the command is exceeding 120 characters length for the lintrunner and having complex type as mentioned in the chat above
| namespace onnxruntime { | ||
| namespace openvino_ep { | ||
|
|
||
| struct ov_tensor_data_t { |
There was a problem hiding this comment.
Class & struct naming case convention is not matching the overall ORT convention, you can use name struct OVTensorData, above looks like a variable name
| @@ -0,0 +1,25 @@ | |||
| // Copyright (C) Intel Corporation | |||
| // Licensed under the MIT License | |||
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 | |||
There was a problem hiding this comment.
This translation unit is force to be OV version specific, can we have below -
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 | |
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4 |
We can remove/ deprecate it later when we move to higher versions like OV 2025.2, etc
| public: | ||
| OVRTAllocator(ov::Core &core, OrtDevice::DeviceType device_type, OrtDevice::DeviceId device_id, const char* name); | ||
| void* Alloc(size_t size) override; | ||
| void Free(void* p) override; |
There was a problem hiding this comment.
Can be called inside destructor like ~OVRTAllocator() { Free(ptr); } This will manage the destruction once you step out of its local scope while using it automatically
| Ort::TypeInfo type_info = session_.GetOutputTypeInfo(i); | ||
| auto tensor_info = type_info.GetTensorTypeAndShapeInfo(); | ||
|
|
||
| std::vector<int64_t> output_shape = tensor_info.GetShape(); |
There was a problem hiding this comment.
We can reserve the vector shape to avoid reallocations using below, if the tensor_info size is known -
output_shape.reserve(tensor_info.GetShape().size());
| name1, type, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, static_cast<OrtDevice::DeviceId>(id1)), id1, | ||
| mem_type1); | ||
| } else if (strcmp(name1, onnxruntime::OpenVINO_RT_NPU) == 0) { | ||
| *out = new OrtMemoryInfo( |
There was a problem hiding this comment.
Verify if the out variable memory is deleted later on to avoid memory leaks, once can use smart pointer to avoid explicit delete call
| OVRemoteContextPtr remote_context_; | ||
| #endif | ||
|
|
||
| using ort_tensor_key_t = std::pair<const void *, const std::string>; |
There was a problem hiding this comment.
Naming a typedef or a namespace can be conflicting with variable name as suggested above,
| using ort_tensor_key_t = std::pair<const void *, const std::string>; | |
| using ORTTensorKey = std::pair<const void *, const std::string>; |
|
|
||
| return Status::OK(); | ||
| } | ||
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 |
There was a problem hiding this comment.
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 | |
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4 |
| @@ -0,0 +1,55 @@ | |||
| // Copyright (C) Intel Corporation | |||
| // Licensed under the MIT License | |||
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 | |||
There was a problem hiding this comment.
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR == 4 | |
| #if OPENVINO_VERSION_MAJOR == 2024 && OPENVINO_VERSION_MINOR >= 4 |
| } | ||
| } | ||
|
|
||
| outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(), |
There was a problem hiding this comment.
Use move semantics to have inplace allocations for the tensors if possible
| outputs_.push_back(Ort::Value::CreateTensor(*custom_allocator_, (const int64_t*)output_shape.data(), | |
| outputs_.emplace_back(Ort::Value::CreateTensor(*custom_allocator_, std::move(const int64_t*)output_shape.data()), output_shape.size(), tensor_info.GetElementType())); |
…PU (#441) * Prototype shared memory allocator on Windows using OV-EP * Partially working allocator. Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct. * Hard code onnx perf to use RT NPU allocator for inputs * Fix allocation lookups coming from different level zero contexts * Page align OV allocation * Allocate input as WC * Only set tensors when they have changed. * Revert "Allocate input as WC" This reverts commit d43219f. * Hard code onnx perf to use RT NPU for outputs * Revert "Hard code onnx perf to use RT NPU for outputs" This reverts commit c1f3b3e. * Hard code onnx perf to use RT NPU for outputs fixed * Fix onnx_perf_test app crash on tensor destroy * refactor: remove redundant ort_shape_to_ovshape lambda function * alocate buffer in NPU visible region from perf test application * remove redundant code * add command line parameter in perf test for using remote tensors * remove redundant code * remove redundant statements * fix crash during inference * remove redundant code * enable backward compatibility of remote tensor feature * Revert "enable backward compatibility of remote tensor feature" This reverts commit 1791b90. * enable backward compatibility of remote tensor feature in OVEP --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
…PU (#441) * Prototype shared memory allocator on Windows using OV-EP * Partially working allocator. Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct. * Hard code onnx perf to use RT NPU allocator for inputs * Fix allocation lookups coming from different level zero contexts * Page align OV allocation * Allocate input as WC * Only set tensors when they have changed. * Revert "Allocate input as WC" This reverts commit d43219f. * Hard code onnx perf to use RT NPU for outputs * Revert "Hard code onnx perf to use RT NPU for outputs" This reverts commit c1f3b3e. * Hard code onnx perf to use RT NPU for outputs fixed * Fix onnx_perf_test app crash on tensor destroy * refactor: remove redundant ort_shape_to_ovshape lambda function * alocate buffer in NPU visible region from perf test application * remove redundant code * add command line parameter in perf test for using remote tensors * remove redundant code * remove redundant statements * fix crash during inference * remove redundant code * enable backward compatibility of remote tensor feature * Revert "enable backward compatibility of remote tensor feature" This reverts commit 1791b90. * enable backward compatibility of remote tensor feature in OVEP --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
…PU (#441) * Prototype shared memory allocator on Windows using OV-EP * Partially working allocator. Crashing on tensor destruction. Might have UMD exceptions. Needs further debug. Unknown if values are correct. * Hard code onnx perf to use RT NPU allocator for inputs * Fix allocation lookups coming from different level zero contexts * Page align OV allocation * Allocate input as WC * Only set tensors when they have changed. * Revert "Allocate input as WC" This reverts commit d43219f. * Hard code onnx perf to use RT NPU for outputs * Revert "Hard code onnx perf to use RT NPU for outputs" This reverts commit c1f3b3e. * Hard code onnx perf to use RT NPU for outputs fixed * Fix onnx_perf_test app crash on tensor destroy * refactor: remove redundant ort_shape_to_ovshape lambda function * alocate buffer in NPU visible region from perf test application * remove redundant code * add command line parameter in perf test for using remote tensors * remove redundant code * remove redundant statements * fix crash during inference * remove redundant code * enable backward compatibility of remote tensor feature * Revert "enable backward compatibility of remote tensor feature" This reverts commit 1791b90. * enable backward compatibility of remote tensor feature in OVEP --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Description:
This PR addresses and resolves the implicit memory copying issue, leading to a significant improvement in average latency for models running on NPU devices.
Motivation and Context:
Issue Discovery: During performance testing with the onnxruntime_perf_test application, the model GT exhibited much lower performance compared to OpenVINO's benchmark application.
Root Cause Analysis: The performance drop was traced back to two instances of unnecessary memcpy operations in OVEP: one before inference (from the ORT buffer to the OV buffer) and one after inference (from the OV buffer back to the ORT buffer).
Impact: The avg latency of the model is less as compared to the dimensionality of the input and output specifically for model GT. These memcpy operations were contributing significantly to the increased inference latency.
Solution: In OpenVINO 2024.4, the implementation of remote tensors for NPUs introduces an interface for working directly with device-specific memory. This feature eliminates the need for memcpy by using the remote NPU buffer directly, significantly reducing latency. OVEP leverages this capability to allocate buffers in the NPU's accessible memory region, optimizing the allocation of input and output tensors.
Fixed Issue:
EISW-135604
To use the above feature we have introduced a new flag "use_device_mem". This flag is set to false by default.
You can use the feature by setting it to true from the cmd line while using perf_test application.
Ex: onnxruntime_perf_test.exe -e openvino -i "device_type|NPU enable_qdq_optimizer|true use_device_mem|true" -m times -r 100 -I "path_to_the_model"