Skip to content

Implement experimental intermediate cross CPU EP allocation#24371

Merged
yuslepukhin merged 29 commits intomainfrom
yuslepukhin/qnn_copy_fix
Apr 25, 2025
Merged

Implement experimental intermediate cross CPU EP allocation#24371
yuslepukhin merged 29 commits intomainfrom
yuslepukhin/qnn_copy_fix

Conversation

@yuslepukhin
Copy link
Member

@yuslepukhin yuslepukhin commented Apr 9, 2025

Description

Onnxruntime manages a number of CPU based accelerators. I.e. those that can operate on CPU based inputs.
However, several of them like Qnn, Openvino and Vitis may require CPU based inputs to be aligned to 4K so they can be memory mapped.

To mitigate that, we introduce a new CPU based allocator that produces 4K aligned memory.

We also adjust allocation planner to override plain CPU device. When we detect a compiled CPU based EP, we adjust the device according by requesting the EP to return OrtMemType::OrtMemTypeCPUInput. This gives the EP an opportunity to return either GPU/NPU device or CPU device depending on the mode it is operating.

We also override Qnn GetOrtDeviceByMemType() to make sure the appropriate allocator is requested.

We also adjust memory patterns to make sure 4K alignment is respected in the contagious buffers when appropriate.

Motivation and Context

CPU Based providers, notably accept CPU based inputs, but they have a requirement of 4K allocations, otherwise the input incurs an extra copy. This is especially noticeable with intermediate values that are produced by upstream CPU based nodes. Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation planner.

Cc: @quic-ashigarg

@yuslepukhin yuslepukhin force-pushed the yuslepukhin/qnn_copy_fix branch from faf10b6 to e23ac42 Compare April 10, 2025 23:55
@yuslepukhin yuslepukhin reopened this Apr 11, 2025
edgchen1 and others added 3 commits April 24, 2025 10:51
	if MemType differs, prefer the non-default one
      e.g. QNN uses OrtDevice::MemType::QNN_HTP_SHARED
   if both are not default, no preference.
   prefer allocator with higher alignment requirement
@yuslepukhin yuslepukhin merged commit 8bb3b07 into main Apr 25, 2025
85 of 89 checks passed
@yuslepukhin yuslepukhin deleted the yuslepukhin/qnn_copy_fix branch April 25, 2025 18:00
yuslepukhin pushed a commit that referenced this pull request May 8, 2025
### Description
Fix compare OrtDevice when Debug mode
Related #24371 



### Motivation and Context
add compare device alignment in  OrtDevice compare function
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request May 12, 2025
…t#24371)

### Description
<!-- Describe your changes. -->
Onnxruntime manages a number of CPU based accelerators. I.e. those that
can operate on CPU based inputs.
However, several of them like `Qnn`, `Openvino` and `Vitis` may require
CPU based inputs to be either aligned to 4K so they can be memory mapped or
prefer to override the device with their own CPU accessible allocator.

To mitigate that, we introduce a new CPU based allocator that produces
4K aligned memory.

We also adjust allocation planner to override plain CPU device. When we
detect a compiled CPU based EP, we adjust the device according by
requesting the EP to return `OrtMemType::OrtMemTypeCPUInput`. This gives
the EP an opportunity to return either GPU/NPU device or CPU device
depending on the mode it is operating.

We select the device with larger alignment betrween CPU default devices.

We also adjust memory patterns to make sure 4K alignment is respected in
the contagious buffers when appropriate.

### Motivation and Context
CPU Based providers, notably accept CPU based inputs, but they have a
requirement of 4K allocations, otherwise the input incurs an extra copy.
This is especially noticeable with intermediate values that are produced
by upstream CPU based nodes. 

Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation
planner. This PR excludes Qnn allocator usage for intermediate values
due to the overhead contributed by memhandle management.


Cc: @quic-ashigarg

---------

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants