Implement experimental intermediate cross CPU EP allocation#24371
Merged
yuslepukhin merged 29 commits intomainfrom Apr 25, 2025
Merged
Implement experimental intermediate cross CPU EP allocation#24371yuslepukhin merged 29 commits intomainfrom
yuslepukhin merged 29 commits intomainfrom
Conversation
yuslepukhin
commented
Apr 10, 2025
faf10b6 to
e23ac42
Compare
skottmckay
reviewed
Apr 15, 2025
skottmckay
reviewed
Apr 18, 2025
skottmckay
reviewed
Apr 23, 2025
yuslepukhin
commented
Apr 23, 2025
yuslepukhin
commented
Apr 23, 2025
… allocation tracking table.
…s instead of allocation base address
skottmckay
reviewed
Apr 24, 2025
edgchen1
reviewed
Apr 24, 2025
edgchen1
reviewed
Apr 24, 2025
if MemType differs, prefer the non-default one
e.g. QNN uses OrtDevice::MemType::QNN_HTP_SHARED
if both are not default, no preference.
prefer allocator with higher alignment requirement
skottmckay
reviewed
Apr 24, 2025
skottmckay
approved these changes
Apr 25, 2025
yuslepukhin
pushed a commit
that referenced
this pull request
May 8, 2025
### Description Fix compare OrtDevice when Debug mode Related #24371 ### Motivation and Context add compare device alignment in OrtDevice compare function
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this pull request
May 12, 2025
…t#24371) ### Description <!-- Describe your changes. --> Onnxruntime manages a number of CPU based accelerators. I.e. those that can operate on CPU based inputs. However, several of them like `Qnn`, `Openvino` and `Vitis` may require CPU based inputs to be either aligned to 4K so they can be memory mapped or prefer to override the device with their own CPU accessible allocator. To mitigate that, we introduce a new CPU based allocator that produces 4K aligned memory. We also adjust allocation planner to override plain CPU device. When we detect a compiled CPU based EP, we adjust the device according by requesting the EP to return `OrtMemType::OrtMemTypeCPUInput`. This gives the EP an opportunity to return either GPU/NPU device or CPU device depending on the mode it is operating. We select the device with larger alignment betrween CPU default devices. We also adjust memory patterns to make sure 4K alignment is respected in the contagious buffers when appropriate. ### Motivation and Context CPU Based providers, notably accept CPU based inputs, but they have a requirement of 4K allocations, otherwise the input incurs an extra copy. This is especially noticeable with intermediate values that are produced by upstream CPU based nodes. Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation planner. This PR excludes Qnn allocator usage for intermediate values due to the overhead contributed by memhandle management. Cc: @quic-ashigarg --------- Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Onnxruntime manages a number of CPU based accelerators. I.e. those that can operate on CPU based inputs.
However, several of them like
Qnn,OpenvinoandVitismay require CPU based inputs to be aligned to 4K so they can be memory mapped.To mitigate that, we introduce a new CPU based allocator that produces 4K aligned memory.
We also adjust allocation planner to override plain CPU device. When we detect a compiled CPU based EP, we adjust the device according by requesting the EP to return
OrtMemType::OrtMemTypeCPUInput. This gives the EP an opportunity to return either GPU/NPU device or CPU device depending on the mode it is operating.We also override Qnn
GetOrtDeviceByMemType()to make sure the appropriate allocator is requested.We also adjust memory patterns to make sure 4K alignment is respected in the contagious buffers when appropriate.
Motivation and Context
CPU Based providers, notably accept CPU based inputs, but they have a requirement of 4K allocations, otherwise the input incurs an extra copy. This is especially noticeable with intermediate values that are produced by upstream CPU based nodes. Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation planner.
Cc: @quic-ashigarg