Implement experimental intermediate cross CPU EP allocation by yuslepukhin · Pull Request #24371 · microsoft/onnxruntime

yuslepukhin · 2025-04-09T23:13:47Z

Description

Onnxruntime manages a number of CPU based accelerators. I.e. those that can operate on CPU based inputs.
However, several of them like Qnn, Openvino and Vitis may require CPU based inputs to be aligned to 4K so they can be memory mapped.

To mitigate that, we introduce a new CPU based allocator that produces 4K aligned memory.

We also adjust allocation planner to override plain CPU device. When we detect a compiled CPU based EP, we adjust the device according by requesting the EP to return OrtMemType::OrtMemTypeCPUInput. This gives the EP an opportunity to return either GPU/NPU device or CPU device depending on the mode it is operating.

We also override Qnn GetOrtDeviceByMemType() to make sure the appropriate allocator is requested.

We also adjust memory patterns to make sure 4K alignment is respected in the contagious buffers when appropriate.

Motivation and Context

CPU Based providers, notably accept CPU based inputs, but they have a requirement of 4K allocations, otherwise the input incurs an extra copy. This is especially noticeable with intermediate values that are produced by upstream CPU based nodes. Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation planner.

Cc: @quic-ashigarg

onnxruntime/core/framework/allocation_planner.cc

…PU. Make VITIS and NPU return 4K

onnxruntime/core/framework/utils.cc

onnxruntime/core/framework/allocator.cc

include/onnxruntime/core/framework/ortdevice.h

onnxruntime/core/framework/allocation_planner.cc

include/onnxruntime/core/framework/ortdevice.h

onnxruntime/core/framework/allocation_planner.cc

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

cmake/onnxruntime_unittests.cmake

include/onnxruntime/core/framework/ortdevice.h

onnxruntime/core/framework/allocation_planner.cc

onnxruntime/core/providers/qnn/qnn_execution_provider.cc

onnxruntime/core/providers/shared_library/provider_api.h

onnxruntime/core/providers/qnn/qnn_allocator.cc

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

… allocation tracking table.

…s instead of allocation base address

onnxruntime/core/framework/allocation_planner.cc

onnxruntime/core/providers/qnn/qnn_allocator.cc

onnxruntime/core/providers/qnn/builder/qnn_model.cc

onnxruntime/core/providers/qnn/qnn_allocator.cc

if MemType differs, prefer the non-default one e.g. QNN uses OrtDevice::MemType::QNN_HTP_SHARED if both are not default, no preference. prefer allocator with higher alignment requirement

onnxruntime/core/framework/allocation_planner.cc

### Description Fix compare OrtDevice when Debug mode Related #24371 ### Motivation and Context add compare device alignment in OrtDevice compare function

@quic-ashigarg

…t#24371) ### Description  Onnxruntime manages a number of CPU based accelerators. I.e. those that can operate on CPU based inputs. However, several of them like `Qnn`, `Openvino` and `Vitis` may require CPU based inputs to be either aligned to 4K so they can be memory mapped or prefer to override the device with their own CPU accessible allocator. To mitigate that, we introduce a new CPU based allocator that produces 4K aligned memory. We also adjust allocation planner to override plain CPU device. When we detect a compiled CPU based EP, we adjust the device according by requesting the EP to return `OrtMemType::OrtMemTypeCPUInput`. This gives the EP an opportunity to return either GPU/NPU device or CPU device depending on the mode it is operating. We select the device with larger alignment betrween CPU default devices. We also adjust memory patterns to make sure 4K alignment is respected in the contagious buffers when appropriate. ### Motivation and Context CPU Based providers, notably accept CPU based inputs, but they have a requirement of 4K allocations, otherwise the input incurs an extra copy. This is especially noticeable with intermediate values that are produced by upstream CPU based nodes. Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation planner. This PR excludes Qnn allocator usage for intermediate values due to the overhead contributed by memhandle management. Cc: @quic-ashigarg --------- Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

Implement experimental intermediate cross CPU EP allocation

faf10b6

yuslepukhin requested review from adrianlizarraga, edgchen1 and skottmckay April 9, 2025 23:13

yuslepukhin commented Apr 10, 2025

View reviewed changes

onnxruntime/core/framework/allocation_planner.cc Outdated Show resolved Hide resolved

Implement experimental intermediate cross CPU EP allocation

e23ac42

yuslepukhin force-pushed the yuslepukhin/qnn_copy_fix branch from faf10b6 to e23ac42 Compare April 10, 2025 23:55

yuslepukhin added 3 commits April 10, 2025 16:56

Address build failures

3d4f853

Merge

fe84316

Address build issues

97c1b53

yuslepukhin closed this Apr 11, 2025

yuslepukhin added 2 commits April 11, 2025 11:56

Re-work CPU output device override

396b0e7

Adjust QNN EP for 4K CPU Allocator

fb9d3cb

yuslepukhin reopened this Apr 11, 2025

yuslepukhin added 3 commits April 11, 2025 15:40

Make sure CPU 4K allocator is not created with arena

5d8c724

Make QNN EP return QTP when rpcmem is present, otherwise just plain C…

a13fccd

…PU. Make VITIS and NPU return 4K

Query memory from EPs

c7b8096

yuslepukhin requested review from HectorSVC and jywu-msft April 15, 2025 17:33

yuslepukhin marked this pull request as ready for review April 15, 2025 17:33

yuslepukhin mentioned this pull request Apr 15, 2025

Set shared memory type based on options during the compilation phase #24196

Merged

skottmckay reviewed Apr 15, 2025

View reviewed changes

yuslepukhin added 3 commits April 16, 2025 10:36

Merge branch 'main' into yuslepukhin/qnn_copy_fix

67fa917

Apply code review changes

36886fa

Make device comparision more specific

b15b24e

skottmckay reviewed Apr 18, 2025

View reviewed changes

yuslepukhin added 4 commits April 18, 2025 10:20

Merge branch 'main' into yuslepukhin/qnn_copy_fix

96fd557

Address review comments

3c7f21b

Remove extra quailifier

fc48a45

Merge branch 'main' into yuslepukhin/qnn_copy_fix

1e781e7

Resolve mlas deps

1ff6086

skottmckay reviewed Apr 23, 2025

View reviewed changes

yuslepukhin added 2 commits April 23, 2025 13:57

Address review comments

35e455c

Fix linking. Enforce default alignment only for CPU devices as before.

3934682

yuslepukhin commented Apr 23, 2025

View reviewed changes

onnxruntime/core/providers/qnn/qnn_allocator.cc Outdated Show resolved Hide resolved

yuslepukhin commented Apr 23, 2025

View reviewed changes

onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated Show resolved Hide resolved

yuslepukhin and others added 4 commits April 23, 2025 16:22

Undo some changes

b691e7b

remove header from HTP shared memory allocations. replace with global…

fb8b668

… allocation tracking table.

call QnnContextMemHandleManager::Unregister with shared memory addres…

5c15fdb

…s instead of allocation base address

Fix shared allocation detection

c0d1de4

skottmckay reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/framework/allocation_planner.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/qnn/qnn_allocator.cc Show resolved Hide resolved

edgchen1 reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_model.cc Outdated Show resolved Hide resolved

edgchen1 reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/providers/qnn/qnn_allocator.cc Show resolved Hide resolved

edgchen1 and others added 3 commits April 24, 2025 10:51

add comments about AllocationTracker. add gsl include.

2e4b46c

Remove alignment detection in qnn model

c2e3f99

For CPU based memory:

b2b949b

if MemType differs, prefer the non-default one e.g. QNN uses OrtDevice::MemType::QNN_HTP_SHARED if both are not default, no preference. prefer allocator with higher alignment requirement

skottmckay reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/framework/allocation_planner.cc Outdated Show resolved Hide resolved

yuslepukhin added 2 commits April 24, 2025 19:02

Extract common logic for input/output device deduction

6da6994

Disable Qnn Qtp shared allocator for intermediate values

f8635f7

skottmckay approved these changes Apr 25, 2025

View reviewed changes

yuslepukhin merged commit 8bb3b07 into main Apr 25, 2025
85 of 89 checks passed

yuslepukhin deleted the yuslepukhin/qnn_copy_fix branch April 25, 2025 18:00

mingyueliuh mentioned this pull request May 7, 2025

[Fix] compare OrtDevice error #24677

Merged

yuslepukhin pushed a commit that referenced this pull request May 8, 2025

[Fix] compare OrtDevice error (#24677)

0e0002b

### Description Fix compare OrtDevice when Debug mode Related #24371 ### Motivation and Context add compare device alignment in OrtDevice compare function

Conversation

yuslepukhin commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuslepukhin commented Apr 9, 2025 •

edited

Loading