[TensorRT EP] Switch to enqueueV3 with support DDS output by chilo-ms · Pull Request #17751 · microsoft/onnxruntime

chilo-ms · 2023-09-30T18:40:14Z

There are 2 phases to switch to enqueueV3. This PR is the 2nd phases. (The 1st phase PR is here)

One of the ways TRT handles data-dependent shape (DDS) output is relying on user to provide an allocator as a callback. TRT calls this allocator when it knows the shape of the tensor during runtime to allocate output memory. So, here, we need a way to bind the allocation output to the kernel context output.

"If the output tensor has data-dependent shape, TRT EP will provide an IOutputAllocator for enqueueV3 to dynamically allocate memory buffer.
Once enqueueV3 returns, TRT EP will then bind the output allocation to ORT kernel context output.
(Please note that we take strategy A mentioned in https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dynamic-shaped-output,
which we defer allocation until the size is known and don't call IExecution::setTensorAddress)
Otherwise, if the shape of the output tensor is known prior to the runtime, ORT will pre-allocate memory buffer for the output tensor for enqueueV3."

Add new ORT KernelContext_SetOutput() API where it calls the existed SetOutputMLValue() which only being used by training before.
This is because Compiled based EP's can only use the public OrtKernelContext api's and not the internal OpKernelContext api.

souptc · 2023-12-04T17:45:45Z

include/onnxruntime/core/session/onnxruntime_c_api.h

+   *
+   * \since Version 1.17.
+   */
+  ORT_API2_STATUS(KernelContext_SetOutput, _Inout_ OrtKernelContext* context, _In_ size_t index,


KernelContext_SetOutput

why we want to expose it to CAPI? we have TRT based custom op?

KernelContext_SetOutput

why we want to expose it to CAPI? we have TRT based custom op?

compile api based EP's need to implement compute_func which only has access to public OrtKernelContext api , not the internal OpKernelContext api.
we need to use SetOutputMLValue() so that's why it's plumbed thru to the public api in this PR

dismiss

souptc

…puts

jywu-msft · 2023-12-08T18:45:34Z

this will be replaced by version which copies output rather than binds output to kernel context since we don't want to expose that api publicly. #18714

…on) (#18714) It's branched off from #17751 but removes KernelContext_SetOutput() API. It copies output allocation buffer to kernel context. --------- Co-authored-by: George Wu <jywu@microsoft.com>

When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: #18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of #17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.

When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: microsoft/onnxruntime#18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of microsoft/onnxruntime#17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.

chilo-ms added 3 commits September 28, 2023 21:50

update

de170d5

update

3affd2e

update

886bcce

yf711 self-requested a review October 2, 2023 23:25

chilo-ms and others added 7 commits October 10, 2023 21:39

update

ff9d3d2

merge

ae69ca7

update

09f2401

Merge branch 'main' into chi/trt_enqueue_v3

1e77cd1

Merge branch 'main' into chi/trt_enqueue_v3

5f82028

fix bug

8f847ec

fix bugs

35d54b8

chilo-ms changed the title ~~[TensorRT EP] Switch to use new TRT APIs from deprecated ones~~ [TensorRT EP] Switch to enqueueV3 with support DDS output Oct 18, 2023

chilo-ms mentioned this pull request Oct 18, 2023

[TensorRT EP] Switch to enqueueV3 #18008

Closed

chilo-ms requested a review from jywu-msft October 18, 2023 00:36

chilo-ms marked this pull request as ready for review October 18, 2023 00:36

update

5330dff

chilo-ms requested review from jslhcl and souptc October 20, 2023 17:33

chilo-ms and others added 5 commits October 23, 2023 17:13

Merge branch 'main' into chi/trt_enqueue_v3

ed4849f

refactor

98b35ed

Merge branch 'main' into chi/trt_enqueue_v3

018a6b4

fix format

0e79992

fix minor bug

bc7e206

chilo-ms mentioned this pull request Nov 1, 2023

[TensorRT EP] Load precompiled TRT engine file directly #18217

Merged

chilo-ms and others added 6 commits November 1, 2023 22:48

remove redundant code

57ba208

code refacotr

b1ec7cd

fix format

9653143

Merge branch 'main' into chi/trt_enqueue_v3

4e40cd3

update

ecc2566

update

7ee0ee0

Merge branch 'main' into chi/trt_enqueue_v3

98c8374

chilo-ms mentioned this pull request Nov 7, 2023

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

Closed

chilo-ms and others added 3 commits November 15, 2023 00:07

Merge branch 'main' into chi/trt_enqueue_v3

8567688

Merge branch 'main' into chi/trt_enqueue_v3

dc88e6b

Merge branch 'main' into chi/trt_enqueue_v3

2413e1d

jywu-msft previously approved these changes Dec 4, 2023

View reviewed changes

souptc reviewed Dec 4, 2023

View reviewed changes

souptc previously approved these changes Dec 4, 2023

View reviewed changes

Add INT32/INT64 and float/double conversion for DDS outputs

99e79dd

chilo-ms dismissed souptc’s stale review via 99e79dd December 5, 2023 04:07

chilo-ms added 4 commits December 5, 2023 04:11

update for adding INT32/INT64 and float/double conversion for DDS out…

e15a8bd

…puts

fix typo

8de13db

fix bug for using local buffer

27ea00e

code refactor and add cleanup for dds_output_allocator_map

4ff9a85

chilo-ms mentioned this pull request Dec 6, 2023

[TensorRT EP] Switch to enqueueV3 with support DDS output (copy version) #18714

Merged

jywu-msft closed this Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TensorRT EP] Switch to enqueueV3 with support DDS output#17751

[TensorRT EP] Switch to enqueueV3 with support DDS output#17751
chilo-ms wants to merge 31 commits intomainfrom
chi/trt_enqueue_v3

chilo-ms commented Sep 30, 2023 •

edited

Loading

Uh oh!

souptc Dec 4, 2023

Uh oh!

jywu-msft Dec 4, 2023

Uh oh!

souptc left a comment

Uh oh!

jywu-msft commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chilo-ms commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

souptc Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

jywu-msft Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

souptc left a comment

Choose a reason for hiding this comment

Uh oh!

jywu-msft commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chilo-ms commented Sep 30, 2023 •

edited

Loading