[ORT 1.17.0 Release] Cherry-pick Final Round#19327
Merged
YUNQIUGUO merged 7 commits intorel-1.17.0from Jan 31, 2024
Merged
Conversation
### Description Adds the ability to specify general session configuration entries via the `-C` command-line option. Example: `-C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"` Some session config entries can already be set via dedicated command-line options. If the user uses multiple command-line options to set the same session config entry, we'll print a warning. Note that the dedicated command-line options will take precedence. ### Motivation and Context Allows setting session configurations when testing EPs. QNN EP, for example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*` options.
…lines (#19293) To fix a pipeline issue.
Given that InferenceSession::Run() is guaranteed to be thread-safe
meaning multiple threads can call this function concurrently,
TRT EP needs to carefully take care of concurrency here, if not,
following concurrent issue might happen:
- It's suggested that to perform inference concurrently in multiple
streams, use one trt execution context per stream.
In the design of TRT EP (Not apply per-thread context implementation)
and if multiple threads are calling InferenceSession::Run()
concurrently, the trt execution context instance is shared by all the
threads and each thread aquires different stream from ORT.
So TRT EP will end up having one trt execution context using multiple
streams which is not suggested.
But, since the whole compute_func() is protected by the lock and if
cudaStreamSynchronize() is enforced here, one trt execution context per
stream is guaranteed.
Therefore, TRT EP needs to call cudaStreamSynchronize() at
compute_func() which means to wait until stream has completed all
operations to prevent the concurrent
github isse: #19275
…9311) ### Description <!-- Describe your changes. --> Updates to only include ios archs framework in artifacts included in Nuget Package. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Related issue: #19295 (comment) --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Contributor
Author
adrianlizarraga
previously approved these changes
Jan 30, 2024
chilo-ms
previously approved these changes
Jan 30, 2024
snnn
previously approved these changes
Jan 30, 2024
### Description This PR updates the Whisper export with beam search by adding the following. - Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the Whisper with beam search model - Sets the default PyTorch attention implementation to `eager` to allow existing attention fusions to continue working - Re-uses the cache directory when loading the PyTorch model to reduce memory used on disk - Adds `--disable_auto_mixed_precision` to the example FP16 export command ### Motivation and Context - [This PR](#19112) added the `is_unidirectional` parameter to `CheckInputs`, but it was not provided when checking the inputs in `DecoderMaskedMultiHeadAttention`. - [This PR](#19200) explains the reasoning behind why `eager` is used to load the `WhisperAttention` class. - By re-using the cache directory for loading the PyTorch model, only one copy of the PyTorch model is saved on disk instead of two copies. - By providing this flag, there will be less Cast nodes in the Whisper with beam search model to switch between FP16 and FP32 precision.
Add Intel neural-speed to ThirdPartyNotices.txt because it will be shipped in the default build in most of our packages.
d101450
snnn
previously approved these changes
Jan 30, 2024
Contributor
|
This one is missed in cherry-pick: #18906 |
Contributor
Author
ok. looks like the label was just added last Friday. but to confirm, it seems like a large change. Would that impact the RC/ any risk for breaks/revalidations,etc? |
### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
snnn
approved these changes
Jan 30, 2024
kunal-vaishnavi
approved these changes
Jan 30, 2024
tianleiwu
approved these changes
Jan 31, 2024
YUNQIUGUO
added a commit
that referenced
this pull request
Feb 1, 2024
### Description <!-- Describe your changes. --> Cherry-pick Final Round ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: aciddelgado <139922440+aciddelgado@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
This was referenced Sep 5, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Cherry-pick Final Round
Motivation and Context