Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash-on-exit issue in the WebGPU execution provider caused by DLL unload ordering. When WebGpuContextFactory::Cleanup() runs, dependent DLLs like dxcompiler.dll may have already been unloaded, leading to crashes during resource destruction.
Changes:
- Explicitly loads and holds references to
dxil.dllanddxcompiler.dllto prevent premature unloading. - Changes
default_instance_fromwgpu::Instance(C++ RAII wrapper) to rawWGPUInstancefor explicit lifetime control. - Reorders cleanup in
WebGpuContextFactory::Cleanup()to ensure resources are released before DLL handles.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| onnxruntime/core/providers/webgpu/webgpu_context.h | Reorders static members, changes default_instance_ to raw WGPUInstance, adds modules_ and modules_dxc_loaded_ fields. |
| onnxruntime/core/providers/webgpu/webgpu_context.cc | Moves context map allocation inside instance creation block, adds DXC DLL loading logic, implements explicit cleanup with correct ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
240d6fd to
ddd0a35
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
PR 27569 Review
Findings & Architectural Feedback (C/C++ Core & Execution Providers)
1. Node.js Bindings & N-API Lifetime Control
The PR attempts to solve a static destruction order fiasco—specifically where dxcompiler.dll or other libraries unload before Ort::Env destructs—by using manual reference counting, is_main_thread checks, and N-API environment cleanup hooks (OrtSingletonData::OrtObjects).
-
Correctness Bug (
InferenceSessionWrapLeak Logic is Incomplete):
The new shutdown guard atjs/node/src/inference_session_wrap.cc:60introduces a protective leak (if (!OrtSingletonData::GetOrtObjects()) { ... release() }) to avoid hitting a finalizedOrtEnvduring GC. However, it only leakssession_andioBinding_afterOrtSingletonDatahas already been torn down.InferenceSessionWrapstill ownsinputTypes_andoutputTypes_(js/node/src/inference_session_wrap.h:89-91). Those members areOrt::TypeInfoRAII wrappers, and their base destructor still callsOrtApi::ReleaseTypeInfo(include/onnxruntime/core/session/onnxruntime_cxx_api.h:706). That means a late N-API finalizer can still re-enter ORT after the singleton/env was deliberately destroyed, which leaves the same crash window this PR is trying to close. The destructor needs to stop releasing thoseOrt::TypeInfomembers as well, or the metadata needs to be cached in plain C++/JS data instead of owning ORT handles past shutdown. -
Architectural Feedback (
std::shared_ptrover manual ref-counting):
There is a dramatically better, standard C++ approach to handle this lifetime requirement that avoids the mutable global state (ref_count,ort_singleton_mutex), removes the need for cleanup hooks, removes the worker thread edge cases, and natively solves theInferenceSessionWrapfinalizer ordering bug (including theOrt::TypeInfobug) without manualis_main_threadlogic:- Wrap the
OrtObjectssingleton in astd::shared_ptr<OrtObjects>when instantiated. - Store a copy of this
std::shared_ptrinside the N-API environment context (OrtInstanceData). - Pass a copy of this
std::shared_ptrto everyInferenceSessionWrapupon creation.
Why this is better: A
std::shared_ptrguarantees that theOrt::Envsingleton stays alive exactly as long as either the Node.js environment is active or anyInferenceSessionWrapobject is still awaiting JS garbage collection. When the last session is garbage collected and the N-API environment tears down, theshared_ptrnaturally drops to zero and destroysOrt::Envcleanly. If the process terminates abruptly (e.g.,process.exit()), theshared_ptrinstances purposefully leak, completely preventing the WindowsExitProcessDLL unloading crash without manual intervention. You can safely remove the N-API cleanup hooks and the intentional bypass leak inside~InferenceSessionWrap. - Wrap the
2. WebGPU Context Factory (WebGpuContextFactory)
The PR alters wgpu::Instance default_instance_ to a raw handle WGPUInstance and dynamically allocates contexts_ to intentionally defer or leak their destruction until WebGpuContextFactory::Cleanup() is manually called by OrtEnv::~OrtEnv().
- Architectural Feedback (Leaky Singletons):
Changing the C++ RAII wrapperwgpu::Instanceback to the rawWGPUInstancehandle is functionally correct here. Since the static destructor forwgpu::Instanceruns indiscriminately duringExitProcess(), relying on the raw handle manually cleaned up byOrtEnv::~OrtEnv()safely bypasses the DLL unloading crash.
However, using a rawnewforcontexts_slightly violates the strict guideline against raw memory allocation. Since ONNX Runtime does not natively offer anabsl::NoDestructor<T>equivalent to safely construct lock-free leaky singletons without heap allocations, documenting this raw pointer as a necessary "construct on first use, leak at exit" workaround is acceptable.
Notes
- I did not find an obvious correctness issue in the WebGPU changes themselves. The
BucketCacheManagerpending-buffer fix and theHasDefaultLogger()guards both look directionally right. - I did not run the Node/WebGPU test matrix locally.
Conclusion and Recommendations
While the PR effectively targets the crash-on-exit semantics, its Node.js bindings rely heavily on extremely manual layout management that introduces a new finalizer crash bug (Ort::TypeInfo). It is highly recommended to refactor the Node.js bindings to use std::shared_ptr<OrtObjects> instead of manual ref_count and N-API lifecycle hooks. This will drastically simplify the code, guarantee robust safety against N-API GC finalizer ordering complexities, and elegantly solve the underlying ExitProcess DLL unloading crash.
|
Updated according to @tianleiwu's comments: 1.(a): modified to deal with inputTypes_ and outputTypes_. 1.(b): I cannot use
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
9f54c7d to
86112ba
Compare
Description
Fixes multiple issues that related to crash and memory leak.
BucketCacheManagermay hold pending buffers while cleaning up the WebGPU context, which causes memory leak.logging::LoggingManager::HasDefaultLogger().Also includes a few fixes related to Node.js binding.
OrtEnvwas used as a function local variable. This is problematic because the destruction of OrtEnv may be too late, where some DLLs are already unloaded. (The order of DLL unloading at process exit is not totally controllable). Change it to:All of the changes above should have covered different scenarios but ensures: