Fix WebGPU EP crash on exit by fs-eire · Pull Request #27569 · microsoft/onnxruntime

fs-eire · 2026-03-05T22:40:30Z

Description

Fixes multiple issues that related to crash and memory leak.

Fix an uncommon situation that BucketCacheManager may hold pending buffers while cleaning up the WebGPU context, which causes memory leak.
Change the WebGPU default instance from a RAII wrapper (wgpu::Instance) to a raw pointer (WGPUInstance) so that it will not be destructed automatically at process exit, which may cause a crash due to accessing DXC code while dxcompiler.dll already unloaded.
Fix a crash in a situation that the default ORT logger destructed before a WebGPU device, so that the device callbacks are guarded by condition logging::LoggingManager::HasDefaultLogger().

Also includes a few fixes related to Node.js binding.

the OrtEnv was used as a function local variable. This is problematic because the destruction of OrtEnv may be too late, where some DLLs are already unloaded. (The order of DLL unloading at process exit is not totally controllable). Change it to:
- if OrtEnv is constructed on main thread, a cleanup hook will be registered when Node.js starts to exit. If the callback is not called (eg. uncaught exception is thrown), the OrtEnv will not be released.
- if OrtEnv is constructed on worker thread, just leave it and allow it to leak at exit.
because of (1), if OrtEnv is released already, do not release any active sessions (they are object wraps that destructed later than registered hooks).

All of the changes above should have covered different scenarios but ensures:

if any resource is intentionally leaked, it must be at process exit.
if it's not at process exit, resources lifecycle should be managed correctly.
best efforts (but not guarantee) to release resources safely, as to be friendly to the memory leak detector.

Copilot

Pull request overview

This PR fixes a crash-on-exit issue in the WebGPU execution provider caused by DLL unload ordering. When WebGpuContextFactory::Cleanup() runs, dependent DLLs like dxcompiler.dll may have already been unloaded, leading to crashes during resource destruction.

Changes:

Explicitly loads and holds references to dxil.dll and dxcompiler.dll to prevent premature unloading.
Changes default_instance_ from wgpu::Instance (C++ RAII wrapper) to raw WGPUInstance for explicit lifetime control.
Reorders cleanup in WebGpuContextFactory::Cleanup() to ensure resources are released before DLL handles.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
onnxruntime/core/providers/webgpu/webgpu_context.h	Reorders static members, changes `default_instance_` to raw `WGPUInstance`, adds `modules_` and `modules_dxc_loaded_` fields.
onnxruntime/core/providers/webgpu/webgpu_context.cc	Moves context map allocation inside instance creation block, adds DXC DLL loading logic, implements explicit cleanup with correct ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/webgpu_context.cc

onnxruntime/core/providers/webgpu/webgpu_context.h

onnxruntime/core/providers/webgpu/webgpu_context.cc

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/webgpu_context.cc

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/webgpu_context.h

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

PR 27569 Review

Findings & Architectural Feedback (C/C++ Core & Execution Providers)

1. Node.js Bindings & N-API Lifetime Control

The PR attempts to solve a static destruction order fiasco—specifically where dxcompiler.dll or other libraries unload before Ort::Env destructs—by using manual reference counting, is_main_thread checks, and N-API environment cleanup hooks (OrtSingletonData::OrtObjects).

Correctness Bug (InferenceSessionWrap Leak Logic is Incomplete):
The new shutdown guard at js/node/src/inference_session_wrap.cc:60 introduces a protective leak (if (!OrtSingletonData::GetOrtObjects()) { ... release() }) to avoid hitting a finalized OrtEnv during GC. However, it only leaks session_ and ioBinding_ after OrtSingletonData has already been torn down. InferenceSessionWrap still owns inputTypes_ and outputTypes_ (js/node/src/inference_session_wrap.h:89-91). Those members are Ort::TypeInfo RAII wrappers, and their base destructor still calls OrtApi::ReleaseTypeInfo (include/onnxruntime/core/session/onnxruntime_cxx_api.h:706). That means a late N-API finalizer can still re-enter ORT after the singleton/env was deliberately destroyed, which leaves the same crash window this PR is trying to close. The destructor needs to stop releasing those Ort::TypeInfo members as well, or the metadata needs to be cached in plain C++/JS data instead of owning ORT handles past shutdown.
Architectural Feedback (std::shared_ptr over manual ref-counting):
There is a dramatically better, standard C++ approach to handle this lifetime requirement that avoids the mutable global state (ref_count, ort_singleton_mutex), removes the need for cleanup hooks, removes the worker thread edge cases, and natively solves the InferenceSessionWrap finalizer ordering bug (including the Ort::TypeInfo bug) without manual is_main_thread logic:
1. Wrap the OrtObjects singleton in a std::shared_ptr<OrtObjects> when instantiated.
2. Store a copy of this std::shared_ptr inside the N-API environment context (OrtInstanceData).
3. Pass a copy of this std::shared_ptr to every InferenceSessionWrap upon creation.
Why this is better: A std::shared_ptr guarantees that the Ort::Env singleton stays alive exactly as long as either the Node.js environment is active or any InferenceSessionWrap object is still awaiting JS garbage collection. When the last session is garbage collected and the N-API environment tears down, the shared_ptr naturally drops to zero and destroys Ort::Env cleanly. If the process terminates abruptly (e.g., process.exit()), the shared_ptr instances purposefully leak, completely preventing the Windows ExitProcess DLL unloading crash without manual intervention. You can safely remove the N-API cleanup hooks and the intentional bypass leak inside ~InferenceSessionWrap.

2. WebGPU Context Factory (`WebGpuContextFactory`)

The PR alters wgpu::Instance default_instance_ to a raw handle WGPUInstance and dynamically allocates contexts_ to intentionally defer or leak their destruction until WebGpuContextFactory::Cleanup() is manually called by OrtEnv::~OrtEnv().

Architectural Feedback (Leaky Singletons):
Changing the C++ RAII wrapper wgpu::Instance back to the raw WGPUInstance handle is functionally correct here. Since the static destructor for wgpu::Instance runs indiscriminately during ExitProcess(), relying on the raw handle manually cleaned up by OrtEnv::~OrtEnv() safely bypasses the DLL unloading crash.
However, using a raw new for contexts_ slightly violates the strict guideline against raw memory allocation. Since ONNX Runtime does not natively offer an absl::NoDestructor<T> equivalent to safely construct lock-free leaky singletons without heap allocations, documenting this raw pointer as a necessary "construct on first use, leak at exit" workaround is acceptable.

Notes

I did not find an obvious correctness issue in the WebGPU changes themselves. The BucketCacheManager pending-buffer fix and the HasDefaultLogger() guards both look directionally right.
I did not run the Node/WebGPU test matrix locally.

Conclusion and Recommendations

While the PR effectively targets the crash-on-exit semantics, its Node.js bindings rely heavily on extremely manual layout management that introduces a new finalizer crash bug (Ort::TypeInfo). It is highly recommended to refactor the Node.js bindings to use std::shared_ptr<OrtObjects> instead of manual ref_count and N-API lifecycle hooks. This will drastically simplify the code, guarantee robust safety against N-API GC finalizer ordering complexities, and elegantly solve the underlying ExitProcess DLL unloading crash.

fs-eire · 2026-03-10T23:43:39Z

Updated according to @tianleiwu's comments:

1.(a): modified to deal with inputTypes_ and outputTypes_.

1.(b): I cannot use shared_ptr because it cannot fulfill one requirement: If only a worker imports 'onnxruntime-node' and the worker exits, the OrtEnv should not be destructed, in case another worker is spawned and use 'onnxruntime-node' too.

Using NoDestructor does not work in this scenario because my change is still trying to support memory leak detector in best-effort. There is no way to actual release the underlying object when using with NoDestructor.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ash-at-exit

fs-eire requested a review from Copilot March 5, 2026 22:57

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Copilot started reviewing on behalf of fs-eire March 5, 2026 23:29 View session

Copilot started reviewing on behalf of fs-eire March 5, 2026 23:30 View session

Copilot started reviewing on behalf of fs-eire March 5, 2026 23:31 View session

fs-eire requested a review from Copilot March 6, 2026 00:12

Copilot started reviewing on behalf of fs-eire March 6, 2026 00:14 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

onnxruntime/core/providers/webgpu/webgpu_context.cc Outdated Show resolved Hide resolved

fs-eire marked this pull request as ready for review March 6, 2026 00:50

guschmue previously approved these changes Mar 6, 2026

View reviewed changes

fs-eire dismissed guschmue’s stale review via ddd0a35 March 7, 2026 01:07

fs-eire force-pushed the fs-eire/fix-webgpu-crash-at-exit branch from 240d6fd to ddd0a35 Compare March 7, 2026 01:07

fs-eire requested a review from Copilot March 7, 2026 01:52

Copilot started reviewing on behalf of fs-eire March 7, 2026 01:53 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

onnxruntime/core/providers/webgpu/webgpu_context.h Outdated Show resolved Hide resolved

fs-eire requested review from Copilot and guschmue March 7, 2026 02:48

Copilot started reviewing on behalf of fs-eire March 7, 2026 02:49 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

fs-eire requested a review from Copilot March 8, 2026 01:57

Copilot started reviewing on behalf of fs-eire March 8, 2026 01:58 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

guschmue previously approved these changes Mar 10, 2026

View reviewed changes

tianleiwu reviewed Mar 10, 2026

View reviewed changes

fs-eire dismissed guschmue’s stale review via ec33395 March 10, 2026 23:12

fs-eire mentioned this pull request Mar 11, 2026

Make WebGPU EP compatible with EP API #26907

Open

fs-eire added 2 commits March 11, 2026 16:04

Update .NET action version

d24f6a2

Fix WebGPU EP crash on exit

f21086a

fs-eire and others added 9 commits March 11, 2026 16:04

node leak

4ba644d

Apply suggestion from @Copilot

cb7616b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

use a best-effort pattern to work with safe release

5c1b220

fix one memory leak in buffer manager

7c77e4c

device callback: only log when default logger exists

1bfb87e

no release session if later than ortenv destruct

ae7a378

add type_info

22b65df

remove unnecessary headers

dc8070f

explicitly dispose type info in InferenceSessionWrap::Dispose

86112ba

fs-eire force-pushed the fs-eire/fix-webgpu-crash-at-exit branch from 9f54c7d to 86112ba Compare March 11, 2026 23:05

Merge remote-tracking branch 'origin/main' into fs-eire/fix-webgpu-cr…

189b700

…ash-at-exit

guschmue approved these changes Mar 12, 2026

View reviewed changes

fs-eire merged commit 2b8176c into main Mar 12, 2026
94 of 108 checks passed

fs-eire deleted the fs-eire/fix-webgpu-crash-at-exit branch March 12, 2026 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WebGPU EP crash on exit#27569

Fix WebGPU EP crash on exit#27569
fs-eire merged 12 commits intomainfrom
fs-eire/fix-webgpu-crash-at-exit

fs-eire commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

tianleiwu left a comment

Uh oh!

fs-eire commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fs-eire commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

PR 27569 Review

Findings & Architectural Feedback (C/C++ Core & Execution Providers)

1. Node.js Bindings & N-API Lifetime Control

2. WebGPU Context Factory (WebGpuContextFactory)

Notes

Conclusion and Recommendations

Uh oh!

fs-eire commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fs-eire commented Mar 5, 2026 •

edited

Loading

2. WebGPU Context Factory (`WebGpuContextFactory`)