Skip to content

Fix WebGPU EP crash on exit#27569

Merged
fs-eire merged 12 commits intomainfrom
fs-eire/fix-webgpu-crash-at-exit
Mar 12, 2026
Merged

Fix WebGPU EP crash on exit#27569
fs-eire merged 12 commits intomainfrom
fs-eire/fix-webgpu-crash-at-exit

Conversation

@fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Mar 5, 2026

Description

Fixes multiple issues that related to crash and memory leak.

  1. Fix an uncommon situation that BucketCacheManager may hold pending buffers while cleaning up the WebGPU context, which causes memory leak.
  2. Change the WebGPU default instance from a RAII wrapper (wgpu::Instance) to a raw pointer (WGPUInstance) so that it will not be destructed automatically at process exit, which may cause a crash due to accessing DXC code while dxcompiler.dll already unloaded.
  3. Fix a crash in a situation that the default ORT logger destructed before a WebGPU device, so that the device callbacks are guarded by condition logging::LoggingManager::HasDefaultLogger().

Also includes a few fixes related to Node.js binding.

  1. the OrtEnv was used as a function local variable. This is problematic because the destruction of OrtEnv may be too late, where some DLLs are already unloaded. (The order of DLL unloading at process exit is not totally controllable). Change it to:
    • if OrtEnv is constructed on main thread, a cleanup hook will be registered when Node.js starts to exit. If the callback is not called (eg. uncaught exception is thrown), the OrtEnv will not be released.
    • if OrtEnv is constructed on worker thread, just leave it and allow it to leak at exit.
  2. because of (1), if OrtEnv is released already, do not release any active sessions (they are object wraps that destructed later than registered hooks).

All of the changes above should have covered different scenarios but ensures:

  • if any resource is intentionally leaked, it must be at process exit.
  • if it's not at process exit, resources lifecycle should be managed correctly.
  • best efforts (but not guarantee) to release resources safely, as to be friendly to the memory leak detector.

@fs-eire fs-eire requested a review from Copilot March 5, 2026 22:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash-on-exit issue in the WebGPU execution provider caused by DLL unload ordering. When WebGpuContextFactory::Cleanup() runs, dependent DLLs like dxcompiler.dll may have already been unloaded, leading to crashes during resource destruction.

Changes:

  • Explicitly loads and holds references to dxil.dll and dxcompiler.dll to prevent premature unloading.
  • Changes default_instance_ from wgpu::Instance (C++ RAII wrapper) to raw WGPUInstance for explicit lifetime control.
  • Reorders cleanup in WebGpuContextFactory::Cleanup() to ensure resources are released before DLL handles.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
onnxruntime/core/providers/webgpu/webgpu_context.h Reorders static members, changes default_instance_ to raw WGPUInstance, adds modules_ and modules_dxc_loaded_ fields.
onnxruntime/core/providers/webgpu/webgpu_context.cc Moves context map allocation inside instance creation block, adds DXC DLL loading logic, implements explicit cleanup with correct ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fs-eire fs-eire marked this pull request as ready for review March 6, 2026 00:50
guschmue
guschmue previously approved these changes Mar 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

guschmue
guschmue previously approved these changes Mar 10, 2026
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 27569 Review

Findings & Architectural Feedback (C/C++ Core & Execution Providers)

1. Node.js Bindings & N-API Lifetime Control

The PR attempts to solve a static destruction order fiasco—specifically where dxcompiler.dll or other libraries unload before Ort::Env destructs—by using manual reference counting, is_main_thread checks, and N-API environment cleanup hooks (OrtSingletonData::OrtObjects).

  • Correctness Bug (InferenceSessionWrap Leak Logic is Incomplete):
    The new shutdown guard at js/node/src/inference_session_wrap.cc:60 introduces a protective leak (if (!OrtSingletonData::GetOrtObjects()) { ... release() }) to avoid hitting a finalized OrtEnv during GC. However, it only leaks session_ and ioBinding_ after OrtSingletonData has already been torn down. InferenceSessionWrap still owns inputTypes_ and outputTypes_ (js/node/src/inference_session_wrap.h:89-91). Those members are Ort::TypeInfo RAII wrappers, and their base destructor still calls OrtApi::ReleaseTypeInfo (include/onnxruntime/core/session/onnxruntime_cxx_api.h:706). That means a late N-API finalizer can still re-enter ORT after the singleton/env was deliberately destroyed, which leaves the same crash window this PR is trying to close. The destructor needs to stop releasing those Ort::TypeInfo members as well, or the metadata needs to be cached in plain C++/JS data instead of owning ORT handles past shutdown.

  • Architectural Feedback (std::shared_ptr over manual ref-counting):
    There is a dramatically better, standard C++ approach to handle this lifetime requirement that avoids the mutable global state (ref_count, ort_singleton_mutex), removes the need for cleanup hooks, removes the worker thread edge cases, and natively solves the InferenceSessionWrap finalizer ordering bug (including the Ort::TypeInfo bug) without manual is_main_thread logic:

    1. Wrap the OrtObjects singleton in a std::shared_ptr<OrtObjects> when instantiated.
    2. Store a copy of this std::shared_ptr inside the N-API environment context (OrtInstanceData).
    3. Pass a copy of this std::shared_ptr to every InferenceSessionWrap upon creation.

    Why this is better: A std::shared_ptr guarantees that the Ort::Env singleton stays alive exactly as long as either the Node.js environment is active or any InferenceSessionWrap object is still awaiting JS garbage collection. When the last session is garbage collected and the N-API environment tears down, the shared_ptr naturally drops to zero and destroys Ort::Env cleanly. If the process terminates abruptly (e.g., process.exit()), the shared_ptr instances purposefully leak, completely preventing the Windows ExitProcess DLL unloading crash without manual intervention. You can safely remove the N-API cleanup hooks and the intentional bypass leak inside ~InferenceSessionWrap.

2. WebGPU Context Factory (WebGpuContextFactory)

The PR alters wgpu::Instance default_instance_ to a raw handle WGPUInstance and dynamically allocates contexts_ to intentionally defer or leak their destruction until WebGpuContextFactory::Cleanup() is manually called by OrtEnv::~OrtEnv().

  • Architectural Feedback (Leaky Singletons):
    Changing the C++ RAII wrapper wgpu::Instance back to the raw WGPUInstance handle is functionally correct here. Since the static destructor for wgpu::Instance runs indiscriminately during ExitProcess(), relying on the raw handle manually cleaned up by OrtEnv::~OrtEnv() safely bypasses the DLL unloading crash.
    However, using a raw new for contexts_ slightly violates the strict guideline against raw memory allocation. Since ONNX Runtime does not natively offer an absl::NoDestructor<T> equivalent to safely construct lock-free leaky singletons without heap allocations, documenting this raw pointer as a necessary "construct on first use, leak at exit" workaround is acceptable.

Notes

  • I did not find an obvious correctness issue in the WebGPU changes themselves. The BucketCacheManager pending-buffer fix and the HasDefaultLogger() guards both look directionally right.
  • I did not run the Node/WebGPU test matrix locally.

Conclusion and Recommendations

While the PR effectively targets the crash-on-exit semantics, its Node.js bindings rely heavily on extremely manual layout management that introduces a new finalizer crash bug (Ort::TypeInfo). It is highly recommended to refactor the Node.js bindings to use std::shared_ptr<OrtObjects> instead of manual ref_count and N-API lifecycle hooks. This will drastically simplify the code, guarantee robust safety against N-API GC finalizer ordering complexities, and elegantly solve the underlying ExitProcess DLL unloading crash.

@fs-eire
Copy link
Contributor Author

fs-eire commented Mar 10, 2026

Updated according to @tianleiwu's comments:

1.(a): modified to deal with inputTypes_ and outputTypes_.

1.(b): I cannot use shared_ptr because it cannot fulfill one requirement: If only a worker imports 'onnxruntime-node' and the worker exits, the OrtEnv should not be destructed, in case another worker is spawned and use 'onnxruntime-node' too.

  1. Using NoDestructor does not work in this scenario because my change is still trying to support memory leak detector in best-effort. There is no way to actual release the underlying object when using with NoDestructor.

@fs-eire fs-eire force-pushed the fs-eire/fix-webgpu-crash-at-exit branch from 9f54c7d to 86112ba Compare March 11, 2026 23:05
@fs-eire fs-eire merged commit 2b8176c into main Mar 12, 2026
94 of 108 checks passed
@fs-eire fs-eire deleted the fs-eire/fix-webgpu-crash-at-exit branch March 12, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants