-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
NVIDIA Open GPU Kernel Modules Version
575.64.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Arch Linux
Kernel Release
Linux GEM12 6.15.6-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 10 Jul 2025 17:10:03 +0000 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 5090 (UUID: GPU-f48cb438-105c-843a-d146-fc2e3e282f1b)
Describe the bug
I am experiencing a critical stability issue when using a GH100-based NVIDIA GPU connected via an OCuLink external PCIe interface.
The system boots correctly, and the NVIDIA driver loads. The GPU is visible via nvidia-smi and appears functional at idle. However, as soon as any significant computational load is applied (e.g., starting a deep learning training workload), the GPU immediately disconnects from the PCIe bus.
The kernel log confirms this with a Xid 79 error, indicating the GPU has "fallen off the bus".
Notably, even during the initial driver loading process at boot, there are GSP firmware bootstrap errors, which may be a precursor to the main issue.
If reboot system, GPU can be detected normally, but still can not be used as above described.
To Reproduce
- Boot the system with the GPU connected via the OCuLink adapter.
- Wait for the desktop environment or command line to become available. Confirm the GPU is detected using nvidia-smi.
- Initiate a heavy, sustained GPU workload (e.g., torch.ones(10000, 10000).cuda() @ torch.ones(10000, 10000).cuda() in a loop, or running a CUDA sample).
- Within seconds, the GPU becomes unresponsive, and the application using it crashes.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response