Skip to content

5090 GPU (via OCuLink PCIe4x4) Falls Off Bus with Xid 79 Under Load #900

@valleyUp

Description

@valleyUp

NVIDIA Open GPU Kernel Modules Version

575.64.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

Linux GEM12 6.15.6-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 10 Jul 2025 17:10:03 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 5090 (UUID: GPU-f48cb438-105c-843a-d146-fc2e3e282f1b)

Describe the bug

I am experiencing a critical stability issue when using a GH100-based NVIDIA GPU connected via an OCuLink external PCIe interface.

The system boots correctly, and the NVIDIA driver loads. The GPU is visible via nvidia-smi and appears functional at idle. However, as soon as any significant computational load is applied (e.g., starting a deep learning training workload), the GPU immediately disconnects from the PCIe bus.

The kernel log confirms this with a Xid 79 error, indicating the GPU has "fallen off the bus".

Notably, even during the initial driver loading process at boot, there are GSP firmware bootstrap errors, which may be a precursor to the main issue.

If reboot system, GPU can be detected normally, but still can not be used as above described.

To Reproduce

  1. Boot the system with the GPU connected via the OCuLink adapter.
  2. Wait for the desktop environment or command line to become available. Confirm the GPU is detected using nvidia-smi.
  3. Initiate a heavy, sustained GPU workload (e.g., torch.ones(10000, 10000).cuda() @ torch.ones(10000, 10000).cuda() in a loop, or running a CUDA sample).
  4. Within seconds, the GPU becomes unresponsive, and the application using it crashes.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions