Skip to content

[Bug Report] torch.distributed.run fails with CUDA_VISIBLE_DEVICES due to "CUDA being in bad state" in Isaac Sim #2756

@deyuf

Description

@deyuf

Hi Isaac Lab Team,

I'm encountering a critical error when trying to run a distributed training script using torch.distributed.run with CUDA_VISIBLE_DEVICES set to a subset of available GPUs. It appears that although the environment variable is set, Isaac Sim fails to correctly initialize the specified GPUs, leading to a "CUDA being in bad state" error and preventing any device from being created.

This issue prevents multi-GPU distributed training on a specific selection of GPUs.

Steps to Reproduce:

System Configuration:

OS: Ubuntu 22.04.5 LTS
GPU: 8 x NVIDIA H100 80GB HBM3
NVIDIA Driver Version: 550.163.01
Isaac Sim Version: 4.5 (based on log paths)
Environment: micromamba
Command to run:
The following command is executed from the root of the Isaac Lab workspace. It attempts to launch a 2-process distributed training job on GPUs with physical IDs 5 and 6.

CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.
py --task=Isaac-Cartpole-v0 --headless --distributed

Expected Behavior:

The script should launch three processes, with each process being assigned to one of the specified GPUs (5, 6,). Isaac Sim should initialize correctly on each process, mapping cuda:0 for each process to the corresponding physical GPU (e.g., process 0 -> physical GPU 5, process 1 -> physical GPU 6, etc.).

Actual Behavior:

The script fails to initialize the simulation environment. The logs indicate that although torch.distributed seems to start correctly, the Isaac Sim/Omniverse application foundation cannot create a GPU device.

Key error messages from the log include:

A warning about CUDA_VISIBLE_DEVICES being set:

2025-06-23 13:59:03 [16,707ms] [Warning] [carb.cudainterop.plugin] CUDA_VISIBLE_DEVICES environment variable is set.
2025-06-23 13:59:03 [16,707ms] [Warning] [carb.cudainterop.plugin] Note CUDA device enumeration and Omniverse device enumeration are different.
2025-06-23 13:59:03 [16,707ms] [Warning] [carb.cudainterop.plugin] Setting CUDA_VISIBLE_DEVICES can lead to undesired behavior or crashes.
A repeated warning that the GPUs are being skipped because CUDA is in a "bad state":

2025-06-23 13:59:04 [16,983ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 13:59:04 [16,983ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
A final critical error stating no device could be created:

2025-06-23 13:59:04 [17,106ms] [Error] [gpu.foundation.plugin] No device could be created.
This suggests a conflict between how torch.distributed and Isaac Sim's gpu.foundation.plugin interpret the CUDA_VISIBLE_DEVICES variable, ultimately leading to a failure in the GPU initialization process within the simulation.

Full Print Out

''' Bash
$ CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.
py --task=Isaac-Cartpole-v0 --headless --distributed
WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[INFO][AppLauncher]: Using device: cuda:0[INFO][AppLauncher]: Using device: cuda:0

[INFO][AppLauncher]: Loading experience file: /raid/ws_df/ws/IsaacLab/apps/isaaclab.python.headless.kit
[INFO][AppLauncher]: Loading experience file: /raid/ws_df/ws/IsaacLab/apps/isaaclab.python.headless.kit
[Warning] [simulation_app.simulation_app] Modules: ['omni.kit_app'] were loaded before SimulationApp was started and might not be loaded correctly.
[Warning] [simulation_app.simulation_app] Modules: ['omni.kit_app'] were loaded before SimulationApp was started and might not be loaded correctly.
[Warning] [simulation_app.simulation_app] Please check to make sure no extra omniverse or pxr modules are imported before the call to SimulationApp(...)
[Warning] [simulation_app.simulation_app] Please check to make sure no extra omniverse or pxr modules are imported before the call to SimulationApp(...)
Loading user config located at: '/home/deyu.fu/micromamba/envs/dex/lib/python3.10/site-packages/omni/data/Kit/Isaac-Sim/4.5/user.config.json'
[Info] [carb] Logging to file: /home/deyu.fu/micromamba/envs/dex/lib/python3.10/site-packages/omni/logs/Kit/Isaac-Sim/4.5/kit_20250623_141446.log
Loading user config located at: '/home/deyu.fu/micromamba/envs/dex/lib/python3.10/site-packages/omni/data/Kit/Isaac-Sim/4.5/user.config.json'
2025-06-23 14:14:46 [0ms] [Warning] [omni.kit.app.plugin] No crash reporter present, dumps uploading isn't available.
[Info] [carb] Logging to file: /home/deyu.fu/micromamba/envs/dex/lib/python3.10/site-packages/omni/logs/Kit/Isaac-Sim/4.5/kit_20250623_141446.log
2025-06-23 14:14:46 [0ms] [Warning] [omni.kit.app.plugin] No crash reporter present, dumps uploading isn't available.
2025-06-23 14:14:46 [2ms] [Warning] [omni.ext.plugin] [ext: rendering_modes] Extensions config 'extension.toml' doesn't exist '/raid/ws_df/ws/IsaacLab/apps/rendering_modes' or '/raid/ws_df/ws/IsaacLab/apps/rendering_modes/config'
2025-06-23 14:14:46 [3ms] [Warning] [omni.ext.plugin] [ext: rendering_modes] Extensions config 'extension.toml' doesn't exist '/raid/ws_df/ws/IsaacLab/apps/rendering_modes' or '/raid/ws_df/ws/IsaacLab/apps/rendering_modes/config'
2025-06-23 14:14:46 [140ms] [Warning] [omni.usd_config.extension] Enable omni.materialx.libs extension to use MaterialX
2025-06-23 14:14:46 [143ms] [Warning] [omni.usd_config.extension] Enable omni.materialx.libs extension to use MaterialX
2025-06-23 14:14:46 [352ms] [Warning] [omni.platforminfo.plugin] failed to open the default display. Can't verify X Server version.
2025-06-23 14:14:46 [352ms] [Warning] [omni.platforminfo.plugin] failed to open the default display. Can't verify X Server version.
2025-06-23 14:14:46 [352ms] [Warning] [omni.platforminfo.plugin] failed to load XRAndR.
2025-06-23 14:14:46 [352ms] [Warning] [omni.platforminfo.plugin] failed to load XRAndR.
2025-06-23 14:14:46 [471ms] [Warning] [omni.datastore] OmniHub is inaccessible
2025-06-23 14:14:46 [486ms] [Warning] [omni.datastore] OmniHub is inaccessible
2025-06-23 14:14:47 [563ms] [Warning] [omni.isaac.dynamic_control] omni.isaac.dynamic_control is deprecated as of Isaac Sim 4.5. No action is needed from end-users.
2025-06-23 14:14:47 [587ms] [Warning] [omni.isaac.dynamic_control] omni.isaac.dynamic_control is deprecated as of Isaac Sim 4.5. No action is needed from end-users.
2025-06-23 14:15:00 [13,687ms] [Warning] [carb.cudainterop.plugin] CUDA_VISIBLE_DEVICES environment variable is set.
2025-06-23 14:15:00 [13,687ms] [Warning] [carb.cudainterop.plugin] Note CUDA device enumeration and Omniverse device enumeration are different.
2025-06-23 14:15:00 [13,687ms] [Warning] [carb.cudainterop.plugin] Setting CUDA_VISIBLE_DEVICES can lead to undesired behavior or crashes.
2025-06-23 14:15:00 [13,702ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,702ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,706ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,706ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,711ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,711ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,716ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,716ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,724ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,725ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,736ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,738ms] [Warning] [gpu.foundation.plugin] activeGpu 0 is not compatible with current foundation settings.

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.163.01 | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
| | | | | | Device-ID | UUID |
| | | | | | Bus-ID | |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b0cfac99.. |
| | | | | | 1b | |
|---------------------------------------------------------------------------------------------|
| 1 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 19ffc695.. |
| | | | | | 43 | |
|---------------------------------------------------------------------------------------------|
| 2 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 943f111d.. |
| | | | | | 52 | |
|---------------------------------------------------------------------------------------------|
| 3 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 32e877ce.. |
| | | | | | 61 | |
|---------------------------------------------------------------------------------------------|
| 4 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 5b4f3983.. |
| | | | | | 9d | |
|---------------------------------------------------------------------------------------------|
| 5 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 834d67a2.. |
| | | | | | c3 | |
|---------------------------------------------------------------------------------------------|
| 6 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | fa8479ed.. |
| | | | | | d1 | |
|---------------------------------------------------------------------------------------------|
| 7 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b58259d5.. |
| | | | | | df | |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 5.15.0-1078-nvidia
| Processor: Intel(R) Xeon(R) Platinum 8480C
| Cores: 112 | Logical Cores: 224
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 2063936 | Free Memory: 1994771
| Total Page/Swap (MB): 0 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|
2025-06-23 14:15:00 [13,738ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:

  • The driver is not installed properly and requires a clean re-install.
  • Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
  • The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
  • For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
  • For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.

2025-06-23 14:15:00 [13,842ms] [Warning] [carb.cudainterop.plugin] CUDA_VISIBLE_DEVICES environment variable is set.
2025-06-23 14:15:00 [13,842ms] [Warning] [carb.cudainterop.plugin] Note CUDA device enumeration and Omniverse device enumeration are different.
2025-06-23 14:15:00 [13,842ms] [Warning] [carb.cudainterop.plugin] Setting CUDA_VISIBLE_DEVICES can lead to undesired behavior or crashes.
2025-06-23 14:15:00 [13,923ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,923ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,950ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,951ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,959ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,959ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,968ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,968ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [13,981ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [13,981ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:15:00 [14,024ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:15:00 [14,025ms] [Warning] [gpu.foundation.plugin] activeGpu 1 is not compatible with current foundation settings.

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.163.01 | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
| | | | | | Device-ID | UUID |
| | | | | | Bus-ID | |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b0cfac99.. |
| | | | | | 1b | |
|---------------------------------------------------------------------------------------------|
| 1 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 19ffc695.. |
| | | | | | 43 | |
|---------------------------------------------------------------------------------------------|
| 2 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 943f111d.. |
| | | | | | 52 | |
|---------------------------------------------------------------------------------------------|
| 3 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 32e877ce.. |
| | | | | | 61 | |
|---------------------------------------------------------------------------------------------|
| 4 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 5b4f3983.. |
| | | | | | 9d | |
|---------------------------------------------------------------------------------------------|
| 5 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 834d67a2.. |
| | | | | | c3 | |
|---------------------------------------------------------------------------------------------|
| 6 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | fa8479ed.. |
| | | | | | d1 | |
|---------------------------------------------------------------------------------------------|
| 7 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b58259d5.. |
| | | | | | df | |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 5.15.0-1078-nvidia
| Processor: Intel(R) Xeon(R) Platinum 8480C
| Cores: 112 | Logical Cores: 224
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 2063936 | Free Memory: 1994769
| Total Page/Swap (MB): 0 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|
2025-06-23 14:15:00 [14,025ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:

  • The driver is not installed properly and requires a clean re-install.
  • Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
  • The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
  • For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
  • For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.

2025-06-23 14:15:00 [14,436ms] [Error] [gpu.foundation.plugin] Invalid getDeviceInfo parameters.
2025-06-23 14:15:00 [14,436ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Couldn't get driver version, failed to getDeviceInfo
2025-06-23 14:15:00 [14,436ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!
2025-06-23 14:15:01 [14,563ms] [Error] [gpu.foundation.plugin] Invalid getDeviceInfo parameters.
2025-06-23 14:15:01 [14,563ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Couldn't get driver version, failed to getDeviceInfo
2025-06-23 14:15:01 [14,563ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!
2025-06-23 14:15:01 [14,627ms] [Warning] [omni.kvdb.plugin] Disabling key-value database because another kit process is locking it
[skrl:INFO] Distributed (rank: 0, local rank: 0, world size: 2)
[skrl:INFO] Distributed (rank: 1, local rank: 1, world size: 2)
[INFO]: Parsing configuration from: isaaclab_tasks.manager_based.classic.cartpole.cartpole_env_cfg:CartpoleEnvCfg
[INFO]: Parsing configuration from: /raid/ws_df/ws/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/manager_based/classic/cartpole/agents/skrl_ppo_cfg.yaml
[INFO]: Parsing configuration from: isaaclab_tasks.manager_based.classic.cartpole.cartpole_env_cfg:CartpoleEnvCfg
[INFO]: Parsing configuration from: /raid/ws_df/ws/IsaacLab/source/isaaclab_tasks/isaaclab_tasks/manager_based/classic/cartpole/agents/skrl_ppo_cfg.yaml
[INFO] Logging experiment in directory: /raid/ws_df/ws/IsaacLab/logs/skrl/cartpole
Exact experiment name requested from command line: 2025-06-23_14-15-02_ppo_torch
[INFO] Logging experiment in directory: /raid/ws_df/ws/IsaacLab/logs/skrl/cartpole
Exact experiment name requested from command line: 2025-06-23_14-15-02_ppo_torch
Setting seed: 42
[INFO]: Base environment:
Environment device : cuda:0
Environment seed : 42
Physics step-size : 0.008333333333333333
Rendering step-size : 0.016666666666666666
Environment step-size : 0.016666666666666666
Setting seed: 42
[INFO]: Base environment:
Environment device : cuda:1
Environment seed : 42
Physics step-size : 0.008333333333333333
Rendering step-size : 0.016666666666666666
Environment step-size : 0.016666666666666666
[INFO]: Time taken for scene creation : 1.919798 seconds
[INFO]: Scene manager:
Number of environments: 4096
Environment spacing : 4.0
Source prim name : /World/envs/env_0
Global prim paths : []
Replicate physics : True
[INFO]: Starting the simulation. This may take a few seconds. Please wait...
[INFO]: Time taken for scene creation : 2.052524 seconds
[INFO]: Scene manager:
Number of environments: 4096
Environment spacing : 4.0
Source prim name : /World/envs/env_0
Global prim paths : []
Replicate physics : True
[INFO]: Starting the simulation. This may take a few seconds. Please wait...
2025-06-23 14:15:05 [19,163ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-06-23 14:15:05 [19,163ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
2025-06-23 14:15:05 [19,176ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-06-23 14:15:05 [19,176ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
2025-06-23 14:15:05 [19,294ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-06-23 14:15:05 [19,294ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
2025-06-23 14:15:05 [19,306ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-06-23 14:15:05 [19,306ms] [Warning] [isaaclab.actuators.actuator_pd] The object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
[INFO]: Time taken for simulation start : 1.655693 seconds
[INFO] Command Manager: contains 0 active terms.
+------------------------+
| Active Command Terms |
+--------+-------+-------+
| Index | Name | Type |
+--------+-------+-------+
+--------+-------+-------+

[INFO] Event Manager: contains 1 active terms.
+-------------------------------------+
| Active Event Terms in Mode: 'reset' |
+---------+---------------------------+
| Index | Name |
+---------+---------------------------+
| 0 | reset_cart_position |
| 1 | reset_pole_position |
+---------+---------------------------+

[INFO] Recorder Manager: contains 0 active terms.
+---------------------+
| Active Recorder Terms |
+-----------+---------+
| Index | Name |
+-----------+---------+
+-----------+---------+

[INFO] Action Manager: contains 1 active terms.
+----------------------------------+
| Active Action Terms (shape: 1) |
+-------+--------------+-----------+
| Index | Name | Dimension |
+-------+--------------+-----------+
| 0 | joint_effort | 1 |
+-------+--------------+-----------+

[INFO] Observation Manager: contains 1 groups.
+------------------------------------------------------+
| Active Observation Terms in Group: 'policy' (shape: (4,)) |
+------------+----------------------------+------------+
| Index | Name | Shape |
+------------+----------------------------+------------+
| 0 | joint_pos_rel | (2,) |
| 1 | joint_vel_rel | (2,) |
+------------+----------------------------+------------+

[INFO] Termination Manager: contains 2 active terms.
+---------------------------------------+
| Active Termination Terms |
+-------+--------------------+----------+
| Index | Name | Time Out |
+-------+--------------------+----------+
| 0 | time_out | True |
| 1 | cart_out_of_bounds | False |
+-------+--------------------+----------+

[INFO] Reward Manager: contains 5 active terms.
+------------------------------+
| Active Reward Terms |
+-------+-------------+--------+
| Index | Name | Weight |
+-------+-------------+--------+
| 0 | alive | 1.0 |
| 1 | terminating | -2.0 |
| 2 | pole_pos | -1.0 |
| 3 | cart_vel | -0.01 |
| 4 | pole_vel | -0.005 |
+-------+-------------+--------+

[INFO] Curriculum Manager: contains 0 active terms.
+----------------------+
| Active Curriculum Terms |
+-----------+----------+
| Index | Name |
+-----------+----------+
+-----------+----------+

[INFO]: Completed setting up the environment...
[skrl:INFO] Environment wrapper: Isaac Lab (single-agent)
[2025-06-23 14:15:05,988][skrl][INFO] - Environment wrapper: Isaac Lab (single-agent)
[skrl:INFO] Seed: 43
[2025-06-23 14:15:05,988][skrl][INFO] - Seed: 43

Shared model (roles): ['policy', 'value']

class SharedModel(GaussianMixin,DeterministicMixin, Model):
def init(self, observation_space, action_space, device):
Model.init(self, observation_space, action_space, device)
GaussianMixin.init(
self,
clip_actions=False,
clip_log_std=True,
min_log_std=-20.0,
max_log_std=2.0,
reduction="sum",
role="policy",
)
DeterministicMixin.init(self, clip_actions=False, role="value")

    self.net_container = nn.Sequential(
        nn.LazyLinear(out_features=32),
        nn.ELU(),
        nn.LazyLinear(out_features=32),
        nn.ELU(),
    )
    self.policy_layer = nn.LazyLinear(out_features=self.num_actions)
    self.log_std_parameter = nn.Parameter(torch.full(size=(self.num_actions,), fill_value=0.0), requires_grad=True)
    self.value_layer = nn.LazyLinear(out_features=1)

def act(self, inputs, role):
    if role == "policy":
        return GaussianMixin.act(self, inputs, role)
    elif role == "value":
        return DeterministicMixin.act(self, inputs, role)

def compute(self, inputs, role=""):
    if role == "policy":
        states = unflatten_tensorized_space(self.observation_space, inputs.get("states"))
        taken_actions = unflatten_tensorized_space(self.action_space, inputs.get("taken_actions"))
        net = self.net_container(states)
        self._shared_output = net
        output = self.policy_layer(net)
        return output, self.log_std_parameter, {}
    elif role == "value":
        if self._shared_output is None:
            states = unflatten_tensorized_space(self.observation_space, inputs.get("states"))
            taken_actions = unflatten_tensorized_space(self.action_space, inputs.get("taken_actions"))
            net = self.net_container(states)
            shared_output = net
        else:
            shared_output = self._shared_output
        self._shared_output = None
        output = self.value_layer(shared_output)
        return output, {}

[skrl:INFO] Broadcasting models' parameters
[2025-06-23 14:15:06,025][skrl][INFO] - Broadcasting models' parameters
[INFO]: Time taken for simulation start : 1.702107 seconds
[INFO] Command Manager: contains 0 active terms.
+------------------------+
| Active Command Terms |
+--------+-------+-------+
| Index | Name | Type |
+--------+-------+-------+
+--------+-------+-------+

[INFO] Event Manager: contains 1 active terms.
+-------------------------------------+
| Active Event Terms in Mode: 'reset' |
+---------+---------------------------+
| Index | Name |
+---------+---------------------------+
| 0 | reset_cart_position |
| 1 | reset_pole_position |
+---------+---------------------------+

[INFO] Recorder Manager: contains 0 active terms.
+---------------------+
| Active Recorder Terms |
+-----------+---------+
| Index | Name |
+-----------+---------+
+-----------+---------+

[INFO] Action Manager: contains 1 active terms.
+----------------------------------+
| Active Action Terms (shape: 1) |
+-------+--------------+-----------+
| Index | Name | Dimension |
+-------+--------------+-----------+
| 0 | joint_effort | 1 |
+-------+--------------+-----------+

[INFO] Observation Manager: contains 1 groups.
+------------------------------------------------------+
| Active Observation Terms in Group: 'policy' (shape: (4,)) |
+------------+----------------------------+------------+
| Index | Name | Shape |
+------------+----------------------------+------------+
| 0 | joint_pos_rel | (2,) |
| 1 | joint_vel_rel | (2,) |
+------------+----------------------------+------------+

[INFO] Termination Manager: contains 2 active terms.
+---------------------------------------+
| Active Termination Terms |
+-------+--------------------+----------+
| Index | Name | Time Out |
+-------+--------------------+----------+
| 0 | time_out | True |
| 1 | cart_out_of_bounds | False |
+-------+--------------------+----------+

[INFO] Reward Manager: contains 5 active terms.
+------------------------------+
| Active Reward Terms |
+-------+-------------+--------+
| Index | Name | Weight |
+-------+-------------+--------+
| 0 | alive | 1.0 |
| 1 | terminating | -2.0 |
| 2 | pole_pos | -1.0 |
| 3 | cart_vel | -0.01 |
| 4 | pole_vel | -0.005 |
+-------+-------------+--------+

[INFO] Curriculum Manager: contains 0 active terms.
+----------------------+
| Active Curriculum Terms |
+-----------+----------+
| Index | Name |
+-----------+----------+
+-----------+----------+

[INFO]: Completed setting up the environment...
[skrl:INFO] Environment wrapper: Isaac Lab (single-agent)
[2025-06-23 14:15:06,143][skrl][INFO] - Environment wrapper: Isaac Lab (single-agent)
[skrl:INFO] Seed: 42
[2025-06-23 14:15:06,143][skrl][INFO] - Seed: 42

Shared model (roles): ['policy', 'value']

class SharedModel(GaussianMixin,DeterministicMixin, Model):
def init(self, observation_space, action_space, device):
Model.init(self, observation_space, action_space, device)
GaussianMixin.init(
self,
clip_actions=False,
clip_log_std=True,
min_log_std=-20.0,
max_log_std=2.0,
reduction="sum",
role="policy",
)
DeterministicMixin.init(self, clip_actions=False, role="value")

    self.net_container = nn.Sequential(
        nn.LazyLinear(out_features=32),
        nn.ELU(),
        nn.LazyLinear(out_features=32),
        nn.ELU(),
    )
    self.policy_layer = nn.LazyLinear(out_features=self.num_actions)
    self.log_std_parameter = nn.Parameter(torch.full(size=(self.num_actions,), fill_value=0.0), requires_grad=True)
    self.value_layer = nn.LazyLinear(out_features=1)

def act(self, inputs, role):
    if role == "policy":
        return GaussianMixin.act(self, inputs, role)
    elif role == "value":
        return DeterministicMixin.act(self, inputs, role)

def compute(self, inputs, role=""):
    if role == "policy":
        states = unflatten_tensorized_space(self.observation_space, inputs.get("states"))
        taken_actions = unflatten_tensorized_space(self.action_space, inputs.get("taken_actions"))
        net = self.net_container(states)
        self._shared_output = net
        output = self.policy_layer(net)
        return output, self.log_std_parameter, {}
    elif role == "value":
        if self._shared_output is None:
            states = unflatten_tensorized_space(self.observation_space, inputs.get("states"))
            taken_actions = unflatten_tensorized_space(self.action_space, inputs.get("taken_actions"))
            net = self.net_container(states)
            shared_output = net
        else:
            shared_output = self._shared_output
        self._shared_output = None
        output = self.value_layer(shared_output)
        return output, {}

[skrl:INFO] Broadcasting models' parameters
[2025-06-23 14:15:06,170][skrl][INFO] - Broadcasting models' parameters
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2400/2400 [00:51<00:00, 47.03it/s]
2025-06-23 14:16:01 [74,902ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Calling createGpuFoundation without first releasing already acquired instances with releaseGpuFoundation!
2025-06-23 14:16:01 [75,054ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Calling createGpuFoundation without first releasing already acquired instances with releaseGpuFoundation!
2025-06-23 14:16:06 [79,858ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,858ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,874ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,874ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,883ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,883ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,891ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,891ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,909ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,909ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [79,992ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:06 [79,993ms] [Warning] [gpu.foundation.plugin] activeGpu 1 is not compatible with current foundation settings.

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.163.01 | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
| | | | | | Device-ID | UUID |
| | | | | | Bus-ID | |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b0cfac99.. |
| | | | | | 1b | |
|---------------------------------------------------------------------------------------------|
| 1 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 19ffc695.. |
| | | | | | 43 | |
|---------------------------------------------------------------------------------------------|
| 2 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 943f111d.. |
| | | | | | 52 | |
|---------------------------------------------------------------------------------------------|
| 3 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 32e877ce.. |
| | | | | | 61 | |
|---------------------------------------------------------------------------------------------|
| 4 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 5b4f3983.. |
| | | | | | 9d | |
|---------------------------------------------------------------------------------------------|
| 5 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 834d67a2.. |
| | | | | | c3 | |
|---------------------------------------------------------------------------------------------|
| 6 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | fa8479ed.. |
| | | | | | d1 | |
|---------------------------------------------------------------------------------------------|
| 7 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b58259d5.. |
| | | | | | df | |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 5.15.0-1078-nvidia
| Processor: Intel(R) Xeon(R) Platinum 8480C
| Cores: 112 | Logical Cores: 224
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 2063936 | Free Memory: 1991712
| Total Page/Swap (MB): 0 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|
2025-06-23 14:16:06 [79,993ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:

  • The driver is not installed properly and requires a clean re-install.
  • Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
  • The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
  • For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
  • For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.

2025-06-23 14:16:06 [80,419ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:06 [80,419ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,551ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,551ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,589ms] [Error] [gpu.foundation.plugin] Invalid getDeviceInfo parameters.
2025-06-23 14:16:07 [80,589ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Couldn't get driver version, failed to getDeviceInfo
2025-06-23 14:16:07 [80,589ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!
2025-06-23 14:16:07 [80,591ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,591ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,621ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,621ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,644ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,644ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,659ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,659ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Skipping NVIDIA GPU due CUDA being in bad state: NVIDIA H100 80GB HBM3
2025-06-23 14:16:07 [80,660ms] [Warning] [gpu.foundation.plugin] Please restart your system if CUDA is known to work in your system.
2025-06-23 14:16:07 [80,661ms] [Warning] [gpu.foundation.plugin] activeGpu 0 is not compatible with current foundation settings.

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.163.01 | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
| | | | | | Device-ID | UUID |
| | | | | | Bus-ID | |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b0cfac99.. |
| | | | | | 1b | |
|---------------------------------------------------------------------------------------------|
| 1 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 19ffc695.. |
| | | | | | 43 | |
|---------------------------------------------------------------------------------------------|
| 2 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 943f111d.. |
| | | | | | 52 | |
|---------------------------------------------------------------------------------------------|
| 3 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 32e877ce.. |
| | | | | | 61 | |
|---------------------------------------------------------------------------------------------|
| 4 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 5b4f3983.. |
| | | | | | 9d | |
|---------------------------------------------------------------------------------------------|
| 5 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | 834d67a2.. |
| | | | | | c3 | |
|---------------------------------------------------------------------------------------------|
| 6 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | fa8479ed.. |
| | | | | | d1 | |
|---------------------------------------------------------------------------------------------|
| 7 | NVIDIA H100 80GB HBM3 | | | 81559 MB | 10de | 0 |
| | | | | | 2330 | b58259d5.. |
| | | | | | df | |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 5.15.0-1078-nvidia
| Processor: Intel(R) Xeon(R) Platinum 8480C
| Cores: 112 | Logical Cores: 224
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 2063936 | Free Memory: 1991726
| Total Page/Swap (MB): 0 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|
2025-06-23 14:16:07 [80,661ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:

  • The driver is not installed properly and requires a clean re-install.
  • Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
  • The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
  • For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
  • For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.

2025-06-23 14:16:07 [81,301ms] [Error] [gpu.foundation.plugin] Invalid getDeviceInfo parameters.
2025-06-23 14:16:07 [81,301ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Couldn't get driver version, failed to getDeviceInfo
2025-06-23 14:16:07 [81,301ms] [Warning] [omni.physx.foundation.plugin] PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions