Skip to content

Fix OVRTX renderer device mismatch on multi-GPU#5594

Merged
kellyguo11 merged 8 commits into
isaac-sim:developfrom
fatimaanes:fatima/fix-ovrtx-device-mismatch
May 15, 2026
Merged

Fix OVRTX renderer device mismatch on multi-GPU#5594
kellyguo11 merged 8 commits into
isaac-sim:developfrom
fatimaanes:fatima/fix-ovrtx-device-mismatch

Conversation

@fatimaanes
Copy link
Copy Markdown
Collaborator

@fatimaanes fatimaanes commented May 12, 2026

Description

Fixes OVRTXRenderer crash on multi-GPU systems when sim.device is not cuda:0.

Root cause: A hardcoded DEVICE = "cuda:0" constant in ovrtx_renderer_kernels.py was imported and used for all Warp kernel launches and buffer allocations. Additionally, AttributeBinding.map() calls used device_id=0, pinning attribute mapping to GPU 0 regardless of the simulation device.

Fix:

  • Remove the DEVICE constant and use self._device (set from CameraRenderSpec.device) for all Warp operations (11 locations)
  • Add _device_id property to extract the CUDA device index from the device string
  • Pass device_id=self._device_id to AttributeBinding.map() calls (2 locations: object binding and camera binding)

Note on RenderVarOutput.map() calls: These remain unchanged (device=Device.CUDA only) because the OVRTX C API for render output mapping (ovrtx_map_output_description_t) does not accept a device_id parameter — the output is inherently mapped on whichever GPU OVRTX rendered on.

Total: 13 hardcoded GPU-0 references fixed (11 Warp + 2 AttributeBinding).

This is the same bug class fixed for NewtonRenderer in #5019 — OVRTX was not updated at that time.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and added my name to the CONTRIBUTORS.md or my organization to the CONTRIBUTORS.md list

Replace hardcoded DEVICE = "cuda:0" in ovrtx_renderer with the
actual sim device from CameraRenderSpec.device. All Warp kernel
launches and buffer allocations now target the correct GPU when
sim.device is not cuda:0 (e.g. distributed training).
@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels May 12, 2026
@fatimaanes fatimaanes requested a review from kellyguo11 May 12, 2026 18:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR fixes OVRTXRenderer crashes on multi-GPU systems by replacing a hardcoded DEVICE = "cuda:0" constant with self._device (derived from CameraRenderSpec.device) across all Warp kernel launches and OVRTX binding.map() calls.

  • Removes the DEVICE constant from ovrtx_renderer_kernels.py and adds a _device_id property to OVRTXRenderer that parses the integer CUDA index from the device string; all 11 kernel launches and 6 binding.map() calls are updated to use the correct device.
  • create_render_data now sets self._device = spec.device before any initialization, ensuring _setup_object_bindings and the returned OVRTXRenderData both target the right GPU.
  • The test file retains a local DEVICE = "cuda:0" constant for unit-test purposes, keeping existing tests unchanged.

Confidence Score: 5/5

Safe to merge — all hardcoded GPU-0 references are systematically replaced and the two issues flagged in the previous review round are fully addressed.

The change is a focused, mechanical substitution: every DEVICE / device_id=0 occurrence in the renderer is now driven by self._device / self._device_id. The _device_id property correctly handles both "cuda:N" and bare "cuda" forms, self._device is set from spec.device before any initialization runs, and the test suite retains its own local DEVICE = "cuda:0" so existing tests are unaffected. No new logic paths or data mutations are introduced.

No files require special attention.

Important Files Changed

Filename Overview
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py Core fix: adds _device_id property, sets self._device from spec in create_render_data, and replaces all 17 hardcoded GPU-0 references with self._device/self._device_id.
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer_kernels.py Removes the DEVICE = "cuda:0" module-level constant; all kernel functions are unchanged and device-agnostic by design.
source/isaaclab_ov/test/test_ovrtx_renderer_kernels.py Stops importing DEVICE from kernels; defines its own local DEVICE = "cuda:0" so unit tests remain pinned to cuda:0 without affecting production code.
source/isaaclab_ov/changelog.d/fix-ovrtx-device-mismatch.rst New changelog entry accurately describing the multi-GPU device mismatch fix.
CONTRIBUTORS.md Adds Fatima Anes in alphabetical order.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[OVRTXRenderer.__init__] -->|self._device = 'cuda:0'| B[Default device set]
    C[create_render_data spec] -->|self._device = spec.device| D[Device updated from spec]
    D --> E{_initialized_scene?}
    E -->|No| F[_initialize_from_spec]
    F --> G[_setup_object_bindings\nwp.array device=self._device]
    E -->|Yes| H[OVRTXRenderData spec, self._device]
    G --> H

    I[_device_id property] -->|self._device.split ':'| J[Returns int CUDA index]

    K[update_transforms] -->|object_binding.map device_id=self._device_id| L[wp.from_dlpack]
    L -->|wp.launch device=self._device| M[sync_newton_transforms_kernel]

    N[update_camera] -->|wp.zeros device=self._device| O[camera_transforms]
    O -->|wp.launch device=self._device| P[create_camera_transforms_kernel]
    P -->|camera_binding.map device_id=self._device_id| Q[wp.copy to binding]

    R[_process_render_frame] -->|LdrColor.map device_id=self._device_id| S[extract_rgba_tiles\ndevice=self._device]
    R -->|DepthSD.map device_id=self._device_id| T[extract_depth_tiles\ndevice=self._device]
    R -->|DiffuseAlbedoSD.map device_id=self._device_id| U[extract_rgba_tiles albedo\ndevice=self._device]
    R -->|SemanticSegmentation.map device_id=self._device_id| V[extract_rgba_tiles semantic\ndevice=self._device]
Loading

Reviews (2): Last reviewed commit: "Update kernel tests to use local device ..." | Re-trigger Greptile

Route all OVRTX binding.map() calls through _device_id property
so DLPack tensors are mapped on the correct CUDA device. Without
this, update_transforms, update_camera, and render output mapping
pin buffers to cuda:0 regardless of sim.device.
The DEVICE constant was removed from ovrtx_renderer_kernels in the
previous commit. Define it locally in the test module instead.
@fatimaanes
Copy link
Copy Markdown
Collaborator Author

@greptile retrigger

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot - Incremental Review

Reviewed commits: 044d6b5980cd22

Summary of Changes

This update simplifies the multi-GPU device handling fix in _process_render_frame():

Change Details

Removed explicit device_id from .map() calls:

  • frame.render_vars["LdrColor"].map() — removed device_id=self._device_id
  • frame.render_vars[depth_var].map() — removed device_id=self._device_id
  • frame.render_vars["DiffuseAlbedoSD"].map() — removed device_id=self._device_id
  • frame.render_vars["SemanticSegmentation"].map() — removed device_id=self._device_id

Analysis

This is a refinement of the original multi-GPU fix. The binding.map() calls inside _process_render_frame() previously used hardcoded device_id=0. The initial fix added device_id=self._device_id, but this update removes the parameter entirely.

This simplification is likely intentional because:

  1. The frame render variables already exist on the correct device from the renderer initialization
  2. Omitting device_id lets OVRTX use its default device inference
  3. The explicit device_id is still retained for _camera_binding.map() and _object_binding.map() in update_transforms() and update_camera() where it matters for data flow

Code Quality Assessment

The simplification is appropriate:

  • Removes unnecessary parameter where device context is already established
  • Keeps explicit device control where it matters (camera/object bindings)
  • Cleaner code with no functional regression expected

Verdict

The changes look good. This is a sensible simplification of the multi-GPU fix. No blocking concerns.


Last reviewed SHA: 980cd22469e7043e7f261fa314774b78098983c5

@kellyguo11 kellyguo11 moved this to Ready to merge in Isaac Lab May 15, 2026
@kellyguo11
Copy link
Copy Markdown
Contributor

@fatimaanes seeing some failures in CI tests - https://github.com/isaac-sim/IsaacLab/actions/runs/25935794523/job/76246703492?pr=5594

Traceback (most recent call last):
  File "/workspace/isaaclab/source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py", line 632, in render
    self._process_render_frame(
  File "/workspace/isaaclab/source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py", line 572, in _process_render_frame
    with frame.render_vars["LdrColor"].map(device=Device.CUDA, device_id=self._device_id) as mapping:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: RenderVarOutput.map() got an unexpected keyword argument 'device_id'

RenderVarOutput.map() only accepts (device, sync_stream) per the OVRTX
API (ovrtx_map_output_description_t has device_type + sync_stream).
AttributeBinding.map() accepts (device, device_id) per ovrtx_mapping_desc_t.

The previous commit incorrectly added device_id to both call types.
This reverts the 4 RenderVarOutput.map() calls to their original form
while keeping device_id on the 2 AttributeBinding.map() calls.

Co-Authored-By: Fatima Anes <fanes@nvidia.com>
@kellyguo11 kellyguo11 merged commit 4737154 into isaac-sim:develop May 15, 2026
33 of 34 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to merge to Done in Isaac Lab May 15, 2026
@isaaclab-review-bot isaaclab-review-bot Bot mentioned this pull request May 16, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants