Skip to content

pref: Enable mgpu in FrameView#5514

Open
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu
Open

pref: Enable mgpu in FrameView#5514
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

Conversation

@pv-nvidia
Copy link
Copy Markdown

@pv-nvidia pv-nvidia commented May 6, 2026

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

The user-facing surface change is small (drops the device guard in __init__, the assertion in _initialize_fabric, and the _fabric_supported_devices allowlist). The remainder of the diff comprises:

  • Write-path refactor. set_world_poses, set_scales, and the initial USD→Fabric sync now share a single _compose_fabric_transform helper. The sync collapses to one combined kernel launch with one PrepareForReuse, eliminating the previous "double PrepareForReuse" code smell where a non-idempotent second call could mask a topology-change signal.
  • Robustness. The topology-change invariant in _rebuild_fabric_arrays now raises RuntimeError instead of using assert, so the check survives python -O. Previously, an -O run with a real prim-count mismatch would silently dispatch with a stale _view_to_fabric mapping and produce wrong poses or out-of-bounds kernel indices.
  • Multi-GPU test coverage. Three cuda:1-parameterized tests guarded by a new multi_gpu pytest marker (registered in pyproject.toml), plus a dedicated CI job on the existing multi-GPU runner so regressions show up automatically on PRs that touch FabricFrameView or its test file.

The skip-vs-fail logic in _skip_if_unavailable is intentional and works in concert with the CI workflow:

  • On a developer workstation with a single GPU, cuda:1 tests pytest.skip so local runs stay green.
  • In CI, the runner is guaranteed to have ≥2 GPUs by a pre-flight nvidia-smi + torch.cuda.device_count() check at the top of test-fabric-multi-gpu. If the runner regresses to a single GPU, the pre-flight emits ::error:: and exits non-zero before pytest is even invoked, so a misconfigured runner cannot silently green-light a PR.

The test helper used to special-case GITHUB_ACTIONS=true and call pytest.fail, but that branch was removed: the workflow pre-flight catches the same failure mode without coupling the test file to os.environ.

Type of change

  • New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Note on the changelog item: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the new fragment-based changelog system (#5434). CHANGELOG.rst and config/extension.toml are intentionally not edited directly — the nightly CI workflow compiles them from fragments. The fragment covers the cuda:N enablement (Fixed), the combined-sync refactor (Changed), and the assertRuntimeError fix (Fixed).

Test plan

Three new tests, all marked @pytest.mark.multi_gpu and parameterized with ["cuda:1"]:

  • test_fabric_cuda1_world_pose_roundtripset_world_posesget_world_poses returns the same values on a non-primary CUDA device.
  • test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD (atol=0.0 — equality, not approximate).
  • test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1, since both Fabric write paths now run on self._device.

A new CI job, test-fabric-multi-gpu, runs in .github/workflows/test-multi-gpu.yaml on the existing [self-hosted, linux, x64, gpu, multi-gpu] runner. The job pre-flights with nvidia-smi and ./isaaclab.sh -p -c "import torch; print(torch.cuda.device_count())", and fails loudly with ::error:: if the runner regresses to a single GPU. Triggered automatically on PRs that touch source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py or its test file.

To verify locally on a multi-GPU machine:

./isaaclab.sh -p -m pytest -m multi_gpu \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged:

./isaaclab.sh -p -m pytest -m "not multi_gpu" \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026
@pv-nvidia pv-nvidia marked this pull request as draft May 6, 2026 12:32
@pv-nvidia pv-nvidia self-assigned this May 6, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR lifts the cuda:0-only restriction in FabricFrameView, allowing Fabric GPU acceleration on any CUDA device index. The primary code change is dropping the _fabric_supported_devices allowlist and replacing it with a direct self._device pass-through to SelectPrims.

  • Write-path refactor: set_world_poses, set_scales, and the initial USD→Fabric sync are unified into a single _compose_fabric_transform helper, guaranteeing PrepareForReuse is called exactly once per logical write instead of twice during the initial sync.
  • Robustness fix: The topology-change invariant in _rebuild_fabric_arrays now raises RuntimeError instead of assert, so it survives python -O rather than silently corrupting pose data.
  • Test & CI: Three new @pytest.mark.multi_gpu tests cover the cuda:1 write/read/no-writeback paths; a new test-fabric-multi-gpu CI job with a GPU pre-flight check ensures these tests cannot silently green-light on a regressed single-GPU runner.

Confidence Score: 5/5

Safe to merge; the device-guard removal is well-scoped, the write-path refactor correctly eliminates the double PrepareForReuse, and the CI pre-flight ensures multi-GPU coverage cannot silently regress.

The core change (passing self._device directly to SelectPrims) is minimal and the surrounding refactor is correctness-improving. The only outstanding note is a stale one-line docstring that still advertises the old device restriction; everything else in the diff is either a genuine fix or well-tested new coverage.

The init docstring in fabric_frame_view.py still lists the old 'cpu' or 'cuda:0' restriction and should be updated to match the new behavior.

Important Files Changed

Filename Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Removes cuda:0-only restriction, adds _compose_fabric_transform helper, replaces assert with RuntimeError, and drops CPU device guard. Stale docstring still lists "cpu" or "cuda:0" only.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Adds sys.argv stripping for Kit startup safety, improves _skip_if_unavailable for cuda:N devices, and adds three new multi-GPU test cases for cuda:1.
.github/workflows/test-multi-gpu.yaml Adds test-fabric-multi-gpu CI job with a GPU pre-flight check that fails loudly if the runner regresses to a single GPU, preventing silent green runs with zero coverage.
pyproject.toml Registers new multi_gpu pytest marker with a clear description of its skip-on-single-GPU semantics.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst New fragment-based changelog entry covering the multi-GPU enablement, combined-sync refactor, and assert→RuntimeError fix.

Reviews (4): Last reviewed commit: "Address review feedback on the multi-GPU..." | Re-trigger Greptile

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from a6cd73e to 2c619fe Compare May 7, 2026 08:44
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Clean, well-motivated PR that removes an artificial device restriction. The refactoring of set_world_poses and set_scales into a single _compose_fabric_transform helper is a nice improvement — it ensures PrepareForReuse is only called once per logical update regardless of which components are being set.

Strengths

  • Single kernel launch for combined updates_sync_fabric_from_usd_once now does pos+orient+scale in one pass instead of two separate calls
  • assert → RuntimeError — Topology guard won't be optimized away with python -O
  • Thorough multi-GPU test coverage with proper skip/fail semantics depending on CI vs local
  • CI workflow integration — dedicated test-fabric-multi-gpu job triggered by relevant file changes
  • sys.argv stripping — prevents Kit segfault from pytest flags

Minor Notes

See inline comments.

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Outdated
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
@pv-nvidia pv-nvidia marked this pull request as ready for review May 11, 2026 11:29
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Code Review for PR #5514 (Updated Review)

Incremental review: changes from c1254cd1c2e02d

✅ Updated Assessment

The PR has grown significantly since the last review, now including:

  • OVPhysX RigidObject data layer (new rigid_object_data.py)
  • PhysX Jacobian/mass matrix/gravity compensation properties
  • OvPhysxManager device locking and reuse improvements
  • IK/OSC task-space controller migration

🎯 Key Changes Since Last Review

1. OVPhysX RigidObject Implementation

A comprehensive BaseRigidObjectData subclass for OVPhysX with:

  • Root/body pose and velocity properties using TimestampedBuffer
  • Proper COM→link frame transformations via Warp kernels
  • ProxyArray wrappers for consistent Torch/Warp interop
  • Deprecated state-concat properties preserved for backward compatibility
@property
def root_com_ang_vel_b(self) -> ProxyArray:
    if self._root_com_ang_vel_b.timestamp < self._sim_timestamp:
        wp.launch(shared_kernels.quat_apply_inverse_1D_kernel, ...)
        self._root_com_ang_vel_b.timestamp = self._sim_timestamp
    # ... lazy ProxyArray init

2. OvPhysxManager Device Locking

Process-global device lock prevents mid-session device switches:

_locked_device: ClassVar[str | None] = None

if cls._locked_device is not None and ovphysx_device != cls._locked_device:
    raise RuntimeError(
        f"OvPhysxManager is locked to device {cls._locked_device!r}..."
    )

The physx.reset() approach for soft-reset (vs dropping reference) avoids the Carbonite static destructor race.

3. PhysX Task-Space Properties

New ArticulationData properties with proper refresh gating:

  • body_link_jacobian_w — COM→origin shift kernel applied
  • body_com_jacobian_w — passthrough to PhysX
  • mass_matrix — passthrough to PhysX
  • gravity_compensation_forces — passthrough to PhysX

The shift_jacobian_com_to_origin kernel correctly handles the frame mismatch.

4. IK/OSC Test Coverage

Comprehensive tests pinning cross-backend parity:

  • test_get_jacobians_link_origin_contract — J·q̇ ≈ body velocity
  • test_franka_ik_tracking_accuracy — IK converges to mm precision
  • test_franka_osc_tracking_accuracy — OSC tracking with impedance
  • test_get_gravity_compensation_forces_static_equilibrium — τ_gc holds arm

💡 Observations

  1. CPU Scene Config Fix: The _configure_physx_scene_prim now correctly gates GPU-specific attributes on device == "gpu". Previously, CPU mode would set GPU broadphase/capacity attributes incorrectly.

  2. Tensor Types Extension: New RIGID_BODY_* tensor types with proper try/except guards for optional wheel bindings (RIGID_BODY_ACCELERATION, RIGID_BODY_INV_MASS, RIGID_BODY_INV_INERTIA).

  3. Test Timeout Adjustment: TIMEOUT_RETRIES = 0 in conftest.py and new timeout for test_contact_sensor.py — reasonable CI tuning.


🔒 Safety Checklist

  • Device lock prevents silent corruption from device mismatch
  • Soft-reset avoids Carbonite destructor race
  • New properties gated by timestamp for lazy refresh
  • Write paths invalidate dependent caches
  • Tests cover both backends where applicable

📋 Summary

Aspect Status
OVPhysX RigidObject ✅ Complete
PhysX Task-Space ✅ Complete
Device Locking ✅ Correct
Test Coverage ✅ Thorough
Breaking Changes ✅ None

Recommendation: PR is ready for merge. Changes since last review are substantial but well-tested.


Update (8de9a39d): Significant additional changes since last review:

🆕 New Features

  1. URDF Converter Refactor — New UrdfConverterCfg with robot type support, asset transformation, multi-physics conversion, and debug mode. Old urdf_utils.py (fixed joint merging) removed.

  2. RT2 Path Tracing SettingsRenderCfg gains max_bounces, split_glass, split_clearcoat, split_rough_reflection, ambient_light_intensity, ambient_occlusion_denoiser_mode, view_tile_limit for advanced rendering control.

  3. Warp Math Kernels — New warp_math.py with GPU-native camera orientation conversion and depth clamping kernels (avoids torch round-trips on hot path).

  4. ProxyArray Attribute Forwarding__getattr__ bridge allows tensor methods (float(), clone(), cpu()) on ProxyArray with deprecation warning.

  5. Stacked Image Observations — New stacked_image MDP term with ring-buffer frame stacking for visual RL (tested via mock camera).

  6. Install CLI Refactorsetup.py extras restructured; new entry points (isaaclab, play, train); comprehensive install CI tests for uv/conda workflows.

🔧 Refactoring

  • SceneDataProvider moved from isaaclab.physics to isaaclab.scene.scene_data_provider
  • SimulationContext.update_scene_data_provider() removed; physics manager now handles forward calls
  • Import cleanup: from isaaclab.utils.configclass import configclass (explicit module path)
  • Math safety: quat_from_matrix and create_rotation_matrix_from_view return NaN for invalid/singular inputs

✅ Test Coverage

  • Heterogeneous cloned scene test for multi-mesh ray caster camera
  • Install workflow tests (uv, conda, none/newton/rl/visualizer/mimic extras)
  • Stacked image observation unit tests
  • CI startup time limit adjusted for GitHub Actions

No blocking issues found. PR continues to be well-tested and ready for merge.

Comment thread .github/workflows/test-multi-gpu.yaml Outdated
@pv-nvidia pv-nvidia changed the title Feat/frame view enable mgpu Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia changed the title Enable mgpu in FrameView pref: Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia added the enhancement New feature or request label May 12, 2026
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 2 times, most recently from ada2608 to c1254cd Compare May 13, 2026 10:54
@isaaclab-review-bot
Copy link
Copy Markdown

🤖 Isaac Lab Review Bot — Follow-up Review (c1254cd)

All previously raised concerns have been addressed

Resolved Issues from Previous Review:

Issue Status
P1 - assertRuntimeError for topology-change invariant ✅ Fixed
P2 - PrepareForReuse called twice during USD→Fabric sync ✅ Fixed (consolidated into single _compose_fabric_transform call)
sys.argv stripping documentation ✅ Added explicit comment
Missing GPU pre-flight in test-fabric-multi-gpu workflow ✅ Added "Verify multi-GPU availability" step

Summary of Additional Changes:

Version bumps:

  • isaaclab 5.1.0 → 5.1.1
  • isaaclab_newton 0.8.0 → 0.8.1
  • isaaclab_ov 0.1.7 → 0.1.8
  • isaaclab_physx 0.6.3 → 0.6.4
  • isaaclab_tasks 1.5.37 → 1.5.38

Dependency updates:

  • newton v1.2.0rc2 → v1.2.0rc3 (now packaged, not git+)
  • mujoco-warp 3.8.0.1 → 3.8.0.2
  • warp-lang pinned 1.13.0 → >=1.13.0

Bug fixes:

  • Carbonite null-client error in _fabric_notices.py and _cubric.py
  • ContactSensor metadata extraction for Newton 1.1+ (scalar types, per-row indices)
  • Dexsuite: mesh-based spawning for Newton heterogeneous object support

New features:

  • test-fabric-multi-gpu.yaml CI workflow
  • multi_gpu pytest marker

Reviewed commits: ada2608c1254cd

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from c1254cd to 1c2e02d Compare May 13, 2026 19:58
Remove _fabric_supported_devices allowlist and the device fallback check.
USDRT SelectPrims supports any CUDA device index, so Fabric acceleration
now runs on cuda:0, cuda:1, etc. This unblocks distributed training where
each rank is pinned to a non-primary GPU.

- Remove _fabric_supported_devices constant and device check in __init__
- Remove assertion in _initialize_fabric
- Update docstrings for multi-GPU support
- Add cuda:1 test cases (skipped on single-GPU runners)
- Add sys.argv strip to prevent Kit segfault from pytest flags
- Add dedicated multi-GPU CI workflow
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 1c2e02d to 8de9a39 Compare May 17, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant