diff --git a/docs/source/experimental-features/newton-physics-integration/index.rst b/docs/source/experimental-features/newton-physics-integration/index.rst index afe783cc8716..b93c5a3fd2c0 100644 --- a/docs/source/experimental-features/newton-physics-integration/index.rst +++ b/docs/source/experimental-features/newton-physics-integration/index.rst @@ -38,6 +38,8 @@ For an overview of how the multi-backend architecture works, including how to ad :titlesonly: installation + warp-environments + warp-env-migration limitations-and-known-bugs solver-transitioning using-kamino diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst new file mode 100644 index 000000000000..468ced739b4a --- /dev/null +++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst @@ -0,0 +1,280 @@ +.. _warp-env-migration: + +Warp Environment Migration Guide +================================ + +This guide covers the key conventions and patterns used by the warp-first environment +infrastructure, useful for migrating existing torch environments or creating new ones +natively. For an overview of the warp env path itself (workflows, available envs, +performance, limitations, benchmarking), see :doc:`warp-environments`. + + +Design Rationale +~~~~~~~~~~~~~~~~ + +The warp environment path is built around `CUDA graph capture +`_. +A CUDA graph records a sequence of GPU operations (kernel launches, memory copies) during a +capture phase, then replays the entire sequence with a single launch. This eliminates per-kernel +CPU overhead — the parameter validation, kernel selection, and buffer setup that normally costs +20–200 μs per operation is performed once during graph instantiation and reused on every replay +(~10 μs total). All CPU-side code (Python logic, torch dispatching) executed during capture is +completely bypassed during replay. See the `Warp concurrency documentation +`_ for Warp's graph capture API +(``wp.ScopedCapture``). + +All design decisions in the warp infrastructure follow from this constraint: every operation in the +step loop must be a GPU kernel launch with stable memory pointers so that the captured graph can +be replayed without modification. + +Key consequences: + +- All buffers are **pre-allocated** — no dynamic allocation inside the step loop +- Data flows through **persistent ``wp.array`` pointers** — never replaced, only overwritten +- MDP terms are **pure ``@wp.kernel`` functions** — no Python branching on GPU data +- Reset uses **boolean masks** (``env_mask``) instead of index lists (``env_ids``) to avoid + variable-length indexing that changes graph topology + + +Project Structure +~~~~~~~~~~~~~~~~~ + +Warp-specific implementations that diverge from the torch-based managers and env classes live in the ``_experimental`` packages: + +- ``isaaclab_experimental`` — warp managers, base env classes, warp MDP terms +- ``isaaclab_tasks_experimental`` — warp task configs and task-specific MDP terms + +Any new warp implementation that differs from the torch-based managers or env classes belongs in these packages. +Warp task configs reference Newton physics directly (no ``PresetCfg``) since the warp path +is Newton-only. + + +Writing Warp MDP Terms +~~~~~~~~~~~~~~~~~~~~~~ + +Imports +^^^^^^^ + +Warp task configs import from the experimental packages: + +.. code-block:: python + + # Warp + from isaaclab_experimental.managers import ObservationTermCfg, RewardTermCfg, SceneEntityCfg + import isaaclab_experimental.envs.mdp as mdp + +The term config classes have the same interface — only the import path changes. + + +Common Pattern +^^^^^^^^^^^^^^ + +All warp MDP terms (observations, rewards, terminations, events, actions) follow the same +**kernel + launch** pattern. Torch terms use torch tensors and return results; warp terms +write into pre-allocated ``wp.array`` output buffers via ``@wp.kernel`` functions: + +.. code-block:: python + + # Torch — returns a tensor + def lin_vel_z_l2(env, asset_cfg) -> torch.Tensor: + return torch.square(asset.data.root_lin_vel_b[:, 2]) + + # Warp — writes into pre-allocated output + @wp.kernel + def _lin_vel_z_l2_kernel(vel: wp.array(...), out: wp.array(dtype=wp.float32)): + i = wp.tid() + out[i] = vel[i][2] * vel[i][2] + + def lin_vel_z_l2(env, out, asset_cfg) -> None: + wp.launch(_lin_vel_z_l2_kernel, dim=env.num_envs, inputs=[..., out]) + +The output buffer shapes differ by term type: + +- **Observations**: ``(num_envs, D)`` where D is the observation dimension +- **Rewards**: ``(num_envs,)`` +- **Terminations**: ``(num_envs,)`` with dtype ``bool`` +- **Events**: ``(num_envs,)`` mask — events don't produce output, they modify sim state + + +Observation Terms +^^^^^^^^^^^^^^^^^ + +Since warp terms write into pre-allocated buffers, the observation manager must know each +term's output dimension at initialization to allocate the correct ``(num_envs, D)`` output +array. This is resolved via a fallback chain (see +``ObservationManager._infer_term_dim_scalar`` in +``isaaclab_experimental/managers/observation_manager.py``): + +1. **Explicit ``out_dim`` in decorator** (preferred): + + .. code-block:: python + + @generic_io_descriptor_warp(out_dim=3, observation_type="RootState") + def base_lin_vel(env, out, asset_cfg) -> None: ... + + ``out_dim`` can be an integer, or a string that resolves at initialization: + + - ``"joint"`` — number of selected joints from ``asset_cfg`` + - ``"body:N"`` — N components per selected body from ``asset_cfg`` + - ``"command"`` — dimension from command manager + - ``"action"`` — dimension from action manager + +2. **``axes`` metadata**: Dimension equals the number of axes listed: + + .. code-block:: python + + @generic_io_descriptor_warp(axes=["X", "Y", "Z"], observation_type="RootState") + def projected_gravity(env, out, asset_cfg) -> None: ... + # → dimension = 3 + +3. **Legacy params**: ``term_dim``, ``out_dim``, or ``obs_dim`` keys in ``term_cfg.params``. + +4. **Asset config fallback**: Count of ``asset_cfg.joint_ids`` (or ``joint_ids_wp``) for + joint-level terms. + + +Event Terms +^^^^^^^^^^^ + +Events use ``env_mask`` (boolean ``wp.array``) instead of ``env_ids``, and each kernel +checks the mask to skip non-selected environments: + +.. code-block:: python + + def reset_joints_by_offset(env, env_mask, ...): + wp.launch(_kernel, dim=env.num_envs, inputs=[env_mask, ...]) + + @wp.kernel + def _kernel(env_mask: wp.array(dtype=wp.bool), ...): + i = wp.tid() + if not env_mask[i]: + return + # ... modify state for selected envs only + +- RNG uses per-env ``env.rng_state_wp`` (``wp.uint32``) instead of ``torch.rand`` +- **Startup/prestartup** events use the torch convention ``(env, env_ids, **params)`` +- **Reset/interval** events use the warp convention ``(env, env_mask, **params)`` + + +Action Terms +^^^^^^^^^^^^ + +Actions follow a **two-stage execution**: ``process_actions`` (called once per env step) scales +and clips raw actions, and ``apply_actions`` (called once per sim step) writes targets to the +asset. Both stages use warp kernels with pre-allocated ``_raw_actions`` and ``_processed_actions`` +buffers. + + +Capture Safety +^^^^^^^^^^^^^^ + +When writing terms that run inside the captured step loop, keep in mind: + +- **No ``wp.to_torch``** or torch arithmetic — stay in warp throughout +- **No lazy-evaluated properties** — use sim-bound (Tier 1) data directly; if a derived + quantity is needed, compute it inline in the kernel +- **No dynamic allocation** — all buffers must be pre-allocated in ``__init__`` + + +Parity Testing +~~~~~~~~~~~~~~ + +Two levels of parity testing are used to validate warp terms: + +**1. Implementation parity (torch vs warp)** — verifies that the warp kernel produces the +same result as the torch implementation. This is optional for terms that have no torch +counterpart (e.g. new terms written directly in warp). + +.. code-block:: python + + import isaaclab.envs.mdp.observations as torch_obs + import isaaclab_experimental.envs.mdp.observations as warp_obs + + # Torch baseline + expected = torch_obs.joint_pos(torch_env, asset_cfg=cfg) + + # Warp (uncaptured) + out = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device) + warp_obs.joint_pos(warp_env, out, asset_cfg=cfg) + actual = wp.to_torch(out) + + torch.testing.assert_close(actual, expected) + +**2. Capture parity (warp vs warp-captured)** — verifies that the term produces identical +results when replayed from a CUDA graph vs launched directly. A mismatch here indicates capture-unsafe +code (e.g. stale pointers, dynamic allocation, or lazy property access that doesn't replay). +This test should always be run, even for terms without a torch counterpart. + +.. code-block:: python + + # Warp uncaptured + out_uncaptured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device) + warp_obs.joint_pos(warp_env, out_uncaptured, asset_cfg=cfg) + + # Warp captured (graph replay) + out_captured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device) + with wp.ScopedCapture() as cap: + warp_obs.joint_pos(warp_env, out_captured, asset_cfg=cfg) + wp.capture_launch(cap.graph) + + torch.testing.assert_close(wp.to_torch(out_captured), wp.to_torch(out_uncaptured)) + +See ``source/isaaclab_experimental/test/envs/mdp/`` for complete parity test examples. + + +Available Warp MDP Terms +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Category + - Available Terms + * - Observations (11) + - | ``base_pos_z`` + | ``base_lin_vel`` + | ``base_ang_vel`` + | ``projected_gravity`` + | ``joint_pos`` + | ``joint_pos_rel`` + | ``joint_pos_limit_normalized`` + | ``joint_vel`` + | ``joint_vel_rel`` + | ``last_action`` + | ``generated_commands`` + * - Rewards (16) + - | ``is_alive`` + | ``is_terminated`` + | ``lin_vel_z_l2`` + | ``ang_vel_xy_l2`` + | ``flat_orientation_l2`` + | ``joint_torques_l2`` + | ``joint_vel_l1`` + | ``joint_vel_l2`` + | ``joint_acc_l2`` + | ``joint_deviation_l1`` + | ``joint_pos_limits`` + | ``action_rate_l2`` + | ``action_l2`` + | ``undesired_contacts`` + | ``track_lin_vel_xy_exp`` + | ``track_ang_vel_z_exp`` + * - Events (6) + - | ``reset_joints_by_offset`` + | ``reset_joints_by_scale`` + | ``reset_root_state_uniform`` + | ``push_by_setting_velocity`` + | ``apply_external_force_torque`` + | ``randomize_rigid_body_com`` + * - Terminations (4) + - | ``time_out`` + | ``root_height_below_minimum`` + | ``joint_pos_out_of_manual_limit`` + | ``illegal_contact`` + * - Actions (2) + - | ``JointPositionAction`` + | ``JointEffortAction`` + +Terms not listed here remain in torch only. When using an env that requires unlisted terms, +those terms must be implemented in warp first. diff --git a/docs/source/experimental-features/newton-physics-integration/warp-environments.rst b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst new file mode 100644 index 000000000000..c1107741239b --- /dev/null +++ b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst @@ -0,0 +1,331 @@ +.. _warp-environments: + +Warp Experimental Environments +============================== + +.. note:: + + The warp environment infrastructure lives in ``isaaclab_experimental`` and + ``isaaclab_tasks_experimental``. It's an experimental feature. + +The experimental extensions introduce **warp-first** environment infrastructure with CUDA graph capture +support. All environment-side computation (observations, rewards, resets, actions) runs as pure Warp +kernels, eliminating Python overhead and enabling CUDA graph capture for maximum throughput. + + +Workflows +~~~~~~~~~ + +Two environment workflows are supported: + +**Direct workflow** — ``DirectRLEnvWarp`` base class. You implement the step loop, observations, +rewards, and resets directly in your env class using Warp kernels. + +**Manager-based workflow** — ``ManagerBasedRLEnvWarp`` base class. You define MDP terms as +standalone Warp-kernel functions and compose them via configuration. + + +Available Environments +~~~~~~~~~~~~~~~~~~~~~~ + +Direct Warp Environments +^^^^^^^^^^^^^^^^^^^^^^^^ + +- ``Isaac-Cartpole-Direct-Warp-v0`` — Cartpole balance +- ``Isaac-Ant-Direct-Warp-v0`` — Ant locomotion +- ``Isaac-Humanoid-Direct-Warp-v0`` — Humanoid locomotion +- ``Isaac-Repose-Cube-Allegro-Direct-Warp-v0`` — Allegro hand cube repose + + +Manager-Based Warp Environments +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Classic** + +- ``Isaac-Cartpole-Warp-v0`` +- ``Isaac-Ant-Warp-v0`` +- ``Isaac-Humanoid-Warp-v0`` + +**Locomotion (Flat)** + +- ``Isaac-Velocity-Flat-Anymal-B-Warp-v0`` +- ``Isaac-Velocity-Flat-Anymal-C-Warp-v0`` +- ``Isaac-Velocity-Flat-Anymal-D-Warp-v0`` +- ``Isaac-Velocity-Flat-Cassie-Warp-v0`` +- ``Isaac-Velocity-Flat-G1-Warp-v0`` +- ``Isaac-Velocity-Flat-G1-Warp-v1`` +- ``Isaac-Velocity-Flat-H1-Warp-v0`` +- ``Isaac-Velocity-Flat-Unitree-A1-Warp-v0`` +- ``Isaac-Velocity-Flat-Unitree-Go1-Warp-v0`` +- ``Isaac-Velocity-Flat-Unitree-Go2-Warp-v0`` + +**Manipulation** + +- ``Isaac-Reach-Franka-Warp-v0`` +- ``Isaac-Reach-UR10-Warp-v0`` + + +Quick Start +~~~~~~~~~~~ + +.. code-block:: bash + + # Direct workflow + ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \ + --task Isaac-Cartpole-Direct-Warp-v0 --num_envs 4096 --headless + + # Manager-based workflow + ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \ + --task Isaac-Velocity-Flat-Anymal-C-Warp-v0 --num_envs 4096 --headless + +All RL libraries with warp-compatible wrappers are supported: RSL-RL, RL Games, SKRL, and +Stable-Baselines3. + + +Performance Comparison +~~~~~~~~~~~~~~~~~~~~~~ + +Step time comparison between the stable (torch/manager) and warp (CUDA graph captured) variants, +both running on the Newton physics backend. Measured over 300 iterations with 4096 environments. + +.. note:: + + The warp migration is an ongoing effort. Several components (e.g. scene write, actuator models) + have not yet been migrated to Warp kernels and still run through torch. Further performance + improvements are expected as these components are migrated. + +.. list-table:: + :header-rows: 1 + :widths: 30 12 15 15 12 + + * - Env + - Type + - Stable Step (us) + - Warp Step (us) + - Change + * - Cartpole-Direct + - Direct + - 5,274 + - 4,331 + - -17.88% + * - Ant-Direct + - Direct + - 6,368 + - 3,128 + - -50.88% + * - Humanoid-Direct + - Direct + - 13,937 + - 10,783 + - -22.63% + * - Allegro-Direct + - Direct + - 82,950 + - 74,570 + - -10.10% + * - Cartpole + - Manager + - 7,971 + - 3,642 + - -54.31% + * - Ant + - Manager + - 9,781 + - 4,672 + - -52.23% + * - Humanoid + - Manager + - 17,653 + - 12,505 + - -29.16% + * - Reach-Franka + - Manager + - 11,458 + - 7,813 + - -31.83% + * - Anymal-B + - Manager + - 29,188 + - 21,781 + - -25.38% + * - Anymal-C + - Manager + - 30,938 + - 22,228 + - -28.15% + * - Anymal-D + - Manager + - 32,294 + - 23,977 + - -25.75% + * - Cassie + - Manager + - 17,320 + - 10,706 + - -38.19% + * - G1 + - Manager + - 34,487 + - 27,300 + - -20.84% + * - H1 + - Manager + - 22,202 + - 15,864 + - -28.55% + * - A1 + - Manager + - 15,257 + - 9,907 + - -35.07% + * - Go1 + - Manager + - 16,515 + - 11,869 + - -28.13% + * - Go2 + - Manager + - 15,221 + - 9,966 + - -34.52% + + +Which Workflows Benefit Most +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The savings come from eliminating Python / torch overhead in the env's step loop, so envs +gain in proportion to how much of their step time was previously dominated by per-kernel CPU +overhead. Reading the table above: + +- **Manager-based classic RL** (Cartpole, Ant) — biggest gains (-52% to -54%). Many small + reward / observation terms with low compute per term, so per-launch CPU overhead dominated + the stable baseline. +- **Manager-based locomotion** (Anymal, G1, H1, Cassie, Unitree) — consistent -25% to -38% + range. The MDP has more terms but the underlying physics step is heavier, so the relative + Python savings shrink. +- **Direct workflow** — gains scale with how much the env's step body was Python (Ant -51%, + Cartpole -18%, Allegro hand -10%). Direct envs that already wrote most of their work as + GPU kernels see modest gains; ones with substantial Python state machinery see large ones. +- **Compute-heavy / scene-write-heavy envs** (Allegro hand, large humanoids) — see smaller + relative gains because the warp-side savings are amortised over a heavier step. Components + that still go through torch (scene write, actuator models) currently bound the floor; this + is expected to improve as remaining components migrate to warp. + +If your env's step time is dominated by physics or scene I/O, expect modest gains. If it has +many small MDP terms or a lot of Python in the step loop, expect large ones. Use the +benchmarking workflow below to measure on your task before committing to a migration. + + +Limitations +~~~~~~~~~~~ + +The warp env path is experimental and has the following known constraints. These are +specific to warp envs; for Newton physics limitations see :doc:`limitations-and-known-bugs`. + +**Physics backend** + +- **Newton only.** PhysX is not supported under the warp env path. Asset and sensor + ``class_type`` fields resolve to ``isaaclab_physx.*`` classes that depend on + ``omni.physics.tensors`` (a Kit module the warp runtime does not initialise), and several + warp APIs (env-mask reset, CUDA graph capture) require the Newton articulation. Configure + the cfg with a Newton physics block (or ``presets=newton``). + +**MDP coverage** + +- Only the terms listed under :ref:`Available Warp MDP Terms ` are + implemented. Stable envs that depend on un-migrated terms cannot be run on the warp path + until those terms are ported. +- Some scene-side operations (asset write, actuator models, certain sensor types) still go + through torch. They participate in the step but are not yet captured into the graph; they + set the lower bound on observed step time. +- Sensors that depend on the Kit RTX renderer (camera-based observations) cannot be combined + with the warp env path — they need Kit, which the warp runtime does not initialise. + +**API differences vs stable** + +- Reset events use a boolean ``env_mask`` (``wp.array(dtype=wp.bool)``) instead of an + ``env_ids`` list. This is required for capture safety: variable-length indexing changes + graph topology and breaks replay. +- All buffers must be pre-allocated in ``__init__``. There is no dynamic allocation inside + the captured step loop, so observation / reward / termination output dimensions must be + known at env init. +- Term functions write into a pre-allocated ``out`` buffer rather than returning a tensor. + See :doc:`warp-env-migration` for the kernel + launch pattern. +- Code inside the captured step loop must follow capture-safety rules (no + ``wp.to_torch``, no torch arithmetic, no lazy-evaluated properties, no Python branching + on GPU data). See the *Capture Safety* section in :doc:`warp-env-migration` for the + full set of rules. + + +Benchmarking Your Environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The performance table above was produced with ``scripts/benchmarks/benchmark_rsl_rl.py``, +which runs a fixed iteration count and reports step-time statistics. Use the same script +to estimate the gain for your own task before committing to a migration. + +**Single-task A/B** + +.. code-block:: bash + + # Stable variant + ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \ + --task -v0 \ + --num_envs 4096 \ + --max_iterations 500 \ + --headless \ + --benchmark_backend summary \ + --output_path benchmarks/stable + + # Warp variant — same task with -Warp- suffix + ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \ + --task -Warp-v0 \ + --num_envs 4096 \ + --max_iterations 500 \ + --headless \ + --benchmark_backend summary \ + --output_path benchmarks/warp + +The ``summary`` backend prints step time (mean / p50 / p99) and total throughput. Compare +"step time" between the two runs to estimate the gain per env step. + +**Sweep across all available tasks** + +``scripts/benchmarks/run_training_benchmarks.sh`` runs the full set of stable tasks listed +in the script (cartpole, ant, humanoid, locomotion, manipulation). Pair it with a +warp-tasks variant (substitute the ``-Warp-`` suffixed task ids) and diff the two outputs. + +**What to look at in the output** + +- *Step time (mean / p99)*: the headline number — what each env step costs. +- *Iteration time*: includes policy update; useful for end-to-end training throughput. +- *Capture overhead*: for warp runs, the first few iterations include CUDA graph capture + cost; exclude those when comparing steady-state numbers. + +**Estimating before you migrate** + +If you can't run the warp variant yet (e.g. the task isn't ported), measure the stable +step time and look at where it's spent: + +- ``num_envs * step_time`` dominated by physics → expect modest warp gains. +- ``step_time`` dominated by ``manager.compute_*`` calls → expect large gains, since those + are exactly what the warp managers replace with captured kernel launches. + +Use ``--num_frames`` on ``benchmark_non_rl.py`` for a no-policy step-time microbenchmark +when you want to isolate env overhead from policy compute. + + +Migrating Existing Environments +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For step-by-step instructions on porting an existing stable env (or writing a new warp +env from scratch) — covering project layout, the kernel + launch pattern shared by +observations / rewards / events / terminations / actions, capture-safety rules, and +parity testing — see :doc:`warp-env-migration` below. + + +.. toctree:: + :maxdepth: 2 + :hidden: + + warp-env-migration