From fc2c3215ee1811d43e2ba08118c0911254c30d4d Mon Sep 17 00:00:00 2001
From: jichuanh <jichuanh@nvidia.com>
Date: Fri, 8 May 2026 21:48:58 +0000
Subject: [PATCH 1/4] Add warp environment docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two new pages under experimental-features/newton-physics-integration/:

- warp-environments.rst: overview of the experimental warp env path —
  the two workflows (direct, manager-based), task inventory, quick
  start, performance comparison vs the stable variants, which workflows
  benefit most, limitations (Newton-only physics, MDP coverage, kit
  sensor restrictions, capture-safety constraints, env_mask vs env_ids
  API delta), benchmarking how-to, and a checklist for adding new warp
  environments.
- warp-env-migration.rst: pytorch -> warp migration guide. Covers the
  CUDA graph capture rationale, project layout, the kernel + launch
  pattern shared by all term types, observation dim resolution, the
  env_mask / env_ids switch for events, capture-safety rules, two-level
  parity testing (stable vs warp, and warp vs warp-captured), and the
  inventory of currently-implemented warp MDP terms.

Both pages register in the section toctree.
---
 .../newton-physics-integration/index.rst      |   3 +
 .../warp-env-migration.rst                    | 280 +++++++++++++++
 .../warp-environments.rst                     | 331 ++++++++++++++++++
 3 files changed, 614 insertions(+)
 create mode 100644 docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
 create mode 100644 docs/source/experimental-features/newton-physics-integration/warp-environments.rst

diff --git a/docs/source/experimental-features/newton-physics-integration/index.rst b/docs/source/experimental-features/newton-physics-integration/index.rst
index afe783cc8716..52df7a68f512 100644
--- a/docs/source/experimental-features/newton-physics-integration/index.rst
+++ b/docs/source/experimental-features/newton-physics-integration/index.rst
@@ -38,6 +38,9 @@ For an overview of how the multi-backend architecture works, including how to ad
   :titlesonly:
 
   installation
+  warp-environments
+  training-environments
+  visualization
   limitations-and-known-bugs
   solver-transitioning
   using-kamino
diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
new file mode 100644
index 000000000000..b47c2c50282d
--- /dev/null
+++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
@@ -0,0 +1,280 @@
+.. _warp-env-migration:
+
+Warp Environment Migration Guide
+================================
+
+This guide covers the key conventions and patterns used by the warp-first environment
+infrastructure, useful for migrating existing stable environments or creating new ones
+natively. For an overview of the warp env path itself (workflows, available envs,
+performance, limitations, benchmarking), see :doc:`warp-environments`.
+
+
+Design Rationale
+~~~~~~~~~~~~~~~~
+
+The warp environment path is built around `CUDA graph capture
+<https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html>`_.
+A CUDA graph records a sequence of GPU operations (kernel launches, memory copies) during a
+capture phase, then replays the entire sequence with a single launch. This eliminates per-kernel
+CPU overhead — the parameter validation, kernel selection, and buffer setup that normally costs
+20–200 μs per operation is performed once during graph instantiation and reused on every replay
+(~10 μs total). All CPU-side code (Python logic, torch dispatching) executed during capture is
+completely bypassed during replay. See the `Warp concurrency documentation
+<https://nvidia.github.io/warp/deep_dive/concurrency.html>`_ for Warp's graph capture API
+(``wp.ScopedCapture``).
+
+All design decisions in the warp infrastructure follow from this constraint: every operation in the
+step loop must be a GPU kernel launch with stable memory pointers so that the captured graph can
+be replayed without modification.
+
+Key consequences:
+
+- All buffers are **pre-allocated** — no dynamic allocation inside the step loop
+- Data flows through **persistent ``wp.array`` pointers** — never replaced, only overwritten
+- MDP terms are **pure ``@wp.kernel`` functions** — no Python branching on GPU data
+- Reset uses **boolean masks** (``env_mask``) instead of index lists (``env_ids``) to avoid
+  variable-length indexing that changes graph topology
+
+
+Project Structure
+~~~~~~~~~~~~~~~~~
+
+Warp-specific implementations that deviate from stable live in the ``_experimental`` packages:
+
+- ``isaaclab_experimental`` — warp managers, base env classes, warp MDP terms
+- ``isaaclab_tasks_experimental`` — warp task configs and task-specific MDP terms
+
+Any new warp implementation that differs from the stable API belongs in these packages.
+Warp task configs reference Newton physics directly (no ``PresetCfg``) since the warp path
+is Newton-only.
+
+
+Writing Warp MDP Terms
+~~~~~~~~~~~~~~~~~~~~~~
+
+Imports
+^^^^^^^
+
+Warp task configs import from the experimental packages:
+
+.. code-block:: python
+
+   # Warp
+   from isaaclab_experimental.managers import ObservationTermCfg, RewardTermCfg, SceneEntityCfg
+   import isaaclab_experimental.envs.mdp as mdp
+
+The term config classes have the same interface — only the import path changes.
+
+
+Common Pattern
+^^^^^^^^^^^^^^
+
+All warp MDP terms (observations, rewards, terminations, events, actions) follow the same
+**kernel + launch** pattern. Stable terms use torch tensors and return results; warp terms
+write into pre-allocated ``wp.array`` output buffers via ``@wp.kernel`` functions:
+
+.. code-block:: python
+
+   # Stable — returns a tensor
+   def lin_vel_z_l2(env, asset_cfg) -> torch.Tensor:
+       return torch.square(asset.data.root_lin_vel_b[:, 2])
+
+   # Warp — writes into pre-allocated output
+   @wp.kernel
+   def _lin_vel_z_l2_kernel(vel: wp.array(...), out: wp.array(dtype=wp.float32)):
+       i = wp.tid()
+       out[i] = vel[i][2] * vel[i][2]
+
+   def lin_vel_z_l2(env, out, asset_cfg) -> None:
+       wp.launch(_lin_vel_z_l2_kernel, dim=env.num_envs, inputs=[..., out])
+
+The output buffer shapes differ by term type:
+
+- **Observations**: ``(num_envs, D)`` where D is the observation dimension
+- **Rewards**: ``(num_envs,)``
+- **Terminations**: ``(num_envs,)`` with dtype ``bool``
+- **Events**: ``(num_envs,)`` mask — events don't produce output, they modify sim state
+
+
+Observation Terms
+^^^^^^^^^^^^^^^^^
+
+Since warp terms write into pre-allocated buffers, the observation manager must know each
+term's output dimension at initialization to allocate the correct ``(num_envs, D)`` output
+array. This is resolved via a fallback chain (see
+``ObservationManager._infer_term_dim_scalar`` in
+``isaaclab_experimental/managers/observation_manager.py``):
+
+1. **Explicit ``out_dim`` in decorator** (preferred):
+
+   .. code-block:: python
+
+      @generic_io_descriptor_warp(out_dim=3, observation_type="RootState")
+      def base_lin_vel(env, out, asset_cfg) -> None: ...
+
+   ``out_dim`` can be an integer, or a string that resolves at initialization:
+
+   - ``"joint"`` — number of selected joints from ``asset_cfg``
+   - ``"body:N"`` — N components per selected body from ``asset_cfg``
+   - ``"command"`` — dimension from command manager
+   - ``"action"`` — dimension from action manager
+
+2. **``axes`` metadata**: Dimension equals the number of axes listed:
+
+   .. code-block:: python
+
+      @generic_io_descriptor_warp(axes=["X", "Y", "Z"], observation_type="RootState")
+      def projected_gravity(env, out, asset_cfg) -> None: ...
+      # → dimension = 3
+
+3. **Legacy params**: ``term_dim``, ``out_dim``, or ``obs_dim`` keys in ``term_cfg.params``.
+
+4. **Asset config fallback**: Count of ``asset_cfg.joint_ids`` (or ``joint_ids_wp``) for
+   joint-level terms.
+
+
+Event Terms
+^^^^^^^^^^^
+
+Events use ``env_mask`` (boolean ``wp.array``) instead of ``env_ids``, and each kernel
+checks the mask to skip non-selected environments:
+
+.. code-block:: python
+
+   def reset_joints_by_offset(env, env_mask, ...):
+       wp.launch(_kernel, dim=env.num_envs, inputs=[env_mask, ...])
+
+   @wp.kernel
+   def _kernel(env_mask: wp.array(dtype=wp.bool), ...):
+       i = wp.tid()
+       if not env_mask[i]:
+           return
+       # ... modify state for selected envs only
+
+- RNG uses per-env ``env.rng_state_wp`` (``wp.uint32``) instead of ``torch.rand``
+- **Startup/prestartup** events use the stable convention ``(env, env_ids, **params)``
+- **Reset/interval** events use the warp convention ``(env, env_mask, **params)``
+
+
+Action Terms
+^^^^^^^^^^^^
+
+Actions follow a **two-stage execution**: ``process_actions`` (called once per env step) scales
+and clips raw actions, and ``apply_actions`` (called once per sim step) writes targets to the
+asset. Both stages use warp kernels with pre-allocated ``_raw_actions`` and ``_processed_actions``
+buffers.
+
+
+Capture Safety
+^^^^^^^^^^^^^^
+
+When writing terms that run inside the captured step loop, keep in mind:
+
+- **No ``wp.to_torch``** or torch arithmetic — stay in warp throughout
+- **No lazy-evaluated properties** — use sim-bound (Tier 1) data directly; if a derived
+  quantity is needed, compute it inline in the kernel
+- **No dynamic allocation** — all buffers must be pre-allocated in ``__init__``
+
+
+Parity Testing
+~~~~~~~~~~~~~~
+
+Two levels of parity testing are used to validate warp terms:
+
+**1. Implementation parity (stable vs warp)** — verifies that the warp kernel produces the
+same result as the stable torch implementation. This is optional for terms that have no stable
+counterpart (e.g. new terms written directly in warp).
+
+.. code-block:: python
+
+   import isaaclab.envs.mdp.observations as stable_obs
+   import isaaclab_experimental.envs.mdp.observations as warp_obs
+
+   # Stable baseline
+   expected = stable_obs.joint_pos(stable_env, asset_cfg=cfg)
+
+   # Warp (uncaptured)
+   out = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   warp_obs.joint_pos(warp_env, out, asset_cfg=cfg)
+   actual = wp.to_torch(out)
+
+   torch.testing.assert_close(actual, expected)
+
+**2. Capture parity (warp vs warp-captured)** — verifies that the term produces identical
+results when replayed from a CUDA graph vs launched directly. A mismatch here indicates capture-unsafe
+code (e.g. stale pointers, dynamic allocation, or lazy property access that doesn't replay).
+This test should always be run, even for terms without a stable counterpart.
+
+.. code-block:: python
+
+   # Warp uncaptured
+   out_uncaptured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   warp_obs.joint_pos(warp_env, out_uncaptured, asset_cfg=cfg)
+
+   # Warp captured (graph replay)
+   out_captured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   with wp.ScopedCapture() as cap:
+       warp_obs.joint_pos(warp_env, out_captured, asset_cfg=cfg)
+   wp.capture_launch(cap.graph)
+
+   torch.testing.assert_close(wp.to_torch(out_captured), wp.to_torch(out_uncaptured))
+
+See ``source/isaaclab_experimental/test/envs/mdp/`` for complete parity test examples.
+
+
+Available Warp MDP Terms
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Category
+     - Available Terms
+   * - Observations (11)
+     - | ``base_pos_z``
+       | ``base_lin_vel``
+       | ``base_ang_vel``
+       | ``projected_gravity``
+       | ``joint_pos``
+       | ``joint_pos_rel``
+       | ``joint_pos_limit_normalized``
+       | ``joint_vel``
+       | ``joint_vel_rel``
+       | ``last_action``
+       | ``generated_commands``
+   * - Rewards (16)
+     - | ``is_alive``
+       | ``is_terminated``
+       | ``lin_vel_z_l2``
+       | ``ang_vel_xy_l2``
+       | ``flat_orientation_l2``
+       | ``joint_torques_l2``
+       | ``joint_vel_l1``
+       | ``joint_vel_l2``
+       | ``joint_acc_l2``
+       | ``joint_deviation_l1``
+       | ``joint_pos_limits``
+       | ``action_rate_l2``
+       | ``action_l2``
+       | ``undesired_contacts``
+       | ``track_lin_vel_xy_exp``
+       | ``track_ang_vel_z_exp``
+   * - Events (6)
+     - | ``reset_joints_by_offset``
+       | ``reset_joints_by_scale``
+       | ``reset_root_state_uniform``
+       | ``push_by_setting_velocity``
+       | ``apply_external_force_torque``
+       | ``randomize_rigid_body_com``
+   * - Terminations (4)
+     - | ``time_out``
+       | ``root_height_below_minimum``
+       | ``joint_pos_out_of_manual_limit``
+       | ``illegal_contact``
+   * - Actions (2)
+     - | ``JointPositionAction``
+       | ``JointEffortAction``
+
+Terms not listed here remain in stable only. When using an env that requires unlisted terms,
+those terms must be implemented in warp first.
diff --git a/docs/source/experimental-features/newton-physics-integration/warp-environments.rst b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst
new file mode 100644
index 000000000000..c1107741239b
--- /dev/null
+++ b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst
@@ -0,0 +1,331 @@
+.. _warp-environments:
+
+Warp Experimental Environments
+==============================
+
+.. note::
+
+   The warp environment infrastructure lives in ``isaaclab_experimental`` and
+   ``isaaclab_tasks_experimental``. It's an experimental feature.
+
+The experimental extensions introduce **warp-first** environment infrastructure with CUDA graph capture
+support. All environment-side computation (observations, rewards, resets, actions) runs as pure Warp
+kernels, eliminating Python overhead and enabling CUDA graph capture for maximum throughput.
+
+
+Workflows
+~~~~~~~~~
+
+Two environment workflows are supported:
+
+**Direct workflow** — ``DirectRLEnvWarp`` base class. You implement the step loop, observations,
+rewards, and resets directly in your env class using Warp kernels.
+
+**Manager-based workflow** — ``ManagerBasedRLEnvWarp`` base class. You define MDP terms as
+standalone Warp-kernel functions and compose them via configuration.
+
+
+Available Environments
+~~~~~~~~~~~~~~~~~~~~~~
+
+Direct Warp Environments
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+- ``Isaac-Cartpole-Direct-Warp-v0`` — Cartpole balance
+- ``Isaac-Ant-Direct-Warp-v0`` — Ant locomotion
+- ``Isaac-Humanoid-Direct-Warp-v0`` — Humanoid locomotion
+- ``Isaac-Repose-Cube-Allegro-Direct-Warp-v0`` — Allegro hand cube repose
+
+
+Manager-Based Warp Environments
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**Classic**
+
+- ``Isaac-Cartpole-Warp-v0``
+- ``Isaac-Ant-Warp-v0``
+- ``Isaac-Humanoid-Warp-v0``
+
+**Locomotion (Flat)**
+
+- ``Isaac-Velocity-Flat-Anymal-B-Warp-v0``
+- ``Isaac-Velocity-Flat-Anymal-C-Warp-v0``
+- ``Isaac-Velocity-Flat-Anymal-D-Warp-v0``
+- ``Isaac-Velocity-Flat-Cassie-Warp-v0``
+- ``Isaac-Velocity-Flat-G1-Warp-v0``
+- ``Isaac-Velocity-Flat-G1-Warp-v1``
+- ``Isaac-Velocity-Flat-H1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-A1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-Go1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-Go2-Warp-v0``
+
+**Manipulation**
+
+- ``Isaac-Reach-Franka-Warp-v0``
+- ``Isaac-Reach-UR10-Warp-v0``
+
+
+Quick Start
+~~~~~~~~~~~
+
+.. code-block:: bash
+
+    # Direct workflow
+    ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
+        --task Isaac-Cartpole-Direct-Warp-v0 --num_envs 4096 --headless
+
+    # Manager-based workflow
+    ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
+        --task Isaac-Velocity-Flat-Anymal-C-Warp-v0 --num_envs 4096 --headless
+
+All RL libraries with warp-compatible wrappers are supported: RSL-RL, RL Games, SKRL, and
+Stable-Baselines3.
+
+
+Performance Comparison
+~~~~~~~~~~~~~~~~~~~~~~
+
+Step time comparison between the stable (torch/manager) and warp (CUDA graph captured) variants,
+both running on the Newton physics backend. Measured over 300 iterations with 4096 environments.
+
+.. note::
+
+   The warp migration is an ongoing effort. Several components (e.g. scene write, actuator models)
+   have not yet been migrated to Warp kernels and still run through torch. Further performance
+   improvements are expected as these components are migrated.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 12 15 15 12
+
+   * - Env
+     - Type
+     - Stable Step (us)
+     - Warp Step (us)
+     - Change
+   * - Cartpole-Direct
+     - Direct
+     - 5,274
+     - 4,331
+     - -17.88%
+   * - Ant-Direct
+     - Direct
+     - 6,368
+     - 3,128
+     - -50.88%
+   * - Humanoid-Direct
+     - Direct
+     - 13,937
+     - 10,783
+     - -22.63%
+   * - Allegro-Direct
+     - Direct
+     - 82,950
+     - 74,570
+     - -10.10%
+   * - Cartpole
+     - Manager
+     - 7,971
+     - 3,642
+     - -54.31%
+   * - Ant
+     - Manager
+     - 9,781
+     - 4,672
+     - -52.23%
+   * - Humanoid
+     - Manager
+     - 17,653
+     - 12,505
+     - -29.16%
+   * - Reach-Franka
+     - Manager
+     - 11,458
+     - 7,813
+     - -31.83%
+   * - Anymal-B
+     - Manager
+     - 29,188
+     - 21,781
+     - -25.38%
+   * - Anymal-C
+     - Manager
+     - 30,938
+     - 22,228
+     - -28.15%
+   * - Anymal-D
+     - Manager
+     - 32,294
+     - 23,977
+     - -25.75%
+   * - Cassie
+     - Manager
+     - 17,320
+     - 10,706
+     - -38.19%
+   * - G1
+     - Manager
+     - 34,487
+     - 27,300
+     - -20.84%
+   * - H1
+     - Manager
+     - 22,202
+     - 15,864
+     - -28.55%
+   * - A1
+     - Manager
+     - 15,257
+     - 9,907
+     - -35.07%
+   * - Go1
+     - Manager
+     - 16,515
+     - 11,869
+     - -28.13%
+   * - Go2
+     - Manager
+     - 15,221
+     - 9,966
+     - -34.52%
+
+
+Which Workflows Benefit Most
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The savings come from eliminating Python / torch overhead in the env's step loop, so envs
+gain in proportion to how much of their step time was previously dominated by per-kernel CPU
+overhead. Reading the table above:
+
+- **Manager-based classic RL** (Cartpole, Ant) — biggest gains (-52% to -54%). Many small
+  reward / observation terms with low compute per term, so per-launch CPU overhead dominated
+  the stable baseline.
+- **Manager-based locomotion** (Anymal, G1, H1, Cassie, Unitree) — consistent -25% to -38%
+  range. The MDP has more terms but the underlying physics step is heavier, so the relative
+  Python savings shrink.
+- **Direct workflow** — gains scale with how much the env's step body was Python (Ant -51%,
+  Cartpole -18%, Allegro hand -10%). Direct envs that already wrote most of their work as
+  GPU kernels see modest gains; ones with substantial Python state machinery see large ones.
+- **Compute-heavy / scene-write-heavy envs** (Allegro hand, large humanoids) — see smaller
+  relative gains because the warp-side savings are amortised over a heavier step. Components
+  that still go through torch (scene write, actuator models) currently bound the floor; this
+  is expected to improve as remaining components migrate to warp.
+
+If your env's step time is dominated by physics or scene I/O, expect modest gains. If it has
+many small MDP terms or a lot of Python in the step loop, expect large ones. Use the
+benchmarking workflow below to measure on your task before committing to a migration.
+
+
+Limitations
+~~~~~~~~~~~
+
+The warp env path is experimental and has the following known constraints. These are
+specific to warp envs; for Newton physics limitations see :doc:`limitations-and-known-bugs`.
+
+**Physics backend**
+
+- **Newton only.** PhysX is not supported under the warp env path. Asset and sensor
+  ``class_type`` fields resolve to ``isaaclab_physx.*`` classes that depend on
+  ``omni.physics.tensors`` (a Kit module the warp runtime does not initialise), and several
+  warp APIs (env-mask reset, CUDA graph capture) require the Newton articulation. Configure
+  the cfg with a Newton physics block (or ``presets=newton``).
+
+**MDP coverage**
+
+- Only the terms listed under :ref:`Available Warp MDP Terms <warp-env-migration>` are
+  implemented. Stable envs that depend on un-migrated terms cannot be run on the warp path
+  until those terms are ported.
+- Some scene-side operations (asset write, actuator models, certain sensor types) still go
+  through torch. They participate in the step but are not yet captured into the graph; they
+  set the lower bound on observed step time.
+- Sensors that depend on the Kit RTX renderer (camera-based observations) cannot be combined
+  with the warp env path — they need Kit, which the warp runtime does not initialise.
+
+**API differences vs stable**
+
+- Reset events use a boolean ``env_mask`` (``wp.array(dtype=wp.bool)``) instead of an
+  ``env_ids`` list. This is required for capture safety: variable-length indexing changes
+  graph topology and breaks replay.
+- All buffers must be pre-allocated in ``__init__``. There is no dynamic allocation inside
+  the captured step loop, so observation / reward / termination output dimensions must be
+  known at env init.
+- Term functions write into a pre-allocated ``out`` buffer rather than returning a tensor.
+  See :doc:`warp-env-migration` for the kernel + launch pattern.
+- Code inside the captured step loop must follow capture-safety rules (no
+  ``wp.to_torch``, no torch arithmetic, no lazy-evaluated properties, no Python branching
+  on GPU data). See the *Capture Safety* section in :doc:`warp-env-migration` for the
+  full set of rules.
+
+
+Benchmarking Your Environment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The performance table above was produced with ``scripts/benchmarks/benchmark_rsl_rl.py``,
+which runs a fixed iteration count and reports step-time statistics. Use the same script
+to estimate the gain for your own task before committing to a migration.
+
+**Single-task A/B**
+
+.. code-block:: bash
+
+    # Stable variant
+    ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \
+        --task <Task-Name>-v0 \
+        --num_envs 4096 \
+        --max_iterations 500 \
+        --headless \
+        --benchmark_backend summary \
+        --output_path benchmarks/stable
+
+    # Warp variant — same task with -Warp- suffix
+    ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \
+        --task <Task-Name>-Warp-v0 \
+        --num_envs 4096 \
+        --max_iterations 500 \
+        --headless \
+        --benchmark_backend summary \
+        --output_path benchmarks/warp
+
+The ``summary`` backend prints step time (mean / p50 / p99) and total throughput. Compare
+"step time" between the two runs to estimate the gain per env step.
+
+**Sweep across all available tasks**
+
+``scripts/benchmarks/run_training_benchmarks.sh`` runs the full set of stable tasks listed
+in the script (cartpole, ant, humanoid, locomotion, manipulation). Pair it with a
+warp-tasks variant (substitute the ``-Warp-`` suffixed task ids) and diff the two outputs.
+
+**What to look at in the output**
+
+- *Step time (mean / p99)*: the headline number — what each env step costs.
+- *Iteration time*: includes policy update; useful for end-to-end training throughput.
+- *Capture overhead*: for warp runs, the first few iterations include CUDA graph capture
+  cost; exclude those when comparing steady-state numbers.
+
+**Estimating before you migrate**
+
+If you can't run the warp variant yet (e.g. the task isn't ported), measure the stable
+step time and look at where it's spent:
+
+- ``num_envs * step_time`` dominated by physics → expect modest warp gains.
+- ``step_time`` dominated by ``manager.compute_*`` calls → expect large gains, since those
+  are exactly what the warp managers replace with captured kernel launches.
+
+Use ``--num_frames`` on ``benchmark_non_rl.py`` for a no-policy step-time microbenchmark
+when you want to isolate env overhead from policy compute.
+
+
+Migrating Existing Environments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For step-by-step instructions on porting an existing stable env (or writing a new warp
+env from scratch) — covering project layout, the kernel + launch pattern shared by
+observations / rewards / events / terminations / actions, capture-safety rules, and
+parity testing — see :doc:`warp-env-migration` below.
+
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+
+   warp-env-migration

From 7283d24388157a63931512a9fcd804cddc70f870 Mon Sep 17 00:00:00 2001
From: jichuanh <jichuanh@nvidia.com>
Date: Mon, 18 May 2026 03:22:56 +0000
Subject: [PATCH 2/4] [Docs] Fix newton toctree and clarify warp-env-migration
 wording

- Drop unwritten 'training-environments' and 'visualization' entries from
  the newton-physics-integration toctree; add 'warp-env-migration' which is
  included in this PR.
- In warp-env-migration.rst, refer to the non-experimental implementation as
  'torch' rather than 'stable' so the warp/torch contrast is explicit.
---
 .../newton-physics-integration/index.rst      |  3 +--
 .../warp-env-migration.rst                    | 26 +++++++++----------
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/docs/source/experimental-features/newton-physics-integration/index.rst b/docs/source/experimental-features/newton-physics-integration/index.rst
index 52df7a68f512..b93c5a3fd2c0 100644
--- a/docs/source/experimental-features/newton-physics-integration/index.rst
+++ b/docs/source/experimental-features/newton-physics-integration/index.rst
@@ -39,8 +39,7 @@ For an overview of how the multi-backend architecture works, including how to ad
 
   installation
   warp-environments
-  training-environments
-  visualization
+  warp-env-migration
   limitations-and-known-bugs
   solver-transitioning
   using-kamino
diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
index b47c2c50282d..a88b0fae5e5c 100644
--- a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
+++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
@@ -4,7 +4,7 @@ Warp Environment Migration Guide
 ================================
 
 This guide covers the key conventions and patterns used by the warp-first environment
-infrastructure, useful for migrating existing stable environments or creating new ones
+infrastructure, useful for migrating existing torch environments or creating new ones
 natively. For an overview of the warp env path itself (workflows, available envs,
 performance, limitations, benchmarking), see :doc:`warp-environments`.
 
@@ -39,12 +39,12 @@ Key consequences:
 Project Structure
 ~~~~~~~~~~~~~~~~~
 
-Warp-specific implementations that deviate from stable live in the ``_experimental`` packages:
+Warp-specific implementations that diverge from the torch API live in the ``_experimental`` packages:
 
 - ``isaaclab_experimental`` — warp managers, base env classes, warp MDP terms
 - ``isaaclab_tasks_experimental`` — warp task configs and task-specific MDP terms
 
-Any new warp implementation that differs from the stable API belongs in these packages.
+Any new warp implementation that differs from the torch API belongs in these packages.
 Warp task configs reference Newton physics directly (no ``PresetCfg``) since the warp path
 is Newton-only.
 
@@ -70,12 +70,12 @@ Common Pattern
 ^^^^^^^^^^^^^^
 
 All warp MDP terms (observations, rewards, terminations, events, actions) follow the same
-**kernel + launch** pattern. Stable terms use torch tensors and return results; warp terms
+**kernel + launch** pattern. Torch terms use torch tensors and return results; warp terms
 write into pre-allocated ``wp.array`` output buffers via ``@wp.kernel`` functions:
 
 .. code-block:: python
 
-   # Stable — returns a tensor
+   # Torch — returns a tensor
    def lin_vel_z_l2(env, asset_cfg) -> torch.Tensor:
        return torch.square(asset.data.root_lin_vel_b[:, 2])
 
@@ -152,7 +152,7 @@ checks the mask to skip non-selected environments:
        # ... modify state for selected envs only
 
 - RNG uses per-env ``env.rng_state_wp`` (``wp.uint32``) instead of ``torch.rand``
-- **Startup/prestartup** events use the stable convention ``(env, env_ids, **params)``
+- **Startup/prestartup** events use the torch convention ``(env, env_ids, **params)``
 - **Reset/interval** events use the warp convention ``(env, env_mask, **params)``
 
 
@@ -181,17 +181,17 @@ Parity Testing
 
 Two levels of parity testing are used to validate warp terms:
 
-**1. Implementation parity (stable vs warp)** — verifies that the warp kernel produces the
-same result as the stable torch implementation. This is optional for terms that have no stable
+**1. Implementation parity (torch vs warp)** — verifies that the warp kernel produces the
+same result as the torch implementation. This is optional for terms that have no torch
 counterpart (e.g. new terms written directly in warp).
 
 .. code-block:: python
 
-   import isaaclab.envs.mdp.observations as stable_obs
+   import isaaclab.envs.mdp.observations as torch_obs
    import isaaclab_experimental.envs.mdp.observations as warp_obs
 
-   # Stable baseline
-   expected = stable_obs.joint_pos(stable_env, asset_cfg=cfg)
+   # Torch baseline
+   expected = torch_obs.joint_pos(torch_env, asset_cfg=cfg)
 
    # Warp (uncaptured)
    out = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
@@ -203,7 +203,7 @@ counterpart (e.g. new terms written directly in warp).
 **2. Capture parity (warp vs warp-captured)** — verifies that the term produces identical
 results when replayed from a CUDA graph vs launched directly. A mismatch here indicates capture-unsafe
 code (e.g. stale pointers, dynamic allocation, or lazy property access that doesn't replay).
-This test should always be run, even for terms without a stable counterpart.
+This test should always be run, even for terms without a torch counterpart.
 
 .. code-block:: python
 
@@ -276,5 +276,5 @@ Available Warp MDP Terms
      - | ``JointPositionAction``
        | ``JointEffortAction``
 
-Terms not listed here remain in stable only. When using an env that requires unlisted terms,
+Terms not listed here remain in torch only. When using an env that requires unlisted terms,
 those terms must be implemented in warp first.

From 8424278a9db3e65f663f42a6326a6db32ae9f10f Mon Sep 17 00:00:00 2001
From: jichuanh <jichuanh@nvidia.com>
Date: Mon, 18 May 2026 03:29:51 +0000
Subject: [PATCH 3/4] [Docs] Fix Warp concurrency doc URL

Use https://nvidia.github.io/warp/stable/deep_dive/concurrency.html
(404 on the old non-stable path).
---
 .../newton-physics-integration/warp-env-migration.rst           | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
index a88b0fae5e5c..14097581d373 100644
--- a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
+++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
@@ -20,7 +20,7 @@ CPU overhead — the parameter validation, kernel selection, and buffer setup th
 20–200 μs per operation is performed once during graph instantiation and reused on every replay
 (~10 μs total). All CPU-side code (Python logic, torch dispatching) executed during capture is
 completely bypassed during replay. See the `Warp concurrency documentation
-<https://nvidia.github.io/warp/deep_dive/concurrency.html>`_ for Warp's graph capture API
+<https://nvidia.github.io/warp/stable/deep_dive/concurrency.html>`_ for Warp's graph capture API
 (``wp.ScopedCapture``).
 
 All design decisions in the warp infrastructure follow from this constraint: every operation in the

From 977808724cc31349f86c0a6b8129810de49e565d Mon Sep 17 00:00:00 2001
From: jichuanh <jichuanh@nvidia.com>
Date: Mon, 18 May 2026 03:31:33 +0000
Subject: [PATCH 4/4] [Docs] Reword torch-API contrast to torch-based managers
 / env classes

---
 .../newton-physics-integration/warp-env-migration.rst         | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
index 14097581d373..468ced739b4a 100644
--- a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
+++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
@@ -39,12 +39,12 @@ Key consequences:
 Project Structure
 ~~~~~~~~~~~~~~~~~
 
-Warp-specific implementations that diverge from the torch API live in the ``_experimental`` packages:
+Warp-specific implementations that diverge from the torch-based managers and env classes live in the ``_experimental`` packages:
 
 - ``isaaclab_experimental`` — warp managers, base env classes, warp MDP terms
 - ``isaaclab_tasks_experimental`` — warp task configs and task-specific MDP terms
 
-Any new warp implementation that differs from the torch API belongs in these packages.
+Any new warp implementation that differs from the torch-based managers or env classes belongs in these packages.
 Warp task configs reference Newton physics directly (no ``PresetCfg``) since the warp path
 is Newton-only.