diff --git a/docs/source/experimental-features/newton-physics-integration/index.rst b/docs/source/experimental-features/newton-physics-integration/index.rst
index afe783cc8716..b93c5a3fd2c0 100644
--- a/docs/source/experimental-features/newton-physics-integration/index.rst
+++ b/docs/source/experimental-features/newton-physics-integration/index.rst
@@ -38,6 +38,8 @@ For an overview of how the multi-backend architecture works, including how to ad
   :titlesonly:
 
   installation
+  warp-environments
+  warp-env-migration
   limitations-and-known-bugs
   solver-transitioning
   using-kamino
diff --git a/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
new file mode 100644
index 000000000000..468ced739b4a
--- /dev/null
+++ b/docs/source/experimental-features/newton-physics-integration/warp-env-migration.rst
@@ -0,0 +1,280 @@
+.. _warp-env-migration:
+
+Warp Environment Migration Guide
+================================
+
+This guide covers the key conventions and patterns used by the warp-first environment
+infrastructure, useful for migrating existing torch environments or creating new ones
+natively. For an overview of the warp env path itself (workflows, available envs,
+performance, limitations, benchmarking), see :doc:`warp-environments`.
+
+
+Design Rationale
+~~~~~~~~~~~~~~~~
+
+The warp environment path is built around `CUDA graph capture
+<https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html>`_.
+A CUDA graph records a sequence of GPU operations (kernel launches, memory copies) during a
+capture phase, then replays the entire sequence with a single launch. This eliminates per-kernel
+CPU overhead — the parameter validation, kernel selection, and buffer setup that normally costs
+20–200 μs per operation is performed once during graph instantiation and reused on every replay
+(~10 μs total). All CPU-side code (Python logic, torch dispatching) executed during capture is
+completely bypassed during replay. See the `Warp concurrency documentation
+<https://nvidia.github.io/warp/stable/deep_dive/concurrency.html>`_ for Warp's graph capture API
+(``wp.ScopedCapture``).
+
+All design decisions in the warp infrastructure follow from this constraint: every operation in the
+step loop must be a GPU kernel launch with stable memory pointers so that the captured graph can
+be replayed without modification.
+
+Key consequences:
+
+- All buffers are **pre-allocated** — no dynamic allocation inside the step loop
+- Data flows through **persistent ``wp.array`` pointers** — never replaced, only overwritten
+- MDP terms are **pure ``@wp.kernel`` functions** — no Python branching on GPU data
+- Reset uses **boolean masks** (``env_mask``) instead of index lists (``env_ids``) to avoid
+  variable-length indexing that changes graph topology
+
+
+Project Structure
+~~~~~~~~~~~~~~~~~
+
+Warp-specific implementations that diverge from the torch-based managers and env classes live in the ``_experimental`` packages:
+
+- ``isaaclab_experimental`` — warp managers, base env classes, warp MDP terms
+- ``isaaclab_tasks_experimental`` — warp task configs and task-specific MDP terms
+
+Any new warp implementation that differs from the torch-based managers or env classes belongs in these packages.
+Warp task configs reference Newton physics directly (no ``PresetCfg``) since the warp path
+is Newton-only.
+
+
+Writing Warp MDP Terms
+~~~~~~~~~~~~~~~~~~~~~~
+
+Imports
+^^^^^^^
+
+Warp task configs import from the experimental packages:
+
+.. code-block:: python
+
+   # Warp
+   from isaaclab_experimental.managers import ObservationTermCfg, RewardTermCfg, SceneEntityCfg
+   import isaaclab_experimental.envs.mdp as mdp
+
+The term config classes have the same interface — only the import path changes.
+
+
+Common Pattern
+^^^^^^^^^^^^^^
+
+All warp MDP terms (observations, rewards, terminations, events, actions) follow the same
+**kernel + launch** pattern. Torch terms use torch tensors and return results; warp terms
+write into pre-allocated ``wp.array`` output buffers via ``@wp.kernel`` functions:
+
+.. code-block:: python
+
+   # Torch — returns a tensor
+   def lin_vel_z_l2(env, asset_cfg) -> torch.Tensor:
+       return torch.square(asset.data.root_lin_vel_b[:, 2])
+
+   # Warp — writes into pre-allocated output
+   @wp.kernel
+   def _lin_vel_z_l2_kernel(vel: wp.array(...), out: wp.array(dtype=wp.float32)):
+       i = wp.tid()
+       out[i] = vel[i][2] * vel[i][2]
+
+   def lin_vel_z_l2(env, out, asset_cfg) -> None:
+       wp.launch(_lin_vel_z_l2_kernel, dim=env.num_envs, inputs=[..., out])
+
+The output buffer shapes differ by term type:
+
+- **Observations**: ``(num_envs, D)`` where D is the observation dimension
+- **Rewards**: ``(num_envs,)``
+- **Terminations**: ``(num_envs,)`` with dtype ``bool``
+- **Events**: ``(num_envs,)`` mask — events don't produce output, they modify sim state
+
+
+Observation Terms
+^^^^^^^^^^^^^^^^^
+
+Since warp terms write into pre-allocated buffers, the observation manager must know each
+term's output dimension at initialization to allocate the correct ``(num_envs, D)`` output
+array. This is resolved via a fallback chain (see
+``ObservationManager._infer_term_dim_scalar`` in
+``isaaclab_experimental/managers/observation_manager.py``):
+
+1. **Explicit ``out_dim`` in decorator** (preferred):
+
+   .. code-block:: python
+
+      @generic_io_descriptor_warp(out_dim=3, observation_type="RootState")
+      def base_lin_vel(env, out, asset_cfg) -> None: ...
+
+   ``out_dim`` can be an integer, or a string that resolves at initialization:
+
+   - ``"joint"`` — number of selected joints from ``asset_cfg``
+   - ``"body:N"`` — N components per selected body from ``asset_cfg``
+   - ``"command"`` — dimension from command manager
+   - ``"action"`` — dimension from action manager
+
+2. **``axes`` metadata**: Dimension equals the number of axes listed:
+
+   .. code-block:: python
+
+      @generic_io_descriptor_warp(axes=["X", "Y", "Z"], observation_type="RootState")
+      def projected_gravity(env, out, asset_cfg) -> None: ...
+      # → dimension = 3
+
+3. **Legacy params**: ``term_dim``, ``out_dim``, or ``obs_dim`` keys in ``term_cfg.params``.
+
+4. **Asset config fallback**: Count of ``asset_cfg.joint_ids`` (or ``joint_ids_wp``) for
+   joint-level terms.
+
+
+Event Terms
+^^^^^^^^^^^
+
+Events use ``env_mask`` (boolean ``wp.array``) instead of ``env_ids``, and each kernel
+checks the mask to skip non-selected environments:
+
+.. code-block:: python
+
+   def reset_joints_by_offset(env, env_mask, ...):
+       wp.launch(_kernel, dim=env.num_envs, inputs=[env_mask, ...])
+
+   @wp.kernel
+   def _kernel(env_mask: wp.array(dtype=wp.bool), ...):
+       i = wp.tid()
+       if not env_mask[i]:
+           return
+       # ... modify state for selected envs only
+
+- RNG uses per-env ``env.rng_state_wp`` (``wp.uint32``) instead of ``torch.rand``
+- **Startup/prestartup** events use the torch convention ``(env, env_ids, **params)``
+- **Reset/interval** events use the warp convention ``(env, env_mask, **params)``
+
+
+Action Terms
+^^^^^^^^^^^^
+
+Actions follow a **two-stage execution**: ``process_actions`` (called once per env step) scales
+and clips raw actions, and ``apply_actions`` (called once per sim step) writes targets to the
+asset. Both stages use warp kernels with pre-allocated ``_raw_actions`` and ``_processed_actions``
+buffers.
+
+
+Capture Safety
+^^^^^^^^^^^^^^
+
+When writing terms that run inside the captured step loop, keep in mind:
+
+- **No ``wp.to_torch``** or torch arithmetic — stay in warp throughout
+- **No lazy-evaluated properties** — use sim-bound (Tier 1) data directly; if a derived
+  quantity is needed, compute it inline in the kernel
+- **No dynamic allocation** — all buffers must be pre-allocated in ``__init__``
+
+
+Parity Testing
+~~~~~~~~~~~~~~
+
+Two levels of parity testing are used to validate warp terms:
+
+**1. Implementation parity (torch vs warp)** — verifies that the warp kernel produces the
+same result as the torch implementation. This is optional for terms that have no torch
+counterpart (e.g. new terms written directly in warp).
+
+.. code-block:: python
+
+   import isaaclab.envs.mdp.observations as torch_obs
+   import isaaclab_experimental.envs.mdp.observations as warp_obs
+
+   # Torch baseline
+   expected = torch_obs.joint_pos(torch_env, asset_cfg=cfg)
+
+   # Warp (uncaptured)
+   out = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   warp_obs.joint_pos(warp_env, out, asset_cfg=cfg)
+   actual = wp.to_torch(out)
+
+   torch.testing.assert_close(actual, expected)
+
+**2. Capture parity (warp vs warp-captured)** — verifies that the term produces identical
+results when replayed from a CUDA graph vs launched directly. A mismatch here indicates capture-unsafe
+code (e.g. stale pointers, dynamic allocation, or lazy property access that doesn't replay).
+This test should always be run, even for terms without a torch counterpart.
+
+.. code-block:: python
+
+   # Warp uncaptured
+   out_uncaptured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   warp_obs.joint_pos(warp_env, out_uncaptured, asset_cfg=cfg)
+
+   # Warp captured (graph replay)
+   out_captured = wp.zeros((num_envs, num_joints), dtype=wp.float32, device=device)
+   with wp.ScopedCapture() as cap:
+       warp_obs.joint_pos(warp_env, out_captured, asset_cfg=cfg)
+   wp.capture_launch(cap.graph)
+
+   torch.testing.assert_close(wp.to_torch(out_captured), wp.to_torch(out_uncaptured))
+
+See ``source/isaaclab_experimental/test/envs/mdp/`` for complete parity test examples.
+
+
+Available Warp MDP Terms
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Category
+     - Available Terms
+   * - Observations (11)
+     - | ``base_pos_z``
+       | ``base_lin_vel``
+       | ``base_ang_vel``
+       | ``projected_gravity``
+       | ``joint_pos``
+       | ``joint_pos_rel``
+       | ``joint_pos_limit_normalized``
+       | ``joint_vel``
+       | ``joint_vel_rel``
+       | ``last_action``
+       | ``generated_commands``
+   * - Rewards (16)
+     - | ``is_alive``
+       | ``is_terminated``
+       | ``lin_vel_z_l2``
+       | ``ang_vel_xy_l2``
+       | ``flat_orientation_l2``
+       | ``joint_torques_l2``
+       | ``joint_vel_l1``
+       | ``joint_vel_l2``
+       | ``joint_acc_l2``
+       | ``joint_deviation_l1``
+       | ``joint_pos_limits``
+       | ``action_rate_l2``
+       | ``action_l2``
+       | ``undesired_contacts``
+       | ``track_lin_vel_xy_exp``
+       | ``track_ang_vel_z_exp``
+   * - Events (6)
+     - | ``reset_joints_by_offset``
+       | ``reset_joints_by_scale``
+       | ``reset_root_state_uniform``
+       | ``push_by_setting_velocity``
+       | ``apply_external_force_torque``
+       | ``randomize_rigid_body_com``
+   * - Terminations (4)
+     - | ``time_out``
+       | ``root_height_below_minimum``
+       | ``joint_pos_out_of_manual_limit``
+       | ``illegal_contact``
+   * - Actions (2)
+     - | ``JointPositionAction``
+       | ``JointEffortAction``
+
+Terms not listed here remain in torch only. When using an env that requires unlisted terms,
+those terms must be implemented in warp first.
diff --git a/docs/source/experimental-features/newton-physics-integration/warp-environments.rst b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst
new file mode 100644
index 000000000000..c1107741239b
--- /dev/null
+++ b/docs/source/experimental-features/newton-physics-integration/warp-environments.rst
@@ -0,0 +1,331 @@
+.. _warp-environments:
+
+Warp Experimental Environments
+==============================
+
+.. note::
+
+   The warp environment infrastructure lives in ``isaaclab_experimental`` and
+   ``isaaclab_tasks_experimental``. It's an experimental feature.
+
+The experimental extensions introduce **warp-first** environment infrastructure with CUDA graph capture
+support. All environment-side computation (observations, rewards, resets, actions) runs as pure Warp
+kernels, eliminating Python overhead and enabling CUDA graph capture for maximum throughput.
+
+
+Workflows
+~~~~~~~~~
+
+Two environment workflows are supported:
+
+**Direct workflow** — ``DirectRLEnvWarp`` base class. You implement the step loop, observations,
+rewards, and resets directly in your env class using Warp kernels.
+
+**Manager-based workflow** — ``ManagerBasedRLEnvWarp`` base class. You define MDP terms as
+standalone Warp-kernel functions and compose them via configuration.
+
+
+Available Environments
+~~~~~~~~~~~~~~~~~~~~~~
+
+Direct Warp Environments
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+- ``Isaac-Cartpole-Direct-Warp-v0`` — Cartpole balance
+- ``Isaac-Ant-Direct-Warp-v0`` — Ant locomotion
+- ``Isaac-Humanoid-Direct-Warp-v0`` — Humanoid locomotion
+- ``Isaac-Repose-Cube-Allegro-Direct-Warp-v0`` — Allegro hand cube repose
+
+
+Manager-Based Warp Environments
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**Classic**
+
+- ``Isaac-Cartpole-Warp-v0``
+- ``Isaac-Ant-Warp-v0``
+- ``Isaac-Humanoid-Warp-v0``
+
+**Locomotion (Flat)**
+
+- ``Isaac-Velocity-Flat-Anymal-B-Warp-v0``
+- ``Isaac-Velocity-Flat-Anymal-C-Warp-v0``
+- ``Isaac-Velocity-Flat-Anymal-D-Warp-v0``
+- ``Isaac-Velocity-Flat-Cassie-Warp-v0``
+- ``Isaac-Velocity-Flat-G1-Warp-v0``
+- ``Isaac-Velocity-Flat-G1-Warp-v1``
+- ``Isaac-Velocity-Flat-H1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-A1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-Go1-Warp-v0``
+- ``Isaac-Velocity-Flat-Unitree-Go2-Warp-v0``
+
+**Manipulation**
+
+- ``Isaac-Reach-Franka-Warp-v0``
+- ``Isaac-Reach-UR10-Warp-v0``
+
+
+Quick Start
+~~~~~~~~~~~
+
+.. code-block:: bash
+
+    # Direct workflow
+    ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
+        --task Isaac-Cartpole-Direct-Warp-v0 --num_envs 4096 --headless
+
+    # Manager-based workflow
+    ./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
+        --task Isaac-Velocity-Flat-Anymal-C-Warp-v0 --num_envs 4096 --headless
+
+All RL libraries with warp-compatible wrappers are supported: RSL-RL, RL Games, SKRL, and
+Stable-Baselines3.
+
+
+Performance Comparison
+~~~~~~~~~~~~~~~~~~~~~~
+
+Step time comparison between the stable (torch/manager) and warp (CUDA graph captured) variants,
+both running on the Newton physics backend. Measured over 300 iterations with 4096 environments.
+
+.. note::
+
+   The warp migration is an ongoing effort. Several components (e.g. scene write, actuator models)
+   have not yet been migrated to Warp kernels and still run through torch. Further performance
+   improvements are expected as these components are migrated.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 12 15 15 12
+
+   * - Env
+     - Type
+     - Stable Step (us)
+     - Warp Step (us)
+     - Change
+   * - Cartpole-Direct
+     - Direct
+     - 5,274
+     - 4,331
+     - -17.88%
+   * - Ant-Direct
+     - Direct
+     - 6,368
+     - 3,128
+     - -50.88%
+   * - Humanoid-Direct
+     - Direct
+     - 13,937
+     - 10,783
+     - -22.63%
+   * - Allegro-Direct
+     - Direct
+     - 82,950
+     - 74,570
+     - -10.10%
+   * - Cartpole
+     - Manager
+     - 7,971
+     - 3,642
+     - -54.31%
+   * - Ant
+     - Manager
+     - 9,781
+     - 4,672
+     - -52.23%
+   * - Humanoid
+     - Manager
+     - 17,653
+     - 12,505
+     - -29.16%
+   * - Reach-Franka
+     - Manager
+     - 11,458
+     - 7,813
+     - -31.83%
+   * - Anymal-B
+     - Manager
+     - 29,188
+     - 21,781
+     - -25.38%
+   * - Anymal-C
+     - Manager
+     - 30,938
+     - 22,228
+     - -28.15%
+   * - Anymal-D
+     - Manager
+     - 32,294
+     - 23,977
+     - -25.75%
+   * - Cassie
+     - Manager
+     - 17,320
+     - 10,706
+     - -38.19%
+   * - G1
+     - Manager
+     - 34,487
+     - 27,300
+     - -20.84%
+   * - H1
+     - Manager
+     - 22,202
+     - 15,864
+     - -28.55%
+   * - A1
+     - Manager
+     - 15,257
+     - 9,907
+     - -35.07%
+   * - Go1
+     - Manager
+     - 16,515
+     - 11,869
+     - -28.13%
+   * - Go2
+     - Manager
+     - 15,221
+     - 9,966
+     - -34.52%
+
+
+Which Workflows Benefit Most
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The savings come from eliminating Python / torch overhead in the env's step loop, so envs
+gain in proportion to how much of their step time was previously dominated by per-kernel CPU
+overhead. Reading the table above:
+
+- **Manager-based classic RL** (Cartpole, Ant) — biggest gains (-52% to -54%). Many small
+  reward / observation terms with low compute per term, so per-launch CPU overhead dominated
+  the stable baseline.
+- **Manager-based locomotion** (Anymal, G1, H1, Cassie, Unitree) — consistent -25% to -38%
+  range. The MDP has more terms but the underlying physics step is heavier, so the relative
+  Python savings shrink.
+- **Direct workflow** — gains scale with how much the env's step body was Python (Ant -51%,
+  Cartpole -18%, Allegro hand -10%). Direct envs that already wrote most of their work as
+  GPU kernels see modest gains; ones with substantial Python state machinery see large ones.
+- **Compute-heavy / scene-write-heavy envs** (Allegro hand, large humanoids) — see smaller
+  relative gains because the warp-side savings are amortised over a heavier step. Components
+  that still go through torch (scene write, actuator models) currently bound the floor; this
+  is expected to improve as remaining components migrate to warp.
+
+If your env's step time is dominated by physics or scene I/O, expect modest gains. If it has
+many small MDP terms or a lot of Python in the step loop, expect large ones. Use the
+benchmarking workflow below to measure on your task before committing to a migration.
+
+
+Limitations
+~~~~~~~~~~~
+
+The warp env path is experimental and has the following known constraints. These are
+specific to warp envs; for Newton physics limitations see :doc:`limitations-and-known-bugs`.
+
+**Physics backend**
+
+- **Newton only.** PhysX is not supported under the warp env path. Asset and sensor
+  ``class_type`` fields resolve to ``isaaclab_physx.*`` classes that depend on
+  ``omni.physics.tensors`` (a Kit module the warp runtime does not initialise), and several
+  warp APIs (env-mask reset, CUDA graph capture) require the Newton articulation. Configure
+  the cfg with a Newton physics block (or ``presets=newton``).
+
+**MDP coverage**
+
+- Only the terms listed under :ref:`Available Warp MDP Terms <warp-env-migration>` are
+  implemented. Stable envs that depend on un-migrated terms cannot be run on the warp path
+  until those terms are ported.
+- Some scene-side operations (asset write, actuator models, certain sensor types) still go
+  through torch. They participate in the step but are not yet captured into the graph; they
+  set the lower bound on observed step time.
+- Sensors that depend on the Kit RTX renderer (camera-based observations) cannot be combined
+  with the warp env path — they need Kit, which the warp runtime does not initialise.
+
+**API differences vs stable**
+
+- Reset events use a boolean ``env_mask`` (``wp.array(dtype=wp.bool)``) instead of an
+  ``env_ids`` list. This is required for capture safety: variable-length indexing changes
+  graph topology and breaks replay.
+- All buffers must be pre-allocated in ``__init__``. There is no dynamic allocation inside
+  the captured step loop, so observation / reward / termination output dimensions must be
+  known at env init.
+- Term functions write into a pre-allocated ``out`` buffer rather than returning a tensor.
+  See :doc:`warp-env-migration` for the kernel + launch pattern.
+- Code inside the captured step loop must follow capture-safety rules (no
+  ``wp.to_torch``, no torch arithmetic, no lazy-evaluated properties, no Python branching
+  on GPU data). See the *Capture Safety* section in :doc:`warp-env-migration` for the
+  full set of rules.
+
+
+Benchmarking Your Environment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The performance table above was produced with ``scripts/benchmarks/benchmark_rsl_rl.py``,
+which runs a fixed iteration count and reports step-time statistics. Use the same script
+to estimate the gain for your own task before committing to a migration.
+
+**Single-task A/B**
+
+.. code-block:: bash
+
+    # Stable variant
+    ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \
+        --task <Task-Name>-v0 \
+        --num_envs 4096 \
+        --max_iterations 500 \
+        --headless \
+        --benchmark_backend summary \
+        --output_path benchmarks/stable
+
+    # Warp variant — same task with -Warp- suffix
+    ./isaaclab.sh -p scripts/benchmarks/benchmark_rsl_rl.py \
+        --task <Task-Name>-Warp-v0 \
+        --num_envs 4096 \
+        --max_iterations 500 \
+        --headless \
+        --benchmark_backend summary \
+        --output_path benchmarks/warp
+
+The ``summary`` backend prints step time (mean / p50 / p99) and total throughput. Compare
+"step time" between the two runs to estimate the gain per env step.
+
+**Sweep across all available tasks**
+
+``scripts/benchmarks/run_training_benchmarks.sh`` runs the full set of stable tasks listed
+in the script (cartpole, ant, humanoid, locomotion, manipulation). Pair it with a
+warp-tasks variant (substitute the ``-Warp-`` suffixed task ids) and diff the two outputs.
+
+**What to look at in the output**
+
+- *Step time (mean / p99)*: the headline number — what each env step costs.
+- *Iteration time*: includes policy update; useful for end-to-end training throughput.
+- *Capture overhead*: for warp runs, the first few iterations include CUDA graph capture
+  cost; exclude those when comparing steady-state numbers.
+
+**Estimating before you migrate**
+
+If you can't run the warp variant yet (e.g. the task isn't ported), measure the stable
+step time and look at where it's spent:
+
+- ``num_envs * step_time`` dominated by physics → expect modest warp gains.
+- ``step_time`` dominated by ``manager.compute_*`` calls → expect large gains, since those
+  are exactly what the warp managers replace with captured kernel launches.
+
+Use ``--num_frames`` on ``benchmark_non_rl.py`` for a no-policy step-time microbenchmark
+when you want to isolate env overhead from policy compute.
+
+
+Migrating Existing Environments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For step-by-step instructions on porting an existing stable env (or writing a new warp
+env from scratch) — covering project layout, the kernel + launch pattern shared by
+observations / rewards / events / terminations / actions, capture-safety rules, and
+parity testing — see :doc:`warp-env-migration` below.
+
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+
+   warp-env-migration