Liuw/release ckpt by nv-liuw · Pull Request #20 · NVlabs/COMPASS

nv-liuw · 2026-05-13T23:48:23Z

No description provided.

Cherry-picked bucket-A changes from origin/samc/support_nurec_assets_isaaclab_3.0 covering only the API surface required for the existing mobility_es extension to run on Isaac Lab 3.0. NuRec assets, occupancy-map precomputation, multi-camera video recording, and debug-image logging are deferred to follow-up PRs. API migration (mobility_es): - isaaclab.utils.noise: AdditiveUniformNoiseCfg -> UniformNoiseCfg - Restructure physics config via isaaclab_physx.physics.PhysxCfg - rerender_on_reset -> num_rerenders_on_reset - ActionTerm import: isaaclab.envs.mdp.actions -> isaaclab.managers - Wrap asset.data.{root_pos_w, root_quat_w, root_lin_vel_w, root_ang_vel_w, default_root_pose, default_root_vel} with wp.to_torch() - Replace write_root_state_to_sim() with write_root_pose_to_sim_index() + write_root_velocity_to_sim_index() - Flip quaternion convention wxyz -> xyzw across EnvSceneAssetCfg rotations and NonHolonomicPerfectControlAction yaw extraction - Rename velocity_limit -> velocity_limit_sim on carter caster joints - Use commands.UniformPose2dCommand directly (module reorg) - Add pyproject.toml to mobility_es extension Pins / docs: - Bump IsaacLab badge to v3.0.0-beta1 in README.md - Bump install instructions in compass/rl_env/README.md to v3.0.0-beta1 - Add CLAUDE.md (project guide for Claude Code agents) - Add release_tracker.md (next-release work tracker) Dockerfile.rl: - Bump base image to nvcr.io/nvidia/isaac-lab:3.0.0-beta1 - Install requirements.txt, the X-Mobility wheel, and the mobility_es editable extension at build time so the resulting image can run `python run.py` end-to-end. Smoke-tested end-to-end on a fresh Isaac Lab 3.0-beta1 image with num_envs=1: Isaac Sim launches, mobility_es env constructs, USDs load with the new xyzw quaternion convention, ResidualPPOTrainer enters its first rollout at ~7 it/s for 256 steps. Verified both --headless and --viz kit modes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Migrate the OSMO training/eval/record/distillation workflow YAMLs and their submitter from the internal repo (gitlab-master.nvidia.com:12051/ml_nav/compass) to the public NVlabs/COMPASS layout. New layout: - osmo/workflows/{rl_es_train,rl_es_eval,rl_es_record,distillation_train}_workflow.yaml - osmo/run_osmo.py - Python CLI replacing the internal interactive run.sh - osmo/README.md - usage docs - README.md - new "OSMO Cloud Submission" section linking to osmo/README.md Sanitization vs the internal originals: - Replace NVIDIA proprietary copyright header with SPDX-Apache-2.0 - Drop OMNI_SERVER env var and omni-auth credentials block (RL workflows pull USDs from the OSMO groot_mobility_rl_es_usds dataset, no internal Nucleus needed) - Drop the three pip install lines that re-installed requirements.txt, the X-Mobility wheel, and the mobility_es editable extension at OSMO startup; these are now baked into Dockerfile.rl by PR-1 - Empty out internal-only defaults (image, base_policy_ckpt_artifact); run_osmo.py errors fast if the user doesn't supply them - Rename default workflow_name afm_rl_es -> compass_rl_es to drop the internal "afm" branding huggingface-cli invocation: - Use ${ISAACLAB_PATH}/isaaclab.sh -p -m huggingface_hub.commands.huggingface_cli rather than the bare /isaac-sim/kit/python/bin/huggingface-cli script. The bare binary's shebang invokes the bundled Python directly, bypassing python.sh and missing the omni.pip.cloud/pip_prebundle path where the pre-installed `requests` package lives. The wrapper-based invocation gets the right sys.path and login succeeds. run_osmo.py: - argparse subparsers for train / eval / record / distill - Reads WANDB_API_KEY / HF_TOKEN from env, with --prompt fallback - Resolves --image (pre-built) or builds + pushes the right Dockerfile using --registry-prefix / $COMPASS_OSMO_REGISTRY - --dry-run prints the computed `osmo workflow submit` invocation - Distillation does not require HF_TOKEN - All four subcommands verified end-to-end with --dry-run rendering syntactically correct submit invocations Smoke-tested: - Local docker run reproducing the OSMO entry script with num_envs=1: hf-login + base-policy ckpt load + ResidualPPOTrainer rollout (256 steps) - Real OSMO submit (compass_rl_es_g1_smoke_v2-1) on pool groot-l40-01 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Migrate the agentic-skills tooling that automates the COMPASS training pipeline (SAGE-10k scene search/download → USD conversion → scene registration → preview → train → eval) from the internal repo (gitlab-master.nvidia.com:12051/ml_nav/compass:.claude/skills/compass/) to the public NVlabs/COMPASS layout. New layout: - .claude/skills/compass/SKILL.md - Claude Code skill definition - .claude/skills/compass/scripts/sage10k_search.py - SAGE-10k search by text query - .claude/skills/compass/scripts/sage10k_to_usd.py - SAGE-10k -> USD converter Sanitization vs the internal originals: - Bump Isaac Lab version reference 2.3.2 -> 3.0.0-beta1 to match PR-1 - Update OSMO submission section to recommend the Python launcher (osmo/run_osmo.py) introduced by PR-3, with osmo/workflows/ paths - Replace internal wandb artifact example (nvidia-isaac/afm_train/model-u5f67ich:v3) with a generic placeholder - Add SPDX-Apache-2.0 headers to both Python scripts - Skip .claude/settings.local.json (user-specific local config) The skill itself is unchanged in structure; only references to the public OSMO launcher and Isaac Lab pin moved forward. Conda-based execution wrapper (conda run -n <ENV_NAME> ...) stays as the documented host-side pattern; the Docker dev environment from item #7 (when implemented) becomes a parallel option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Replace the manual GUI-based occupancy-map authoring workflow with a one-shot CLI that wraps Isaac Sim's isaacsim.asset.gen.omap APIs and emits ROS-style PNG + YAML at <usd_dir>/omap/. New: scripts/generate_omap_from_usd.py - Wraps isaacsim.asset.gen.omap.bindings._omap.Generator with a small argparse front-end (--out-dir, --cell-size, --z-min/--z-max, --bounds, --padding, --map-name) - Boots a headless SimulationApp, enables the omap extension, opens the input USD, ensures a UsdPhysics.Scene, computes world bounds from the stage bbox (with default 1m padding), generates the 2D occupancy buffer, writes RGBA PNG + ROS-compatible YAML - Output convention matches the existing bundled maps (top-left origin: YAML.origin = (xmin, ymax), image row 0 = ymax) Loader: compass/rl_env/exts/mobility_es/mobility_es/utils/occupancy_map.py - New auto-discovery fallback: when OMAP_PATHS has no entry for the scene, look at <dirname(usd_path)>/omap/occupancy_map.yaml (where the generator drops files by default). Skipped for scenes using MultiUsdFileCfg (no single USD path). - New scenes can now skip the OMAP_PATHS registration entirely. Docs: compass/rl_env/README.md - "Add Occupancy Map" step now points at the script as the primary flow; the manual Isaac Sim UI flow stays as a fallback. Verification on the office scene: - Generator produced 245x242 RGBA PNG + YAML under /tmp/omap_test/office/ (vs the existing 205x202; difference is the default 1m padding). - Ran a 300-sample collision-check (loader-side mirror in scratch /tmp/verify_omap.py): 200/300 came back free, 100/300 collision. Visual annotation confirms every free sample lands on an unoccupied (white) cell — the script's output is semantically correct under OccupancyMapCollisionChecker. Two larger scenes were attempted but blocked by Isaac Sim 3.0-beta1 flakiness unrelated to this script: - combined_single_rack (3.1 GB, deep refs): open_stage hung; will need more update ticks or a heavier stage-loading wait. Out of scope for this PR. - sample_small_footprint (428 MB): repeatedly crashed kit at 2s with std::out_of_range in omni.graph.core during init, even with a clean container/GPU. Environmental. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Cut COMPASS first-run UX from "30-60 min, 6 manual steps" to "3 commands, ~3 min after the image build" and make the steady-state dev loop feel like a Python venv: host-side editor, host-side shell, but every python / pip / tensorboard invocation transparently routed through the container. Three-layer dev model: Host shell → shim PATH → docker exec → daemon container (editor, (set by (translates (compass-rl image, terminal, `source host CWD repo bind-mounted prompt) docker/ to container at /workspace/COMPASS) activate`) path) Quick start: export HF_TOKEN=hf_xxx ./docker/run.sh assets # USDs + X-Mobility ckpt → ./assets/ ./docker/run.sh build # build the dev image source ./docker/activate # venv-like activation python run.py -c configs/train_config.gin -o /tmp/out \ -b ./assets/x_mobility.ckpt --enable_camera Elegance: a single bind-mount ($(pwd) → /workspace/COMPASS) covers the entire dev experience. The only extras are X11 socket forwarding (for --viz kit) and a writable kit shader cache. Compare with the robotic_grounding/workflow/run.sh reference (~10 individual mounts because it isn't structured around a single repo root). New files: - docker/run.sh — subcommand wrapper: build / assets / up / down / exec / shell / status. Single source of truth for mount + env args via _compass_run_args(). Container name hashes the absolute repo path (compass-<user>-<sha1[1:8]>) so multiple checkouts coexist. - docker/activate — sourceable; brings up the container if needed, generates a tmp dir of shim scripts (python, pip, isaaclab.sh, tensorboard, pytest, yapf, pylint, pre-commit) on PATH. Each shim docker-exec's into the container with the host CWD translated to the container path. Defines deactivate() to revert PATH/PS1 and clean up. - docker/prepare_assets.sh — HF downloader for compass_usds.zip + x_mobility-nav2-semantic_action_path.ckpt → ./assets/. Cache-aware, no-op on second run. - docker/README.md — subcommand reference, multi-checkout / multi-GPU notes, git workflow notes, troubleshooting. Modified: - docker/Dockerfile.rl — install COMPASS at /workspace/COMPASS (so /workspace/isaaclab from the base image survives the bind-mount); add /usr/local/bin/python wrapper that exec's Isaac Sim's bundled python.sh directly (so `python run.py` inside the container does not need ${ISAACLAB_PATH}/isaaclab.sh -p boilerplate). - README.md — Quick Start now leads with the Docker path; bare-metal install moved under "Manual install". - .dockerignore — exclude ./assets, ./.cache, ./.git, ./.nv, ./.nvidia-omniverse from the build context. - .gitignore — exclude /assets/, /.cache/, /.nv/, /.nvidia-omniverse/. Verification on this checkout: - `./docker/run.sh build && ./docker/run.sh up` brings up the container. - From an activated shell: * `python --version` → Python 3.12.12 (Isaac Sim's bundled python). * `python -c "import mobility_es; print(mobility_es.__file__)"` resolves to /workspace/COMPASS/compass/rl_env/exts/mobility_es/... (the bind-mount path → host edits hot-reload via the editable install). * `python -c "import isaaclab; print(isaaclab.__file__)"` resolves to /workspace/isaaclab/source/... (base image, untouched by bind-mount). * `python -c "import torch; print(torch.cuda.is_available())"` → True. * `cd compass/rl_env/exts/mobility_es && python -c "import os; print(os.getcwd())"` prints /workspace/COMPASS/compass/rl_env/exts/mobility_es (CWD translation working under repo subdirs). - End-to-end smoke: `python run.py -c configs/train_config.gin -o /tmp/out -b ./assets/x_mobility.ckpt --num_envs 1 --enable_camera --headless` reaches scene-creation and simulation-start (162s, then PPO setup) via the activate shim. Container exits cleanly on `./docker/run.sh down`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Replace the hand-served `gh_page` branch with a two-tier publication that auto-deploys via GitHub Actions on every push to main: - nvlabs.github.io/COMPASS/ - academic landing (Bulma static site, migrated from gh_page into docs/project_page/) - nvlabs.github.io/COMPASS/docs/ - Sphinx handbook with the NVIDIA theme (matches agentic_model_training, RAPIDS, Isaac Lab look) Stack: - Sphinx 7.x + myst-parser (markdown stays markdown, no .rst) - nvidia-sphinx-theme (NVIDIA OSS house style) - sphinx_design (grid cards on the landing), sphinx_copybutton, sphinxcontrib-mermaid Layout under docs/: project_page/ academic landing (Bulma; mp4/png LFS-tracked, rest as Git blobs; index.html grew a "Docs" CTA next to Code/arXiv pointing at ./docs/, plus copy fixes: "Nvidia,"->"NVIDIA,", "can achieves"->"can achieve", "GROOT"->"GR00T", missing googletagmanager loader added so analytics actually fires) handbook/ Sphinx project (conf.py, Makefile, requirements.txt, _static/custom.css for minor brand polish, plus the markdown sources) README.md contributor guide (build + serve recipes, deploy description, editing tips) Handbook nav (4 captioned toctrees, all in docs/handbook/index.md): Installation: Quick start / Docker-as-venv install / Agentic skills / Adding a new embodiment or scene Workflows: Training / Recording / Distillation / Export / GR00T post-training (VLA fine-tuning) ROS2 Deployment: Overview / Isaac Sim setup Reference: OSMO cloud submission / Auto OMap from USDs / Contributing Each handbook page owns its content directly — no `{include}` of source READMEs. As part of this change the five top-level READMEs that the handbook used to transclude have been deleted and their content folded into the corresponding handbook pages: - compass/rl_env/README.md -> docs/handbook/extending.md - docker/README.md -> docs/handbook/installation/docker.md - osmo/README.md -> docs/handbook/osmo.md - ros2_deployment/README.md -> docs/handbook/deployment/ros2.md - ros2_deployment/ISAACSIM_README.md -> docs/handbook/deployment/isaac_sim.md Three READMEs are kept (per "outside docs/"): - README.md (root) - slimmed to overview + 5-line quick start + pointer to the handbook for everything else (was ~320 lines, now ~85) - docs/README.md - contributor docs setup notes (build/serve recipes) - docs/project_page/README.md - academic page archive CI workflow at .github/workflows/docs.yml: - python -m pip install -r docs/handbook/requirements.txt - make -C docs/handbook html (sphinx-build -W; warnings -> errors, including broken internal links - no `myst.xref_missing` / `image.not_readable` suppressions) - copy docs/project_page/* into _site/, docs/handbook/_build/html into _site/docs/, deploy via actions/deploy-pages@v4 (no intermediate gh-pages branch). - LFS pull enabled so academic-landing mp4/png assets reach the deploy. One-time owner action (called out in the PR description, not done by this commit): Settings -> Pages -> Source: "GitHub Actions" (replaces "Deploy from a branch: gh_page"). The gh_page branch becomes a frozen archive — not deleted, not rebuilt. Cross-references repointed at handbook URLs (Sphinx outputs .html files, not directory-style URLs): - README.md (root) - 4 deleted-README links -> handbook .html URLs - CLAUDE.md - 2 deleted-README references repointed - osmo/run_osmo.py docstring - "See osmo/README.md" -> handbook URL - .claude/skills/compass/SKILL.md - OSMO link -> handbook URL - release_tracker.md pending items - retargeted at handbook URLs; already-completed checkboxes left as historical record Camera flag: standardize on `--enable_cameras` (plural) everywhere. Verified canonical name in /workspace/isaaclab/source/isaaclab/isaaclab/app/app_launcher.py:276 and that record.py:66 already uses the plural. Singular wrongly appeared in the root README quick-start, quickstart.md, training.md (x2), distillation.md, gr00t_finetuning.md - all fixed. Side fix in source: drop the trailing `---` from ros2_deployment/README.md before deletion (it was confusing docutils when transcluded — moot now that the README is gone). .gitignore additions: - /assets/ (Docker-asset downloads from PR-7) - /.cache/ (pip / pre-commit / huggingface caches) - /.nv/, /.nvidia-omniverse/ (Isaac Sim runtime cruft) - /docs/handbook/_build/ (Sphinx local-build output) Local verification: python3 -m venv /tmp/sphinx_venv /tmp/sphinx_venv/bin/pip install -r docs/handbook/requirements.txt make -C docs/handbook clean html -> "build succeeded" with -W enabled (zero warnings). -> 16 handbook pages present. -> Combined preview at /tmp/_site/ via python -m http.server hits both / (academic) and /docs/ (handbook) cleanly. -> Regression test: a deliberate `[bogus](does_not_exist.md)` link in a page now fails -W as expected, confirming the strict posture is load-bearing. release_tracker.md item #6 status flipped 🟢 with the full checklist ticked (Pages settings switch + first deploy still pending — owner action). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

…rnal defaults Pre-release leak audit found three categories of internal-only references in the OSMO entry path. Fixed in this commit: 1. USDs from internal OSMO dataset -> HuggingFace The three RL workflows (train / eval / record) used to mount an internal OSMO dataset `groot_mobility_rl_es_usds` and `cp` USDs out of `{{input:0}}/...`. Public users have no access to that dataset. Replaced the `cp` step with `huggingface_cli download nvidia/COMPASS compass_usds.zip --repo-type dataset`, then unzip into compass/rl_env/exts/mobility_es/mobility_es/. Mirrors the host-side docker/prepare_assets.sh recipe. The `inputs:` block (and its trailing `name: groot_mobility_rl_es_usds` line) is gone. 2. X-Mobility ckpt from internal wandb artifact -> HuggingFace Same three workflows used `wandb artifact get {{base_policy_ckpt_artifact}} --root ./[base_policy/]` to pull the public X-Mobility ckpt from a private wandb mirror. Replaced with `huggingface_cli download nvidia/X-Mobility x_mobility-nav2-semantic_action_path.ckpt`, then rename to `model.ckpt` so downstream `-b ./model.ckpt` / `-b ./base_policy/model.ckpt` invocations stay byte-identical. The `base_policy_ckpt_artifact` template var is gone from defaults. `osmo/run_osmo.py` no longer accepts `--base-policy-ckpt`. 3. Internal wandb-project defaults `osmo/run_osmo.py` previously baked in `compass_rl_enhance` / `afm_train` as `nvidia-isaac`-entity defaults; "afm" is internal branding. Dropped the DEFAULT_WANDB_PROJECT dict entirely; made `--wandb-project` `required=True` on every subparser that hits a wandb-enabled workflow. Also: - Updated docs/handbook/osmo.md: prerequisites no longer mention the OSMO dataset upload, the wandb base-policy mirror, or the --base-policy-ckpt flag. Quick-start example uses --wandb-project with a generic value. Subcommand stanzas updated. Troubleshooting bullet on dataset-not-found now explicitly distillation-only; added a new bullet for HF download failures. - release_tracker.md: new workstream #8 (Pre-release leak audit + sanitization) added to the summary table and as a section, with status 🟡 (most boxes ticked, OSMO smoke + maintainer review pending). - Platform names (`ovx-l40` for RL, `dgx-h100` for distillation) kept as defaults per user direction — these are recommended pool names rather than internal-only references. Verification: - Handbook still passes -W: `make -C docs/handbook clean html`. - grep -rnE 'groot_mobility_rl_es_usds|nvidia-isaac/|afm_train' excluding _build/.git/release_tracker.md/dev_env_plan.md returns no live-source hits. - `python osmo/run_osmo.py train --help` shows --wandb-project as required, no --base-policy-ckpt flag. - `python osmo/run_osmo.py train --experiment-name probe --image foo --dry-run` errors with "the following arguments are required: --wandb-project", as expected. Out of scope here: - ros2_deployment/compass_navigator/setup.py maintainer attribution (flagged in tracker; defer to user). - release_tracker.md and dev_env_plan.md gitlab-master references (handled at ship time per the existing CHANGELOG distillation gate). - A live OSMO smoke test against the new workflows (image rebuild + resubmit pending). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Two release-readiness changes: 1. .github/workflows/pre-commit.yml — runs the existing pre-commit hooks (yapf, pylint, nbstripout, clang-format, end-of-file-fixer, trailing- whitespace, requirements-txt-fixer, check-added-large-files) on every PR + push to main. Uses pre-commit/[email protected] for hook-env caching. Python 3.11 (yapf v0.31.0 still depends on lib2to3, removed in Python 3.13). pylint runs without project deps because .pylintrc has `import-error` disabled. 2. requirements.txt — replace 17 unpinned lines with `==<version>` pins based on the package set verified inside compass-rl:latest (the image used for the just-passed OSMO smoke): einops==0.8.2, gin-config==0.5.0, h5py==3.16.0, matplotlib==3.10.8, moviepy==2.2.1, msgpack==1.1.2, numpy==1.26.4, onnx==1.20.1, onnxruntime-gpu==1.25.1, pandas==3.0.1, pytorch-lightning==2.6.1, timm==1.0.26, torcheval==0.0.7, transformers==4.57.6, wandb==0.25.1, wheel==0.47.0, zmq==0.0.0 diffusers==0.29.2 was already pinned. zmq==0.0.0 stays as the stub it currently is (functional swap to pyzmq is out of scope). Surfaced legacy violations on the first `pre-commit run --all-files` — fixed in the same commit so the workflow lands green: - Trailing whitespace + missing EOF newline cleanup across 22 files (mostly empty __init__.py files, project-page CSS/JS, mobility_es pyproject.toml). - yapf reformatted argparse blocks in osmo/run_osmo.py to its 2-space multi-line style. - requirements-txt-fixer alphabetized docs/handbook/requirements.txt. - scripts/generate_omap_from_usd.py: drop unused `import os`; add a scoped `pylint: disable=ungrouped-imports` on the `isaacsim.asset.gen.omap.bindings` import (intentionally late so the extension can be enabled before its bindings load). Verification: - `pre-commit run --all-files` is green locally (Python 3.10 venv, same hook set as the workflow). - `requirements.txt` parses; the only line without `==` is the (intentional) blank EOF. - Workflow YAML is syntactically valid; pre-commit/[email protected] is the canonical action. release_tracker.md: new workstream #9 (CI/CD setup + dep pinning) added to the summary table and as a section, status 🟡 (the file changes are done; the live first-CI-run is the only remaining checkbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

…nt skills Hybrid front-door pattern: /compass stays as the umbrella (training, SAGE workflows, eval, OSMO), three new specialty siblings handle the high-friction onboarding moments. - /compass: drop conda + ISAACLAB_PATH wrappers throughout, replace Setup with the 3-command docker-as-venv flow, add Specialty skills routing section, soften MUST-style rules with the why. OSMO example aligned with #8 sanitization (no --base-policy-ckpt; HF download inside the workflow). - /compass-deploy: ckpt -> ONNX (-r/-g+-e branch) -> TRT engine -> ROS2 launch scaffold. Skill prints the launch command but doesn't run it. - /compass-debug: 8-check diagnostic with bundled scripts/compass_status.sh (parallel checks; --deep adds Isaac Sim init; --ckpt loads via torch). Reports root cause and routes to the right specialty for the fix. - /compass-newembodiment: interactive robot onboarding. Parses pre-supplied input; AskUserQuestion only for missing fields; shows diff before writing; smoke-tests with --num_envs 1. - Progressive-disclosure split: Setup SAGE Local extracted to compass/references/setup-sage-local.md. - docs/handbook/agentic.md: opens with a "Pick the right skill" matrix and per-skill sections. - release_tracker.md: flipped #4's migration boxes ✅; added #10 row and section for this work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

The eval subcommand of osmo/run_osmo.py already exposes --embodiment and --environment overrides; train didn't, so multi-embodiment training sweeps had to either hand-edit the workflow YAML or run with the gin-config default (g1) for every job. - osmo/run_osmo.py: add --embodiment / --environment flags to the train subparser; thread them into cmd_train's set_args dict. - osmo/workflows/rl_es_train_workflow.yaml: accept embodiment / environment template vars (default empty), conditionally append them to TRAIN_CMD and to the post-training EVAL_CMD that runs in the same workflow. Pattern matches what rl_es_eval_workflow.yaml already does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

The internal repo's release/internal.yaml strips a 113-line benchmark.py that hardcodes nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance and afm_rl_enhance defaults. The public side currently has no benchmark runner, and the No-regression benchmark gate (P0 release blocker) has no concrete tooling. Track the sanitization-and-land work as a sub-bullet under that gate so it has an owner-able shape; suggested landing path osmo/run_benchmark.py mirrors osmo/run_osmo.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Adds 8-GPU distributed residual-RL training (torchrun + manual all-reduce), per-stage timing instrumentation, supporting OSMO workflow, and a perf-analysis report. Multi-GPU (run.py, compass/residual_rl/ppo.py, residual_ppo_trainer.py) - --distributed flag, dist.init_process_group(nccl), per-rank device (cuda:local_rank), per-rank env seed offset, env_cfg.sim.device pinned per rank, torch.cuda.set_device(local_rank) before init so NCCL object collectives route through per-rank GPUs, conditional nn.DataParallel(device_ids=[local_rank]), rank-0-only Logger / RecordVideo (no-op logger on other ranks). - ResidualPPOTrainer: world_size / global_rank state, initial param broadcast from rank 0, rank-0 gating on _save_ckpt / _save_episode_logs / _upload_video, weighted per-rank episode-log aggregation via dist.all_gather_object on CPU-converted dicts, torch.load(map_location=self.device) on resume, os.makedirs(exist_ok=True). - PPO.update: manual gradient all-reduce (AVG) BEFORE clip (canonical DDP order — clipping the averaged gradient, not the per-rank pre-avg one), kl_mean all-reduce before LR adaptation so all ranks pick the same LR, metric all-reduce on (value_loss, surrogate_loss, entropy). Returns a diagnostics dict so the trainer logs ppo/learning_rate, ppo/kl_mean, ppo/entropy, ppo/action_std_mean each iter. OSMO 8-GPU workflow (osmo/workflows/rl_es_train_8gpu_workflow.yaml, osmo/run_osmo.py) - 8-GPU / 80-CPU / 800-GiB resource block on ovx-l40; train phase via torch.distributed.run --nproc_per_node=8 run.py --distributed with --num_envs 32 per rank (256 total envs); eval phase as a single-process call. - run_osmo.py: --num-gpus flag (choices 2, 8) routes to the matching workflow YAML. Per-stage timing instrumentation (always on, ~2-3% overhead) - ResidualPPOTrainer.learn(): CUDA-synced _timer context manager around each top-level stage (env_reset_and_init, rollout, compute_returns, update, logging, checkpoint) plus rollout sub-steps (act, env_step, process_env_step, base_policy_process). - _install_env_timers monkey-patches Isaac Lab managers + sim.step / sim.render + per-ObsTerm .func so logs include time/rollout/env_step/{obs, sim_step, ...} and time/rollout/env_step/obs_term/<group>/<name>. Once-per-iter log_dict via the existing logger. Perf-analysis report (docs/PERF_ANALYSIS.{md,pdf}) - Methodology, baseline 32-env 256-step iter breakdown, A/B experiments (depth-drop, DLAA + denoiser, BF16 autocast, per-stage env_step breakdown, per-ObsTerm breakdown of obs.compute), ranked recommendations. Signed-off-by: Wei Liu <liuw@nvidia.com>

Sanitizes the 113-line benchmark.py from the internal repo and lands it as osmo/run_benchmark.py next to run_osmo.py. Closes the "sanitize and land internal benchmark.py" subtask under the No-regression benchmark gate. Behavior: fires one rl_es_eval_workflow.yaml submission per --environments entry, all using the same --embodiment. Each run writes the usual eval/* metrics (goal_reached_rate / fall_down_rate / total_travel_time / weighted_travel_time) to W&B at bm_<embodiment>_<env>_<experiment>; user pulls those out for regression assessment. Sanitization changes vs internal: - Apache-2.0 SPDX header replacing NVIDIA proprietary copyright. - Hardcoded registry nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance -> --registry-prefix flag with $COMPASS_OSMO_REGISTRY fallback (mirrors the run_osmo.py:80-83 pattern; errors fast if unset and --image-name not given). - --wandb-project-name: drop afm_rl_enhance_benchmark default, mark required=True (mirrors run_osmo.py:93-95). - Workflow path: ./workflows/rl_es_eval_workflow.yaml -> Path(__file__).resolve().parent / "workflows" / ... so the script works from any CWD and lives under osmo/. - Adds --dry-run and --prompt for parity with run_osmo.py. Reuses existing osmo/workflows/rl_es_eval_workflow.yaml unchanged. Default sweep stays at 5 scenes (simple_office, warehouse_single_rack, warehouse_multi_rack, combined_single_rack, combined_multi_rack), default --embodiment stays h1; per user direction the matrix lives in the script rather than a separate YAML. docs/handbook/osmo.md: adds a Benchmark section + cross-reference under the subcommand table. release_tracker.md: ticks the sanitize-and-land subtask; widens the section-8 grep gate to include groot_mobility_rl_enhance and afm_rl_enhance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Closes the §8 sanitization tail in release_tracker.md and pastes the iter-500 multi-GPU benchmark numbers into the No-regression gate. Argparse defaults — drop the internal `afm_rl_enhance*` wandb-project defaults and mark --wandb-project-name required in all three top-level entry points. Mirrors the run_osmo.py:93-95 pattern we already applied to osmo/run_benchmark.py: - run.py:58 — was 'afm_rl_enhance' - record.py:135 — was 'afm_rl_enhance_record' - distillation_train.py:54 — was 'afm_rl_enhance_distillation' After the fix, the extended §8 grep gate grep -rnE "nvidia-isaac/|afm_train|groot_mobility_rl_enhance|afm_rl_enhance" . returns zero live-source hits. release_tracker.md hygiene: - Title and version refs: "COMPASS 2.0" -> "COMPASS 1.6", target version 2.0.0 -> 1.6.0, integration branch updated to liuw/benchmark_port. - §2&3 NuRec PR-2 (Buckets B+E+H) marked deferred to post-1.6: only Bucket A (22b25ef) ships in 1.6. - §11 multi-GPU PPO release-scope decision marked settled (ships in 1.6 by squash-strategy default). - §8 grep gate tightened: dropped `groot_mobility_rl_es_usds` from the pattern — it's the directory name inside the public HF compass_usds.zip and is correctly referenced by osmo/workflows/*.yaml after unzipping. Gate now returns zero live-source hits. - §Pre-release gates → No-regression benchmark: pasted 4-embodiment × 5- scene iter-500 multi-GPU results table (goal_reached / fall_down, per-embodiment averages, headlines). Eval ran with the relaxed-heading image compass_release_1_6_relaxed:c87052af (heading_threshold=π); source ships default 0.1 — release notes will document this. - Marked "Define the regression matrix" complete (matrix is the script's --environments default; documented). v1.5.0 baseline capture deferred; the 1.6 numbers become the new published baseline. - iter-1000 single-GPU baseline section stubbed for the in-flight cells. ros2_deployment/compass_navigator/setup.py: removed a duplicated SPDX Apache-2.0 license header (lines 16-28 were a verbatim copy of 1-14). Maintainer fields kept as-is for 1.6 per user direction; team-alias swap is a post-tag follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

yapf v0.31.0 imports lib2to3, which was removed from the standard library in Python 3.13 (and broken on some 3.12 installations). The pre-commit yapf hook fails to install on py3.12 with `ModuleNotFoundError: No module named 'lib2to3'`, which blocks `pre-commit run --all-files` at tag time and any local pre-commit run. yapf v0.40.0+ bundles its own lib2to3 fork and no longer depends on the stdlib copy. Bumping to v0.43.0 (latest stable as of this commit) fixes the install on py3.12 with no formatting policy change. Side effect: yapf added one PEP8 blank line in run.py between class `_NoOpLogger` and the top-level `EmbodimentEnvCfgMap`. Pure formatting, no behavior change. Verification: `pre-commit clean && pre-commit run yapf --files run.py record.py distillation_train.py ros2_deployment/compass_navigator/setup.py osmo/run_benchmark.py` -> all Passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Distilled from release_tracker.md and the integration-branch commits. Date kept as TBD; will be set to the tag-day date in the v1.6.0 commit. The entry intentionally does NOT list full NuRec real2sim asset support under "Added"; only Bucket A (Isaac Lab 3.0 API migration, commit 22b25ef) ships in 1.6. NuRec PR-2 (Buckets B+E+H) is deferred to a clean post-1.6 PR off main. Release notes (separate from CHANGELOG, drafted at tag time) will additionally call out: - Benchmark eval ran against compass_release_1_6_relaxed:c87052af, an image with heading_threshold flipped from 0.1 to math.pi in termination.py:33. Source ships the default 0.1 — the relaxation is image-only and is a known-test-config for the published v1.6.0 benchmark numbers. - Pre-existing argparse defaults `afm_rl_enhance*` in run.py / record.py / distillation_train.py were dropped; --wandb-project-name is now required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

The Sphinx config carries the only in-source COMPASS version string; its `version` and `release` fields feed the handbook footer and the title bar at nvlabs.github.io/COMPASS/docs/. The previous "2.0.0" was left over from the early planning phase when this release was assumed to be a major bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

All 20 cells of the iter-1000 single-GPU baseline sweep (run on pool groot-l40-04 against the *_baseline wandb runs) completed. Pasted the 4×5 results table alongside the existing iter-500 multi-GPU table and added a per-embodiment averages comparison with Δ deltas. Key findings: - carter (wheeled) improves on all axes at iter-1000 (+1.8 pp goal, -1.6 pp fall, faster wtt) — converges fast, benefits from more training. - Bipeds (g1, h1) are roughly on-par or slightly regress on average at iter-1000; the drift concentrates in `warehouse_multi_rack` (g1: -25.9 pp goal, +20.8 pp fall; h1: -20.8 pp goal). Other scenes are within seed-noise. - spot is essentially flat. - Validates the §11 multi-GPU PPO path in spirit: 8 GPUs × 500 iter ≈ 1 GPU × 1000 iter in samples-seen, and resulting policies are within seed-noise on most cells. The `warehouse_multi_rack` divergence for bipeds is worth flagging in release notes but does not block tag. - `simple_office` remains the universal weakness across configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

Surfaces and fixes issues that the older yapf v0.31.0 hook missed: - yapf v0.43.0 wants slightly different formatting for a few existing lines: line-break style in termination.py:43, dict comprehension in residual_ppo_trainer.py:497, blank lines between nested defs in residual_ppo_trainer.py (PEP8 nits), and a long print expression in sage10k_search.py. Pure formatting, no behavior change. - pylint flagged two real issues in residual_ppo_trainer.py: 1. C0301 line-too-long at line 328 — the tuple unpacking of `self.base_policy_process(...)`. Wrapped with parentheses and reformatted across three lines to fit the 100-char limit. 2. W0212 protected-access on obs_mgr._group_obs_term_cfgs / _group_obs_term_names at lines 206-207. The access is intentional and load-bearing — we need to reach into the obs manager's internal term tables to wrap each obs term's `func` for the rollout/obs perf-measurement breakdown. Added a scoped `# pylint: disable=protected-access` comment with explanation. Verified `pre-commit run --all-files` passes clean on liuw/release_ckpt post-fix (all hooks: nbstripout, large-files, EOF, requirements-txt, trailing-whitespace, yapf, clang-format, pylint). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>

nv-liuw and others added 19 commits May 7, 2026 15:40

nv-liuw closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liuw/release ckpt#20

Liuw/release ckpt#20
nv-liuw wants to merge 19 commits into
mainfrom
liuw/release_ckpt

nv-liuw commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nv-liuw commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant