Skip to content

Liuw/release ckpt#20

Closed
nv-liuw wants to merge 19 commits into
mainfrom
liuw/release_ckpt
Closed

Liuw/release ckpt#20
nv-liuw wants to merge 19 commits into
mainfrom
liuw/release_ckpt

Conversation

@nv-liuw
Copy link
Copy Markdown
Collaborator

@nv-liuw nv-liuw commented May 13, 2026

No description provided.

nv-liuw and others added 19 commits May 7, 2026 15:40
Cherry-picked bucket-A changes from origin/samc/support_nurec_assets_isaaclab_3.0
covering only the API surface required for the existing mobility_es extension
to run on Isaac Lab 3.0. NuRec assets, occupancy-map precomputation,
multi-camera video recording, and debug-image logging are deferred to
follow-up PRs.

API migration (mobility_es):
- isaaclab.utils.noise: AdditiveUniformNoiseCfg -> UniformNoiseCfg
- Restructure physics config via isaaclab_physx.physics.PhysxCfg
- rerender_on_reset -> num_rerenders_on_reset
- ActionTerm import: isaaclab.envs.mdp.actions -> isaaclab.managers
- Wrap asset.data.{root_pos_w, root_quat_w, root_lin_vel_w, root_ang_vel_w,
  default_root_pose, default_root_vel} with wp.to_torch()
- Replace write_root_state_to_sim() with write_root_pose_to_sim_index() +
  write_root_velocity_to_sim_index()
- Flip quaternion convention wxyz -> xyzw across EnvSceneAssetCfg rotations
  and NonHolonomicPerfectControlAction yaw extraction
- Rename velocity_limit -> velocity_limit_sim on carter caster joints
- Use commands.UniformPose2dCommand directly (module reorg)
- Add pyproject.toml to mobility_es extension

Pins / docs:
- Bump IsaacLab badge to v3.0.0-beta1 in README.md
- Bump install instructions in compass/rl_env/README.md to v3.0.0-beta1
- Add CLAUDE.md (project guide for Claude Code agents)
- Add release_tracker.md (next-release work tracker)

Dockerfile.rl:
- Bump base image to nvcr.io/nvidia/isaac-lab:3.0.0-beta1
- Install requirements.txt, the X-Mobility wheel, and the mobility_es
  editable extension at build time so the resulting image can run
  `python run.py` end-to-end.

Smoke-tested end-to-end on a fresh Isaac Lab 3.0-beta1 image with
num_envs=1: Isaac Sim launches, mobility_es env constructs, USDs load
with the new xyzw quaternion convention, ResidualPPOTrainer enters its
first rollout at ~7 it/s for 256 steps. Verified both --headless and
--viz kit modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Migrate the OSMO training/eval/record/distillation workflow YAMLs and
their submitter from the internal repo (gitlab-master.nvidia.com:12051/ml_nav/compass)
to the public NVlabs/COMPASS layout.

New layout:
- osmo/workflows/{rl_es_train,rl_es_eval,rl_es_record,distillation_train}_workflow.yaml
- osmo/run_osmo.py - Python CLI replacing the internal interactive run.sh
- osmo/README.md - usage docs
- README.md - new "OSMO Cloud Submission" section linking to osmo/README.md

Sanitization vs the internal originals:
- Replace NVIDIA proprietary copyright header with SPDX-Apache-2.0
- Drop OMNI_SERVER env var and omni-auth credentials block (RL workflows
  pull USDs from the OSMO groot_mobility_rl_es_usds dataset, no internal
  Nucleus needed)
- Drop the three pip install lines that re-installed requirements.txt,
  the X-Mobility wheel, and the mobility_es editable extension at OSMO
  startup; these are now baked into Dockerfile.rl by PR-1
- Empty out internal-only defaults (image, base_policy_ckpt_artifact);
  run_osmo.py errors fast if the user doesn't supply them
- Rename default workflow_name afm_rl_es -> compass_rl_es to drop the
  internal "afm" branding

huggingface-cli invocation:
- Use ${ISAACLAB_PATH}/isaaclab.sh -p -m huggingface_hub.commands.huggingface_cli
  rather than the bare /isaac-sim/kit/python/bin/huggingface-cli script. The
  bare binary's shebang invokes the bundled Python directly, bypassing
  python.sh and missing the omni.pip.cloud/pip_prebundle path where the
  pre-installed `requests` package lives. The wrapper-based invocation gets
  the right sys.path and login succeeds.

run_osmo.py:
- argparse subparsers for train / eval / record / distill
- Reads WANDB_API_KEY / HF_TOKEN from env, with --prompt fallback
- Resolves --image (pre-built) or builds + pushes the right Dockerfile
  using --registry-prefix / $COMPASS_OSMO_REGISTRY
- --dry-run prints the computed `osmo workflow submit` invocation
- Distillation does not require HF_TOKEN
- All four subcommands verified end-to-end with --dry-run rendering
  syntactically correct submit invocations

Smoke-tested:
- Local docker run reproducing the OSMO entry script with num_envs=1:
  hf-login + base-policy ckpt load + ResidualPPOTrainer rollout (256 steps)
- Real OSMO submit (compass_rl_es_g1_smoke_v2-1) on pool groot-l40-01

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Migrate the agentic-skills tooling that automates the COMPASS training
pipeline (SAGE-10k scene search/download → USD conversion → scene
registration → preview → train → eval) from the internal repo
(gitlab-master.nvidia.com:12051/ml_nav/compass:.claude/skills/compass/)
to the public NVlabs/COMPASS layout.

New layout:
- .claude/skills/compass/SKILL.md - Claude Code skill definition
- .claude/skills/compass/scripts/sage10k_search.py - SAGE-10k search by text query
- .claude/skills/compass/scripts/sage10k_to_usd.py - SAGE-10k -> USD converter

Sanitization vs the internal originals:
- Bump Isaac Lab version reference 2.3.2 -> 3.0.0-beta1 to match PR-1
- Update OSMO submission section to recommend the Python launcher
  (osmo/run_osmo.py) introduced by PR-3, with osmo/workflows/ paths
- Replace internal wandb artifact example
  (nvidia-isaac/afm_train/model-u5f67ich:v3) with a generic placeholder
- Add SPDX-Apache-2.0 headers to both Python scripts
- Skip .claude/settings.local.json (user-specific local config)

The skill itself is unchanged in structure; only references to the
public OSMO launcher and Isaac Lab pin moved forward. Conda-based
execution wrapper (conda run -n <ENV_NAME> ...) stays as the documented
host-side pattern; the Docker dev environment from item #7 (when
implemented) becomes a parallel option.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Replace the manual GUI-based occupancy-map authoring workflow with a
one-shot CLI that wraps Isaac Sim's isaacsim.asset.gen.omap APIs and
emits ROS-style PNG + YAML at <usd_dir>/omap/.

New: scripts/generate_omap_from_usd.py
- Wraps isaacsim.asset.gen.omap.bindings._omap.Generator with a small
  argparse front-end (--out-dir, --cell-size, --z-min/--z-max, --bounds,
  --padding, --map-name)
- Boots a headless SimulationApp, enables the omap extension, opens
  the input USD, ensures a UsdPhysics.Scene, computes world bounds
  from the stage bbox (with default 1m padding), generates the 2D
  occupancy buffer, writes RGBA PNG + ROS-compatible YAML
- Output convention matches the existing bundled maps (top-left
  origin: YAML.origin = (xmin, ymax), image row 0 = ymax)

Loader: compass/rl_env/exts/mobility_es/mobility_es/utils/occupancy_map.py
- New auto-discovery fallback: when OMAP_PATHS has no entry for the
  scene, look at <dirname(usd_path)>/omap/occupancy_map.yaml (where
  the generator drops files by default). Skipped for scenes using
  MultiUsdFileCfg (no single USD path).
- New scenes can now skip the OMAP_PATHS registration entirely.

Docs: compass/rl_env/README.md
- "Add Occupancy Map" step now points at the script as the primary
  flow; the manual Isaac Sim UI flow stays as a fallback.

Verification on the office scene:
- Generator produced 245x242 RGBA PNG + YAML under /tmp/omap_test/office/
  (vs the existing 205x202; difference is the default 1m padding).
- Ran a 300-sample collision-check (loader-side mirror in scratch
  /tmp/verify_omap.py): 200/300 came back free, 100/300 collision.
  Visual annotation confirms every free sample lands on an unoccupied
  (white) cell — the script's output is semantically correct under
  OccupancyMapCollisionChecker.

Two larger scenes were attempted but blocked by Isaac Sim 3.0-beta1
flakiness unrelated to this script:
- combined_single_rack (3.1 GB, deep refs): open_stage hung; will
  need more update ticks or a heavier stage-loading wait. Out of
  scope for this PR.
- sample_small_footprint (428 MB): repeatedly crashed kit at 2s
  with std::out_of_range in omni.graph.core during init, even
  with a clean container/GPU. Environmental.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Cut COMPASS first-run UX from "30-60 min, 6 manual steps" to "3 commands,
~3 min after the image build" and make the steady-state dev loop feel like
a Python venv: host-side editor, host-side shell, but every python / pip /
tensorboard invocation transparently routed through the container.

Three-layer dev model:
    Host shell → shim PATH → docker exec → daemon container
    (editor,     (set by      (translates    (compass-rl image,
     terminal,    `source       host CWD     repo bind-mounted
     prompt)      docker/      to container   at /workspace/COMPASS)
                  activate`)    path)

Quick start:
    export HF_TOKEN=hf_xxx
    ./docker/run.sh assets        # USDs + X-Mobility ckpt → ./assets/
    ./docker/run.sh build         # build the dev image
    source ./docker/activate      # venv-like activation
    python run.py -c configs/train_config.gin -o /tmp/out \
        -b ./assets/x_mobility.ckpt --enable_camera

Elegance: a single bind-mount ($(pwd) → /workspace/COMPASS) covers the
entire dev experience. The only extras are X11 socket forwarding (for
--viz kit) and a writable kit shader cache. Compare with the
robotic_grounding/workflow/run.sh reference (~10 individual mounts because
it isn't structured around a single repo root).

New files:
- docker/run.sh — subcommand wrapper: build / assets / up / down /
  exec / shell / status. Single source of truth for mount + env args
  via _compass_run_args(). Container name hashes the absolute repo path
  (compass-<user>-<sha1[1:8]>) so multiple checkouts coexist.
- docker/activate — sourceable; brings up the container if needed,
  generates a tmp dir of shim scripts (python, pip, isaaclab.sh,
  tensorboard, pytest, yapf, pylint, pre-commit) on PATH. Each shim
  docker-exec's into the container with the host CWD translated to the
  container path. Defines deactivate() to revert PATH/PS1 and clean up.
- docker/prepare_assets.sh — HF downloader for compass_usds.zip +
  x_mobility-nav2-semantic_action_path.ckpt → ./assets/. Cache-aware,
  no-op on second run.
- docker/README.md — subcommand reference, multi-checkout / multi-GPU
  notes, git workflow notes, troubleshooting.

Modified:
- docker/Dockerfile.rl — install COMPASS at /workspace/COMPASS (so
  /workspace/isaaclab from the base image survives the bind-mount); add
  /usr/local/bin/python wrapper that exec's Isaac Sim's bundled python.sh
  directly (so `python run.py` inside the container does not need
  ${ISAACLAB_PATH}/isaaclab.sh -p boilerplate).
- README.md — Quick Start now leads with the Docker path; bare-metal
  install moved under "Manual install".
- .dockerignore — exclude ./assets, ./.cache, ./.git, ./.nv,
  ./.nvidia-omniverse from the build context.
- .gitignore — exclude /assets/, /.cache/, /.nv/, /.nvidia-omniverse/.

Verification on this checkout:
- `./docker/run.sh build && ./docker/run.sh up` brings up the container.
- From an activated shell:
  * `python --version` → Python 3.12.12 (Isaac Sim's bundled python).
  * `python -c "import mobility_es; print(mobility_es.__file__)"` resolves
    to /workspace/COMPASS/compass/rl_env/exts/mobility_es/... (the
    bind-mount path → host edits hot-reload via the editable install).
  * `python -c "import isaaclab; print(isaaclab.__file__)"` resolves to
    /workspace/isaaclab/source/... (base image, untouched by bind-mount).
  * `python -c "import torch; print(torch.cuda.is_available())"` → True.
  * `cd compass/rl_env/exts/mobility_es && python -c "import os; print(os.getcwd())"`
    prints /workspace/COMPASS/compass/rl_env/exts/mobility_es (CWD
    translation working under repo subdirs).
- End-to-end smoke: `python run.py -c configs/train_config.gin -o /tmp/out
  -b ./assets/x_mobility.ckpt --num_envs 1 --enable_camera --headless`
  reaches scene-creation and simulation-start (162s, then PPO setup) via
  the activate shim. Container exits cleanly on `./docker/run.sh down`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Replace the hand-served `gh_page` branch with a two-tier publication that
auto-deploys via GitHub Actions on every push to main:

- nvlabs.github.io/COMPASS/        - academic landing (Bulma static site,
                                      migrated from gh_page into
                                      docs/project_page/)
- nvlabs.github.io/COMPASS/docs/   - Sphinx handbook with the NVIDIA
                                      theme (matches agentic_model_training,
                                      RAPIDS, Isaac Lab look)

Stack:
- Sphinx 7.x + myst-parser (markdown stays markdown, no .rst)
- nvidia-sphinx-theme (NVIDIA OSS house style)
- sphinx_design (grid cards on the landing), sphinx_copybutton,
  sphinxcontrib-mermaid

Layout under docs/:
  project_page/   academic landing (Bulma; mp4/png LFS-tracked, rest as
                  Git blobs; index.html grew a "Docs" CTA next to
                  Code/arXiv pointing at ./docs/, plus copy fixes:
                  "Nvidia,"->"NVIDIA,", "can achieves"->"can achieve",
                  "GROOT"->"GR00T", missing googletagmanager loader
                  added so analytics actually fires)
  handbook/       Sphinx project (conf.py, Makefile, requirements.txt,
                  _static/custom.css for minor brand polish, plus the
                  markdown sources)
  README.md       contributor guide (build + serve recipes, deploy
                  description, editing tips)

Handbook nav (4 captioned toctrees, all in docs/handbook/index.md):
  Installation:  Quick start / Docker-as-venv install / Agentic skills /
                 Adding a new embodiment or scene
  Workflows:     Training / Recording / Distillation / Export /
                 GR00T post-training (VLA fine-tuning)
  ROS2 Deployment: Overview / Isaac Sim setup
  Reference:     OSMO cloud submission / Auto OMap from USDs / Contributing

Each handbook page owns its content directly — no `{include}` of source
READMEs. As part of this change the five top-level READMEs that the
handbook used to transclude have been deleted and their content folded
into the corresponding handbook pages:
- compass/rl_env/README.md         -> docs/handbook/extending.md
- docker/README.md                 -> docs/handbook/installation/docker.md
- osmo/README.md                   -> docs/handbook/osmo.md
- ros2_deployment/README.md        -> docs/handbook/deployment/ros2.md
- ros2_deployment/ISAACSIM_README.md -> docs/handbook/deployment/isaac_sim.md

Three READMEs are kept (per "outside docs/"):
- README.md (root) - slimmed to overview + 5-line quick start + pointer
                    to the handbook for everything else (was ~320 lines,
                    now ~85)
- docs/README.md - contributor docs setup notes (build/serve recipes)
- docs/project_page/README.md - academic page archive

CI workflow at .github/workflows/docs.yml:
- python -m pip install -r docs/handbook/requirements.txt
- make -C docs/handbook html  (sphinx-build -W; warnings -> errors,
  including broken internal links - no `myst.xref_missing` /
  `image.not_readable` suppressions)
- copy docs/project_page/* into _site/, docs/handbook/_build/html into
  _site/docs/, deploy via actions/deploy-pages@v4 (no intermediate
  gh-pages branch).
- LFS pull enabled so academic-landing mp4/png assets reach the deploy.

One-time owner action (called out in the PR description, not done by
this commit): Settings -> Pages -> Source: "GitHub Actions"  (replaces
"Deploy from a branch: gh_page"). The gh_page branch becomes a frozen
archive — not deleted, not rebuilt.

Cross-references repointed at handbook URLs (Sphinx outputs .html files,
not directory-style URLs):
- README.md (root) - 4 deleted-README links -> handbook .html URLs
- CLAUDE.md - 2 deleted-README references repointed
- osmo/run_osmo.py docstring - "See osmo/README.md" -> handbook URL
- .claude/skills/compass/SKILL.md - OSMO link -> handbook URL
- release_tracker.md pending items - retargeted at handbook URLs;
  already-completed checkboxes left as historical record

Camera flag: standardize on `--enable_cameras` (plural) everywhere.
Verified canonical name in
/workspace/isaaclab/source/isaaclab/isaaclab/app/app_launcher.py:276
and that record.py:66 already uses the plural. Singular wrongly appeared
in the root README quick-start, quickstart.md, training.md (x2),
distillation.md, gr00t_finetuning.md - all fixed.

Side fix in source: drop the trailing `---` from
ros2_deployment/README.md before deletion (it was confusing docutils
when transcluded — moot now that the README is gone).

.gitignore additions:
- /assets/  (Docker-asset downloads from PR-7)
- /.cache/  (pip / pre-commit / huggingface caches)
- /.nv/, /.nvidia-omniverse/  (Isaac Sim runtime cruft)
- /docs/handbook/_build/  (Sphinx local-build output)

Local verification:
  python3 -m venv /tmp/sphinx_venv
  /tmp/sphinx_venv/bin/pip install -r docs/handbook/requirements.txt
  make -C docs/handbook clean html
  -> "build succeeded" with -W enabled (zero warnings).
  -> 16 handbook pages present.
  -> Combined preview at /tmp/_site/ via python -m http.server hits
     both / (academic) and /docs/ (handbook) cleanly.
  -> Regression test: a deliberate `[bogus](does_not_exist.md)` link in
     a page now fails -W as expected, confirming the strict posture is
     load-bearing.

release_tracker.md item #6 status flipped 🟢 with the full checklist
ticked (Pages settings switch + first deploy still pending — owner
action).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
…rnal defaults

Pre-release leak audit found three categories of internal-only references
in the OSMO entry path. Fixed in this commit:

1. USDs from internal OSMO dataset -> HuggingFace
   The three RL workflows (train / eval / record) used to mount an
   internal OSMO dataset `groot_mobility_rl_es_usds` and `cp` USDs out
   of `{{input:0}}/...`. Public users have no access to that dataset.
   Replaced the `cp` step with `huggingface_cli download nvidia/COMPASS
   compass_usds.zip --repo-type dataset`, then unzip into
   compass/rl_env/exts/mobility_es/mobility_es/. Mirrors the host-side
   docker/prepare_assets.sh recipe. The `inputs:` block (and its
   trailing `name: groot_mobility_rl_es_usds` line) is gone.

2. X-Mobility ckpt from internal wandb artifact -> HuggingFace
   Same three workflows used `wandb artifact get
   {{base_policy_ckpt_artifact}} --root ./[base_policy/]` to pull the
   public X-Mobility ckpt from a private wandb mirror. Replaced with
   `huggingface_cli download nvidia/X-Mobility
   x_mobility-nav2-semantic_action_path.ckpt`, then rename to
   `model.ckpt` so downstream `-b ./model.ckpt` / `-b
   ./base_policy/model.ckpt` invocations stay byte-identical. The
   `base_policy_ckpt_artifact` template var is gone from defaults.
   `osmo/run_osmo.py` no longer accepts `--base-policy-ckpt`.

3. Internal wandb-project defaults
   `osmo/run_osmo.py` previously baked in `compass_rl_enhance` /
   `afm_train` as `nvidia-isaac`-entity defaults; "afm" is internal
   branding. Dropped the DEFAULT_WANDB_PROJECT dict entirely; made
   `--wandb-project` `required=True` on every subparser that hits a
   wandb-enabled workflow.

Also:
- Updated docs/handbook/osmo.md: prerequisites no longer mention the
  OSMO dataset upload, the wandb base-policy mirror, or the
  --base-policy-ckpt flag. Quick-start example uses --wandb-project
  with a generic value. Subcommand stanzas updated. Troubleshooting
  bullet on dataset-not-found now explicitly distillation-only;
  added a new bullet for HF download failures.
- release_tracker.md: new workstream #8 (Pre-release leak audit +
  sanitization) added to the summary table and as a section, with
  status 🟡 (most boxes ticked, OSMO smoke + maintainer review
  pending).
- Platform names (`ovx-l40` for RL, `dgx-h100` for distillation) kept
  as defaults per user direction — these are recommended pool names
  rather than internal-only references.

Verification:
- Handbook still passes -W: `make -C docs/handbook clean html`.
- grep -rnE 'groot_mobility_rl_es_usds|nvidia-isaac/|afm_train'
  excluding _build/.git/release_tracker.md/dev_env_plan.md returns
  no live-source hits.
- `python osmo/run_osmo.py train --help` shows --wandb-project
  as required, no --base-policy-ckpt flag.
- `python osmo/run_osmo.py train --experiment-name probe --image foo
  --dry-run` errors with "the following arguments are required:
  --wandb-project", as expected.

Out of scope here:
- ros2_deployment/compass_navigator/setup.py maintainer attribution
  (flagged in tracker; defer to user).
- release_tracker.md and dev_env_plan.md gitlab-master references
  (handled at ship time per the existing CHANGELOG distillation gate).
- A live OSMO smoke test against the new workflows (image rebuild +
  resubmit pending).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Two release-readiness changes:

1. .github/workflows/pre-commit.yml — runs the existing pre-commit hooks
   (yapf, pylint, nbstripout, clang-format, end-of-file-fixer, trailing-
   whitespace, requirements-txt-fixer, check-added-large-files) on every
   PR + push to main. Uses pre-commit/[email protected] for hook-env
   caching. Python 3.11 (yapf v0.31.0 still depends on lib2to3, removed
   in Python 3.13). pylint runs without project deps because .pylintrc
   has `import-error` disabled.

2. requirements.txt — replace 17 unpinned lines with `==<version>` pins
   based on the package set verified inside compass-rl:latest (the
   image used for the just-passed OSMO smoke):
     einops==0.8.2, gin-config==0.5.0, h5py==3.16.0, matplotlib==3.10.8,
     moviepy==2.2.1, msgpack==1.1.2, numpy==1.26.4, onnx==1.20.1,
     onnxruntime-gpu==1.25.1, pandas==3.0.1, pytorch-lightning==2.6.1,
     timm==1.0.26, torcheval==0.0.7, transformers==4.57.6, wandb==0.25.1,
     wheel==0.47.0, zmq==0.0.0
   diffusers==0.29.2 was already pinned. zmq==0.0.0 stays as the stub
   it currently is (functional swap to pyzmq is out of scope).

Surfaced legacy violations on the first `pre-commit run --all-files` —
fixed in the same commit so the workflow lands green:

- Trailing whitespace + missing EOF newline cleanup across 22 files
  (mostly empty __init__.py files, project-page CSS/JS, mobility_es
  pyproject.toml).
- yapf reformatted argparse blocks in osmo/run_osmo.py to its 2-space
  multi-line style.
- requirements-txt-fixer alphabetized docs/handbook/requirements.txt.
- scripts/generate_omap_from_usd.py: drop unused `import os`; add a
  scoped `pylint: disable=ungrouped-imports` on the
  `isaacsim.asset.gen.omap.bindings` import (intentionally late so the
  extension can be enabled before its bindings load).

Verification:
- `pre-commit run --all-files` is green locally (Python 3.10 venv, same
  hook set as the workflow).
- `requirements.txt` parses; the only line without `==` is the
  (intentional) blank EOF.
- Workflow YAML is syntactically valid; pre-commit/[email protected] is
  the canonical action.

release_tracker.md: new workstream #9 (CI/CD setup + dep pinning) added
to the summary table and as a section, status 🟡 (the file changes are
done; the live first-CI-run is the only remaining checkbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
…nt skills

Hybrid front-door pattern: /compass stays as the umbrella (training,
SAGE workflows, eval, OSMO), three new specialty siblings handle the
high-friction onboarding moments.

- /compass: drop conda + ISAACLAB_PATH wrappers throughout, replace
  Setup with the 3-command docker-as-venv flow, add Specialty skills
  routing section, soften MUST-style rules with the why. OSMO example
  aligned with #8 sanitization (no --base-policy-ckpt; HF download
  inside the workflow).
- /compass-deploy: ckpt -> ONNX (-r/-g+-e branch) -> TRT engine -> ROS2
  launch scaffold. Skill prints the launch command but doesn't run it.
- /compass-debug: 8-check diagnostic with bundled scripts/compass_status.sh
  (parallel checks; --deep adds Isaac Sim init; --ckpt loads via torch).
  Reports root cause and routes to the right specialty for the fix.
- /compass-newembodiment: interactive robot onboarding. Parses
  pre-supplied input; AskUserQuestion only for missing fields; shows
  diff before writing; smoke-tests with --num_envs 1.
- Progressive-disclosure split: Setup SAGE Local extracted to
  compass/references/setup-sage-local.md.
- docs/handbook/agentic.md: opens with a "Pick the right skill" matrix
  and per-skill sections.
- release_tracker.md: flipped #4's migration boxes ✅; added #10 row
  and section for this work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
The eval subcommand of osmo/run_osmo.py already exposes --embodiment
and --environment overrides; train didn't, so multi-embodiment training
sweeps had to either hand-edit the workflow YAML or run with the
gin-config default (g1) for every job.

- osmo/run_osmo.py: add --embodiment / --environment flags to the
  train subparser; thread them into cmd_train's set_args dict.
- osmo/workflows/rl_es_train_workflow.yaml: accept embodiment /
  environment template vars (default empty), conditionally append
  them to TRAIN_CMD and to the post-training EVAL_CMD that runs in
  the same workflow. Pattern matches what rl_es_eval_workflow.yaml
  already does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
The internal repo's release/internal.yaml strips a 113-line benchmark.py
that hardcodes nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance and
afm_rl_enhance defaults. The public side currently has no benchmark
runner, and the No-regression benchmark gate (P0 release blocker) has no
concrete tooling. Track the sanitization-and-land work as a sub-bullet
under that gate so it has an owner-able shape; suggested landing path
osmo/run_benchmark.py mirrors osmo/run_osmo.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Adds 8-GPU distributed residual-RL training (torchrun + manual
all-reduce), per-stage timing instrumentation, supporting OSMO
workflow, and a perf-analysis report.

Multi-GPU (run.py, compass/residual_rl/ppo.py, residual_ppo_trainer.py)
- --distributed flag, dist.init_process_group(nccl), per-rank device
  (cuda:local_rank), per-rank env seed offset, env_cfg.sim.device
  pinned per rank, torch.cuda.set_device(local_rank) before init so
  NCCL object collectives route through per-rank GPUs, conditional
  nn.DataParallel(device_ids=[local_rank]), rank-0-only Logger /
  RecordVideo (no-op logger on other ranks).
- ResidualPPOTrainer: world_size / global_rank state, initial param
  broadcast from rank 0, rank-0 gating on _save_ckpt /
  _save_episode_logs / _upload_video, weighted per-rank episode-log
  aggregation via dist.all_gather_object on CPU-converted dicts,
  torch.load(map_location=self.device) on resume,
  os.makedirs(exist_ok=True).
- PPO.update: manual gradient all-reduce (AVG) BEFORE clip
  (canonical DDP order — clipping the averaged gradient, not the
  per-rank pre-avg one), kl_mean all-reduce before LR adaptation so
  all ranks pick the same LR, metric all-reduce on (value_loss,
  surrogate_loss, entropy). Returns a diagnostics dict so the trainer
  logs ppo/learning_rate, ppo/kl_mean, ppo/entropy,
  ppo/action_std_mean each iter.

OSMO 8-GPU workflow (osmo/workflows/rl_es_train_8gpu_workflow.yaml,
osmo/run_osmo.py)
- 8-GPU / 80-CPU / 800-GiB resource block on ovx-l40; train phase
  via torch.distributed.run --nproc_per_node=8 run.py --distributed
  with --num_envs 32 per rank (256 total envs); eval phase as a
  single-process call.
- run_osmo.py: --num-gpus flag (choices 2, 8) routes to the matching
  workflow YAML.

Per-stage timing instrumentation (always on, ~2-3% overhead)
- ResidualPPOTrainer.learn(): CUDA-synced _timer context manager
  around each top-level stage (env_reset_and_init, rollout,
  compute_returns, update, logging, checkpoint) plus rollout
  sub-steps (act, env_step, process_env_step, base_policy_process).
- _install_env_timers monkey-patches Isaac Lab managers + sim.step /
  sim.render + per-ObsTerm .func so logs include
  time/rollout/env_step/{obs, sim_step, ...} and
  time/rollout/env_step/obs_term/<group>/<name>. Once-per-iter
  log_dict via the existing logger.

Perf-analysis report (docs/PERF_ANALYSIS.{md,pdf})
- Methodology, baseline 32-env 256-step iter breakdown, A/B
  experiments (depth-drop, DLAA + denoiser, BF16 autocast,
  per-stage env_step breakdown, per-ObsTerm breakdown of
  obs.compute), ranked recommendations.

Signed-off-by: Wei Liu <liuw@nvidia.com>
Sanitizes the 113-line benchmark.py from the internal repo and lands it as
osmo/run_benchmark.py next to run_osmo.py. Closes the "sanitize and land
internal benchmark.py" subtask under the No-regression benchmark gate.

Behavior: fires one rl_es_eval_workflow.yaml submission per --environments
entry, all using the same --embodiment. Each run writes the usual eval/*
metrics (goal_reached_rate / fall_down_rate / total_travel_time /
weighted_travel_time) to W&B at bm_<embodiment>_<env>_<experiment>; user
pulls those out for regression assessment.

Sanitization changes vs internal:
- Apache-2.0 SPDX header replacing NVIDIA proprietary copyright.
- Hardcoded registry nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance
  -> --registry-prefix flag with $COMPASS_OSMO_REGISTRY fallback (mirrors
  the run_osmo.py:80-83 pattern; errors fast if unset and --image-name
  not given).
- --wandb-project-name: drop afm_rl_enhance_benchmark default, mark
  required=True (mirrors run_osmo.py:93-95).
- Workflow path: ./workflows/rl_es_eval_workflow.yaml ->
  Path(__file__).resolve().parent / "workflows" / ... so the script works
  from any CWD and lives under osmo/.
- Adds --dry-run and --prompt for parity with run_osmo.py.

Reuses existing osmo/workflows/rl_es_eval_workflow.yaml unchanged. Default
sweep stays at 5 scenes (simple_office, warehouse_single_rack,
warehouse_multi_rack, combined_single_rack, combined_multi_rack), default
--embodiment stays h1; per user direction the matrix lives in the script
rather than a separate YAML.

docs/handbook/osmo.md: adds a Benchmark section + cross-reference under
the subcommand table.

release_tracker.md: ticks the sanitize-and-land subtask; widens the
section-8 grep gate to include groot_mobility_rl_enhance and
afm_rl_enhance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Closes the §8 sanitization tail in release_tracker.md and pastes the
iter-500 multi-GPU benchmark numbers into the No-regression gate.

Argparse defaults — drop the internal `afm_rl_enhance*` wandb-project
defaults and mark --wandb-project-name required in all three top-level
entry points. Mirrors the run_osmo.py:93-95 pattern we already applied
to osmo/run_benchmark.py:

- run.py:58 — was 'afm_rl_enhance'
- record.py:135 — was 'afm_rl_enhance_record'
- distillation_train.py:54 — was 'afm_rl_enhance_distillation'

After the fix, the extended §8 grep gate
  grep -rnE "nvidia-isaac/|afm_train|groot_mobility_rl_enhance|afm_rl_enhance" .
returns zero live-source hits.

release_tracker.md hygiene:
- Title and version refs: "COMPASS 2.0" -> "COMPASS 1.6", target version
  2.0.0 -> 1.6.0, integration branch updated to liuw/benchmark_port.
- §2&3 NuRec PR-2 (Buckets B+E+H) marked deferred to post-1.6: only
  Bucket A (22b25ef) ships in 1.6.
- §11 multi-GPU PPO release-scope decision marked settled (ships in 1.6
  by squash-strategy default).
- §8 grep gate tightened: dropped `groot_mobility_rl_es_usds` from the
  pattern — it's the directory name inside the public HF compass_usds.zip
  and is correctly referenced by osmo/workflows/*.yaml after unzipping.
  Gate now returns zero live-source hits.
- §Pre-release gates → No-regression benchmark: pasted 4-embodiment × 5-
  scene iter-500 multi-GPU results table (goal_reached / fall_down,
  per-embodiment averages, headlines). Eval ran with the relaxed-heading
  image compass_release_1_6_relaxed:c87052af (heading_threshold=π);
  source ships default 0.1 — release notes will document this.
- Marked "Define the regression matrix" complete (matrix is the script's
  --environments default; documented). v1.5.0 baseline capture deferred;
  the 1.6 numbers become the new published baseline.
- iter-1000 single-GPU baseline section stubbed for the in-flight cells.

ros2_deployment/compass_navigator/setup.py: removed a duplicated SPDX
Apache-2.0 license header (lines 16-28 were a verbatim copy of 1-14).
Maintainer fields kept as-is for 1.6 per user direction; team-alias swap
is a post-tag follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
yapf v0.31.0 imports lib2to3, which was removed from the standard
library in Python 3.13 (and broken on some 3.12 installations). The
pre-commit yapf hook fails to install on py3.12 with
`ModuleNotFoundError: No module named 'lib2to3'`, which blocks
`pre-commit run --all-files` at tag time and any local pre-commit run.

yapf v0.40.0+ bundles its own lib2to3 fork and no longer depends on the
stdlib copy. Bumping to v0.43.0 (latest stable as of this commit) fixes
the install on py3.12 with no formatting policy change.

Side effect: yapf added one PEP8 blank line in run.py between
class `_NoOpLogger` and the top-level `EmbodimentEnvCfgMap`. Pure
formatting, no behavior change.

Verification: `pre-commit clean && pre-commit run yapf --files run.py
record.py distillation_train.py ros2_deployment/compass_navigator/setup.py
osmo/run_benchmark.py` -> all Passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Distilled from release_tracker.md and the integration-branch commits.
Date kept as TBD; will be set to the tag-day date in the v1.6.0 commit.

The entry intentionally does NOT list full NuRec real2sim asset support
under "Added"; only Bucket A (Isaac Lab 3.0 API migration, commit
22b25ef) ships in 1.6. NuRec PR-2 (Buckets B+E+H) is deferred to a clean
post-1.6 PR off main.

Release notes (separate from CHANGELOG, drafted at tag time) will
additionally call out:
- Benchmark eval ran against compass_release_1_6_relaxed:c87052af, an
  image with heading_threshold flipped from 0.1 to math.pi in
  termination.py:33. Source ships the default 0.1 — the relaxation is
  image-only and is a known-test-config for the published v1.6.0
  benchmark numbers.
- Pre-existing argparse defaults `afm_rl_enhance*` in run.py / record.py
  / distillation_train.py were dropped; --wandb-project-name is now
  required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
The Sphinx config carries the only in-source COMPASS version string;
its `version` and `release` fields feed the handbook footer and the
title bar at nvlabs.github.io/COMPASS/docs/. The previous "2.0.0" was
left over from the early planning phase when this release was assumed
to be a major bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
All 20 cells of the iter-1000 single-GPU baseline sweep (run on pool
groot-l40-04 against the *_baseline wandb runs) completed. Pasted the
4×5 results table alongside the existing iter-500 multi-GPU table and
added a per-embodiment averages comparison with Δ deltas.

Key findings:
- carter (wheeled) improves on all axes at iter-1000 (+1.8 pp goal,
  -1.6 pp fall, faster wtt) — converges fast, benefits from more
  training.
- Bipeds (g1, h1) are roughly on-par or slightly regress on average at
  iter-1000; the drift concentrates in `warehouse_multi_rack`
  (g1: -25.9 pp goal, +20.8 pp fall; h1: -20.8 pp goal). Other scenes
  are within seed-noise.
- spot is essentially flat.
- Validates the §11 multi-GPU PPO path in spirit: 8 GPUs × 500 iter ≈
  1 GPU × 1000 iter in samples-seen, and resulting policies are within
  seed-noise on most cells. The `warehouse_multi_rack` divergence for
  bipeds is worth flagging in release notes but does not block tag.
- `simple_office` remains the universal weakness across configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Surfaces and fixes issues that the older yapf v0.31.0 hook missed:

- yapf v0.43.0 wants slightly different formatting for a few existing
  lines: line-break style in termination.py:43, dict comprehension in
  residual_ppo_trainer.py:497, blank lines between nested defs in
  residual_ppo_trainer.py (PEP8 nits), and a long print expression in
  sage10k_search.py. Pure formatting, no behavior change.

- pylint flagged two real issues in residual_ppo_trainer.py:

  1. C0301 line-too-long at line 328 — the tuple unpacking of
     `self.base_policy_process(...)`. Wrapped with parentheses and
     reformatted across three lines to fit the 100-char limit.

  2. W0212 protected-access on obs_mgr._group_obs_term_cfgs /
     _group_obs_term_names at lines 206-207. The access is intentional
     and load-bearing — we need to reach into the obs manager's
     internal term tables to wrap each obs term's `func` for the
     rollout/obs perf-measurement breakdown. Added a scoped
     `# pylint: disable=protected-access` comment with explanation.

Verified `pre-commit run --all-files` passes clean on liuw/release_ckpt
post-fix (all hooks: nbstripout, large-files, EOF, requirements-txt,
trailing-whitespace, yapf, clang-format, pylint).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
@nv-liuw nv-liuw closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant