Liuw/release ckpt#20
Closed
nv-liuw wants to merge 19 commits into
Closed
Conversation
Cherry-picked bucket-A changes from origin/samc/support_nurec_assets_isaaclab_3.0
covering only the API surface required for the existing mobility_es extension
to run on Isaac Lab 3.0. NuRec assets, occupancy-map precomputation,
multi-camera video recording, and debug-image logging are deferred to
follow-up PRs.
API migration (mobility_es):
- isaaclab.utils.noise: AdditiveUniformNoiseCfg -> UniformNoiseCfg
- Restructure physics config via isaaclab_physx.physics.PhysxCfg
- rerender_on_reset -> num_rerenders_on_reset
- ActionTerm import: isaaclab.envs.mdp.actions -> isaaclab.managers
- Wrap asset.data.{root_pos_w, root_quat_w, root_lin_vel_w, root_ang_vel_w,
default_root_pose, default_root_vel} with wp.to_torch()
- Replace write_root_state_to_sim() with write_root_pose_to_sim_index() +
write_root_velocity_to_sim_index()
- Flip quaternion convention wxyz -> xyzw across EnvSceneAssetCfg rotations
and NonHolonomicPerfectControlAction yaw extraction
- Rename velocity_limit -> velocity_limit_sim on carter caster joints
- Use commands.UniformPose2dCommand directly (module reorg)
- Add pyproject.toml to mobility_es extension
Pins / docs:
- Bump IsaacLab badge to v3.0.0-beta1 in README.md
- Bump install instructions in compass/rl_env/README.md to v3.0.0-beta1
- Add CLAUDE.md (project guide for Claude Code agents)
- Add release_tracker.md (next-release work tracker)
Dockerfile.rl:
- Bump base image to nvcr.io/nvidia/isaac-lab:3.0.0-beta1
- Install requirements.txt, the X-Mobility wheel, and the mobility_es
editable extension at build time so the resulting image can run
`python run.py` end-to-end.
Smoke-tested end-to-end on a fresh Isaac Lab 3.0-beta1 image with
num_envs=1: Isaac Sim launches, mobility_es env constructs, USDs load
with the new xyzw quaternion convention, ResidualPPOTrainer enters its
first rollout at ~7 it/s for 256 steps. Verified both --headless and
--viz kit modes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Migrate the OSMO training/eval/record/distillation workflow YAMLs and
their submitter from the internal repo (gitlab-master.nvidia.com:12051/ml_nav/compass)
to the public NVlabs/COMPASS layout.
New layout:
- osmo/workflows/{rl_es_train,rl_es_eval,rl_es_record,distillation_train}_workflow.yaml
- osmo/run_osmo.py - Python CLI replacing the internal interactive run.sh
- osmo/README.md - usage docs
- README.md - new "OSMO Cloud Submission" section linking to osmo/README.md
Sanitization vs the internal originals:
- Replace NVIDIA proprietary copyright header with SPDX-Apache-2.0
- Drop OMNI_SERVER env var and omni-auth credentials block (RL workflows
pull USDs from the OSMO groot_mobility_rl_es_usds dataset, no internal
Nucleus needed)
- Drop the three pip install lines that re-installed requirements.txt,
the X-Mobility wheel, and the mobility_es editable extension at OSMO
startup; these are now baked into Dockerfile.rl by PR-1
- Empty out internal-only defaults (image, base_policy_ckpt_artifact);
run_osmo.py errors fast if the user doesn't supply them
- Rename default workflow_name afm_rl_es -> compass_rl_es to drop the
internal "afm" branding
huggingface-cli invocation:
- Use ${ISAACLAB_PATH}/isaaclab.sh -p -m huggingface_hub.commands.huggingface_cli
rather than the bare /isaac-sim/kit/python/bin/huggingface-cli script. The
bare binary's shebang invokes the bundled Python directly, bypassing
python.sh and missing the omni.pip.cloud/pip_prebundle path where the
pre-installed `requests` package lives. The wrapper-based invocation gets
the right sys.path and login succeeds.
run_osmo.py:
- argparse subparsers for train / eval / record / distill
- Reads WANDB_API_KEY / HF_TOKEN from env, with --prompt fallback
- Resolves --image (pre-built) or builds + pushes the right Dockerfile
using --registry-prefix / $COMPASS_OSMO_REGISTRY
- --dry-run prints the computed `osmo workflow submit` invocation
- Distillation does not require HF_TOKEN
- All four subcommands verified end-to-end with --dry-run rendering
syntactically correct submit invocations
Smoke-tested:
- Local docker run reproducing the OSMO entry script with num_envs=1:
hf-login + base-policy ckpt load + ResidualPPOTrainer rollout (256 steps)
- Real OSMO submit (compass_rl_es_g1_smoke_v2-1) on pool groot-l40-01
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Migrate the agentic-skills tooling that automates the COMPASS training pipeline (SAGE-10k scene search/download → USD conversion → scene registration → preview → train → eval) from the internal repo (gitlab-master.nvidia.com:12051/ml_nav/compass:.claude/skills/compass/) to the public NVlabs/COMPASS layout. New layout: - .claude/skills/compass/SKILL.md - Claude Code skill definition - .claude/skills/compass/scripts/sage10k_search.py - SAGE-10k search by text query - .claude/skills/compass/scripts/sage10k_to_usd.py - SAGE-10k -> USD converter Sanitization vs the internal originals: - Bump Isaac Lab version reference 2.3.2 -> 3.0.0-beta1 to match PR-1 - Update OSMO submission section to recommend the Python launcher (osmo/run_osmo.py) introduced by PR-3, with osmo/workflows/ paths - Replace internal wandb artifact example (nvidia-isaac/afm_train/model-u5f67ich:v3) with a generic placeholder - Add SPDX-Apache-2.0 headers to both Python scripts - Skip .claude/settings.local.json (user-specific local config) The skill itself is unchanged in structure; only references to the public OSMO launcher and Isaac Lab pin moved forward. Conda-based execution wrapper (conda run -n <ENV_NAME> ...) stays as the documented host-side pattern; the Docker dev environment from item #7 (when implemented) becomes a parallel option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Replace the manual GUI-based occupancy-map authoring workflow with a one-shot CLI that wraps Isaac Sim's isaacsim.asset.gen.omap APIs and emits ROS-style PNG + YAML at <usd_dir>/omap/. New: scripts/generate_omap_from_usd.py - Wraps isaacsim.asset.gen.omap.bindings._omap.Generator with a small argparse front-end (--out-dir, --cell-size, --z-min/--z-max, --bounds, --padding, --map-name) - Boots a headless SimulationApp, enables the omap extension, opens the input USD, ensures a UsdPhysics.Scene, computes world bounds from the stage bbox (with default 1m padding), generates the 2D occupancy buffer, writes RGBA PNG + ROS-compatible YAML - Output convention matches the existing bundled maps (top-left origin: YAML.origin = (xmin, ymax), image row 0 = ymax) Loader: compass/rl_env/exts/mobility_es/mobility_es/utils/occupancy_map.py - New auto-discovery fallback: when OMAP_PATHS has no entry for the scene, look at <dirname(usd_path)>/omap/occupancy_map.yaml (where the generator drops files by default). Skipped for scenes using MultiUsdFileCfg (no single USD path). - New scenes can now skip the OMAP_PATHS registration entirely. Docs: compass/rl_env/README.md - "Add Occupancy Map" step now points at the script as the primary flow; the manual Isaac Sim UI flow stays as a fallback. Verification on the office scene: - Generator produced 245x242 RGBA PNG + YAML under /tmp/omap_test/office/ (vs the existing 205x202; difference is the default 1m padding). - Ran a 300-sample collision-check (loader-side mirror in scratch /tmp/verify_omap.py): 200/300 came back free, 100/300 collision. Visual annotation confirms every free sample lands on an unoccupied (white) cell — the script's output is semantically correct under OccupancyMapCollisionChecker. Two larger scenes were attempted but blocked by Isaac Sim 3.0-beta1 flakiness unrelated to this script: - combined_single_rack (3.1 GB, deep refs): open_stage hung; will need more update ticks or a heavier stage-loading wait. Out of scope for this PR. - sample_small_footprint (428 MB): repeatedly crashed kit at 2s with std::out_of_range in omni.graph.core during init, even with a clean container/GPU. Environmental. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Cut COMPASS first-run UX from "30-60 min, 6 manual steps" to "3 commands,
~3 min after the image build" and make the steady-state dev loop feel like
a Python venv: host-side editor, host-side shell, but every python / pip /
tensorboard invocation transparently routed through the container.
Three-layer dev model:
Host shell → shim PATH → docker exec → daemon container
(editor, (set by (translates (compass-rl image,
terminal, `source host CWD repo bind-mounted
prompt) docker/ to container at /workspace/COMPASS)
activate`) path)
Quick start:
export HF_TOKEN=hf_xxx
./docker/run.sh assets # USDs + X-Mobility ckpt → ./assets/
./docker/run.sh build # build the dev image
source ./docker/activate # venv-like activation
python run.py -c configs/train_config.gin -o /tmp/out \
-b ./assets/x_mobility.ckpt --enable_camera
Elegance: a single bind-mount ($(pwd) → /workspace/COMPASS) covers the
entire dev experience. The only extras are X11 socket forwarding (for
--viz kit) and a writable kit shader cache. Compare with the
robotic_grounding/workflow/run.sh reference (~10 individual mounts because
it isn't structured around a single repo root).
New files:
- docker/run.sh — subcommand wrapper: build / assets / up / down /
exec / shell / status. Single source of truth for mount + env args
via _compass_run_args(). Container name hashes the absolute repo path
(compass-<user>-<sha1[1:8]>) so multiple checkouts coexist.
- docker/activate — sourceable; brings up the container if needed,
generates a tmp dir of shim scripts (python, pip, isaaclab.sh,
tensorboard, pytest, yapf, pylint, pre-commit) on PATH. Each shim
docker-exec's into the container with the host CWD translated to the
container path. Defines deactivate() to revert PATH/PS1 and clean up.
- docker/prepare_assets.sh — HF downloader for compass_usds.zip +
x_mobility-nav2-semantic_action_path.ckpt → ./assets/. Cache-aware,
no-op on second run.
- docker/README.md — subcommand reference, multi-checkout / multi-GPU
notes, git workflow notes, troubleshooting.
Modified:
- docker/Dockerfile.rl — install COMPASS at /workspace/COMPASS (so
/workspace/isaaclab from the base image survives the bind-mount); add
/usr/local/bin/python wrapper that exec's Isaac Sim's bundled python.sh
directly (so `python run.py` inside the container does not need
${ISAACLAB_PATH}/isaaclab.sh -p boilerplate).
- README.md — Quick Start now leads with the Docker path; bare-metal
install moved under "Manual install".
- .dockerignore — exclude ./assets, ./.cache, ./.git, ./.nv,
./.nvidia-omniverse from the build context.
- .gitignore — exclude /assets/, /.cache/, /.nv/, /.nvidia-omniverse/.
Verification on this checkout:
- `./docker/run.sh build && ./docker/run.sh up` brings up the container.
- From an activated shell:
* `python --version` → Python 3.12.12 (Isaac Sim's bundled python).
* `python -c "import mobility_es; print(mobility_es.__file__)"` resolves
to /workspace/COMPASS/compass/rl_env/exts/mobility_es/... (the
bind-mount path → host edits hot-reload via the editable install).
* `python -c "import isaaclab; print(isaaclab.__file__)"` resolves to
/workspace/isaaclab/source/... (base image, untouched by bind-mount).
* `python -c "import torch; print(torch.cuda.is_available())"` → True.
* `cd compass/rl_env/exts/mobility_es && python -c "import os; print(os.getcwd())"`
prints /workspace/COMPASS/compass/rl_env/exts/mobility_es (CWD
translation working under repo subdirs).
- End-to-end smoke: `python run.py -c configs/train_config.gin -o /tmp/out
-b ./assets/x_mobility.ckpt --num_envs 1 --enable_camera --headless`
reaches scene-creation and simulation-start (162s, then PPO setup) via
the activate shim. Container exits cleanly on `./docker/run.sh down`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Replace the hand-served `gh_page` branch with a two-tier publication that
auto-deploys via GitHub Actions on every push to main:
- nvlabs.github.io/COMPASS/ - academic landing (Bulma static site,
migrated from gh_page into
docs/project_page/)
- nvlabs.github.io/COMPASS/docs/ - Sphinx handbook with the NVIDIA
theme (matches agentic_model_training,
RAPIDS, Isaac Lab look)
Stack:
- Sphinx 7.x + myst-parser (markdown stays markdown, no .rst)
- nvidia-sphinx-theme (NVIDIA OSS house style)
- sphinx_design (grid cards on the landing), sphinx_copybutton,
sphinxcontrib-mermaid
Layout under docs/:
project_page/ academic landing (Bulma; mp4/png LFS-tracked, rest as
Git blobs; index.html grew a "Docs" CTA next to
Code/arXiv pointing at ./docs/, plus copy fixes:
"Nvidia,"->"NVIDIA,", "can achieves"->"can achieve",
"GROOT"->"GR00T", missing googletagmanager loader
added so analytics actually fires)
handbook/ Sphinx project (conf.py, Makefile, requirements.txt,
_static/custom.css for minor brand polish, plus the
markdown sources)
README.md contributor guide (build + serve recipes, deploy
description, editing tips)
Handbook nav (4 captioned toctrees, all in docs/handbook/index.md):
Installation: Quick start / Docker-as-venv install / Agentic skills /
Adding a new embodiment or scene
Workflows: Training / Recording / Distillation / Export /
GR00T post-training (VLA fine-tuning)
ROS2 Deployment: Overview / Isaac Sim setup
Reference: OSMO cloud submission / Auto OMap from USDs / Contributing
Each handbook page owns its content directly — no `{include}` of source
READMEs. As part of this change the five top-level READMEs that the
handbook used to transclude have been deleted and their content folded
into the corresponding handbook pages:
- compass/rl_env/README.md -> docs/handbook/extending.md
- docker/README.md -> docs/handbook/installation/docker.md
- osmo/README.md -> docs/handbook/osmo.md
- ros2_deployment/README.md -> docs/handbook/deployment/ros2.md
- ros2_deployment/ISAACSIM_README.md -> docs/handbook/deployment/isaac_sim.md
Three READMEs are kept (per "outside docs/"):
- README.md (root) - slimmed to overview + 5-line quick start + pointer
to the handbook for everything else (was ~320 lines,
now ~85)
- docs/README.md - contributor docs setup notes (build/serve recipes)
- docs/project_page/README.md - academic page archive
CI workflow at .github/workflows/docs.yml:
- python -m pip install -r docs/handbook/requirements.txt
- make -C docs/handbook html (sphinx-build -W; warnings -> errors,
including broken internal links - no `myst.xref_missing` /
`image.not_readable` suppressions)
- copy docs/project_page/* into _site/, docs/handbook/_build/html into
_site/docs/, deploy via actions/deploy-pages@v4 (no intermediate
gh-pages branch).
- LFS pull enabled so academic-landing mp4/png assets reach the deploy.
One-time owner action (called out in the PR description, not done by
this commit): Settings -> Pages -> Source: "GitHub Actions" (replaces
"Deploy from a branch: gh_page"). The gh_page branch becomes a frozen
archive — not deleted, not rebuilt.
Cross-references repointed at handbook URLs (Sphinx outputs .html files,
not directory-style URLs):
- README.md (root) - 4 deleted-README links -> handbook .html URLs
- CLAUDE.md - 2 deleted-README references repointed
- osmo/run_osmo.py docstring - "See osmo/README.md" -> handbook URL
- .claude/skills/compass/SKILL.md - OSMO link -> handbook URL
- release_tracker.md pending items - retargeted at handbook URLs;
already-completed checkboxes left as historical record
Camera flag: standardize on `--enable_cameras` (plural) everywhere.
Verified canonical name in
/workspace/isaaclab/source/isaaclab/isaaclab/app/app_launcher.py:276
and that record.py:66 already uses the plural. Singular wrongly appeared
in the root README quick-start, quickstart.md, training.md (x2),
distillation.md, gr00t_finetuning.md - all fixed.
Side fix in source: drop the trailing `---` from
ros2_deployment/README.md before deletion (it was confusing docutils
when transcluded — moot now that the README is gone).
.gitignore additions:
- /assets/ (Docker-asset downloads from PR-7)
- /.cache/ (pip / pre-commit / huggingface caches)
- /.nv/, /.nvidia-omniverse/ (Isaac Sim runtime cruft)
- /docs/handbook/_build/ (Sphinx local-build output)
Local verification:
python3 -m venv /tmp/sphinx_venv
/tmp/sphinx_venv/bin/pip install -r docs/handbook/requirements.txt
make -C docs/handbook clean html
-> "build succeeded" with -W enabled (zero warnings).
-> 16 handbook pages present.
-> Combined preview at /tmp/_site/ via python -m http.server hits
both / (academic) and /docs/ (handbook) cleanly.
-> Regression test: a deliberate `[bogus](does_not_exist.md)` link in
a page now fails -W as expected, confirming the strict posture is
load-bearing.
release_tracker.md item #6 status flipped 🟢 with the full checklist
ticked (Pages settings switch + first deploy still pending — owner
action).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
…rnal defaults
Pre-release leak audit found three categories of internal-only references
in the OSMO entry path. Fixed in this commit:
1. USDs from internal OSMO dataset -> HuggingFace
The three RL workflows (train / eval / record) used to mount an
internal OSMO dataset `groot_mobility_rl_es_usds` and `cp` USDs out
of `{{input:0}}/...`. Public users have no access to that dataset.
Replaced the `cp` step with `huggingface_cli download nvidia/COMPASS
compass_usds.zip --repo-type dataset`, then unzip into
compass/rl_env/exts/mobility_es/mobility_es/. Mirrors the host-side
docker/prepare_assets.sh recipe. The `inputs:` block (and its
trailing `name: groot_mobility_rl_es_usds` line) is gone.
2. X-Mobility ckpt from internal wandb artifact -> HuggingFace
Same three workflows used `wandb artifact get
{{base_policy_ckpt_artifact}} --root ./[base_policy/]` to pull the
public X-Mobility ckpt from a private wandb mirror. Replaced with
`huggingface_cli download nvidia/X-Mobility
x_mobility-nav2-semantic_action_path.ckpt`, then rename to
`model.ckpt` so downstream `-b ./model.ckpt` / `-b
./base_policy/model.ckpt` invocations stay byte-identical. The
`base_policy_ckpt_artifact` template var is gone from defaults.
`osmo/run_osmo.py` no longer accepts `--base-policy-ckpt`.
3. Internal wandb-project defaults
`osmo/run_osmo.py` previously baked in `compass_rl_enhance` /
`afm_train` as `nvidia-isaac`-entity defaults; "afm" is internal
branding. Dropped the DEFAULT_WANDB_PROJECT dict entirely; made
`--wandb-project` `required=True` on every subparser that hits a
wandb-enabled workflow.
Also:
- Updated docs/handbook/osmo.md: prerequisites no longer mention the
OSMO dataset upload, the wandb base-policy mirror, or the
--base-policy-ckpt flag. Quick-start example uses --wandb-project
with a generic value. Subcommand stanzas updated. Troubleshooting
bullet on dataset-not-found now explicitly distillation-only;
added a new bullet for HF download failures.
- release_tracker.md: new workstream #8 (Pre-release leak audit +
sanitization) added to the summary table and as a section, with
status 🟡 (most boxes ticked, OSMO smoke + maintainer review
pending).
- Platform names (`ovx-l40` for RL, `dgx-h100` for distillation) kept
as defaults per user direction — these are recommended pool names
rather than internal-only references.
Verification:
- Handbook still passes -W: `make -C docs/handbook clean html`.
- grep -rnE 'groot_mobility_rl_es_usds|nvidia-isaac/|afm_train'
excluding _build/.git/release_tracker.md/dev_env_plan.md returns
no live-source hits.
- `python osmo/run_osmo.py train --help` shows --wandb-project
as required, no --base-policy-ckpt flag.
- `python osmo/run_osmo.py train --experiment-name probe --image foo
--dry-run` errors with "the following arguments are required:
--wandb-project", as expected.
Out of scope here:
- ros2_deployment/compass_navigator/setup.py maintainer attribution
(flagged in tracker; defer to user).
- release_tracker.md and dev_env_plan.md gitlab-master references
(handled at ship time per the existing CHANGELOG distillation gate).
- A live OSMO smoke test against the new workflows (image rebuild +
resubmit pending).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
Two release-readiness changes:
1. .github/workflows/pre-commit.yml — runs the existing pre-commit hooks
(yapf, pylint, nbstripout, clang-format, end-of-file-fixer, trailing-
whitespace, requirements-txt-fixer, check-added-large-files) on every
PR + push to main. Uses pre-commit/[email protected] for hook-env
caching. Python 3.11 (yapf v0.31.0 still depends on lib2to3, removed
in Python 3.13). pylint runs without project deps because .pylintrc
has `import-error` disabled.
2. requirements.txt — replace 17 unpinned lines with `==<version>` pins
based on the package set verified inside compass-rl:latest (the
image used for the just-passed OSMO smoke):
einops==0.8.2, gin-config==0.5.0, h5py==3.16.0, matplotlib==3.10.8,
moviepy==2.2.1, msgpack==1.1.2, numpy==1.26.4, onnx==1.20.1,
onnxruntime-gpu==1.25.1, pandas==3.0.1, pytorch-lightning==2.6.1,
timm==1.0.26, torcheval==0.0.7, transformers==4.57.6, wandb==0.25.1,
wheel==0.47.0, zmq==0.0.0
diffusers==0.29.2 was already pinned. zmq==0.0.0 stays as the stub
it currently is (functional swap to pyzmq is out of scope).
Surfaced legacy violations on the first `pre-commit run --all-files` —
fixed in the same commit so the workflow lands green:
- Trailing whitespace + missing EOF newline cleanup across 22 files
(mostly empty __init__.py files, project-page CSS/JS, mobility_es
pyproject.toml).
- yapf reformatted argparse blocks in osmo/run_osmo.py to its 2-space
multi-line style.
- requirements-txt-fixer alphabetized docs/handbook/requirements.txt.
- scripts/generate_omap_from_usd.py: drop unused `import os`; add a
scoped `pylint: disable=ungrouped-imports` on the
`isaacsim.asset.gen.omap.bindings` import (intentionally late so the
extension can be enabled before its bindings load).
Verification:
- `pre-commit run --all-files` is green locally (Python 3.10 venv, same
hook set as the workflow).
- `requirements.txt` parses; the only line without `==` is the
(intentional) blank EOF.
- Workflow YAML is syntactically valid; pre-commit/[email protected] is
the canonical action.
release_tracker.md: new workstream #9 (CI/CD setup + dep pinning) added
to the summary table and as a section, status 🟡 (the file changes are
done; the live first-CI-run is the only remaining checkbox).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
…nt skills Hybrid front-door pattern: /compass stays as the umbrella (training, SAGE workflows, eval, OSMO), three new specialty siblings handle the high-friction onboarding moments. - /compass: drop conda + ISAACLAB_PATH wrappers throughout, replace Setup with the 3-command docker-as-venv flow, add Specialty skills routing section, soften MUST-style rules with the why. OSMO example aligned with #8 sanitization (no --base-policy-ckpt; HF download inside the workflow). - /compass-deploy: ckpt -> ONNX (-r/-g+-e branch) -> TRT engine -> ROS2 launch scaffold. Skill prints the launch command but doesn't run it. - /compass-debug: 8-check diagnostic with bundled scripts/compass_status.sh (parallel checks; --deep adds Isaac Sim init; --ckpt loads via torch). Reports root cause and routes to the right specialty for the fix. - /compass-newembodiment: interactive robot onboarding. Parses pre-supplied input; AskUserQuestion only for missing fields; shows diff before writing; smoke-tests with --num_envs 1. - Progressive-disclosure split: Setup SAGE Local extracted to compass/references/setup-sage-local.md. - docs/handbook/agentic.md: opens with a "Pick the right skill" matrix and per-skill sections. - release_tracker.md: flipped #4's migration boxes ✅; added #10 row and section for this work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
The eval subcommand of osmo/run_osmo.py already exposes --embodiment and --environment overrides; train didn't, so multi-embodiment training sweeps had to either hand-edit the workflow YAML or run with the gin-config default (g1) for every job. - osmo/run_osmo.py: add --embodiment / --environment flags to the train subparser; thread them into cmd_train's set_args dict. - osmo/workflows/rl_es_train_workflow.yaml: accept embodiment / environment template vars (default empty), conditionally append them to TRAIN_CMD and to the post-training EVAL_CMD that runs in the same workflow. Pattern matches what rl_es_eval_workflow.yaml already does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
The internal repo's release/internal.yaml strips a 113-line benchmark.py that hardcodes nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance and afm_rl_enhance defaults. The public side currently has no benchmark runner, and the No-regression benchmark gate (P0 release blocker) has no concrete tooling. Track the sanitization-and-land work as a sub-bullet under that gate so it has an owner-able shape; suggested landing path osmo/run_benchmark.py mirrors osmo/run_osmo.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Adds 8-GPU distributed residual-RL training (torchrun + manual
all-reduce), per-stage timing instrumentation, supporting OSMO
workflow, and a perf-analysis report.
Multi-GPU (run.py, compass/residual_rl/ppo.py, residual_ppo_trainer.py)
- --distributed flag, dist.init_process_group(nccl), per-rank device
(cuda:local_rank), per-rank env seed offset, env_cfg.sim.device
pinned per rank, torch.cuda.set_device(local_rank) before init so
NCCL object collectives route through per-rank GPUs, conditional
nn.DataParallel(device_ids=[local_rank]), rank-0-only Logger /
RecordVideo (no-op logger on other ranks).
- ResidualPPOTrainer: world_size / global_rank state, initial param
broadcast from rank 0, rank-0 gating on _save_ckpt /
_save_episode_logs / _upload_video, weighted per-rank episode-log
aggregation via dist.all_gather_object on CPU-converted dicts,
torch.load(map_location=self.device) on resume,
os.makedirs(exist_ok=True).
- PPO.update: manual gradient all-reduce (AVG) BEFORE clip
(canonical DDP order — clipping the averaged gradient, not the
per-rank pre-avg one), kl_mean all-reduce before LR adaptation so
all ranks pick the same LR, metric all-reduce on (value_loss,
surrogate_loss, entropy). Returns a diagnostics dict so the trainer
logs ppo/learning_rate, ppo/kl_mean, ppo/entropy,
ppo/action_std_mean each iter.
OSMO 8-GPU workflow (osmo/workflows/rl_es_train_8gpu_workflow.yaml,
osmo/run_osmo.py)
- 8-GPU / 80-CPU / 800-GiB resource block on ovx-l40; train phase
via torch.distributed.run --nproc_per_node=8 run.py --distributed
with --num_envs 32 per rank (256 total envs); eval phase as a
single-process call.
- run_osmo.py: --num-gpus flag (choices 2, 8) routes to the matching
workflow YAML.
Per-stage timing instrumentation (always on, ~2-3% overhead)
- ResidualPPOTrainer.learn(): CUDA-synced _timer context manager
around each top-level stage (env_reset_and_init, rollout,
compute_returns, update, logging, checkpoint) plus rollout
sub-steps (act, env_step, process_env_step, base_policy_process).
- _install_env_timers monkey-patches Isaac Lab managers + sim.step /
sim.render + per-ObsTerm .func so logs include
time/rollout/env_step/{obs, sim_step, ...} and
time/rollout/env_step/obs_term/<group>/<name>. Once-per-iter
log_dict via the existing logger.
Perf-analysis report (docs/PERF_ANALYSIS.{md,pdf})
- Methodology, baseline 32-env 256-step iter breakdown, A/B
experiments (depth-drop, DLAA + denoiser, BF16 autocast,
per-stage env_step breakdown, per-ObsTerm breakdown of
obs.compute), ranked recommendations.
Signed-off-by: Wei Liu <liuw@nvidia.com>
Sanitizes the 113-line benchmark.py from the internal repo and lands it as osmo/run_benchmark.py next to run_osmo.py. Closes the "sanitize and land internal benchmark.py" subtask under the No-regression benchmark gate. Behavior: fires one rl_es_eval_workflow.yaml submission per --environments entry, all using the same --embodiment. Each run writes the usual eval/* metrics (goal_reached_rate / fall_down_rate / total_travel_time / weighted_travel_time) to W&B at bm_<embodiment>_<env>_<experiment>; user pulls those out for regression assessment. Sanitization changes vs internal: - Apache-2.0 SPDX header replacing NVIDIA proprietary copyright. - Hardcoded registry nvcr.io/nvstaging/isaac-amr/groot_mobility_rl_enhance -> --registry-prefix flag with $COMPASS_OSMO_REGISTRY fallback (mirrors the run_osmo.py:80-83 pattern; errors fast if unset and --image-name not given). - --wandb-project-name: drop afm_rl_enhance_benchmark default, mark required=True (mirrors run_osmo.py:93-95). - Workflow path: ./workflows/rl_es_eval_workflow.yaml -> Path(__file__).resolve().parent / "workflows" / ... so the script works from any CWD and lives under osmo/. - Adds --dry-run and --prompt for parity with run_osmo.py. Reuses existing osmo/workflows/rl_es_eval_workflow.yaml unchanged. Default sweep stays at 5 scenes (simple_office, warehouse_single_rack, warehouse_multi_rack, combined_single_rack, combined_multi_rack), default --embodiment stays h1; per user direction the matrix lives in the script rather than a separate YAML. docs/handbook/osmo.md: adds a Benchmark section + cross-reference under the subcommand table. release_tracker.md: ticks the sanitize-and-land subtask; widens the section-8 grep gate to include groot_mobility_rl_enhance and afm_rl_enhance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Closes the §8 sanitization tail in release_tracker.md and pastes the iter-500 multi-GPU benchmark numbers into the No-regression gate. Argparse defaults — drop the internal `afm_rl_enhance*` wandb-project defaults and mark --wandb-project-name required in all three top-level entry points. Mirrors the run_osmo.py:93-95 pattern we already applied to osmo/run_benchmark.py: - run.py:58 — was 'afm_rl_enhance' - record.py:135 — was 'afm_rl_enhance_record' - distillation_train.py:54 — was 'afm_rl_enhance_distillation' After the fix, the extended §8 grep gate grep -rnE "nvidia-isaac/|afm_train|groot_mobility_rl_enhance|afm_rl_enhance" . returns zero live-source hits. release_tracker.md hygiene: - Title and version refs: "COMPASS 2.0" -> "COMPASS 1.6", target version 2.0.0 -> 1.6.0, integration branch updated to liuw/benchmark_port. - §2&3 NuRec PR-2 (Buckets B+E+H) marked deferred to post-1.6: only Bucket A (22b25ef) ships in 1.6. - §11 multi-GPU PPO release-scope decision marked settled (ships in 1.6 by squash-strategy default). - §8 grep gate tightened: dropped `groot_mobility_rl_es_usds` from the pattern — it's the directory name inside the public HF compass_usds.zip and is correctly referenced by osmo/workflows/*.yaml after unzipping. Gate now returns zero live-source hits. - §Pre-release gates → No-regression benchmark: pasted 4-embodiment × 5- scene iter-500 multi-GPU results table (goal_reached / fall_down, per-embodiment averages, headlines). Eval ran with the relaxed-heading image compass_release_1_6_relaxed:c87052af (heading_threshold=π); source ships default 0.1 — release notes will document this. - Marked "Define the regression matrix" complete (matrix is the script's --environments default; documented). v1.5.0 baseline capture deferred; the 1.6 numbers become the new published baseline. - iter-1000 single-GPU baseline section stubbed for the in-flight cells. ros2_deployment/compass_navigator/setup.py: removed a duplicated SPDX Apache-2.0 license header (lines 16-28 were a verbatim copy of 1-14). Maintainer fields kept as-is for 1.6 per user direction; team-alias swap is a post-tag follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
yapf v0.31.0 imports lib2to3, which was removed from the standard library in Python 3.13 (and broken on some 3.12 installations). The pre-commit yapf hook fails to install on py3.12 with `ModuleNotFoundError: No module named 'lib2to3'`, which blocks `pre-commit run --all-files` at tag time and any local pre-commit run. yapf v0.40.0+ bundles its own lib2to3 fork and no longer depends on the stdlib copy. Bumping to v0.43.0 (latest stable as of this commit) fixes the install on py3.12 with no formatting policy change. Side effect: yapf added one PEP8 blank line in run.py between class `_NoOpLogger` and the top-level `EmbodimentEnvCfgMap`. Pure formatting, no behavior change. Verification: `pre-commit clean && pre-commit run yapf --files run.py record.py distillation_train.py ros2_deployment/compass_navigator/setup.py osmo/run_benchmark.py` -> all Passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Distilled from release_tracker.md and the integration-branch commits. Date kept as TBD; will be set to the tag-day date in the v1.6.0 commit. The entry intentionally does NOT list full NuRec real2sim asset support under "Added"; only Bucket A (Isaac Lab 3.0 API migration, commit 22b25ef) ships in 1.6. NuRec PR-2 (Buckets B+E+H) is deferred to a clean post-1.6 PR off main. Release notes (separate from CHANGELOG, drafted at tag time) will additionally call out: - Benchmark eval ran against compass_release_1_6_relaxed:c87052af, an image with heading_threshold flipped from 0.1 to math.pi in termination.py:33. Source ships the default 0.1 — the relaxation is image-only and is a known-test-config for the published v1.6.0 benchmark numbers. - Pre-existing argparse defaults `afm_rl_enhance*` in run.py / record.py / distillation_train.py were dropped; --wandb-project-name is now required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
The Sphinx config carries the only in-source COMPASS version string; its `version` and `release` fields feed the handbook footer and the title bar at nvlabs.github.io/COMPASS/docs/. The previous "2.0.0" was left over from the early planning phase when this release was assumed to be a major bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
All 20 cells of the iter-1000 single-GPU baseline sweep (run on pool groot-l40-04 against the *_baseline wandb runs) completed. Pasted the 4×5 results table alongside the existing iter-500 multi-GPU table and added a per-embodiment averages comparison with Δ deltas. Key findings: - carter (wheeled) improves on all axes at iter-1000 (+1.8 pp goal, -1.6 pp fall, faster wtt) — converges fast, benefits from more training. - Bipeds (g1, h1) are roughly on-par or slightly regress on average at iter-1000; the drift concentrates in `warehouse_multi_rack` (g1: -25.9 pp goal, +20.8 pp fall; h1: -20.8 pp goal). Other scenes are within seed-noise. - spot is essentially flat. - Validates the §11 multi-GPU PPO path in spirit: 8 GPUs × 500 iter ≈ 1 GPU × 1000 iter in samples-seen, and resulting policies are within seed-noise on most cells. The `warehouse_multi_rack` divergence for bipeds is worth flagging in release notes but does not block tag. - `simple_office` remains the universal weakness across configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wei Liu <liuw@nvidia.com>
Surfaces and fixes issues that the older yapf v0.31.0 hook missed:
- yapf v0.43.0 wants slightly different formatting for a few existing
lines: line-break style in termination.py:43, dict comprehension in
residual_ppo_trainer.py:497, blank lines between nested defs in
residual_ppo_trainer.py (PEP8 nits), and a long print expression in
sage10k_search.py. Pure formatting, no behavior change.
- pylint flagged two real issues in residual_ppo_trainer.py:
1. C0301 line-too-long at line 328 — the tuple unpacking of
`self.base_policy_process(...)`. Wrapped with parentheses and
reformatted across three lines to fit the 100-char limit.
2. W0212 protected-access on obs_mgr._group_obs_term_cfgs /
_group_obs_term_names at lines 206-207. The access is intentional
and load-bearing — we need to reach into the obs manager's
internal term tables to wrap each obs term's `func` for the
rollout/obs perf-measurement breakdown. Added a scoped
`# pylint: disable=protected-access` comment with explanation.
Verified `pre-commit run --all-files` passes clean on liuw/release_ckpt
post-fix (all hooks: nbstripout, large-files, EOF, requirements-txt,
trailing-whitespace, yapf, clang-format, pylint).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wei Liu <liuw@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.