Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
22b25ef
Migrate mobility_es extension to Isaac Lab 3.0 API
nv-liuw May 7, 2026
1253f2a
Add OSMO cloud-submission workflows and Python launcher
nv-liuw May 7, 2026
000c3be
Migrate compass Claude Code skill from internal repo
nv-liuw May 8, 2026
b20fb67
Add USD-derived occupancy-map generator + loader fallback
nv-liuw May 8, 2026
725b79c
Add Docker-as-venv dev environment (docker/run.sh + docker/activate)
nv-liuw May 8, 2026
f4d89b8
Add COMPASS docs site: academic landing + Sphinx handbook
nv-liuw May 8, 2026
d0275cc
Sanitize OSMO workflows for public release: HF asset sources, no inte…
nv-liuw May 8, 2026
3cc5eaf
Add pre-commit CI workflow + pin requirements.txt to verified versions
nv-liuw May 8, 2026
e0b6f20
Refresh /compass for docker-as-venv; add deploy / debug / newembodime…
nv-liuw May 8, 2026
d605901
Thread --embodiment / --environment through OSMO train workflow
nv-liuw May 8, 2026
424a96a
Add benchmark.py sanitization subtask under No-regression benchmark gate
nv-liuw May 8, 2026
6ecc62e
Multi-GPU PPO training + perf instrumentation for residual RL
nv-liuw May 11, 2026
2cb460e
Add osmo/run_benchmark.py: sanitized port of internal benchmark.py
nv-liuw May 12, 2026
6b0d65c
Sanitization tail + iter-500 benchmark results for COMPASS 1.6
nv-liuw May 13, 2026
fdc8b92
Bump yapf pre-commit ref to v0.43.0 for Python 3.12 support
nv-liuw May 13, 2026
ed20cd9
Draft CHANGELOG [1.6.0] entry
nv-liuw May 13, 2026
980fad6
docs/handbook/conf.py: bump COMPASS_VERSION 2.0.0 -> 1.6.0
nv-liuw May 13, 2026
a3535a5
Add iter-1000 single-GPU benchmark results + side-by-side comparison
nv-liuw May 13, 2026
6ec66ad
Pre-commit cleanup: yapf reformats + pylint fixes for tag-time gate
nv-liuw May 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions .claude/skills/compass-debug/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
name: compass-debug
description: >
Diagnose why COMPASS isn't working: container, GPU, activated shell,
assets, Isaac Sim init, checkpoint validity. Use whenever the user
reports vague COMPASS errors — "training won't start", "something's
wrong", "why is this failing" — even without the word "debug". Make
sure to use this when the user mentions COMPASS isn't behaving and
doesn't have a more specific intent.
allowed-tools:
- Bash
- Read
- Grep
- Glob
---

You produce a diagnostic snapshot of a COMPASS dev environment and report what's working and what's broken. Read-only by design — you report problems but never auto-fix them. Once you've identified the root cause, point the user at the specialty skill that owns the fix.

This skill is the place users land when training "just won't start" or something obscure breaks. The goal is to surface the actual root cause in one screen, not to chase the symptom.

## When NOT to use this skill

- The user knows what they want to do (train, deploy, etc.) and just needs the workflow → `/compass`, `/compass-deploy`, etc.
- The user wants to add a robot platform → `/compass-newembodiment`.
- A specific error message has a clear fix in the handbook → quote the relevant page; debug isn't needed.

---

## Workflow

### Step 1: Run the diagnostic script

The skill bundles `scripts/compass_status.sh` which runs all checks in parallel and prints a markdown table:

```bash
# dangerouslyDisableSandbox: true (script touches nvidia-smi)
bash <SKILL_BASE_DIR>/scripts/compass_status.sh
```

Output looks like:

```
| Status | Check | Detail |
|---|---|---|
| ✓ | Container | compass-rl up |
| ✓ | Activated shell | shim dir on PATH |
| ✓ | GPU | NVIDIA H100 80GB HBM3, 79980 MiB |
| ✓ | Base ckpt | ./assets/x_mobility.ckpt (1.2G) |
| ✓ | USDs | 8 entries in ./assets/usd/ |
| ✓ | Recent log | /tmp/isaaclab/logs/2026-05-08_14-23-02 |
```

For deeper diagnostics (slower, ~30s — runs Isaac Sim init headless to surface kit-init errors that don't show up in the quick check):

```bash
bash <SKILL_BASE_DIR>/scripts/compass_status.sh --deep
```

For checkpoint-specific issues (verifies a `.pt` file is a valid torch checkpoint and not corrupted):

```bash
bash <SKILL_BASE_DIR>/scripts/compass_status.sh --ckpt <PATH_TO_CKPT>
```

### Step 2: Interpret the output

The table is the report; just relay it to the user. Then add a one-paragraph interpretation: which row is the root cause, and what to do about it.

| Failed check | Most likely cause | Where to fix |
|---|---|---|
| Container | Container hasn't been started | `./docker/run.sh up` (or `./docker/run.sh build` if image missing) |
| Activated shell | User invoked Claude in a fresh shell | `source ./docker/activate` then retry |
| GPU | Driver issue OR sandbox blocking nvidia-smi | Check `dangerouslyDisableSandbox: true` was set; if persists, check host driver with `nvidia-smi` from the host shell |
| Base ckpt / USDs | Assets not downloaded yet | `./docker/run.sh assets` (needs `HF_TOKEN`) |
| Recent log | No training has run yet (informational) | Not a failure — just confirms a clean state |
| Isaac Sim init | Container build broken OR GPU not exposed | Try `./docker/run.sh down && ./docker/run.sh build` |
| Ckpt load | File corrupt OR wrong torch version | Re-download / re-train; if re-train, route to `/compass` |

### Step 3: Recommend the next skill

Don't try to fix the issue yourself — different fixes belong to different skills:

| Root cause class | Recommended next skill |
|---|---|
| Setup issue (container, assets, activated shell) | `/compass` (Setup COMPASS section) |
| Training-time issue (config, ckpt path, env name) | `/compass` (Train section) |
| Deploy-time issue (ONNX/TRT/ROS2) | `/compass-deploy` |
| New robot platform missing | `/compass-newembodiment` |

Anti-pattern guard: don't run fixes yourself. The user benefits from understanding what went wrong; running fixes blind hides root causes.

---

## Common patterns

### "Training crashed silently"

Look for the most-recent kit log:
```bash
find ~/.local/share/ov/pkg/isaac-sim-* -name "kit_*.log" 2>/dev/null | head -1 | xargs tail -100
```

Common silent-crash causes:
- GPU OOM (look for "out of memory" / "CUDA error" near the tail).
- USD asset missing (look for "Failed to load" / file-path lines).
- Quaternion convention mismatch on a custom embodiment (look for "wxyz" / "xyzw").

### "Pre-flight passes but `python run.py …` still errors"

Confirm the user is in an activated shell. The script checks PATH for the shim dir but a user can run scripts in any shell — verify with:
```bash
which python
# Should resolve to a /tmp/compass-shims.* path
```

If `which python` resolves to `/usr/bin/python` or similar, the shell isn't activated even if other env vars suggest it is.

### "OSMO submission fails on `osmo workflow submit`"

That's an OSMO-side issue, not a local one. Check:
```bash
test -n "${WANDB_API_KEY:-}" && echo "WANDB_API_KEY set" || echo "WANDB_API_KEY MISSING"
test -n "${HF_TOKEN:-}" && echo "HF_TOKEN set" || echo "HF_TOKEN MISSING"
test -n "${COMPASS_OSMO_REGISTRY:-}" && echo "registry set" || echo "registry MISSING (or pass --image)"
```

If all set, route the user to `/compass` (OSMO submission section in that skill body).

---

## Key File Locations

| File | Purpose |
|------|---------|
| `<SKILL_BASE_DIR>/scripts/compass_status.sh` | The diagnostic script; usable as a standalone CLI |
| `./docker/run.sh` | Container lifecycle (status / up / down / build / assets / shell) |
| `./docker/activate` | Shell activate script (sets up python/pip shims) |
| `./assets/` | Asset bind-mount (USDs + base ckpt) |
| `/tmp/isaaclab/logs/` | Training run logs (latest is most useful) |
| `~/.local/share/ov/pkg/isaac-sim-*/kit_*.log` | Kit logs for crash diagnostics |
138 changes: 138 additions & 0 deletions .claude/skills/compass-debug/scripts/compass_status.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
#!/usr/bin/env bash
# compass_status.sh — diagnostic snapshot of a COMPASS dev environment.
#
# Usage:
# ./compass_status.sh # quick checks (~1s)
# ./compass_status.sh --deep # also runs Isaac Sim init test (~30s)
# ./compass_status.sh --ckpt PATH # also load a specific .pt file with torch
#
# Exit code: 0 if all required checks pass, 1 if any fail.
# WARN-level entries don't fail the run.

set -uo pipefail

DEEP=0
CKPT_PATH=""
while [ "$#" -gt 0 ]; do
case "$1" in
--deep) DEEP=1; shift ;;
--ckpt) CKPT_PATH="${2:-}"; shift 2 ;;
-h|--help)
sed -n '4,9p' "$0" | sed 's/^# \?//'
exit 0
;;
*) echo "Unknown flag: $1" >&2; exit 2 ;;
esac
done

# Resolve repo root (script lives at .claude/skills/compass-debug/scripts/).
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../../.." && pwd)"
cd "${REPO_ROOT}"

PASS="✓"
FAIL="✗"
WARN="⚠"

declare -a TABLE
fail_count=0

row() {
local status="$1" name="$2" detail="$3"
TABLE+=("| ${status} | ${name} | ${detail} |")
[ "${status}" = "${FAIL}" ] && fail_count=$((fail_count + 1))
}

# 1. Container running?
if ./docker/run.sh status 2>/dev/null | grep -qE "Up|running"; then
row "${PASS}" "Container" "compass-rl up"
else
row "${FAIL}" "Container" "down — run \`./docker/run.sh up\`"
fi

# 2. Activated shell? Detect by PATH containing the shim dir created by
# docker/activate (mktemp pattern: compass-shims.XXXXXX).
if echo "${PATH}" | grep -q "compass-shims\."; then
row "${PASS}" "Activated shell" "shim dir on PATH"
else
row "${WARN}" "Activated shell" "no — run \`source ./docker/activate\`"
fi

# 3. GPU
if command -v nvidia-smi >/dev/null 2>&1; then
GPU_INFO="$(nvidia-smi --query-gpu=name,memory.free --format=csv,noheader 2>/dev/null | head -1)"
if [ -n "${GPU_INFO}" ]; then
row "${PASS}" "GPU" "${GPU_INFO}"
else
row "${FAIL}" "GPU" "nvidia-smi returned empty (driver issue?)"
fi
else
row "${FAIL}" "GPU" "nvidia-smi not on PATH"
fi

# 4. Base policy ckpt
if [ -f "./assets/x_mobility.ckpt" ]; then
SIZE="$(du -h ./assets/x_mobility.ckpt | cut -f1)"
row "${PASS}" "Base ckpt" "./assets/x_mobility.ckpt (${SIZE})"
else
row "${FAIL}" "Base ckpt" "missing — run \`./docker/run.sh assets\`"
fi

# 5. Built-in scene USDs
if [ -d "./assets/usd" ] && [ -n "$(ls -A ./assets/usd 2>/dev/null)" ]; then
USD_COUNT="$(ls ./assets/usd | wc -l)"
row "${PASS}" "USDs" "${USD_COUNT} entries in ./assets/usd/"
else
row "${FAIL}" "USDs" "missing — run \`./docker/run.sh assets\`"
fi

# 6. Recent training log (informational; non-blocking)
LATEST="$(ls -t /tmp/isaaclab/logs/ 2>/dev/null | head -1 || true)"
if [ -n "${LATEST}" ]; then
row "${PASS}" "Recent log" "/tmp/isaaclab/logs/${LATEST}"
else
row "${WARN}" "Recent log" "none — no training has run yet on this machine"
fi

# 7. Deep check: Isaac Sim init (only with --deep)
if [ "${DEEP}" = "1" ]; then
if python -c "
from isaacsim import SimulationApp
app = SimulationApp({'headless': True})
app.close()
" >/dev/null 2>&1; then
row "${PASS}" "Isaac Sim init" "headless app start/close OK"
else
row "${FAIL}" "Isaac Sim init" "FAILED — check container build / GPU access"
fi
fi

# 8. Optional: load a specific ckpt
if [ -n "${CKPT_PATH}" ]; then
if [ -f "${CKPT_PATH}" ] && python -c "
import torch
torch.load('${CKPT_PATH}', map_location='cpu')
" >/dev/null 2>&1; then
SIZE="$(du -h "${CKPT_PATH}" | cut -f1)"
row "${PASS}" "Ckpt load" "${CKPT_PATH} (${SIZE})"
else
row "${FAIL}" "Ckpt load" "${CKPT_PATH} (missing or not a valid torch ckpt)"
fi
fi

# Print table
echo "| Status | Check | Detail |"
echo "|---|---|---|"
for line in "${TABLE[@]}"; do
echo "${line}"
done

if [ "${fail_count}" -eq 0 ]; then
echo ""
echo "All checks passed."
exit 0
else
echo ""
echo "${fail_count} check(s) failed. See https://nvlabs.github.io/COMPASS/docs/quickstart.html for setup."
exit 1
fi
Loading
Loading