Fixes OmniHub startup in Docker tests#5633
Conversation
There was a problem hiding this comment.
Code Review Summary
This PR addresses a CI timeout issue caused by the HUB__ARGS__DETECT_ONLY environment variable preventing OmniHub from starting in Docker test containers.
✅ Strengths
-
Root cause correctly identified: The author traced the timeout in
test_rsl_rl_export_flow.pyto OmniHub's detect-only mode blocking asset downloads. -
Minimal, surgical fix: The change adds just 4 lines (including helpful comments) in the right location - the test container setup phase.
-
Well-documented PR: Clear explanation of the problem, reproduction steps, and empirical timing data (600s timeout → 73s success).
-
Safe operation:
unset HUB__ARGS__DETECT_ONLYis a no-op if the variable was never set, so this won't break environments where the flag isn't present. -
No production impact: This change only affects CI test containers.
📋 Minor Observations
-
Placement is good: The
unsetis positioned early in the test setup, before any Isaac Lab commands run, ensuring OmniHub can start properly for asset downloads. -
Comment is helpful: The inline explanation will help future maintainers understand why this workaround exists.
🤔 Question (Non-blocking)
Is this environment variable set intentionally in some base images for resource/network reasons? If so, it might be worth documenting which images set it and why, so future image updates don't reintroduce the issue.
Overall: Clean fix with solid testing evidence. The change is correctly scoped and well-explained.
Update (ccb6d49): Reviewed the additional changes from #5620 that are now included:
✅ AppLauncher CUDA device deferral — Good fix for fork-crash issue by deferring torch.cuda.set_device() until after SimulationApp starts. Well-documented in changelog.
✅ LEAPP export timing instrumentation — Clean _timed_phase context manager for debugging CI timeouts.
✅ Import optimizations — Lazy-loading configclass avoids eager imports.
✅ Test cleanups — Removing unnecessary AppLauncher from pure-Python tests and improving RTX renderer test fixtures.
✅ Better test diagnostics — Improved timeout messages with timing info.
No new issues found. All changes are well-structured and improve code quality.
Update (7497d24): New commit adds export script hang detection improvements:
✅ Faulthandler integration — Adds traceback dumps if simulation_app.close() hangs for >120s. Good debugging aid for CI timeouts.
✅ _emit_timing() helper — Routes timing diagnostics to sys.__stderr__, ensuring output survives stream redirects.
✅ wait_for_replicator=False — Skips waiting for Replicator workflows during close, reducing potential hangs.
No issues found. Clean improvements to export script reliability.
Update (6ef2fbf): New commit batches RSL-RL export flow tests:
✅ Export script refactoring — Made export.py importable by deferring runtime dependencies to _load_runtime_dependencies(), extracting parse_export_args(), and refactoring main() into export_rsl_rl_agent() with explicit parameters. Clean separation of CLI parsing from export logic.
✅ Robust cleanup handling — Added try/finally with leapp_started and env_closed flags to ensure proper cleanup on exceptions. Good defensive coding.
✅ Test batching — Groups ~8 tasks per Kit process via _task_batches() and subprocess isolation. Reduces Kit startup churn while keeping processes short enough to avoid GPU memory accumulation.
✅ Module loading — Clean _load_export_module() using importlib.util to import export.py dynamically.
✅ Changelog skip file — Appropriate for CI-only changes.
No new issues found. Well-structured performance optimization for CI.
Update (63037a9): Timing instrumentation removed from export script and tests.
✅ Cleanup of diagnostic code — Removed _timed_phase(), _emit_timing(), _close_simulation_app(), and faulthandler integration. The export flow is now cleaner without the debugging scaffolding.
✅ Simplified test output — Removed timing extraction and timing-related log statements from tests.
✅ Reverted to standard close — simulation_app.close() without wait_for_replicator=False. Presumed safe now that root cause (OmniHub detect-only) is addressed.
No new issues. This is straightforward cleanup of temporary diagnostic code.
Greptile SummaryThis PR fixes a Docker test regression where the
Confidence Score: 5/5Safe to merge — single-line unset in the container entrypoint with no side effects outside the test container. The change removes one environment variable inside the container's bash entrypoint. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as GitHub Actions
participant Docker as Docker Container
participant OmniHub as OmniHub
participant Nucleus as Nucleus (asset server)
GHA->>Docker: "docker run (base image may set HUB__ARGS__DETECT_ONLY=true)"
Docker->>Docker: ln -s /isaac-sim _isaac_sim
Note over Docker: NEW: unset HUB__ARGS__DETECT_ONLY
Docker->>Docker: pip install extra packages (optional)
Docker->>OmniHub: isaaclab.sh pytest (OmniHub starts normally)
OmniHub->>Nucleus: Fetch cold assets
Nucleus-->>OmniHub: Assets downloaded (~73 s)
OmniHub-->>Docker: Ready
Docker-->>GHA: Tests pass
Reviews (1): Last reviewed commit: "Fix OmniHub startup in Docker tests" | Re-trigger Greptile |
115d3a9 to
d7746d9
Compare
LEAPP export does not use Replicator workflows, so avoid waiting for Replicator during SimulationApp shutdown. This keeps teardown from blocking on unrelated queued Replicator work and keeps traceback diagnostics around close for future CI timeouts.
Run the export-flow matrix in subprocess batches so each batch launches Kit once instead of restarting it per task. Keep export.py importable for the batched helper while preserving the CLI AppLauncher boundary before task/runtime imports.
Drop the temporary export-flow timing logs and restore normal SimulationApp close behavior now that the batched test structure has been verified.
Conflict resolution touched two files: * scripts/reinforcement_learning/leapp/rsl_rl/export.py Develop (isaac-sim#5633) restructured export.py into create_arg_parser / parse_export_args / run_export_with_hydra / main_cli functions so the test suite can drive the export flow in-process. Plug the preset CLI into the new shape: - parse_export_args uses setup_preset_cli(parser, argv) to attach the preset-selection help group and parse argv (now forwarded to parse_known_args for the in-process test path) - run_export_with_hydra folds typed selectors via fold_preset_tokens before assigning sys.argv for the Hydra decorator * source/isaaclab_tasks/isaaclab_tasks/utils/preset_cli.py setup_preset_cli grows an optional argv: list[str] | None = None parameter forwarded to parse_known_args. Defaults to None so existing module-level callers keep working unchanged. Help-time variant introspection still reads sys.argv since --help only fires from the interactive command line.
…_ik (#5644) ## Summary One-character fix in `source/isaaclab/test/controllers/test_pink_ik.py:309`: ```diff - quat_from_matrix(matrix_from_quat(target_rot_tensor) * matrix_from_quat(quat_inv(current_rot))) + quat_from_matrix(matrix_from_quat(target_rot_tensor) @ matrix_from_quat(quat_inv(current_rot))) ``` `calculate_rotation_error` was composing two rotation matrices with PyTorch's element-wise multiplication (`*`) where matrix multiplication (`@`) was intended. The Hadamard product of two rotation matrices is not generally a rotation matrix. ## Why this surfaced as test failures now The bug has been latent since [#3149](#3149) (2025-08-26) because the Hadamard product of two near-identity matrices is also near-identity — `quat_from_matrix` could still recover a near-unit quaternion and the assertion `rot_error ≈ 0` would pass for completely wrong mathematical reasons. It became visible when [#5609 (jmart)](#5609) (2026-05-14) added the unit-norm guard to `isaaclab/utils/math.py:quat_from_matrix`: ```python invalid = (quat.norm(p=2, dim=-1, keepdim=True) - 1.0).abs() > 2e-5 return torch.where(invalid, torch.full_like(quat, float("nan")), quat) ``` After that PR, any non-rotation input (the Hadamard mess) returns NaN, which `axis_angle_from_quat` propagates → `torch.max(NaN) = NaN` → `AssertionError: Left hand IK rotation error (nan) exceeds tolerance`. Both hands always went to NaN; left hand is just asserted first. ## Verification Local repro on the Horde VM against current `develop` (`isaaclab_physx` backend, `newton[sim]@v1.2.0rc2`): | Configuration | Result | |---|---| | Unfixed, `Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` | FAILED — `Left hand IK rotation error (nan)` | | Fixed, same parameterization | PASSED — rotation errors `1e-4` to `1e-7` (well within 0.02 rad tolerance) | | Fixed, all 12 GR1T2 cases, run 1 | 11 passed, 1 skipped | | Fixed, all 12 GR1T2 cases, run 2 | 11 passed, 1 skipped (deterministic) | ## Scope This addresses the consistent `Left hand IK rotation error (nan)` failures seen across recent develop PRs (e.g. [#5633 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25926139790/job/76211194676), [#5609 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25831490295/job/75897258188), [#5616 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25930392313/job/76222556444)). Remaining failures on G1 envs (finite ~0.03-0.05 rad rotation errors against the 0.030 rad tolerance) are a **separate** issue — IK convergence quality rather than the NaN math bug. Out of scope for this PR; needs its own ticket. ## Test plan - [x] Pre-commit clean. - [x] Unfixed branch reproduces NaN on `Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` locally. - [x] Fixed branch passes the same parameterization locally with finite rotation errors. - [x] Fixed branch passes all 12 GR1T2 parameterizations across two consecutive runs (deterministic).
Description
This PR is based on and includes the changes from #5620, then adds one CI fix on top: it unsets
HUB__ARGS__DETECT_ONLYinside the Docker test container before running Isaac Lab commands. Some base images set this flag, which prevents OmniHub from starting and makes cold Nucleus asset retrieval fall back to slow repeated retries.This was reproduced from the failing Actions job:
https://github.com/isaac-sim/IsaacLab/actions/runs/25904143763/job/76158743634
The affected
test_rsl_rl_export_flow.pyDexsuite Kuka-Allegro export timed out at 600 s with the flag set, then completed in about 73 s with the flag unset after clearing the local KukaAllegro mirror.Fixes # N/A
Type of change
Screenshots
N/A - CI-only change.
Checklist
pre-commitchecks with./isaaclab.sh --formatsource/<pkg>/changelog.d/for every touched package (N/A for the CI-only commit; Prevents early numpy imports to avoid Kit crash #5620 carries its own changelog fragments)CONTRIBUTORS.mdor my name already exists thereTest Plan
./isaaclab.sh -fHUB__ARGS__DETECT_ONLY=true:test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0]timed out after 600 s.HUB__ARGS__DETECT_ONLYunset after clearing the KukaAllegro mirror:test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0]passed in 72.75 s.