Fixes OmniHub startup in Docker tests by AntoineRichard · Pull Request #5633 · isaac-sim/IsaacLab

AntoineRichard · 2026-05-15T11:09:51Z

Description

This PR is based on and includes the changes from #5620, then adds one CI fix on top: it unsets HUB__ARGS__DETECT_ONLY inside the Docker test container before running Isaac Lab commands. Some base images set this flag, which prevents OmniHub from starting and makes cold Nucleus asset retrieval fall back to slow repeated retries.

This was reproduced from the failing Actions job:
https://github.com/isaac-sim/IsaacLab/actions/runs/25904143763/job/76158743634

The affected test_rsl_rl_export_flow.py Dexsuite Kuka-Allegro export timed out at 600 s with the flag set, then completed in about 73 s with the flag unset after clearing the local KukaAllegro mirror.

Fixes # N/A

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

N/A - CI-only change.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (N/A - CI-only change)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works (validated with the affected Docker export test)
I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package (N/A for the CI-only commit; Prevents early numpy imports to avoid Kit crash #5620 carries its own changelog fragments)
I have added my name to the CONTRIBUTORS.md or my name already exists there

Test Plan

./isaaclab.sh -f
Docker reproduction with HUB__ARGS__DETECT_ONLY=true: test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0] timed out after 600 s.
Docker reproduction with HUB__ARGS__DETECT_ONLY unset after clearing the KukaAllegro mirror: test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0] passed in 72.75 s.

isaaclab-review-bot

Code Review Summary

This PR addresses a CI timeout issue caused by the HUB__ARGS__DETECT_ONLY environment variable preventing OmniHub from starting in Docker test containers.

✅ Strengths

Root cause correctly identified: The author traced the timeout in test_rsl_rl_export_flow.py to OmniHub's detect-only mode blocking asset downloads.
Minimal, surgical fix: The change adds just 4 lines (including helpful comments) in the right location - the test container setup phase.
Well-documented PR: Clear explanation of the problem, reproduction steps, and empirical timing data (600s timeout → 73s success).
Safe operation: unset HUB__ARGS__DETECT_ONLY is a no-op if the variable was never set, so this won't break environments where the flag isn't present.
No production impact: This change only affects CI test containers.

📋 Minor Observations

Placement is good: The unset is positioned early in the test setup, before any Isaac Lab commands run, ensuring OmniHub can start properly for asset downloads.
Comment is helpful: The inline explanation will help future maintainers understand why this workaround exists.

🤔 Question (Non-blocking)

Is this environment variable set intentionally in some base images for resource/network reasons? If so, it might be worth documenting which images set it and why, so future image updates don't reintroduce the issue.

Overall: Clean fix with solid testing evidence. The change is correctly scoped and well-explained.

Update (ccb6d49): Reviewed the additional changes from #5620 that are now included:

✅ AppLauncher CUDA device deferral — Good fix for fork-crash issue by deferring torch.cuda.set_device() until after SimulationApp starts. Well-documented in changelog.

✅ LEAPP export timing instrumentation — Clean _timed_phase context manager for debugging CI timeouts.

✅ Import optimizations — Lazy-loading configclass avoids eager imports.

✅ Test cleanups — Removing unnecessary AppLauncher from pure-Python tests and improving RTX renderer test fixtures.

✅ Better test diagnostics — Improved timeout messages with timing info.

No new issues found. All changes are well-structured and improve code quality.

Update (7497d24): New commit adds export script hang detection improvements:

✅ Faulthandler integration — Adds traceback dumps if simulation_app.close() hangs for >120s. Good debugging aid for CI timeouts.

✅ _emit_timing() helper — Routes timing diagnostics to sys.__stderr__, ensuring output survives stream redirects.

✅ wait_for_replicator=False — Skips waiting for Replicator workflows during close, reducing potential hangs.

No issues found. Clean improvements to export script reliability.

Update (6ef2fbf): New commit batches RSL-RL export flow tests:

✅ Export script refactoring — Made export.py importable by deferring runtime dependencies to _load_runtime_dependencies(), extracting parse_export_args(), and refactoring main() into export_rsl_rl_agent() with explicit parameters. Clean separation of CLI parsing from export logic.

✅ Robust cleanup handling — Added try/finally with leapp_started and env_closed flags to ensure proper cleanup on exceptions. Good defensive coding.

✅ Test batching — Groups ~8 tasks per Kit process via _task_batches() and subprocess isolation. Reduces Kit startup churn while keeping processes short enough to avoid GPU memory accumulation.

✅ Module loading — Clean _load_export_module() using importlib.util to import export.py dynamically.

✅ Changelog skip file — Appropriate for CI-only changes.

No new issues found. Well-structured performance optimization for CI.

Update (63037a9): Timing instrumentation removed from export script and tests.

✅ Cleanup of diagnostic code — Removed _timed_phase(), _emit_timing(), _close_simulation_app(), and faulthandler integration. The export flow is now cleaner without the debugging scaffolding.

✅ Simplified test output — Removed timing extraction and timing-related log statements from tests.

✅ Reverted to standard close — simulation_app.close() without wait_for_replicator=False. Presumed safe now that root cause (OmniHub detect-only) is addressed.

No new issues. This is straightforward cleanup of temporary diagnostic code.

greptile-apps · 2026-05-15T11:16:11Z

Greptile Summary

This PR fixes a Docker test regression where the HUB__ARGS__DETECT_ONLY environment variable (set by some Isaac Sim base images) prevented OmniHub from starting inside the test container, causing cold Nucleus asset downloads to time out after 600 s instead of completing in ~73 s.

Adds a single unset HUB__ARGS__DETECT_ONLY statement inside the container's entrypoint bash string, placed after the symlink setup and before pip installs, so the flag is cleared for all subsequent test processes.
The fix is narrowly scoped to .github/actions/run-tests/action.yml and has no effect outside the Docker container entrypoint.

Confidence Score: 5/5

Safe to merge — single-line unset in the container entrypoint with no side effects outside the test container.

The change removes one environment variable inside the container's bash entrypoint. unset is a no-op when the variable is absent and has no impact on the host or other jobs. The fix is validated with timing data (600 s timeout → 73 s pass) and the code path is straightforward.

No files require special attention.

Important Files Changed

Filename	Overview
.github/actions/run-tests/action.yml	Adds `unset HUB__ARGS__DETECT_ONLY` to the container entrypoint to clear a base-image flag that blocks OmniHub startup; change is correct, minimal, and well-scoped.

Sequence Diagram

sequenceDiagram
    participant GHA as GitHub Actions
    participant Docker as Docker Container
    participant OmniHub as OmniHub
    participant Nucleus as Nucleus (asset server)

    GHA->>Docker: "docker run (base image may set HUB__ARGS__DETECT_ONLY=true)"
    Docker->>Docker: ln -s /isaac-sim _isaac_sim
    Note over Docker: NEW: unset HUB__ARGS__DETECT_ONLY
    Docker->>Docker: pip install extra packages (optional)
    Docker->>OmniHub: isaaclab.sh pytest (OmniHub starts normally)
    OmniHub->>Nucleus: Fetch cold assets
    Nucleus-->>OmniHub: Assets downloaded (~73 s)
    OmniHub-->>Docker: Ready
    Docker-->>GHA: Tests pass

_{Reviews (1): Last reviewed commit: "Fix OmniHub startup in Docker tests" | Re-trigger Greptile}

LEAPP export does not use Replicator workflows, so avoid waiting for Replicator during SimulationApp shutdown. This keeps teardown from blocking on unrelated queued Replicator work and keeps traceback diagnostics around close for future CI timeouts.

Run the export-flow matrix in subprocess batches so each batch launches Kit once instead of restarting it per task. Keep export.py importable for the batched helper while preserving the CLI AppLauncher boundary before task/runtime imports.

Drop the temporary export-flow timing logs and restore normal SimulationApp close behavior now that the batched test structure has been verified.

Conflict resolution touched two files: * scripts/reinforcement_learning/leapp/rsl_rl/export.py Develop (isaac-sim#5633) restructured export.py into create_arg_parser / parse_export_args / run_export_with_hydra / main_cli functions so the test suite can drive the export flow in-process. Plug the preset CLI into the new shape: - parse_export_args uses setup_preset_cli(parser, argv) to attach the preset-selection help group and parse argv (now forwarded to parse_known_args for the in-process test path) - run_export_with_hydra folds typed selectors via fold_preset_tokens before assigning sys.argv for the Hydra decorator * source/isaaclab_tasks/isaaclab_tasks/utils/preset_cli.py setup_preset_cli grows an optional argv: list[str] | None = None parameter forwarded to parse_known_args. Defaults to None so existing module-level callers keep working unchanged. Help-time variant introspection still reads sys.argv since --help only fires from the interactive command line.

…_ik (#5644) ## Summary One-character fix in `source/isaaclab/test/controllers/test_pink_ik.py:309`: ```diff - quat_from_matrix(matrix_from_quat(target_rot_tensor) * matrix_from_quat(quat_inv(current_rot))) + quat_from_matrix(matrix_from_quat(target_rot_tensor) @ matrix_from_quat(quat_inv(current_rot))) ``` `calculate_rotation_error` was composing two rotation matrices with PyTorch's element-wise multiplication (`*`) where matrix multiplication (`@`) was intended. The Hadamard product of two rotation matrices is not generally a rotation matrix. ## Why this surfaced as test failures now The bug has been latent since [#3149](#3149) (2025-08-26) because the Hadamard product of two near-identity matrices is also near-identity — `quat_from_matrix` could still recover a near-unit quaternion and the assertion `rot_error ≈ 0` would pass for completely wrong mathematical reasons. It became visible when [#5609 (jmart)](#5609) (2026-05-14) added the unit-norm guard to `isaaclab/utils/math.py:quat_from_matrix`: ```python invalid = (quat.norm(p=2, dim=-1, keepdim=True) - 1.0).abs() > 2e-5 return torch.where(invalid, torch.full_like(quat, float("nan")), quat) ``` After that PR, any non-rotation input (the Hadamard mess) returns NaN, which `axis_angle_from_quat` propagates → `torch.max(NaN) = NaN` → `AssertionError: Left hand IK rotation error (nan) exceeds tolerance`. Both hands always went to NaN; left hand is just asserted first. ## Verification Local repro on the Horde VM against current `develop` (`isaaclab_physx` backend, `newton[sim]@v1.2.0rc2`): | Configuration | Result | |---|---| | Unfixed, `Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` | FAILED — `Left hand IK rotation error (nan)` | | Fixed, same parameterization | PASSED — rotation errors `1e-4` to `1e-7` (well within 0.02 rad tolerance) | | Fixed, all 12 GR1T2 cases, run 1 | 11 passed, 1 skipped | | Fixed, all 12 GR1T2 cases, run 2 | 11 passed, 1 skipped (deterministic) | ## Scope This addresses the consistent `Left hand IK rotation error (nan)` failures seen across recent develop PRs (e.g. [#5633 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25926139790/job/76211194676), [#5609 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25831490295/job/75897258188), [#5616 `test-curobo` log](https://github.com/isaac-sim/IsaacLab/actions/runs/25930392313/job/76222556444)). Remaining failures on G1 envs (finite ~0.03-0.05 rad rotation errors against the 0.030 rad tolerance) are a **separate** issue — IK convergence quality rather than the NaN math bug. Out of scope for this PR; needs its own ticket. ## Test plan - [x] Pre-commit clean. - [x] Unfixed branch reproduces NaN on `Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` locally. - [x] Fixed branch passes the same parameterization locally with finite rotation errors. - [x] Fixed branch passes all 12 GR1T2 parameterizations across two consecutive runs (deterministic).

pbarejko added 9 commits May 14, 2026 20:07

Update

67d10a4

Defer use of torch

3b15efd

Explicit import of configclass

c51ba2b

Remove verbose flag

f52a750

Fix RTX Renderer utils tests

ba57283

rst files

fa628db

Remove app for noise tests

d04e1a1

Shutdown app for wrench composer

c70e9e0

Remove app launcher from wrench composer test

f8b2c7a

AntoineRichard requested review from hhansen-bdai and kellyguo11 as code owners May 15, 2026 11:09

github-actions Bot added bug Something isn't working infrastructure labels May 15, 2026

isaaclab-review-bot Bot reviewed May 15, 2026

View reviewed changes

ooctipus approved these changes May 15, 2026

View reviewed changes

Fix OmniHub startup in Docker tests

d7746d9

AntoineRichard force-pushed the antoiner/fix-hub-detect-only branch from 115d3a9 to d7746d9 Compare May 15, 2026 11:18

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 15, 2026

AntoineRichard changed the title ~~Fix OmniHub startup in Docker tests~~ Fixes OmniHub startup in Docker tests May 15, 2026

AntoineRichard added 3 commits May 15, 2026 14:20

Add export timing diagnostics

ccb6d49

Batch RSL-RL export flow tests

6ef2fbf

Run the export-flow matrix in subprocess batches so each batch launches Kit once instead of restarting it per task. Keep export.py importable for the batched helper while preserving the CLI AppLauncher boundary before task/runtime imports.

pbarejko self-requested a review May 15, 2026 15:15

pbarejko approved these changes May 15, 2026

View reviewed changes

pbarejko mentioned this pull request May 15, 2026

Prevents early numpy imports to avoid Kit crash #5620

Closed

7 tasks

Remove export timing diagnostics

63037a9

Drop the temporary export-flow timing logs and restore normal SimulationApp close behavior now that the batched test structure has been verified.

pbarejko merged commit a5eb9ad into isaac-sim:develop May 15, 2026
33 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes OmniHub startup in Docker tests#5633

Fixes OmniHub startup in Docker tests#5633
pbarejko merged 14 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/fix-hub-detect-only

AntoineRichard commented May 15, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AntoineRichard commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Screenshots

Checklist

Test Plan

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review Summary

✅ Strengths

📋 Minor Observations

🤔 Question (Non-blocking)

Uh oh!

greptile-apps Bot commented May 15, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntoineRichard commented May 15, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading