Skip to content

Fixes OmniHub startup in Docker tests#5633

Merged
pbarejko merged 14 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/fix-hub-detect-only
May 15, 2026
Merged

Fixes OmniHub startup in Docker tests#5633
pbarejko merged 14 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/fix-hub-detect-only

Conversation

@AntoineRichard
Copy link
Copy Markdown
Collaborator

@AntoineRichard AntoineRichard commented May 15, 2026

Description

This PR is based on and includes the changes from #5620, then adds one CI fix on top: it unsets HUB__ARGS__DETECT_ONLY inside the Docker test container before running Isaac Lab commands. Some base images set this flag, which prevents OmniHub from starting and makes cold Nucleus asset retrieval fall back to slow repeated retries.

This was reproduced from the failing Actions job:
https://github.com/isaac-sim/IsaacLab/actions/runs/25904143763/job/76158743634

The affected test_rsl_rl_export_flow.py Dexsuite Kuka-Allegro export timed out at 600 s with the flag set, then completed in about 73 s with the flag unset after clearing the local KukaAllegro mirror.

Fixes # N/A

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

N/A - CI-only change.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (N/A - CI-only change)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (validated with the affected Docker export test)
  • I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package (N/A for the CI-only commit; Prevents early numpy imports to avoid Kit crash #5620 carries its own changelog fragments)
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Test Plan

  • ./isaaclab.sh -f
  • Docker reproduction with HUB__ARGS__DETECT_ONLY=true: test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0] timed out after 600 s.
  • Docker reproduction with HUB__ARGS__DETECT_ONLY unset after clearing the KukaAllegro mirror: test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0] passed in 72.75 s.

@github-actions github-actions Bot added bug Something isn't working infrastructure labels May 15, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR addresses a CI timeout issue caused by the HUB__ARGS__DETECT_ONLY environment variable preventing OmniHub from starting in Docker test containers.

✅ Strengths

  1. Root cause correctly identified: The author traced the timeout in test_rsl_rl_export_flow.py to OmniHub's detect-only mode blocking asset downloads.

  2. Minimal, surgical fix: The change adds just 4 lines (including helpful comments) in the right location - the test container setup phase.

  3. Well-documented PR: Clear explanation of the problem, reproduction steps, and empirical timing data (600s timeout → 73s success).

  4. Safe operation: unset HUB__ARGS__DETECT_ONLY is a no-op if the variable was never set, so this won't break environments where the flag isn't present.

  5. No production impact: This change only affects CI test containers.

📋 Minor Observations

  • Placement is good: The unset is positioned early in the test setup, before any Isaac Lab commands run, ensuring OmniHub can start properly for asset downloads.

  • Comment is helpful: The inline explanation will help future maintainers understand why this workaround exists.

🤔 Question (Non-blocking)

Is this environment variable set intentionally in some base images for resource/network reasons? If so, it might be worth documenting which images set it and why, so future image updates don't reintroduce the issue.


Overall: Clean fix with solid testing evidence. The change is correctly scoped and well-explained.


Update (ccb6d49): Reviewed the additional changes from #5620 that are now included:

AppLauncher CUDA device deferral — Good fix for fork-crash issue by deferring torch.cuda.set_device() until after SimulationApp starts. Well-documented in changelog.

LEAPP export timing instrumentation — Clean _timed_phase context manager for debugging CI timeouts.

Import optimizations — Lazy-loading configclass avoids eager imports.

Test cleanups — Removing unnecessary AppLauncher from pure-Python tests and improving RTX renderer test fixtures.

Better test diagnostics — Improved timeout messages with timing info.

No new issues found. All changes are well-structured and improve code quality.


Update (7497d24): New commit adds export script hang detection improvements:

Faulthandler integration — Adds traceback dumps if simulation_app.close() hangs for >120s. Good debugging aid for CI timeouts.

_emit_timing() helper — Routes timing diagnostics to sys.__stderr__, ensuring output survives stream redirects.

wait_for_replicator=False — Skips waiting for Replicator workflows during close, reducing potential hangs.

No issues found. Clean improvements to export script reliability.


Update (6ef2fbf): New commit batches RSL-RL export flow tests:

Export script refactoring — Made export.py importable by deferring runtime dependencies to _load_runtime_dependencies(), extracting parse_export_args(), and refactoring main() into export_rsl_rl_agent() with explicit parameters. Clean separation of CLI parsing from export logic.

Robust cleanup handling — Added try/finally with leapp_started and env_closed flags to ensure proper cleanup on exceptions. Good defensive coding.

Test batching — Groups ~8 tasks per Kit process via _task_batches() and subprocess isolation. Reduces Kit startup churn while keeping processes short enough to avoid GPU memory accumulation.

Module loading — Clean _load_export_module() using importlib.util to import export.py dynamically.

Changelog skip file — Appropriate for CI-only changes.

No new issues found. Well-structured performance optimization for CI.


Update (63037a9): Timing instrumentation removed from export script and tests.

Cleanup of diagnostic code — Removed _timed_phase(), _emit_timing(), _close_simulation_app(), and faulthandler integration. The export flow is now cleaner without the debugging scaffolding.

Simplified test output — Removed timing extraction and timing-related log statements from tests.

Reverted to standard closesimulation_app.close() without wait_for_replicator=False. Presumed safe now that root cause (OmniHub detect-only) is addressed.

No new issues. This is straightforward cleanup of temporary diagnostic code.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR fixes a Docker test regression where the HUB__ARGS__DETECT_ONLY environment variable (set by some Isaac Sim base images) prevented OmniHub from starting inside the test container, causing cold Nucleus asset downloads to time out after 600 s instead of completing in ~73 s.

  • Adds a single unset HUB__ARGS__DETECT_ONLY statement inside the container's entrypoint bash string, placed after the symlink setup and before pip installs, so the flag is cleared for all subsequent test processes.
  • The fix is narrowly scoped to .github/actions/run-tests/action.yml and has no effect outside the Docker container entrypoint.

Confidence Score: 5/5

Safe to merge — single-line unset in the container entrypoint with no side effects outside the test container.

The change removes one environment variable inside the container's bash entrypoint. unset is a no-op when the variable is absent and has no impact on the host or other jobs. The fix is validated with timing data (600 s timeout → 73 s pass) and the code path is straightforward.

No files require special attention.

Important Files Changed

Filename Overview
.github/actions/run-tests/action.yml Adds unset HUB__ARGS__DETECT_ONLY to the container entrypoint to clear a base-image flag that blocks OmniHub startup; change is correct, minimal, and well-scoped.

Sequence Diagram

sequenceDiagram
    participant GHA as GitHub Actions
    participant Docker as Docker Container
    participant OmniHub as OmniHub
    participant Nucleus as Nucleus (asset server)

    GHA->>Docker: "docker run (base image may set HUB__ARGS__DETECT_ONLY=true)"
    Docker->>Docker: ln -s /isaac-sim _isaac_sim
    Note over Docker: NEW: unset HUB__ARGS__DETECT_ONLY
    Docker->>Docker: pip install extra packages (optional)
    Docker->>OmniHub: isaaclab.sh pytest (OmniHub starts normally)
    OmniHub->>Nucleus: Fetch cold assets
    Nucleus-->>OmniHub: Assets downloaded (~73 s)
    OmniHub-->>Docker: Ready
    Docker-->>GHA: Tests pass
Loading

Reviews (1): Last reviewed commit: "Fix OmniHub startup in Docker tests" | Re-trigger Greptile

@AntoineRichard AntoineRichard force-pushed the antoiner/fix-hub-detect-only branch from 115d3a9 to d7746d9 Compare May 15, 2026 11:18
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 15, 2026
@AntoineRichard AntoineRichard changed the title Fix OmniHub startup in Docker tests Fixes OmniHub startup in Docker tests May 15, 2026
LEAPP export does not use Replicator workflows, so avoid waiting for Replicator during SimulationApp shutdown. This keeps teardown from blocking on unrelated queued Replicator work and keeps traceback diagnostics around close for future CI timeouts.
Run the export-flow matrix in subprocess batches so each batch launches Kit once instead of restarting it per task. Keep export.py importable for the batched helper while preserving the CLI AppLauncher boundary before task/runtime imports.
@pbarejko pbarejko self-requested a review May 15, 2026 15:15
Drop the temporary export-flow timing logs and restore normal SimulationApp close behavior now that the batched test structure has been verified.
@pbarejko pbarejko merged commit a5eb9ad into isaac-sim:develop May 15, 2026
33 of 34 checks passed
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 15, 2026
Conflict resolution touched two files:

* scripts/reinforcement_learning/leapp/rsl_rl/export.py
  Develop (isaac-sim#5633) restructured export.py into create_arg_parser /
  parse_export_args / run_export_with_hydra / main_cli functions so the
  test suite can drive the export flow in-process. Plug the preset CLI
  into the new shape:
  - parse_export_args uses setup_preset_cli(parser, argv) to attach the
    preset-selection help group and parse argv (now forwarded to
    parse_known_args for the in-process test path)
  - run_export_with_hydra folds typed selectors via fold_preset_tokens
    before assigning sys.argv for the Hydra decorator

* source/isaaclab_tasks/isaaclab_tasks/utils/preset_cli.py
  setup_preset_cli grows an optional argv: list[str] | None = None
  parameter forwarded to parse_known_args. Defaults to None so existing
  module-level callers keep working unchanged. Help-time variant
  introspection still reads sys.argv since --help only fires from the
  interactive command line.
ooctipus pushed a commit that referenced this pull request May 16, 2026
…_ik (#5644)

## Summary

One-character fix in
`source/isaaclab/test/controllers/test_pink_ik.py:309`:

```diff
- quat_from_matrix(matrix_from_quat(target_rot_tensor) * matrix_from_quat(quat_inv(current_rot)))
+ quat_from_matrix(matrix_from_quat(target_rot_tensor) @ matrix_from_quat(quat_inv(current_rot)))
```

`calculate_rotation_error` was composing two rotation matrices with
PyTorch's element-wise multiplication (`*`) where matrix multiplication
(`@`) was intended. The Hadamard product of two rotation matrices is not
generally a rotation matrix.

## Why this surfaced as test failures now

The bug has been latent since
[#3149](#3149)
(2025-08-26) because the Hadamard product of two near-identity matrices
is also near-identity — `quat_from_matrix` could still recover a
near-unit quaternion and the assertion `rot_error ≈ 0` would pass for
completely wrong mathematical reasons.

It became visible when [#5609
(jmart)](#5609) (2026-05-14)
added the unit-norm guard to `isaaclab/utils/math.py:quat_from_matrix`:

```python
invalid = (quat.norm(p=2, dim=-1, keepdim=True) - 1.0).abs() > 2e-5
return torch.where(invalid, torch.full_like(quat, float("nan")), quat)
```

After that PR, any non-rotation input (the Hadamard mess) returns NaN,
which `axis_angle_from_quat` propagates → `torch.max(NaN) = NaN` →
`AssertionError: Left hand IK rotation error (nan) exceeds tolerance`.
Both hands always went to NaN; left hand is just asserted first.

## Verification

Local repro on the Horde VM against current `develop` (`isaaclab_physx`
backend, `newton[sim]@v1.2.0rc2`):

| Configuration | Result |
|---|---|
| Unfixed, `Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` | FAILED —
`Left hand IK rotation error (nan)` |
| Fixed, same parameterization | PASSED — rotation errors `1e-4` to
`1e-7` (well within 0.02 rad tolerance) |
| Fixed, all 12 GR1T2 cases, run 1 | 11 passed, 1 skipped |
| Fixed, all 12 GR1T2 cases, run 2 | 11 passed, 1 skipped
(deterministic) |

## Scope

This addresses the consistent `Left hand IK rotation error (nan)`
failures seen across recent develop PRs (e.g. [#5633
`test-curobo`
log](https://github.com/isaac-sim/IsaacLab/actions/runs/25926139790/job/76211194676),
[#5609 `test-curobo`
log](https://github.com/isaac-sim/IsaacLab/actions/runs/25831490295/job/75897258188),
[#5616 `test-curobo`
log](https://github.com/isaac-sim/IsaacLab/actions/runs/25930392313/job/76222556444)).

Remaining failures on G1 envs (finite ~0.03-0.05 rad rotation errors
against the 0.030 rad tolerance) are a **separate** issue — IK
convergence quality rather than the NaN math bug. Out of scope for this
PR; needs its own ticket.

## Test plan

- [x] Pre-commit clean.
- [x] Unfixed branch reproduces NaN on
`Isaac-PickPlace-GR1T2-Abs-v0-horizontal_movement` locally.
- [x] Fixed branch passes the same parameterization locally with finite
rotation errors.
- [x] Fixed branch passes all 12 GR1T2 parameterizations across two
consecutive runs (deterministic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants