Skip to content

Prevents early numpy imports to avoid Kit crash#5620

Closed
pbarejko wants to merge 10 commits into
isaac-sim:developfrom
pbarejko:pbarejko/debugging
Closed

Prevents early numpy imports to avoid Kit crash#5620
pbarejko wants to merge 10 commits into
isaac-sim:developfrom
pbarejko:pbarejko/debugging

Conversation

@pbarejko
Copy link
Copy Markdown
Collaborator

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing functionality will not work without user modification)
  • Documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added bug Something isn't working infrastructure labels May 14, 2026
Comment thread tools/conftest.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR contains two debugging-only changes that appear unintentional for a merge to develop: a broken pytest invocation and a CI runner switch to a staging pool.

  • tools/conftest.py: -s and -vv are inserted between -m and pytest in the subprocess command list, so they are parsed as Python interpreter flags rather than pytest arguments. This silently prevents output-capture disabling and verbose pytest output while also stripping user site-packages from sys.path.
  • .github/workflows/build.yaml: The build job runner label is changed from gpu to gpu-stg, routing production CI Docker builds to the staging GPU runner pool.

Confidence Score: 2/5

Not safe to merge — both changes are debugging artefacts that actively break the CI pipeline and test runner.

The pytest command in conftest.py is now malformed: -s and -vv land in the Python interpreter argument list, not pytest's, so the subprocess will either error on import-resolution or run with wrong sys.path settings. Separately, the CI build job is redirected to gpu-stg, meaning every Docker-image build triggered on develop would run on staging hardware until reverted.

tools/conftest.py and .github/workflows/build.yaml both need attention before this is merged.

Important Files Changed

Filename Overview
tools/conftest.py Debugging flags -s and -vv inserted at the wrong position in the pytest subprocess command — placed between -m and pytest so they are consumed by the Python interpreter, not by pytest.
.github/workflows/build.yaml Build job runner changed from gpu to gpu-stg; likely a temporary debugging change that routes CI builds to a staging runner pool instead of the production one.

Comments Outside Diff (1)

  1. tools/conftest.py, line 340-345 (link)

    P1 The -s and -vv flags are inserted between -m and pytest, so Python interprets them as interpreter-level flags rather than pytest arguments. Specifically, -s becomes Python's "don't add user site-packages to sys.path" flag (which can break imports in the test environment), and -vv would be treated as two -v (verbose interpreter) flags. Neither flag is forwarded to pytest, so the intended behaviour (no output capture, extra verbose pytest output) silently does not take effect.

Reviews (1): Last reviewed commit: "py test args" | Re-trigger Greptile

Comment thread .github/workflows/build.yaml Outdated
build:
name: Build Base Docker Image
runs-on: [self-hosted, gpu]
runs-on: [self-hosted, gpu-stg]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The build job runner label was changed from gpu to gpu-stg, routing all Docker-image builds to the staging GPU runner pool. If this was intentional for debugging purposes only, it should be reverted before merging to develop — merging as-is means production CI will continue to run on staging hardware.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR makes two infrastructure/debugging changes:

  1. Runner change (.github/workflows/build.yaml): gpugpu-stg
  2. Pytest verbosity (tools/conftest.py): Added -s -vv flags

⚠️ Concerns

PR Metadata Issues

  • Incomplete description: The PR body uses template placeholders without actual content explaining the purpose
  • No linked issue: "Fixes #" is empty
  • Unchecked checklist: All items remain unchecked
  • Vague title: "Debugging" doesn't describe the actual changes

Technical Concerns

Runner Change (gpugpu-stg)

  • Is gpu-stg a staging/test runner? Please clarify the intent
  • Will this affect CI reliability if staging runners have different availability or configuration?
  • Is this intended to be temporary for debugging, or permanent?

Pytest Flags (-s -vv)

  • Adding -s disables output capture, which will significantly increase CI log verbosity
  • -vv (very verbose) will further increase output volume
  • Impact: All CI test runs will produce much larger logs, which may:
    • Slow down CI execution
    • Make it harder to identify actual failures in verbose output
    • Increase storage/bandwidth for CI artifacts

🔍 Questions for Author

  1. Is this PR intended for debugging purposes only? If so, consider marking it as draft or adding [WIP] to the title
  2. Should these changes be reverted after debugging? If yes, please note that in the PR description
  3. What issue or problem are you debugging? This context would help reviewers understand the necessity of these changes

📋 Recommendations

  • Update PR description with actual context explaining the debugging purpose
  • Consider if verbose pytest output should be permanent or temporary
  • Clarify the runner change rationale
  • If temporary, consider using a feature flag or environment variable instead of hardcoding debug flags

Waiting for clarification before approving.


Update (8447f64): New commits add PYTHONFAULTHANDLER=1 and expand PYTHONUNBUFFERED=1 across multiple files (action.yml, Dockerfile.installci, utils.py, conftest.py, run_install_ci.py). These are reasonable diagnostic enhancements for better crash tracebacks and real-time output streaming. ✅ No concerns with these additions.

However, original concerns remain unaddressed:

  • gpu-stg runner change still present
  • -s -vv verbose pytest flags still present

Still awaiting clarification on whether these debugging changes are temporary or intended for permanent merge.


Update (943d9fa): Runner label changed again: [self-hosted, gpu-stg][self-hosted] (removes GPU requirement entirely for build job). ⚠️ New concern: Build job now runs on any self-hosted runner without GPU label — is this intentional? The original was [self-hosted, gpu].

Original concerns still apply:

  • -s -vv verbose pytest flags still present
  • PR metadata still incomplete

Awaiting clarification on intended runner configuration.


Update (c08d001): Pytest verbosity reduced from -vv to -v. ✅ This partially addresses the verbose output concern — single verbose (-v) is more reasonable than very verbose (-vv).

Remaining concerns:

  • -s flag still disables output capture (increased log size)
  • Runner set to [self-hosted] (no GPU label) — clarification needed
  • PR metadata still incomplete

Still awaiting clarification on runner configuration and whether these changes are temporary.


Update (bc8d650): Added thread count constraints (OPENBLAS_NUM_THREADS=1, OMP_NUM_THREADS=1, MKL_NUM_THREADS=1) across action.yml, utils.py, conftest.py, and run_install_ci.py. ✅ These are reasonable additions to prevent thread oversubscription during parallel test execution — no concerns with these changes.

Remaining concerns unchanged:

  • -s flag still disables pytest output capture
  • Runner configuration still needs clarification
  • PR metadata incomplete

Update (9b90222): New lazy import implementation for isaaclab.app:

  • Replaced lazy_export() from isaaclab.utils.module with a custom __getattr__ using only importlib
  • Added new test file test_app_imports.py verifying no numpy/torch/scipy loaded on import

This is a positive improvement — avoids pulling in heavy numeric dependencies when importing isaaclab.app, which is important for fast startup before Kit initializes.

Remaining concerns unchanged:

  • -s flag still disables pytest output capture
  • Runner [self-hosted] (no GPU) — clarification needed
  • PR metadata still incomplete

Update (f438131): ⚠️ Reverted the lazy import improvement:

  • isaaclab/app/__init__.py: Switched back to using lazy_export() from isaaclab.utils.module (which may pull in heavier dependencies)
  • Deleted test_app_imports.py — the test that verified no numpy/torch/scipy were loaded on import is now gone

This removes the isolation benefit from the previous commit. If lightweight imports before Kit startup are important, consider keeping the standalone __getattr__ implementation.

Remaining concerns unchanged:

  • -s flag still disables pytest output capture
  • Runner [self-hosted] (no GPU) — clarification needed
  • PR metadata still incomplete

Update (67d10a4): PR scope significantly reduced. Current diff now only contains:

  1. source/isaaclab/isaaclab/utils/__init__.py: Removed explicit configclass import (now lazy-loaded via attach_stub) — ✅ reasonable change for lazy import consistency
  2. tools/conftest.py: Added -s -v pytest flags — ⚠️ previous concern still applies (increases log verbosity)

The workflow/runner changes and environment variable additions appear to have been removed from this PR.

Current status: Only the pytest verbosity flags (-s -v) remain as a concern. Is this PR being used for debugging only, or are these flags intended to be permanent?


Update (3b15efd): New commit adds deferred CUDA device setting in app_launcher.py — torch import/set_device() is now deferred until after SimulationApp starts to avoid NumPy/OpenBLAS at-fork handlers crashing Kit's platform-info fork during startup. ✅ This is a well-implemented fix with clear comments explaining the rationale. No concerns with this change.

Current PR scope:

  1. app_launcher.py: Deferred CUDA device setting — ✅ looks good
  2. utils/__init__.py: Lazy configclass import — ✅ looks good
  3. conftest.py: -s -v pytest flags — ⚠️ previous concern still applies (increases log verbosity)

Overall this PR is improving. The main technical changes are solid. Only remaining question is whether the pytest verbosity flags are intended to be permanent.


Update (ba57283):Pytest -v flag removed — only -s remains in conftest.py. This addresses my verbosity concern.

New changes in this commit:

  • physics_manager_cfg.py, newton_collision_cfg.py, newton_manager_cfg.py: Changed to direct imports (from isaaclab.utils.configclass import configclass) — ✅ cleaner, more explicit
  • test_isaac_rtx_renderer_utils.py: Refactored tests with a mock_sim_context fixture — ✅ cleaner test code, removes duplicated patch blocks

Current PR scope looks clean. No remaining concerns with the technical changes. The -s flag still disables output capture but without -v the impact is reasonable for debugging purposes.---

Update (fa628db): Added changelog entry (pbarejko-debugging.rst) documenting the CUDA device setting fix and .skip files for newton/physx sub-packages. ✅ Documentation-only changes — no new concerns.


Update (c70e9e0): Test infrastructure improvements:

  • test_noise.py: Removed unnecessary AppLauncher/simulation_app boilerplate — this test only exercises pure math utilities and doesn't need the Omniverse app. ✅ Good cleanup.
  • test_wrench_composer.py: Added teardown_module() to close simulation_app after tests complete. ✅ Proper resource cleanup.

No new concerns. PR continues to look solid.


Update (f8b2c7a): test_wrench_composer.py now also removes the AppLauncher/simulation_app boilerplate entirely (and the teardown_module added in the previous commit). ✅ This test only exercises pure math/torch utilities and doesn't need the Omniverse app — same cleanup as test_noise.py. Good improvement for faster test execution.

No new concerns. PR looks ready for merge.


Update (ee49ecf): Changelog consolidation — moved fragment files from changelog.d/ into main CHANGELOG.rst files and bumped versions (isaaclab 5.2.0→5.2.1, isaaclab_newton 0.9.0→0.9.1, isaaclab_physx 0.7.0→0.7.1). ✅ Standard release housekeeping. No code changes, no new concerns.

Final status: PR looks good to merge.

@pbarejko pbarejko requested a review from pascal-roth as a code owner May 14, 2026 22:47
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 14, 2026
@pbarejko pbarejko force-pushed the pbarejko/debugging branch from f438131 to 67d10a4 Compare May 15, 2026 03:08
Copy link
Copy Markdown
Collaborator

@AntoineRichard AntoineRichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread tools/conftest.py
Copy link
Copy Markdown
Collaborator

@AntoineRichard AntoineRichard May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this if we're going to merge this?

@AntoineRichard AntoineRichard changed the title Debugging Prevents early numpy imports to avoid Kit crash May 15, 2026
@pbarejko
Copy link
Copy Markdown
Collaborator Author

Closing because of #5633

@pbarejko pbarejko closed this May 15, 2026
pbarejko added a commit that referenced this pull request May 15, 2026
# Description

This PR is based on and includes the changes from #5620, then adds one
CI fix on top: it unsets `HUB__ARGS__DETECT_ONLY` inside the Docker test
container before running Isaac Lab commands. Some base images set this
flag, which prevents OmniHub from starting and makes cold Nucleus asset
retrieval fall back to slow repeated retries.

This was reproduced from the failing Actions job:

https://github.com/isaac-sim/IsaacLab/actions/runs/25904143763/job/76158743634

The affected `test_rsl_rl_export_flow.py` Dexsuite Kuka-Allegro export
timed out at 600 s with the flag set, then completed in about 73 s with
the flag unset after clearing the local KukaAllegro mirror.

Fixes # N/A

## Type of change

- Bug fix (non-breaking change which fixes an issue)

## Screenshots

N/A - CI-only change.

## Checklist

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation (N/A -
CI-only change)
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works (validated with the affected Docker export test)
- [x] I have added a changelog fragment under
`source/<pkg>/changelog.d/` for every touched package (N/A for the
CI-only commit; #5620 carries its own changelog fragments)
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

## Test Plan

- `./isaaclab.sh -f`
- Docker reproduction with `HUB__ARGS__DETECT_ONLY=true`:
`test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0]` timed out
after 600 s.
- Docker reproduction with `HUB__ARGS__DETECT_ONLY` unset after clearing
the KukaAllegro mirror:
`test_export_flow[Isaac-Dexsuite-Kuka-Allegro-Reorient-v0]` passed in
72.75 s.

---------

Co-authored-by: Piotr Barejko <pbarejko@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants