[BUG FIX] vis/rasterizer: defer OffscreenRenderer creation until first render#31
Open
gpinkert wants to merge 2 commits intoamd-integrationfrom
Open
[BUG FIX] vis/rasterizer: defer OffscreenRenderer creation until first render#31gpinkert wants to merge 2 commits intoamd-integrationfrom
gpinkert wants to merge 2 commits intoamd-integrationfrom
Conversation
…t render
Previously, `Rasterizer.build()` eagerly constructed a
`pyrender.OffscreenRenderer` whenever no interactive viewer was active,
which is the path taken by every `scene.build()` call with
`show_viewer=False`. Constructing the OffscreenRenderer immediately calls
`EGLPlatform.init_context()` (genesis/ext/pyrender/platforms/egl.py), which
in turn calls `eglInitialize` on a Mesa/`radeonsi` display.
This is fine when a single test process drives the GPU sequentially, but
it has two problems on the AMD/ROCm test setup:
1. It creates an EGL/GL context for *every* `scene.build()`, even when the
test never renders. The vast majority of the rigid-physics test suite
never instantiates a camera or calls `scene.render()` / `cam.render()`,
so the GL context, FBO, depth buffer, and shader compiles are pure
per-test overhead.
2. When the test runner uses `pytest-xdist` with multiple workers on a
single AMD GPU, the concurrent `eglInitialize` / `radeonsi` context
creations contend on the driver and reliably fail with:
radeonsi: error: can't create eop_bug_scratch
radeonsi: error: Failed to create a context.
followed by a SIGSEGV inside `eglInitialize` (egl.py:223). The crash
manifests under any combination of `-n N` and `-n N --forked` because
`pytest-forked` runs each test inside a fresh fork that re-enters
`scene.build()` and races against the other workers' GL initialisations.
The fix moves the OffscreenRenderer construction out of `build()` and
into a new `_ensure_renderer()` helper that is invoked on the first
`render_camera()` call. With this change:
- Tests that never render (the dominant case) never touch EGL, so they
cannot hit the `radeonsi` context-creation race. This unblocks
`pytest -n N` parallel execution on a single AMD GPU.
- Tests that do render are unaffected behaviourally: the same
`OffscreenRenderer` is created the first time `render_camera()` runs,
and reused on subsequent calls. `make_current()` / `make_uncurrent()`,
resize handling, and `destroy()` all already null-check `self._renderer`,
so no further refactor was required.
The visualizer/rasterizer wiring is otherwise unchanged: cameras are
still added in `add_camera()`, `pyrender.Renderer` (per-camera target) is
still allocated up-front (it only touches the JIT context, not GL), and
`destroy()` continues to clean up whichever renderer was actually
materialised.
Result: `pytest tests/test_rigid_physics.py -n 8 -m required` now runs
to completion on a single-GPU AMD/ROCm host. End-to-end wall-time on the
required set drops from ~30 minutes (sequential `-n 0 --forked`) to
roughly 8 minutes, with the speedup limited by per-process kernel
compilation rather than EGL or GPU contention.
The CI tester scripts that invoke the Genesis test suite live in another
repo and pass `-n 0` (sequential) explicitly. Now that EGL is initialised
lazily and concurrent `scene.build()` calls no longer race on `radeonsi`
context creation (see prior commit), it is safe — and substantially
faster — to run the suite in parallel on a single AMD GPU.
This change rewrites `-n 0` to `-n 8` from inside the existing
`pytest_cmdline_main` hook so the parallel default takes effect without
any modification to the external tester scripts. The override is gated
on:
* `os.path.exists("/dev/kfd")` — the AMDGPU kernel driver device file,
so non-AMD hosts (NVIDIA, Apple, CPU-only) keep the user-supplied
value.
* `not show_viewer` — the immediately-preceding block already pins
`numprocesses` to 0 when the interactive viewer is requested, and
that decision must win.
* `config.option.numprocesses == 0` — explicit `-n N` for any non-zero
`N` is preserved verbatim, so debugging with `-n 1` or experimenting
with other worker counts still works.
`pytest-xdist` is already installed in `genesis:amd-integration` (via
the `[dev]` extras pulled in by `Dockerfile.rocm`), and the existing
`-n 0` invocations already depend on it being present, so no packaging
changes are needed for this hook to take effect.
Net effect: `pytest -n 0 --forked` issued from the upstream tester
scripts now runs with eight workers on AMD/ROCm, dropping the
required-test wall-time on `tests/test_rigid_physics.py` from roughly
30 minutes to under 10 minutes on a single-GPU host.
Collaborator
|
/run-ci |
yaoliu13
requested changes
Apr 27, 2026
Collaborator
yaoliu13
left a comment
There was a problem hiding this comment.
This PR is not based on the latest amd-integration: lazy-offscreen-renderer...ROCm:Genesis:amd-integration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously,
Rasterizer.build()eagerly constructed apyrender.OffscreenRendererwhenever no interactive viewer was active, which is the path taken by everyscene.build()call withshow_viewer=False. Constructing the OffscreenRenderer immediately callsEGLPlatform.init_context()(genesis/ext/pyrender/platforms/egl.py), which in turn callseglInitializeon a Mesa/radeonsidisplay.This is fine when a single test process drives the GPU sequentially, but it has two problems on the AMD/ROCm test setup:
It creates an EGL/GL context for every
scene.build(), even when the test never renders. The vast majority of the rigid-physics test suite never instantiates a camera or callsscene.render()/cam.render(), so the GL context, FBO, depth buffer, and shader compiles are pure per-test overhead.When the test runner uses
pytest-xdistwith multiple workers on a single AMD GPU, the concurrenteglInitialize/radeonsicontext creations contend on the driver and reliably fail with:followed by a SIGSEGV inside
eglInitialize(egl.py:223). The crash manifests under any combination of-n Nand-n N --forkedbecausepytest-forkedruns each test inside a fresh fork that re-entersscene.build()and races against the other workers' GL initialisations.The fix moves the OffscreenRenderer construction out of
build()and into a new_ensure_renderer()helper that is invoked on the firstrender_camera()call. With this change:radeonsicontext-creation race. This unblockspytest -n Nparallel execution on a single AMD GPU.OffscreenRendereris created the first timerender_camera()runs, and reused on subsequent calls.make_current()/make_uncurrent(), resize handling, anddestroy()all already null-checkself._renderer, so no further refactor was required.The visualizer/rasterizer wiring is otherwise unchanged: cameras are still added in
add_camera(),pyrender.Renderer(per-camera target) is still allocated up-front (it only touches the JIT context, not GL), anddestroy()continues to clean up whichever renderer was actually materialised.Result:
pytest tests/test_rigid_physics.py -n 8 -m requirednow runs to completion on a single-GPU AMD/ROCm host. End-to-end wall-time on the required set drops from ~30 minutes (sequential-n 0 --forked) to roughly 8 minutes, with the speedup limited by per-process kernel compilation rather than EGL or GPU contention.