Hotfix: probe torch.cuda via subprocess + opaque install log by cryptopoly · Pull Request #25 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-05-01T16:51:29Z

Summary

Two regressions surfaced when smoke-testing the v0.7.2 rebuild on Windows / RTX 4090 — both blockers for shipping.

#1 — torch DLL lock prevents GPU bundle install

PR #22 added _snapshot_torch_cuda which did import torch in the backend process. On Windows that pins torch/lib/*.dll (asmjit, cublas, cudnn, ...) into the process handle table. The next click on Install GPU runtime runs pip install --target which calls shutil.rmtree on the existing torch dir, hits the locked DLLs, and crashes:

```
PermissionError: [WinError 5] Access is denied:
'...\extras\cp312\site-packages\torch\lib\asmjit.dll'
```

DiffusersImageEngine.probe() already documents this exact trap (it deliberately uses find_spec instead of importing torch). PR #22 was undoing that protection.

Fix: spawn a short-lived Python subprocess that imports torch, prints {gpu_name, total, used} as JSON, and exits. The OS releases the DLL handles on subprocess exit so the next install can swap torch in place. Prefer the embedded sidecar Python (CHAOSENGINE_EMBED_PYTHON_BIN); fall back to sys.executable.

Skip the probe entirely on macOS — Apple Silicon has no torch.cuda; _snapshot_macos owns that path.

#2 — Install log appears to overlap Prompt + Recent Outputs

PR #23's position: relative; z-index: 5 won the stacking battle, but the install-log-panel kept its translucent rgba(0, 0, 0, 0.22) background, so the Prompt + Recent Outputs panel headers bled through visually whenever the install log was adjacent to them. Reads as "overlap" even when the layout doesn't actually intersect.

Fix: switch the background to var(--surface) for a fully opaque card, and add contain: layout so the panel's growth during a long torch download can't leak into sibling grid rows.

Changes

backend_service/helpers/gpu.py — _snapshot_torch_cuda now spawns a Python subprocess; new _resolve_python_executable picks the embedded sidecar Python first; macOS short-circuits to None.
tests/test_gpu_detection.py — rewritten to mock subprocess.run instead of sys.modules['torch']. Adds an explicit assertion that the probe never imports torch in the main process — if anyone reverts to an in-process import, this test catches it.
src/styles.css — .install-log-panel gets background: var(--surface) + contain: layout.

Test plan

.venv/bin/python -m pytest tests/test_gpu.py tests/test_gpu_detection.py tests/test_inference.py tests/test_setup_routes.py -q — all pass (30 in test_gpu* alone, 1 expected skip)
Manual verify on Windows / RTX 4090: install GPU runtime → restart backend → click Install GPU runtime AGAIN → confirm no PermissionError; uninstall + reinstall → confirm extras dir survives.
Manual verify on Windows: while a long torch install is streaming, scroll Image Studio → confirm install log card is opaque, no Prompt headers showing through.
Manual verify on macOS: VRAM detection still reports unified memory via _snapshot_macos (untouched).

Two regressions reported from the v0.7.2 Windows smoke test (RTX 4090). 1. GPU bundle install fails with PermissionError on torch DLLs. PR #22's _snapshot_torch_cuda probed the GPU by importing torch directly in the backend process. On Windows that loads torch/lib/*.dll (asmjit, cublas, cudnn, ...) into the process handle table, which then makes pip's --target install fail with PermissionError: [WinError 5] Access is denied: '...\extras\cp312\site-packages\torch\lib\asmjit.dll' when shutil.rmtree tries to swap the existing torch wheel. DiffusersImageEngine.probe() already documents this exact trap and explicitly avoids importing torch — _snapshot_torch_cuda was undoing that protection. Fix: spawn a short-lived Python subprocess that imports torch, prints {gpu_name, total, used} as JSON to stdout, and exits. The OS releases the DLL handles on process exit, so the next Install GPU runtime click can rmtree + replace torch in place. Prefer the embedded sidecar Python (CHAOSENGINE_EMBED_PYTHON_BIN) so the subprocess sees the same site-packages as the backend; fall back to sys.executable when the env var is not set. Also skip the probe entirely on macOS — Apple Silicon has no torch.cuda; the unified-memory path in _snapshot_macos owns that case. 2. InstallLogPanel still appears to overlap Prompt + Recent Outputs. The previous rgba(0, 0, 0, 0.22) background let the sibling panel headers bleed through whenever the install log was visually adjacent to them, which read as 'overlap' even when the layout wasn't actually intersecting. Switch to var(--surface) for a fully opaque card background, and add 'contain: layout' so the panel's growth during a long torch download cannot leak into sibling grid rows. Tests - tests/test_gpu_detection.py rewritten to mock subprocess.run instead of sys.modules['torch']. Adds an explicit assertion that the probe never imports torch in the main process — if anyone reverts to an in-process import, that test catches it. - All existing tests still pass.

cryptopoly merged commit 0f84066 into main May 1, 2026
1 of 2 checks passed

cryptopoly mentioned this pull request May 1, 2026

Allow None vram_total_gb in test_gpu.py snapshot test #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix: probe torch.cuda via subprocess + opaque install log#25

Hotfix: probe torch.cuda via subprocess + opaque install log#25
cryptopoly merged 1 commit intomainfrom
fix/torch-dll-lock-and-zindex

cryptopoly commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented May 1, 2026

Summary

#1 — torch DLL lock prevents GPU bundle install

#2 — Install log appears to overlap Prompt + Recent Outputs

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant