Fix Windows CUDA detection + image gen by cryptopoly · Pull Request #22 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-05-01T12:03:57Z

Summary

Two related Windows bugs surfaced during the v0.7.2 smoke test on an RTX 4090 / 24 GB box:

Bump actions/upload-artifact from 6 to 7 #6 GPU memory misreported as 12 GB on a 24 GB card. GPUMonitor._snapshot_nvidia() shelled out to nvidia-smi, and on Windows boxes without it on PATH (driver installed but no CUDA toolkit) it fell through to _fallback_psutil(), which returns system RAM. The image / video safety estimators then read that as the GPU budget and produced spurious "Likely to crash" warnings.
Bump tauri-apps/tauri-action from 0.6.0 to 0.6.2 #7 Image gen produces gibberish placeholder after install. DiffusersImageEngine.probe() uses importlib.util.find_spec to choose between the placeholder engine and the real diffusers pipeline. After the GPU bundle install lands new packages into the extras dir, importlib's negative-lookup cache still answers None until invalidate_caches() is called, so the probe kept reporting realGenerationAvailable=False and the SVG placeholder leaked through.

Changes

`backend_service/helpers/gpu.py`

New _snapshot_torch_cuda() reads VRAM via torch.cuda.get_device_properties(0).total_memory first — works whenever the GPU bundle is installed, no PATH dependency.
_snapshot_nvidia() now tries torch.cuda → nvidia-smi → returns vram_total_gb=None (no system-RAM lie).
The old _fallback_psutil() is kept untouched but no longer reachable from the live path.

`backend_service/image_runtime.py`

DiffusersImageEngine.probe() calls importlib.invalidate_caches() before the find_spec checks so newly-installed packages from the GPU bundle install are visible without a backend restart.

`backend_service/routes/setup.py`

_gpu_bundle_job_worker invalidates the import cache and resets the VRAM total cache when transitioning to phase=done, so the immediately-following capabilities snapshot reflects freshly-importable torch.

`tests/test_gpu_detection.py` (new)

Nine unit tests covering:

torch.cuda returns full 24 GB for a mocked RTX 4090.
torch.cuda unavailable / not installed returns None.
_snapshot_nvidia falls back to {"gpu_name": "No GPU detected", "vram_total_gb": None} when both torch.cuda and nvidia-smi fail.
_snapshot_nvidia does NOT fall back to system RAM via psutil any more.
torch.cuda takes precedence over nvidia-smi when both are available.
get_device_vram_total_gb caches result for process lifetime.

Test plan

.venv/bin/python -m pytest tests/test_gpu_detection.py -v — 9/9 pass
.venv/bin/python -m pytest tests/test_setup_routes.py tests/test_inference.py tests/test_services.py -q — pre-existing tests still pass
Manual verify on Windows / RTX 4090: Settings → Diagnostics reports 24 GB VRAM after restart; FLUX.1 Dev no longer triggers "Likely to crash" warning; clicking Generate after a fresh GPU bundle install produces a real image instead of the placeholder.
Manual verify on macOS (regression): VRAM detection still reports unified memory correctly via the existing _snapshot_macos path (untouched).

Out of scope

The diffusers/safetensors version-incompatibility warning ("safetensors 0.7.0 vs >=0.8.0-rc.0") observed in the install log is a separate issue. safetensors 0.8.0-rc.0 is a pre-release that pip won't install by default; in practice 0.7.0 works for the FLUX pipeline, so this PR leaves the pin at safetensors>=0.4.5 and treats the pip-resolver warning as benign.

Two related Windows-only bugs surfaced by the v0.7.2 smoke test on an RTX 4090 box: Bug #6 — RTX 4090 reported as 12 GB total GPUMonitor._snapshot_nvidia() shells out to nvidia-smi, and on Windows boxes without it on PATH (driver installed but no CUDA toolkit) it fell through to _fallback_psutil() which returns psutil.virtual_memory().total — system RAM, not VRAM. The image / video safety estimators then read that as the GPU budget and produced 'Likely to crash' warnings on a 24 GB card holding an 11 GB FLUX model. Fix: - Try torch.cuda.get_device_properties(0).total_memory first. When the GPU bundle is installed this is the most reliable source — it reads through the CUDA driver, no PATH needed. - Fall back to nvidia-smi as before. - Drop the psutil fallback. When neither answers we now return {'vram_total_gb': None}, which the TS estimators (utils/images.ts, utils/videos.ts) already treat as 'unknown' via the DEFAULT_*_MEMORY_GB fallbacks. Better an honest 'unknown' than a wrong 12 GB. Bug #7 — Image gen produces gibberish placeholder after install DiffusersImageEngine.probe() uses importlib.util.find_spec to decide between the placeholder engine and the real diffusers pipeline. Once the GPU bundle install lands new packages into the extras dir, importlib's negative-lookup cache still answers None for the new modules until invalidate_caches() is called. The probe kept reporting realGenerationAvailable=False and the generation pipeline returned the SVG placeholder, which lands as a gibberish image when the frontend renders it as data:image/svg+xml. Fix: - probe() now calls importlib.invalidate_caches() before find_spec so newly-installed packages are picked up without a backend restart. - The GPU bundle worker (_gpu_bundle_job_worker) now also calls invalidate_caches and resets the VRAM total cache when it transitions to phase=done, so the immediately-following capabilities snapshot reflects the freshly-importable torch. Tests tests/test_gpu_detection.py — 9 unit tests covering torch.cuda detection, nvidia-smi precedence, the new no-system-RAM fallback path, and the process-lifetime cache. All pass; existing pytest suite still green.

PR #22 (Fix Windows CUDA detection) replaced the system-RAM-as-VRAM fallback in _snapshot_nvidia with a no-GPU response that returns {'vram_total_gb': None, 'vram_used_gb': None}. The pre-existing test_snapshot_vram_values_are_numeric still required (int, float), which broke on the Linux CI runner where neither torch.cuda nor nvidia-smi is available. Loosen the type check to (int, float, type(None)) so the no-GPU path is accepted. Numeric responses still fail the test if vram_used_gb returns something garbage like a string. Renamed the test to ..._numeric_or_none to make the intent loud at the call site.

PR #22's no-system-RAM-fallback path returns {'vram_total_gb': None} on Linux CI runners (no torch.cuda, no nvidia-smi). The pre-existing test_snapshot_vram_values_are_numeric required (int, float) which breaks on those runners. This fix originally landed in branch fix/test-host-platform-mock (commit 3b147a9) but was pushed after PR #24 had already merged, so only the imageDiscoverMemoryEstimate Mac pin (commit 7bbeeef) made it into main. The orphan commit went unnoticed until run 25223969487 on this PR's first CI ride re-surfaced the same failure. Loosen the type check to (int, float, type(None)) and rename the test to ..._numeric_or_none so the intent is loud at the call site.

cryptopoly mentioned this pull request May 1, 2026

Preserve Windows GPU runtime on uninstall + fix install log z-index #23

Merged

4 tasks

cryptopoly merged commit 3967db3 into main May 1, 2026
1 of 2 checks passed

cryptopoly mentioned this pull request May 1, 2026

Pin imageDiscoverMemoryEstimate tests to Mac host #24

Merged

2 tasks

cryptopoly mentioned this pull request May 1, 2026

Hotfix: probe torch.cuda via subprocess + opaque install log #25

Merged

4 tasks

cryptopoly mentioned this pull request May 1, 2026

Allow None vram_total_gb in test_gpu.py snapshot test #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Windows CUDA detection + image gen#22

Fix Windows CUDA detection + image gen#22
cryptopoly merged 1 commit intomainfrom
fix/windows-cuda-detection

cryptopoly commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented May 1, 2026

Summary

Changes

backend_service/helpers/gpu.py

backend_service/image_runtime.py

backend_service/routes/setup.py

tests/test_gpu_detection.py (new)

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`backend_service/helpers/gpu.py`

`backend_service/image_runtime.py`

`backend_service/routes/setup.py`

`tests/test_gpu_detection.py` (new)