Skip to content

fix(testing_utils): guard get_device_capability() with torch.cuda.is_available()#45427

Closed
Aftabbs wants to merge 1 commit intohuggingface:mainfrom
Aftabbs:fix/testing-utils-cuda-available-check
Closed

fix(testing_utils): guard get_device_capability() with torch.cuda.is_available()#45427
Aftabbs wants to merge 1 commit intohuggingface:mainfrom
Aftabbs:fix/testing-utils-cuda-available-check

Conversation

@Aftabbs
Copy link
Copy Markdown

@Aftabbs Aftabbs commented Apr 14, 2026

What does this PR do?

Fixes #45341.

get_device_properties() in testing_utils.py calls torch.cuda.get_device_capability() whenever IS_CUDA_SYSTEM or IS_ROCM_SYSTEM is True. This raises a RuntimeError on environments where CUDA drivers are installed (so torch.version.cuda is not None) but no physical GPU is attached (e.g., Lightning AI Studio CPU-only instances, CI runners with CUDA drivers but no GPU).

Root cause: IS_CUDA_SYSTEM reflects whether the CUDA toolkit is present, not whether a CUDA-capable device is available at runtime. torch.cuda.get_device_capability() requires an actual device.

Fix: Add and torch.cuda.is_available() to the condition so that get_device_capability() is only called when a CUDA/ROCm device is actually present. When CUDA is installed but no device is available, the function falls through to the generic else branch and returns (torch_device, None, None).

-    if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
+    if (IS_CUDA_SYSTEM or IS_ROCM_SYSTEM) and torch.cuda.is_available():

Before submitting

  • This PR fixes a bug (non-breaking change that fixes an issue)
  • This PR is a new feature (non-breaking change that adds functionality)
  • This PR is a breaking change (fix or feature that would cause existing functionality not to work as expected)
  • This PR adds tests that prove my fix is effective or that my feature works: N/A — the crash only occurs on systems with CUDA installed but no GPU, which is not a typical CI environment; the fix is a one-line guard matching existing patterns elsewhere in the file (e.g. line 995).

torch.cuda.get_device_capability() raises RuntimeError when CUDA
is installed (IS_CUDA_SYSTEM=True) but no physical GPU is present
(torch.cuda.is_available()=False). This happens on cloud environments
like Lightning AI Studio that have CUDA drivers but no attached GPU.

Add torch.cuda.is_available() to the condition so the function falls
through to the generic else-branch (returning (torch_device, None, None))
when the CUDA/ROCm system flag is set but no device is actually available.

Fixes huggingface#45341
@Rocketknight1
Copy link
Copy Markdown
Member

There was already an open PR doing basically the same thing! Please check for other PRs first before sending your agent to fix issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A little bug in testing_utils.py

2 participants