fix(testing_utils): guard get_device_capability with torch.cuda.is_available()#45351
fix(testing_utils): guard get_device_capability with torch.cuda.is_available()#45351RudrenduPaul wants to merge 6 commits intohuggingface:mainfrom
Conversation
|
Hi @remi-or — the The change in this PR is a 2-line guard: adding Happy to look into the CircleCI logs more closely if you can confirm this is expected to be investigated before merge. Thanks! |
remi-or
left a comment
There was a problem hiding this comment.
LGTM! The processors failure is unrelated.
|
@remi-or @RudrenduPaul the current fix is neat, but doesn't it mean that if both cuda and xpu are installed and there is no gpu, the xpu case will be ignored? Also, actually this was also the case earlier due to the if elif pattern. |
|
Ok, then what about this @MHRDYN7 @RudrenduPaul that way we escape of the the cuda / rocm block if cuda is not available, and we can enter the XPU block afterwards, and exit it for the same reason then. |
…blocks Change elif chain to separate if blocks so that when CUDA is installed but no GPU is available, the code falls through to check XPU (and then NPU). Per @remi-or's suggestion in review. Built by Rudrendu Paul, developed with Claude Code
| gen_mask = 0x000000FF00000000 | ||
| gen = (arch & gen_mask) >> 32 | ||
| return ("xpu", gen, None) | ||
| if IS_NPU_SYSTEM: |
There was a problem hiding this comment.
Can we add a TODO so after torch 2.5.1 we also use if hasattr(torch, 'npu') and torch.npu.is_available() there ? To stay consistent. Thanks
|
Thanks @remi-or @MHRDYN7 — that refactored structure looks great. It cleanly handles the case where both CUDA and XPU are installed but neither has a device available, and it keeps the early-import guard intact. I'll implement that pattern and push an update. I'll also dig into the |
|
Implemented @remi-or's refactored structure — the if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
import torch
if torch.cuda.is_available():
major, minor = torch.cuda.get_device_capability()
if IS_ROCM_SYSTEM:
return ("rocm", major, minor)
else:
return ("cuda", major, minor)
if IS_XPU_SYSTEM:
import torch
if torch.xpu.is_available():
arch = torch.xpu.get_device_capability()["architecture"]
...
return ("xpu", gen, None)
if IS_NPU_SYSTEM:
return ("npu", None, None)
return (torch_device, None, None)This handles the case @MHRDYN7 raised — if both CUDA and XPU are installed but neither has a device available, the code now falls through cleanly to check XPU (and then NPU) rather than returning early. |
|
Hey @RudrenduPaul , can you add the TODO I requested please? That way we can close this. Thanks! |
What does this PR do?
Fixes a crash in
get_device_properties()intesting_utils.pywhen CUDA is installed on the system but no GPU device is present (e.g., a CPU-only cloud studio with CUDA libraries installed).The function called
torch.cuda.get_device_capability()immediately after checkingIS_CUDA_SYSTEM(which isTruewhenevertorch.version.cuda is not None), without first verifying that an actual GPU is available. On CUDA-installed but GPU-less systems,get_device_capability()raises an error.Fixes #45341
Changes
src/transformers/testing_utils.py: Addif not torch.cuda.is_available(): return (torch_device, None, None)guard inside theIS_CUDA_SYSTEM or IS_ROCM_SYSTEMbranch ofget_device_properties(), before theget_device_capability()call.Tests
This is a fix to the test infrastructure itself (
testing_utils.py). The change prevents a crash that occurs in environments whereIS_CUDA_SYSTEM=Truebut no physical GPU is present (e.g., runningpyteston a CPU-only Lightning AI studio).No new tests were added because the existing test suite runs in environments where
torch.cuda.is_available()is True — the crash scenario only reproduces on CUDA-installed, no-GPU systems.