Skip to content

fix(testing_utils): guard get_device_capability with torch.cuda.is_available()#45351

Open
RudrenduPaul wants to merge 6 commits intohuggingface:mainfrom
RudrenduPaul:fix/testing-utils-cuda-available-check
Open

fix(testing_utils): guard get_device_capability with torch.cuda.is_available()#45351
RudrenduPaul wants to merge 6 commits intohuggingface:mainfrom
RudrenduPaul:fix/testing-utils-cuda-available-check

Conversation

@RudrenduPaul
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a crash in get_device_properties() in testing_utils.py when CUDA is installed on the system but no GPU device is present (e.g., a CPU-only cloud studio with CUDA libraries installed).

The function called torch.cuda.get_device_capability() immediately after checking IS_CUDA_SYSTEM (which is True whenever torch.version.cuda is not None), without first verifying that an actual GPU is available. On CUDA-installed but GPU-less systems, get_device_capability() raises an error.

Fixes #45341

Changes

  • src/transformers/testing_utils.py: Add if not torch.cuda.is_available(): return (torch_device, None, None) guard inside the IS_CUDA_SYSTEM or IS_ROCM_SYSTEM branch of get_device_properties(), before the get_device_capability() call.

Tests

This is a fix to the test infrastructure itself (testing_utils.py). The change prevents a crash that occurs in environments where IS_CUDA_SYSTEM=True but no physical GPU is present (e.g., running pytest on a CPU-only Lightning AI studio).

No new tests were added because the existing test suite runs in environments where torch.cuda.is_available() is True — the crash scenario only reproduces on CUDA-installed, no-GPU systems.

Note: This PR was developed with AI assistance (Claude Code). I have reviewed every line and understand the change. This is not a duplicate of any existing open PR (checked open PRs searching for issue 45341 in body and keyword searches for get_device_capability + is_available).

@Rocketknight1
Copy link
Copy Markdown
Member

cc @remi-or to this as well as #45341, feel free to merge this if you're happy with it!

@RudrenduPaul
Copy link
Copy Markdown
Contributor Author

Hi @remi-or — the run_tests CircleCI check is showing a failure. Investigating whether this is related to this PR or a pre-existing issue on main.

The change in this PR is a 2-line guard: adding and torch.cuda.is_available() to get_device_capability() in testing_utils.py. This should only affect test utilities when CUDA is installed but no GPU is present — it shouldn't affect processors tests at all.

Happy to look into the CircleCI logs more closely if you can confirm this is expected to be investigated before merge. Thanks!

Copy link
Copy Markdown
Collaborator

@remi-or remi-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The processors failure is unrelated.

@MHRDYN7
Copy link
Copy Markdown
Contributor

MHRDYN7 commented Apr 13, 2026

@remi-or @RudrenduPaul the current fix is neat, but doesn't it mean that if both cuda and xpu are installed and there is no gpu, the xpu case will be ignored? Also, actually this was also the case earlier due to the if elif pattern.

@remi-or
Copy link
Copy Markdown
Collaborator

remi-or commented Apr 14, 2026

Ok, then what about this @MHRDYN7 @RudrenduPaul

    if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
        import torch

        if torch.cuda.is_available():
            major, minor = torch.cuda.get_device_capability()
            if IS_ROCM_SYSTEM:
                return ("rocm", major, minor)
            else:
                 return ("cuda", major, minor)
    if IS_XPU_SYSTEM:
        import torch

        if torch.xpu.is_available():
            ...

that way we escape of the the cuda / rocm block if cuda is not available, and we can enter the XPU block afterwards, and exit it for the same reason then.

@remi-or remi-or self-requested a review April 14, 2026 00:46
RudrenduPaul and others added 2 commits April 13, 2026 17:49
…blocks

Change elif chain to separate if blocks so that when CUDA is installed
but no GPU is available, the code falls through to check XPU (and then NPU).
Per @remi-or's suggestion in review.

Built by Rudrendu Paul, developed with Claude Code
gen_mask = 0x000000FF00000000
gen = (arch & gen_mask) >> 32
return ("xpu", gen, None)
if IS_NPU_SYSTEM:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO so after torch 2.5.1 we also use if hasattr(torch, 'npu') and torch.npu.is_available() there ? To stay consistent. Thanks

@RudrenduPaul
Copy link
Copy Markdown
Contributor Author

Thanks @remi-or @MHRDYN7 — that refactored structure looks great. It cleanly handles the case where both CUDA and XPU are installed but neither has a device available, and it keeps the early-import guard intact. I'll implement that pattern and push an update.

I'll also dig into the tests_torch / run_tests failure to confirm whether it's related to this change or a pre-existing flake on main.

@RudrenduPaul
Copy link
Copy Markdown
Contributor Author

Implemented @remi-or's refactored structure — the elif chain has been replaced with separate if blocks so CUDA/ROCm and XPU paths are fully independent:

if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
    import torch
    if torch.cuda.is_available():
        major, minor = torch.cuda.get_device_capability()
        if IS_ROCM_SYSTEM:
            return ("rocm", major, minor)
        else:
            return ("cuda", major, minor)
if IS_XPU_SYSTEM:
    import torch
    if torch.xpu.is_available():
        arch = torch.xpu.get_device_capability()["architecture"]
        ...
        return ("xpu", gen, None)
if IS_NPU_SYSTEM:
    return ("npu", None, None)
return (torch_device, None, None)

This handles the case @MHRDYN7 raised — if both CUDA and XPU are installed but neither has a device available, the code now falls through cleanly to check XPU (and then NPU) rather than returning early.

@remi-or
Copy link
Copy Markdown
Collaborator

remi-or commented Apr 22, 2026

Hey @RudrenduPaul , can you add the TODO I requested please? That way we can close this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A little bug in testing_utils.py

4 participants