Fix CUDA errors in sharded generation with Qwen3 by SrijanUpadhyay · Pull Request #41734 · huggingface/transformers

SrijanUpadhyay · 2025-10-19T14:34:12Z

Issue #41720: CUDA asserts during multi-GPU generation with Qwen3 models due to NaN/Inf in hidden states.

Changes:

Enhanced InfNanRemoveLogitsProcessor to handle hidden state stabilization
Added automatic remove_invalid_values=True for sharded models
Removed direct nan handling from Qwen3 model for cleaner architecture

Issue huggingface#41720: CUDA asserts during multi-GPU generation with Qwen3 models due to NaN/Inf in hidden states. Changes: - Enhanced InfNanRemoveLogitsProcessor to handle hidden state stabilization - Added automatic remove_invalid_values=True for sharded models - Removed direct nan handling from Qwen3 model for cleaner architecture Fixes huggingface#41720

SrijanUpadhyay · 2025-10-23T14:00:22Z

Hey! @vasqu, i have made these changes, please look into it and provide me feedback on this PR.

Bobchenyx · 2025-11-26T02:54:34Z

Hi there, thanks for this potential fix. I’m truly grateful that you’re taking the time to look into this issue. I pulled and built your branch locally, but I’m still running into a similar problem.
I’ve attached the logs / error messages I’m seeing below in case that helps with debugging.

(moe-pwe) user1@nnmc67:~/workspace/bobchenyx/MoE-PWE$ CUDA_VISIBLE_DEVICES=0,1 python qwen3-generate.py 
Loading model from: ../Qwen/Qwen3-30B-A3B-Instruct-2507
Using device: cuda
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 16/16 [00:17<00:00,  1.11s/it]
Model loaded successfully!

Prompt: Explain the concept of Mixture of Experts(MoE) in a few sentences.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/home/user1/workspace/bobchenyx/MoE-PWE/qwen3-generate.py", line 78, in <module>
    main()
  File "/home/user1/workspace/bobchenyx/MoE-PWE/qwen3-generate.py", line 42, in main
    outputs = model.generate(
  File "/home/user1/miniconda3/envs/moe-pwe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/user1/miniconda3/envs/moe-pwe/lib/python3.10/site-packages/transformers/generation/utils.py", line 2695, in generate
    result = decoding_method(
  File "/home/user1/miniconda3/envs/moe-pwe/lib/python3.10/site-packages/transformers/generation/utils.py", line 2903, in _sample
    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
  File "/home/user1/miniconda3/envs/moe-pwe/lib/python3.10/site-packages/transformers/generation/utils.py", line 2721, in _has_unfinished_sequences
    elif this_peer_finished:
torch.AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(moe-pwe) user1@nnmc67:~/workspace/bobchenyx/MoE-PWE$ pip show transformers
Name: transformers
Version: 5.0.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/user1/miniconda3/envs/moe-pwe/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm, typer-slim
Required-by: 
(moe-pwe) user1@nnmc67:~/workspace/bobchenyx/MoE-PWE$

Rocketknight1 mentioned this pull request Oct 20, 2025

Qwen3 with auto device mapping fails due to cudaErrorAssert on A800 #41720

Closed

4 tasks

SrijanUpadhyay mentioned this pull request Oct 23, 2025

Fix Qwen3Next dtype API usage #41735

Merged

Bobchenyx mentioned this pull request Nov 25, 2025

Bug: Qwen3-30B-A3B-Instruct-2507 inclusionAI/MoBE#3

Closed

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA errors in sharded generation with Qwen3#41734

Fix CUDA errors in sharded generation with Qwen3#41734
SrijanUpadhyay wants to merge 1 commit intohuggingface:mainfrom
SrijanUpadhyay:fix-sharded-generation-nans

SrijanUpadhyay commented Oct 19, 2025

Uh oh!

SrijanUpadhyay commented Oct 23, 2025

Uh oh!

Bobchenyx commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SrijanUpadhyay commented Oct 19, 2025

Uh oh!

SrijanUpadhyay commented Oct 23, 2025

Uh oh!

Bobchenyx commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants