fix(generation): handle CUDA multinomial limit in beam search sampling#45369
Closed
sharziki wants to merge 1 commit intohuggingface:mainfrom
Closed
fix(generation): handle CUDA multinomial limit in beam search sampling#45369sharziki wants to merge 1 commit intohuggingface:mainfrom
sharziki wants to merge 1 commit intohuggingface:mainfrom
Conversation
torch.multinomial on CUDA requires the last dimension to be <= 2^24. With large num_beams * vocab_size (e.g. 128 * 164K = 21M), this limit is exceeded, causing a RuntimeError. Pre-filter to the top 2^24 candidates via torch.topk before sampling when necessary. Fixes huggingface#45245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45369&sha=838fbd |
Member
|
Hi @sharziki, as commented in the issue I don't think we need extra code paths to solve what is a very rare edge case. If you're doing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #45245 —
torch.multinomialcrashes withRuntimeError: number of categories cannot exceed 2^24whennum_beams * vocab_size > 16,777,216during beam search withdo_sample=True.Root cause: In
_get_top_k_continuations(), the accumulated log-probs are flattened to shape(batch_size, num_beams * vocab_size)and passed directly totorch.multinomial. With large beam counts (e.g. 128) and large vocabularies (e.g. 164K), this exceeds PyTorch's CUDA limit of2^24categories.Fix: When the flattened dimension exceeds
2^24, pre-filter to the top2^24candidates usingtorch.topk(which has no such limit), then sample from the filtered set. The candidate indices are mapped back to the original space. This preserves the sampling distribution — with 16.7M out of ~21M candidates retained, virtually all probability mass is covered.The fix is 7 net new lines. No new files, no new dependencies, no behavioral change for users within the limit.
Coordination
Test plan
model.generate(num_beams=128, do_sample=True)no longer crashes with large-vocab modelsnum_beams < 2^24/vocab_size) is unaffected (takes the else branch)ruff check src/transformers/generation/utils.pypasses🤖 Generated with Claude Code