fix(generation): handle CUDA multinomial limit in beam search sampling by sharziki · Pull Request #45369 · huggingface/transformers

sharziki · 2026-04-11T02:42:07Z

Summary

Fixes #45245 — torch.multinomial crashes with RuntimeError: number of categories cannot exceed 2^24 when num_beams * vocab_size > 16,777,216 during beam search with do_sample=True.

Root cause: In _get_top_k_continuations(), the accumulated log-probs are flattened to shape (batch_size, num_beams * vocab_size) and passed directly to torch.multinomial. With large beam counts (e.g. 128) and large vocabularies (e.g. 164K), this exceeds PyTorch's CUDA limit of 2^24 categories.

Fix: When the flattened dimension exceeds 2^24, pre-filter to the top 2^24 candidates using torch.topk (which has no such limit), then sample from the filtered set. The candidate indices are mapped back to the original space. This preserves the sampling distribution — with 16.7M out of ~21M candidates retained, virtually all probability mass is covered.

The fix is 7 net new lines. No new files, no new dependencies, no behavioral change for users within the limit.

Coordination

Issue discussion: RuntimeError: number of categories cannot exceed 2^24 #45245 (comment)
Previous PR fix(generation): beam sample when num_beams * vocab_size exceeds multinomial limit #45251 was closed for being over-engineered (91 additions). This fix is minimal.
No other open PRs for this issue.
AI assistance (Claude Code) was used. All changes reviewed and validated by the submitting human.

Test plan

Verify model.generate(num_beams=128, do_sample=True) no longer crashes with large-vocab models
Verify normal beam search (num_beams < 2^24/vocab_size) is unaffected (takes the else branch)
ruff check src/transformers/generation/utils.py passes

🤖 Generated with Claude Code

torch.multinomial on CUDA requires the last dimension to be <= 2^24. With large num_beams * vocab_size (e.g. 128 * 164K = 21M), this limit is exceeded, causing a RuntimeError. Pre-filter to the top 2^24 candidates via torch.topk before sampling when necessary. Fixes huggingface#45245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-11T02:55:44Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45369&sha=838fbd

Rocketknight1 · 2026-04-13T12:49:42Z

Hi @sharziki, as commented in the issue I don't think we need extra code paths to solve what is a very rare edge case. If you're doing torch.multinomial over 16 million values then something has gone terribly wrong 😅

Rocketknight1 closed this Apr 13, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(generation): handle CUDA multinomial limit in beam search sampling#45369

fix(generation): handle CUDA multinomial limit in beam search sampling#45369
sharziki wants to merge 1 commit intohuggingface:mainfrom
sharziki:fix/45245-beam-search-multinomial-limit

sharziki commented Apr 11, 2026

Uh oh!

github-actions Bot commented Apr 11, 2026

Uh oh!

Rocketknight1 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sharziki commented Apr 11, 2026

Summary

Coordination

Test plan

Uh oh!

github-actions Bot commented Apr 11, 2026

Uh oh!

Rocketknight1 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants