CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 by ORippler · Pull Request #21181 · ggml-org/llama.cpp

ORippler · 2026-03-30T12:59:07Z

Overview

We wrongly calculated offset_grid as ceildiv(nrows, block_size), while it must be ceildiv(nrows + 1, block_size). As a consequence, we had uninitialized values in offset_iterator[nrows] for the case when nrows % block_size == 0.

This bug affected CCCL < 3.1 only, as newer CCCL versions take the strided_iterator path

Additional information

Fixes #21162

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, to ask some clarifying questions during implementation.

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162

CISC · 2026-03-30T13:16:26Z

Maybe I misunderstood, but didn't @fairydreaming say that it crashed with CUB disabled?

ORippler · 2026-03-30T13:49:48Z

Maybe I misunderstood, but didn't @fairydreaming say that it crashed with CUB disabled?

Quoting from the issue

@fairydreaming How exactly did you configure your CUDA build? On CTK 13.1 and CCCL 3.1, tests pass for me.

@ORippler I tested:
CUDA 12.8, no CCCL -> crash
CUDA 13.2 and CCCL 13.2.27 -> all OK

CUDA Toolkit (CTK) >= 11.0 always comes bundled with CUB, which was later on integrated into CCCL.

So in CTK 12.8 we have CCCL's CUB components available (see here for the compile-time check which limits CUB availability to CTK >= 11.7). I see no way to disable this via cmake.

Moreover, we correctly limit the support surface to bitonic sort in the case where CUB is unavailable in ggml_backend_cuda_device_supports_op.

CTK 12.8 comes bundled with CUB 2.7, which does not yet have strided_iterator available (this was added in CCCL 3.1, bundled with CTK 13.1). CTK 12.8 thus entered the explicit offset_index path, which was bugged exactly for nrows % block_size == 0 (where block_size = 256 currently).

CISC · 2026-03-30T13:55:25Z

@ORippler My confusion may stem from the statement in the related PR:

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.

Either way, if @fairydreaming can confirm this completely fixes the issue we're good.

fairydreaming · 2026-03-30T14:14:27Z

Either way, if @fairydreaming can confirm this completely fixes the issue we're good.

@CISC I confirm that it's all OK with this PR, no more crashes observed, tested up to TOP_K(type=f32,ne=[164352,2048,1,1],k=2048,ties=0).

…l-org#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162 * Reduce nrows in test case to 256, don't need 768

ORippler requested review from a team and ggerganov as code owners March 30, 2026 12:59

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 30, 2026

ORippler mentioned this pull request Mar 30, 2026

Misc. bug: CUDA ggml_top_k() implementation crashes for large tensor shapes #21162

Closed

JohannesGaessler approved these changes Mar 30, 2026

View reviewed changes

Reduce nrows in test case to 256, don't need 768

d601b1b

ORippler requested a review from JohannesGaessler March 30, 2026 13:19

CISC approved these changes Mar 30, 2026

View reviewed changes

CISC merged commit 64ac9ab into ggml-org:master Mar 30, 2026
45 of 46 checks passed

ORippler deleted the osimons/cuda_fix_argsort_ceildiv_for_cccl_lt_3dot1 branch March 30, 2026 14:20

fairydreaming mentioned this pull request Mar 30, 2026

Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA) #21149

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1#21181

CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1#21181
CISC merged 2 commits intoggml-org:masterfrom
ORippler:osimons/cuda_fix_argsort_ceildiv_for_cccl_lt_3dot1

ORippler commented Mar 30, 2026 •

edited

Loading

Uh oh!

CISC commented Mar 30, 2026

Uh oh!

ORippler commented Mar 30, 2026 •

edited

Loading

Uh oh!

CISC commented Mar 30, 2026

Uh oh!

fairydreaming commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ORippler commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

CISC commented Mar 30, 2026

Uh oh!

ORippler commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 30, 2026

Uh oh!

fairydreaming commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ORippler commented Mar 30, 2026 •

edited

Loading

ORippler commented Mar 30, 2026 •

edited

Loading