Skip to content

CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1#21181

Merged
CISC merged 2 commits intoggml-org:masterfrom
ORippler:osimons/cuda_fix_argsort_ceildiv_for_cccl_lt_3dot1
Mar 30, 2026
Merged

CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1#21181
CISC merged 2 commits intoggml-org:masterfrom
ORippler:osimons/cuda_fix_argsort_ceildiv_for_cccl_lt_3dot1

Conversation

@ORippler
Copy link
Copy Markdown
Collaborator

@ORippler ORippler commented Mar 30, 2026

Overview

We wrongly calculated offset_grid as ceildiv(nrows, block_size), while it must be ceildiv(nrows + 1, block_size). As a consequence, we had uninitialized values in offset_iterator[nrows] for the case when nrows % block_size == 0.

This bug affected CCCL < 3.1 only, as newer CCCL versions take the strided_iterator path

Additional information

Fixes #21162

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, to ask some clarifying questions during implementation.

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162
@ORippler ORippler requested review from a team and ggerganov as code owners March 30, 2026 12:59
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 30, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 30, 2026

Maybe I misunderstood, but didn't @fairydreaming say that it crashed with CUB disabled?

@ORippler
Copy link
Copy Markdown
Collaborator Author

ORippler commented Mar 30, 2026

Maybe I misunderstood, but didn't @fairydreaming say that it crashed with CUB disabled?

Quoting from the issue

@fairydreaming How exactly did you configure your CUDA build? On CTK 13.1 and CCCL 3.1, tests pass for me.

@ORippler I tested:
CUDA 12.8, no CCCL -> crash
CUDA 13.2 and CCCL 13.2.27 -> all OK

CUDA Toolkit (CTK) >= 11.0 always comes bundled with CUB, which was later on integrated into CCCL.

So in CTK 12.8 we have CCCL's CUB components available (see here for the compile-time check which limits CUB availability to CTK >= 11.7). I see no way to disable this via cmake.

Moreover, we correctly limit the support surface to bitonic sort in the case where CUB is unavailable in ggml_backend_cuda_device_supports_op.

CTK 12.8 comes bundled with CUB 2.7, which does not yet have strided_iterator available (this was added in CCCL 3.1, bundled with CTK 13.1). CTK 12.8 thus entered the explicit offset_index path, which was bugged exactly for nrows % block_size == 0 (where block_size = 256 currently).

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 30, 2026

@ORippler My confusion may stem from the statement in the related PR:

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.

Either way, if @fairydreaming can confirm this completely fixes the issue we're good.

@fairydreaming
Copy link
Copy Markdown
Collaborator

Either way, if @fairydreaming can confirm this completely fixes the issue we're good.

@CISC I confirm that it's all OK with this PR, no more crashes observed, tested up to TOP_K(type=f32,ne=[164352,2048,1,1],k=2048,ties=0).

@CISC CISC merged commit 64ac9ab into ggml-org:master Mar 30, 2026
45 of 46 checks passed
@ORippler ORippler deleted the osimons/cuda_fix_argsort_ceildiv_for_cccl_lt_3dot1 branch March 30, 2026 14:20
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
…l-org#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162

* Reduce nrows in test case to 256, don't need 768
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…l-org#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162

* Reduce nrows in test case to 256, don't need 768
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…l-org#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162

* Reduce nrows in test case to 256, don't need 768
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: CUDA ggml_top_k() implementation crashes for large tensor shapes

4 participants