CUDA: Limit DeviceSegmentedSort to immediate mode by ORippler · Pull Request #21718 · ggml-org/llama.cpp

ORippler · 2026-04-10T10:37:06Z

Overview

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 12291 runs -   105.94 us/run -     8192 kB/run -   73.75 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 10245 runs -   115.08 us/run -    16384 kB/run -  135.77 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   221.22 us/run -    32768 kB/run -  141.26 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   430.98 us/run -    65536 kB/run -  145.02 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1028 runs -  1185.83 us/run -   131072 kB/run -  105.41 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  387 runs -  2748.62 us/run -   262144 kB/run -   90.95 GB/s

DeviceSegmentedSort in immediate mode

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 16388 runs -    71.17 us/run -     8192 kB/run -  109.78 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 12294 runs -    81.38 us/run -    16384 kB/run -  192.00 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   240.81 us/run -    32768 kB/run -  129.77 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   406.60 us/run -    65536 kB/run -  153.71 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1285 runs -   873.23 us/run -   131072 kB/run -  143.15 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  516 runs -  2288.46 us/run -   262144 kB/run -  109.24 GB/s

Closes #21682

Additional information

There is no way to force graph mode in the CUDA backend at the moment: We execute each graph only once in ggml_backend_compare_graph_backend, and depending on how the host OS allocates the first node of subsequent test cases' ggml_cgraphs we get some weird mix of "some tests running in graph and some in immediate mode" at the moment in test-backend-op. While we do have a way to force immediate mode via GGML_CUDA_DISABLE_GRAPHS, I feel there may be a need to force graph mode for testing purposes.

I did a local run where I patched ggml_backend_compare_graph_backend to evaluate each graph twice to trigger cuda graph warmup reliably for each test config of argsort.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s

We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode

fairydreaming · 2026-04-10T14:50:59Z

I tried:

running test-backend-op perf for ARGSORT and TOP_K with large tensor shapes (up to {65536, 512, 1, 1}) - it doesn't crash anymore (was crashing before)
running llama-perplexity on DeepSeek V3.2 DSA with large ubatch (4096/8192/16384) - it doesn't crash anymore (was crashing before)
running llama-bench on DeepSeek V3.2 DSA with 2048 ubatch - it doesn't crash anymore (crashed yesterday on 8x RTX PRO 6000S, so tested this as well)

Looks good for me!

One thing I'm not sure about is wild inconsistency in code formatting that may force some poor OCD sufferers to seek therapy. 😵‍💫 But that's a minor issue.

am17an

We also have the env variable GGML_CUDA_DISABLE_GRAPHS which this PR doesn't seem to respect

ORippler · 2026-04-13T08:37:24Z

We also have the env variable GGML_CUDA_DISABLE_GRAPHS which this PR doesn't seem to respect

Care to elaborate? AFAIK GGML_CUDA_DISABLE_GRAPHS operates on the ggml_cuda_graph level: setting the env variable ensures all ggmL_cuda_graph objects created are always disabled. Hence, no cudaStream_t will ever be placed in capture mode. Thus, all the individual kernels/ops have to do is check for the cudaStreamCaptureStatus.

am17an

Ah yes, I see. You're right, this should be fine

* CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode

ORippler added 2 commits April 10, 2026 12:03

Add test case for dispatch to DeviceSegmentedRadixSort

c8f82b0

We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode

ORippler requested a review from a team as a code owner April 10, 2026 10:37

ORippler mentioned this pull request Apr 10, 2026

Misc. bug: CUDA ggml_top_k() implementation crashes for large tensor shapes #21162

Closed

ORippler requested a review from ggerganov as a code owner April 10, 2026 10:38

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 10, 2026

JohannesGaessler approved these changes Apr 10, 2026

View reviewed changes

am17an reviewed Apr 11, 2026

View reviewed changes

am17an approved these changes Apr 13, 2026

View reviewed changes

ORippler merged commit 9f5e1ed into ggml-org:master Apr 13, 2026
46 of 48 checks passed

ORippler deleted the osimons/cuda_fix_argsort_graph_capture branch April 13, 2026 09:14

JordanCason mentioned this pull request Apr 14, 2026

Eval bug: commit d6f303004 cause AMD IOMMU I/O page‑faults and Nvidia Xid 31 fault #21908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Limit DeviceSegmentedSort to immediate mode#21718

CUDA: Limit DeviceSegmentedSort to immediate mode#21718
ORippler merged 2 commits intoggml-org:masterfrom
ORippler:osimons/cuda_fix_argsort_graph_capture

ORippler commented Apr 10, 2026 •

edited

Loading

Uh oh!

fairydreaming commented Apr 10, 2026

Uh oh!

am17an left a comment

Uh oh!

ORippler commented Apr 13, 2026 •

edited

Loading

Uh oh!

am17an left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ORippler commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

fairydreaming commented Apr 10, 2026

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ORippler commented Apr 10, 2026 •

edited

Loading

ORippler commented Apr 13, 2026 •

edited

Loading