cuda: refactored ssm_scan and use CUB by Your-Cheese · Pull Request #13291 · ggml-org/llama.cpp

Your-Cheese · 2025-05-04T02:11:59Z

I modified the structure of the CUDA kernel for the ssm scan such parallelization is performed per thread across the channel dimension (D). This allows A and the initial state (s0) to be loaded into registers and reused across the sequence (L) and SSM state dimensions (N). Additionally, B and C can be loaded into shared memory since blocks process the same timestep in parallel. I also added another CUDA kernel specifically for a sequence length of 1 (recurrent mode) in order to reduce the number of registers used by removing the loop over the sequence dimension.

I'm unsure about optimizing the number of threads per block or the minimum number of blocks per multiprocessor in the launch bounds, however, so I left them as is.

Benchmarks

I got the following results with the following test cases added to test-backend-ops.cpp.

for (int seq_len : {1, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 }) {
        for (int batch_size : { 1, 2, 4, 8, }) {
            test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1536, seq_len, batch_size));
        }
    }

./bin/test-backend-ops perf -o SSM_SCAN -b CUDA0

Hardware: Intel i7-13700K, Nvidia RTX 3090
Raw output:
cpu.txt
original_cuda.txt
improved_cuda.txt
improved_cuda_no_cub.txt

llama-bench

./bin/llama-bench -m falcon-mamba-7B-BF16.gguf

Original:

  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           pp512 |       2624.38 ± 2.44 |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           tg128 |         39.67 ± 0.22 |

Improved:

  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           pp512 |      2748.06 ± 11.23 |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           tg128 |         40.64 ± 0.11 |

Improved (No CUB):

  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           pp512 |       2723.28 ± 6.63 |
| mamba ?B BF16                  |  14.10 GiB |     7.27 B | CUDA       |  99 |           tg128 |         39.68 ± 0.15 |

JohannesGaessler

Consider deduplicating the code by adding an optional template parameter for L. Ignore the template parameter if it's 0, otherwise use it instead of the runtime parameter (add #pragma unroll to the loops over L). Adding additional template specializations for L <= 8 would likely also improve performance. You can look at softmax.cu for an example.

If you are not doing this already, my recommendation for optimizing CUDA performance would be to first use NVIDIA NSight Systems to identify which kernels take up a large percentage of the total runtime (and are thus worth optimizing). Then you can use NVIDIA NSight Compute to get a detailed breakdown of a specific kernel and to identify bottlenecks. For this kernel I assume the bottleneck is I/0.

JohannesGaessler · 2025-05-06T11:18:46Z

+    const float *s0_block = (const float *)((const char *)src0 + blockIdx.x * src0_nb2 + blockIdx.y * splitD * src0_nb1);
+    const float *x_block = (const float *)((const char *)src1 + (blockIdx.x * src1_nb2) + blockIdx.y * splitD * sizeof(float));
+    const float *dt_block = (const float *)((const char *)src2 + (blockIdx.x * src2_nb2) + blockIdx.y * splitD * sizeof(float));
+    const float *A_block = (const float *)((const char *)src3 + blockIdx.y * splitD * src3_nb1);
+    const float *B_block = (const float *)((const char *)src4 + (blockIdx.x * src4_nb2));
+    const float *C_block = (const float *)((const char *)src5 + (blockIdx.x * src5_nb2));
+    float *y_block = (float *)((char *)dst + (blockIdx.x * src1_nb2) + blockIdx.y * splitD * sizeof(float));
+    float *s_block = (float *)((char *)dst + src1_nb3 + blockIdx.x * src0_nb2 + blockIdx.y * splitD * src0_nb1);


In GPU code there can be performance issues if you cast to char *, do pointer arithmetic, and then cast back to float *. But since this is only done once here it should be fine and in my experience this mostly affects the HIP port for AMD anyways.

JohannesGaessler · 2025-05-06T11:37:06Z

 #include "ssm-scan.cuh"

 template <size_t splitD, size_t N>
 __global__ void __launch_bounds__(splitD, 2)


In CUDA there are 64k registers per SM and each thread can at most use 255 registers. So with 128 threads the occupancy limit in terms of registers is 4 and telling the compiler to limit register usage in order to fit 2 blocks effectively tells it to just use as many registers as it wants. You could maybe change the args to (splitD, 1) to make this a little clearer but I think it's also fine as-is.

I could just remove it if it's not doing anything then, so it would be (splitD) only.

No, this does in fact do something. The compiler is by default very conservative with how many registers it uses because this avoids the worst-performing cases but it also leaves potential performance on the table. If you explicitly tell the compiler to use as many registers as it wants the performance can be better (for this kernel it probably doesn't matter anyways).

Oh, I see that's why the register count used was 64 if I removed it. It does seem to make a small difference in performance. I'll change it to 1 since there doesn't seem to be a difference from 2 in the generated assembly.

JohannesGaessler · 2025-05-06T11:38:06Z

+        regA[n] = A_block[threadIdx.x * stride_A + n];
+        regs0[n] = s0_block[threadIdx.x * stride_s0 + n];


The memory access pattern here is inefficient though I also wouldn't know how to improve it.

Does the problem lie in that the loads aren't coalesced? Wouldn't using a coalesced loading pattern require the data to be in a different layout?

Yes, the problem is the uncoalesced I/O. If you could somehow re-write the kernel to make the loads coalesced or change the memory pattern the previous kernel puts out the performance would likely be better. (I did not try to analyze whether something like this is possible.)

JohannesGaessler · 2025-05-06T11:40:43Z

+#pragma unroll
+    for (size_t n = 0; n < N; ++n)
+    {
+        s_block[threadIdx.x * stride_s + n] = regs0[n];


The memory access pattern here is also inefficient.

Your-Cheese · 2025-05-11T01:59:17Z

Consider deduplicating the code by adding an optional template parameter for L. Ignore the template parameter if it's 0, otherwise use it instead of the runtime parameter (add #pragma unroll to the loops over L). Adding additional template specializations for L <= 8 would likely also improve performance. You can look at softmax.cu for an example.

If you are not doing this already, my recommendation for optimizing CUDA performance would be to first use NVIDIA NSight Systems to identify which kernels take up a large percentage of the total runtime (and are thus worth optimizing). Then you can use NVIDIA NSight Compute to get a detailed breakdown of a specific kernel and to identify bottlenecks. For this kernel I assume the bottleneck is I/0.

Here's what I saw when profiling with NSight Systems, and I think the speedup is visible, though it doesn't make too much of an impact overall since the ssm_scan only occupies around 10% of the overall execution time anyways.

It's more significant in prompt processing since the scan takes up a larger portion of time there, while it doesn't make much of a difference on text generation.

Original:

Improved:

I tried using the template parameter for L to deduplicate the functions, but it seems to have reduced the number of registers being used alongside slightly reducing the performance. Is there something I'm missing in how I'm using a template for L?

With 1 min block per MP:

WIthout min blocks per MP:

JohannesGaessler · 2025-08-06T09:15:43Z

Sorry, I kind of forgot about this PR. Regardless of whether or not this code is perfect, I don't remember there being any major issues with it, and it does provide a speedup over master. Are there still things you want to do or should we move towards merging it?

JohannesGaessler · 2025-08-06T09:20:57Z

Also there is a concurrent PR touching the code: #15101 . Can you check whether that PR conflicts with yours?

Your-Cheese · 2025-08-06T12:20:35Z

Also there is a concurrent PR touching the code: #15101 . Can you check whether that PR conflicts with yours?

It makes the same change as I did of using registers instead of shared memory to store A and s0, so the issue that PR solves would also be fixed by merging this one.

Your-Cheese · 2025-08-06T12:25:25Z

Sorry, I kind of forgot about this PR. Regardless of whether or not this code is perfect, I don't remember there being any major issues with it, and it does provide a speedup over master. Are there still things you want to do or should we move towards merging it?

Outside of any possible issues with the style of the code, I think it's fine to merge at this point.

IMbackK · 2025-08-06T14:30:00Z


-    __syncthreads();
+#pragma unroll
+    for (size_t i = 0; i < L; i++)


L is not known at compile time in the L_template == 0 case here, which means the #pragma unroll causes a warning when this is compiled via llvm.
At least for llvm, you can just remove the pragma as the compiler unrolls this loop anyhow for the L_template != 0 case.

I tried removing the #pragma unroll and compared the output from Nsight Compute after running a quick test to make sure again. It makes a difference for CUDA, even in the case where L isn't known at compile time for some reason. Without explicitly unrolling the loop, it uses 2 more registers per thread. I could suppress the warning like in softmax.cu where the same sort of thing is done.

In my experience the CUDA compiler is very conservative when it comes to unrolling loops so my preference would definitely be to keep the #pragma unroll and suppress the warning.

It makes a difference for CUDA, even in the case where L isn't known at compile time for some reason. Without explicitly unrolling the loop, it uses 2 more registers per thread.

Thats really strange and sounds like a mild compiler bug.
Anyhow, suppressing the warning is sufficant for me.

Said suppression of warning was sufficient.

* cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning

cuda: refactored ssm_scan to use CUB

b2f8eea

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 4, 2025

fixed compilation error when when not using CUB

c7d4d45

Your-Cheese force-pushed the ssm_scan_cub branch from 3a454c9 to c7d4d45 Compare May 4, 2025 15:29

JohannesGaessler reviewed May 6, 2025

View reviewed changes

assign L to constant and use size_t instead of int

949e4fa

Your-Cheese added 3 commits May 11, 2025 09:19

deduplicated functions

75520d6

change min blocks per mp to 1

7e559f3

Use cub load and store warp transpose

7d259d9

compilade mentioned this pull request May 14, 2025

llama : initial Mamba-2 support #9126

Merged

9 tasks

Merge https://github.com/ggml-org/llama.cpp into ssm_scan_cub

ae519a4

JohannesGaessler mentioned this pull request Aug 6, 2025

CUDA/HIP: ssm-scan: switch from shared memory to reisters, fixes indexing problem on warp64 devices #15101

Closed

IMbackK previously requested changes Aug 6, 2025

View reviewed changes

suppress clang warning

dd6ff8e

JohannesGaessler approved these changes Aug 9, 2025

View reviewed changes

JohannesGaessler merged commit 79c1160 into ggml-org:master Aug 9, 2025
47 checks passed

		regA[n] = A_block[threadIdx.x * stride_A + n];
		regs0[n] = s0_block[threadIdx.x * stride_s0 + n];

Conversation

Your-Cheese commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

llama-bench

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Your-Cheese May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Your-Cheese commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original:

Improved:

With 1 min block per MP:

WIthout min blocks per MP:

Uh oh!

JohannesGaessler commented Aug 6, 2025

Uh oh!

JohannesGaessler commented Aug 6, 2025

Uh oh!

Your-Cheese commented Aug 6, 2025

Uh oh!

Your-Cheese commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Your-Cheese commented May 4, 2025 •

edited

Loading

Your-Cheese May 11, 2025 •

edited

Loading

Your-Cheese commented May 11, 2025 •

edited

Loading

JohannesGaessler Aug 7, 2025 •

edited

Loading