CUDA: use shared mem for ssm_conv by am17an · Pull Request #20128 · ggml-org/llama.cpp

am17an · 2026-03-05T08:17:57Z

Add shared mem loading to ssm_conv_long_token for some mild benefit in prompt processing

At ub = 2048 on a 4090

Model	Test	t/s master	t/s cuda_ssm_conv	Speedup
qwen35 2B Q8_0	pp2048	27300.78	28136.03	1.03
qwen35 2B Q8_0	pp4096	26340.90	26944.11	1.02
qwen35 2B Q8_0	pp8192	25932.23	26401.70	1.02
qwen35 2B Q8_0	pp16384	24257.07	24673.20	1.02
qwen35moe ?B Q4_K_S	pp2048	6784.72	6872.92	1.01
qwen35moe ?B Q4_K_S	pp4096	6701.00	6810.24	1.02
qwen35moe ?B Q4_K_S	pp8192	6684.46	6787.67	1.02
qwen35moe ?B Q4_K_S	pp16384	6490.17	6578.51	1.01

am17an · 2026-03-05T12:07:48Z

Added some trivial fusions also

Model	Test	t/s `0141e9c`	t/s cuda_ssm_conv	Speedup
qwen35 2B Q8_0	tg128	391.73	402.56	1.03
qwen35moe ?B Q4_K_S	tg128	183.79	186.81	1.02

JohannesGaessler · 2026-03-05T13:47:31Z

Do you intend to add still more features or is this ready for review?

am17an · 2026-03-05T13:48:48Z

This it it for now. Will add more in a separate PR

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

IMbackK · 2026-03-06T22:22:39Z

 #pragma unroll
-    for (size_t j = 0; j < d_conv; j++) {
-        w[j] = w_block[tid * stride_w + j];
+    for (int idx = tid; idx < total_elems; idx += split_d_inner) {


@am17an The hip compiler throws a bunch of warnings for this function because of this loop, tid is not known at compile time, obviously, so there is no way for the compiler to know how many iterations this loop should be unrolled to.

@ggerganov what do you think about -Werror for the hip build? the fact that the cuda compiler dosent warn for these cases has been a constant source of annoyance

oh right, I think this should be inside the loop like

for (int idx = 0; idx < total_elems; idx += split_d_inner) { int idx0 = idx0 + idx; .... }

I think there are a lot of warnings we need to clean up before enabling -Werror, let me do that in separate PR

A least building the hip backend usually never throws any warnings we could restrict werror just to the hip objects and only for the ci build. or even just -werror for the transformation warning. The goal being that the failing hip backend on the ci run for a pr blocks this kind of thing from entering the codebase, which has happend alot.

Yes if this is the only warning for now, I'm okay enabling -Werror for HIP since it would help catch these bugs in the CI

@ggerganov what do you think about -Werror for the hip build?

@IMbackK Sounds good

* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

CUDA: use shared mem for ssm_conv

0141e9c

am17an requested a review from ggerganov as a code owner March 5, 2026 08:17

am17an requested a review from JohannesGaessler March 5, 2026 08:18

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 5, 2026

am17an added 2 commits March 5, 2026 12:12

fuse silu + ssm_conv

7ba1b0a

fuse unary + mul

de3856d

CISC reviewed Mar 5, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

enable for fp16

d440e64

JohannesGaessler approved these changes Mar 6, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

formatting

43f3f54

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

am17an merged commit 1e38a7a into ggml-org:master Mar 6, 2026
73 of 75 checks passed

am17an deleted the cuda_ssm_conv branch March 6, 2026 15:10

IMbackK reviewed Mar 6, 2026

View reviewed changes

This was referenced Mar 10, 2026

cuda/hip: fix loop unrolling in ssm-conv #20369

Merged

CI: add hip quality check #20430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use shared mem for ssm_conv#20128

CUDA: use shared mem for ssm_conv#20128
am17an merged 5 commits intoggml-org:masterfrom
am17an:cuda_ssm_conv

am17an commented Mar 5, 2026 •

edited

Loading

Uh oh!

am17an commented Mar 5, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Mar 5, 2026

Uh oh!

am17an commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IMbackK Mar 6, 2026

Uh oh!

IMbackK Mar 6, 2026

Uh oh!

am17an Mar 7, 2026

Uh oh!

IMbackK Mar 7, 2026 •

edited

Loading

Uh oh!

am17an Mar 7, 2026

Uh oh!

ggerganov Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

am17an commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 5, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Mar 5, 2026

Uh oh!

am17an commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IMbackK Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

IMbackK Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

IMbackK Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Mar 5, 2026 •

edited

Loading

IMbackK Mar 7, 2026 •

edited

Loading