Skip to content

CUDA: use shared mem for ssm_conv#20128

Merged
am17an merged 5 commits intoggml-org:masterfrom
am17an:cuda_ssm_conv
Mar 6, 2026
Merged

CUDA: use shared mem for ssm_conv#20128
am17an merged 5 commits intoggml-org:masterfrom
am17an:cuda_ssm_conv

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Mar 5, 2026

Add shared mem loading to ssm_conv_long_token for some mild benefit in prompt processing

At ub = 2048 on a 4090

Model Test t/s master t/s cuda_ssm_conv Speedup
qwen35 2B Q8_0 pp2048 27300.78 28136.03 1.03
qwen35 2B Q8_0 pp4096 26340.90 26944.11 1.02
qwen35 2B Q8_0 pp8192 25932.23 26401.70 1.02
qwen35 2B Q8_0 pp16384 24257.07 24673.20 1.02
qwen35moe ?B Q4_K_S pp2048 6784.72 6872.92 1.01
qwen35moe ?B Q4_K_S pp4096 6701.00 6810.24 1.02
qwen35moe ?B Q4_K_S pp8192 6684.46 6787.67 1.02
qwen35moe ?B Q4_K_S pp16384 6490.17 6578.51 1.01

@am17an am17an requested a review from ggerganov as a code owner March 5, 2026 08:17
@am17an am17an requested a review from JohannesGaessler March 5, 2026 08:18
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 5, 2026
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Mar 5, 2026

Added some trivial fusions also

Model Test t/s 0141e9c t/s cuda_ssm_conv Speedup
qwen35 2B Q8_0 tg128 391.73 402.56 1.03
qwen35moe ?B Q4_K_S tg128 183.79 186.81 1.02

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Do you intend to add still more features or is this ready for review?

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Mar 5, 2026

This it it for now. Will add more in a separate PR

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@am17an am17an merged commit 1e38a7a into ggml-org:master Mar 6, 2026
73 of 75 checks passed
@am17an am17an deleted the cuda_ssm_conv branch March 6, 2026 15:10
#pragma unroll
for (size_t j = 0; j < d_conv; j++) {
w[j] = w_block[tid * stride_w + j];
for (int idx = tid; idx < total_elems; idx += split_d_inner) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@am17an The hip compiler throws a bunch of warnings for this function because of this loop, tid is not known at compile time, obviously, so there is no way for the compiler to know how many iterations this loop should be unrolled to.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov what do you think about -Werror for the hip build? the fact that the cuda compiler dosent warn for these cases has been a constant source of annoyance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, I think this should be inside the loop like

for (int idx = 0; idx < total_elems; idx += split_d_inner) {
      int idx0 = idx0 + idx;
      ....
}

I think there are a lot of warnings we need to clean up before enabling -Werror, let me do that in separate PR

Copy link
Copy Markdown
Collaborator

@IMbackK IMbackK Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A least building the hip backend usually never throws any warnings we could restrict werror just to the hip objects and only for the ci build. or even just -werror for the transformation warning. The goal being that the failing hip backend on the ci run for a pr blocks this kind of thing from entering the codebase, which has happend alot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if this is the only warning for now, I'm okay enabling -Werror for HIP since it would help catch these bugs in the CI

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov what do you think about -Werror for the hip build?

@IMbackK Sounds good

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants