cuda : Add conv2d Implicit GEMM by bssrdf · Pull Request #15805 · ggml-org/llama.cpp

bssrdf · 2025-09-04T20:16:52Z

This PR added another CUDA conv_2d op using implicit GEMM approach. It is only optimized for cuda cores and its performance is up to 10x of that of direct method currently in llama.cpp.

On a RTX4090

Cases	Direct	Implicit GEMM
ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],	2.23 TFLOPS	38.76 TFLOPS
ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],	1.85 TFLOPS	9.12 TFLOPS
ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],	1.76 TFLOPS	9.27 TFLOPS
ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],	147.71 GFLOPS	150.00 GFLOPS
ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],	1.04 TFLOPS	1.02 TFLOPS
ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],	255.40 GFLOPS	238.21 GFLOPS
ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],	308.44 GFLOPS	324.17 GFLOPS
ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],	1.49 TFLOPS	3.98 TFLOPS
ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],	1.88 TFLOPS	15.85 TFLOPS
ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],	1.98 TFLOPS	16.90 TFLOPS
ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],	2.27 TFLOPS	38.00 TFLOPS
ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],	1.86 TFLOPS	8.64 TFLOPS
ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],	1.80 TFLOPS	8.78 TFLOPS
ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],	150.12 GFLOPS	147.95 GFLOPS
ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],	1.01 TFLOPS	980.39 GFLOPS
ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],	245.83 GFLOPS	212.52 GFLOPS
ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],	305.41 GFLOPS	317.95 GFLOPS
ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],	1.43 TFLOPS	3.74 TFLOPS
ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],	1.81 TFLOPS	14.96 TFLOPS
ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],	1.84 TFLOPS	15.80 TFLOPS

Comparison with im2col+gemm

Fp16 filter, Fp32 activation

(IC, OC, IW, IH)	im2col+GEMM TIME	im2col+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM
(64, 64, 48, 64)	0.03 ms	4.12 MB	0.07 ms	0.75 MB
(320, 320, 104, 152)	0.56 ms	106.13 MB	0.98 ms	19.30 MB
(640, 640, 52, 76)	0.32 ms	53.07 MB	1.24 ms	9.65 MB
(640, 640, 104, 152)	1.41 ms	212.27 MB	3.04 ms	38.59 MB
(960, 320, 104, 152)	1.48 ms	279.80 MB	2.68 ms	19.30 MB
(1280, 1280, 26, 38)	0.21 ms	26.53 MB	1.19 ms	4.82 MB
(1280, 640, 52, 76)	0.62 ms	96.48 MB	2.33 ms	9.65 MB
(1920, 1280, 26, 38)	0.30 ms	37.39 MB	1.79 ms	4.82 MB
(2560, 1280, 26, 38)	0.42 ms	48.24 MB	2.36 ms	4.82 MB
(512, 512, 104, 152)	0.91 ms	169.81 MB	1.88 ms	30.88 MB
(512, 512, 208, 304)	3.90 ms	679.25 MB	7.95 ms	123.50 MB
(512, 256, 416, 608)	12.55 ms	2470.00 MB	15.67 ms	247.00 MB
(256, 128, 832, 1216)	24.82 ms	4940.00 MB	15.67 ms	494.00 MB
(256, 256, 832, 1216)	27.43 ms	5434.00 MB	31.17 ms	988.00 MB
(320, 256, 1024, 1920)	66.56 ms	12720.00 MB	76.05 ms	1920.00 MB

Fp32 filter, Fp32 activation

(IC, OC, IW, IH)	im2col+GEMM TIME	im2col+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM
(64, 64, 48, 64)	0.04 ms	7.50 MB	0.07 ms	0.75 MB
(320, 320, 104, 152)	0.92 ms	192.97 MB	0.90 ms	19.30 MB
(640, 640, 52, 76)	0.68 ms	96.48 MB	1.19 ms	9.65 MB
(640, 640, 104, 152)	2.41 ms	385.94 MB	2.95 ms	38.59 MB
(960, 320, 104, 152)	2.38 ms	540.31 MB	2.56 ms	19.30 MB
(1280, 1280, 26, 38)	0.71 ms	48.24 MB	1.10 ms	4.82 MB
(1280, 640, 52, 76)	1.18 ms	183.32 MB	2.20 ms	9.65 MB
(1920, 1280, 26, 38)	0.72 ms	69.95 MB	1.83 ms	4.82 MB
(2560, 1280, 26, 38)	0.94 ms	91.66 MB	2.35 ms	4.82 MB
(512, 512, 104, 152)	1.57 ms	308.75 MB	1.79 ms	30.88 MB
(512, 512, 208, 304)	6.34 ms	1235.00 MB	7.61 ms	123.50 MB
(512, 256, 416, 608)	17.49 ms	4693.00 MB	15.00 ms	247.00 MB
(256, 128, 832, 1216)	32.16 ms	9386.00 MB	15.06 ms	494.00 MB
(256, 256, 832, 1216)	36.54 ms	9880.00 MB	30.23 ms	988.00 MB
(320, 256, 1024, 1920)	562.36 ms	23520.00 MB	73.56 ms	1920.00 MB

…mentations

…es for 2D convolution

JohannesGaessler · 2025-09-04T21:32:53Z

Why are you adding a new ggml op?

bssrdf · 2025-09-04T22:40:21Z

Why are you adding a new ggml op?

Because of #15669 (comment)

leejet · 2025-09-05T00:55:51Z

I think the implementation of implicit gemm can directly use ggml_conv2d_direct. There's really no need to provide so many conv2d functions.

bssrdf · 2025-09-05T01:24:12Z

I think the implementation of implicit gemm can directly use ggml_conv2d_direct. There's really no need to provide so many conv2d functions.

I can reuse ggml_conv2d_direct. TBH it is not a very good or intuitive name (the best one, ggml_conv_2d, is already occupied). I do wish it has an additional argument (ggml_conv_2d should carry in the beginning) for what method implemented.

leejet · 2025-09-05T01:42:05Z

If the performance of implicit gemm is on par with or even better than that of im2col + gemm, I think ggml_conv_2d can also adopt the implementation of implicit gemm.

JohannesGaessler · 2025-09-05T08:43:48Z

What should be done regarding IM2COL vs. CONV2D is to construct the compute graph using CONV2D and to then let each backend decide how to do the operation. If a backend lacks support for convolution it should allocate a temporary buffer for IM2COL and use that as a workaround.

For kernel selection, please take a look at how e.g. FLASH_ATTN_EXT is being handled. There are multiple kernels that can be used, at runtime one is selected based on hardware capabilities and tensor shapes. All convolution kernels that do the exact same operation should be using the same ggml op. If we have multiple kernels that could be used we need to test which code paths are faster under which circumstances and write the logic accordingly. This is particularly relevant because there is a concurrent PR using tensor cores: #15813 . cc @mnehete32

JohannesGaessler

For this PR, try removing the current conv2d kernel and replacing it with this one. Chances are it will be universally faster since it uses shared memory and has (unless I misread the code) coalesced memory accesses. I'll test the performance using a P40, RTX 3090, and RTX 4090 for NVIDIA and an RX 6800 and Mi 50 for AMD.

JohannesGaessler · 2025-09-05T09:42:41Z

+#include "convert.cuh"
+
+typedef struct{    
+    unsigned int      n;                              //batch szie              


Suggested change

unsigned int n; //batch szie

unsigned int n; //batch size

JohannesGaessler · 2025-09-05T09:43:37Z

+
+typedef struct{    
+    unsigned int      n;                              //batch szie              
+    unsigned int      c;                              //channel number          


Change to either "channel index" or "number of channels" depending on which this is.

JohannesGaessler · 2025-09-05T09:48:11Z

+    int threadz = 1;   // threadz number per block
+    dim3 thblock(threadx, thready, threadz);
+    dim3 grid(blockx, blocky, blockz);
+    int smem_size = 24 * 1024;


On some CUDA architectures shared memory comes out of the L1 cache so it at all possible you should reserve only as much as will actually be used.

JohannesGaessler · 2025-09-05T09:49:09Z

+                                     float * __restrict__ output,
+                                     const param_t param) {
+
+    extern __shared__ __align__(16 * 1024) char smem[];        


What is the purpose of __align__ here?

removed, no difference in performance

JohannesGaessler · 2025-09-05T09:51:12Z

+    for (int i = 0; i < 4; ++i)
+    {


Suggested change

for (int i = 0; i < 4; ++i)

{

for (int i = 0; i < 4; ++i) {

See contribution guidelines

Done. corrected styles in all places

Thanks, @JohannesGaessler, for taking time to review. I agree with your idea as to kernel selection behind the scenes. Indeed, no single kernel is optimal for input and filter shapes. That's why cudnn provide all kinds of them for user to choose. Previously I am not sure if selecting kernels is possible and I 'll look into FLASH_ATTN_EXT example (thanks again).

Now #15813 is adding tensor support with shared mem, I don't want to step over. This PR will be in hold for now. I may contribute to the current conv_2d_direct once tensor code is merged.

Even if there is a kernel with tensor core support a good kernel without tensor cores would still be extremely useful. P40s and Mi50s are very cheap options for 24/32 GB VRAM but they lack tensor cores. And from a ggml perspective it's much easier to squeeze out more performance than it is to compress the weights (without affecting quality).

Speaking of P40s, you should be careful with FP16 arithmetic since that is massively gimped on Pascal. You can use the macro FAST_FP16_AVAILABLE to check whether FP16 would be fast and use FP32 as a workaround if not. You can look at e.g. mmvf.cu for an example.

Will look into it. Thanks.

…ncy; update parameter comments and remove unused code

…test for implicit convolution

Green-Sky · 2025-09-11T09:17:57Z

erroneous numbers.

old numbers

Took it for a small test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method	time	memory
imcol+mul	~1.68s	4992.19 MB
direct	~35.35s	1920.19 MB
implicitgemm	~35.21s	1920.19 MB

The resulting images look correct.

For some reason implicitgemm is as fast as the current direct implementation.

im2col+mat_mul

[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 4992.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 1.68s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 1.68s

direct

[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 1920.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 35.35s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 35.35s

implicitgemm

[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 1920.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 35.20s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 35.21s

bssrdf · 2025-09-11T12:53:12Z

Took it for a small test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method time memory
imcol+mul ~1.68s 4992.19 MB
direct ~35.35s 1920.19 MB
implicitgemm ~35.21s 1920.19 MB
The resulting images look correct.

For some reason implicitgemm is as fast as the current direct implementation.

im2col+mat_mul
[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 4992.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 1.68s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 1.68s
direct
[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 1920.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 35.35s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 35.35s
implicitgemm
[DEBUG] ggml_extend.hpp:1425 - vae compute buffer size: 1920.19 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1457 - computing vae decode graph completed, taking 35.20s
[INFO ] stable-diffusion.cpp:2098 - latent 3 decoded, taking 35.21s

@Green-Sky, thanks for giving it a try-out. I am a bit puzzled by the results.

I think the current implicit gemm needs to improve a lot to catch up with im2col+gemm, but 20x slower is too much. As I showed above, for small input dimension, implicit is about 2x slower than im2col. On middle to large dimensions, implicit is close if not faster.
What's even more puzzling is implicit is not faster than direct. Below is 3-way comparison of direct, implicit and im2col for fp16 3x3 filter:

(IC, OC, IW, IH)	im2col+GEMM TIME	im2col+GEMM VRAM	direct TIME	direct VRAM	implicit GEMM TIME	implicit GEMM VRAM
(64, 64, 48, 64)	0.03 ms	4.12 MB	0.15 ms	0.75 MB	0.07 ms	0.75 MB
(320, 320, 104, 152)	0.59 ms	106.13 MB	15.94 ms	19.30 MB	0.90 ms	19.30 MB
(640, 640, 52, 76)	0.31 ms	53.07 MB	15.52 ms	9.65 MB	1.20 ms	9.65 MB
(640, 640, 104, 152)	1.41 ms	212.27 MB	62.20 ms	38.59 MB	3.04 ms	38.59 MB
(960, 320, 104, 152)	1.44 ms	279.80 MB	48.39 ms	19.30 MB	2.73 ms	19.30 MB
(1280, 1280, 26, 38)	0.22 ms	26.53 MB	15.75 ms	4.82 MB	1.20 ms	4.82 MB
(1280, 640, 52, 76)	0.62 ms	96.48 MB	31.89 ms	9.65 MB	2.34 ms	9.65 MB
(1920, 1280, 26, 38)	0.30 ms	37.39 MB	23.69 ms	4.82 MB	1.76 ms	4.82 MB
(2560, 1280, 26, 38)	0.43 ms	48.24 MB	31.62 ms	4.82 MB	2.37 ms	4.82 MB
(512, 512, 104, 152)	0.94 ms	169.81 MB	40.08 ms	30.88 MB	2.04 ms	30.88 MB
(512, 512, 208, 304)	3.88 ms	679.25 MB	171.59 ms	123.50 MB	7.98 ms	123.50 MB
(512, 256, 416, 608)	12.62 ms	2470.00 MB	352.60 ms	247.00 MB	15.67 ms	247.00 MB
(256, 128, 832, 1216)	24.96 ms	4940.00 MB	352.86 ms	494.00 MB	15.85 ms	494.00 MB
(256, 256, 832, 1216)	27.46 ms	5434.00 MB	711.06 ms	988.00 MB	31.78 ms	988.00 MB
(320, 256, 1024, 1920)	69.85 ms	12720.00 MB	1767.82 ms	1920.00 MB	76.39 ms	1920.00 MB

Green-Sky · 2025-09-11T14:02:05Z

diff --git a/ggml_extend.hpp b/ggml_extend.hpp
index 560d2861..36048bcc 100644
--- a/ggml_extend.hpp
+++ b/ggml_extend.hpp
@@ -851,7 +851,7 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_2d_direct(struct ggml_context
                                                              int p1 = 0,
                                                              int d0 = 1,
                                                              int d1 = 1) {
-    x = ggml_conv_2d_direct(ctx, w, x, s0, s1, p0, p1, d0, d1);
+    x = ggml_conv_2d_implicitgemm(ctx, w, x, s0, s1, p0, p1, d0, d1);
     if (b != NULL) {
         b = ggml_reshape_4d(ctx, b, 1, 1, b->ne[0], 1);
         // b = ggml_repeat(ctx, b, x);

What's even more puzzling is implicit is not faster than direct.

Yes, I first thought I did something wrong, but it is only one call in sd.cpp (see diff) and I looked at your code and it also looks like the OP is properly emitted ...

Giving it another try, just to make sure I really did nothing wrong here (: .

update: ok, getting (way) better values now. I do not know what I did wrong before, I made sure to recompile a couple of times with different settings, but it really did use the naive conv2d_direct impl. My mistake.

Green-Sky · 2025-09-11T14:11:02Z

Corrected test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method	time	memory
CUDA imcol+mul	~1.68s	4992.19 MB
CUDA direct (master)	~35.35s	1920.19 MB
CUDA direct (`ac5e0c0`)	~9.00s	1920.19 MB
CUDA implicitgemm (`2ec76aa`)	~2.20s	1920.19 MB
VULKAN imcol+mul	OOM	~4992 MB
VULKAN direct	~1.17s	1920.19 MB

Ignore previous numbers. Now looking more like it. Good job @bssrdf :) , sorry for the confusion.

edit: added vulkan numbers from same device. (cm2)
edit2: added numbers from #15813

bssrdf · 2025-09-11T14:54:34Z

Corrected test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method time memory
CUDA imcol+mul ~1.68s 4992.19 MB
CUDA direct ~35.35s 1920.19 MB
CUDA implicitgemm ~2.20s 1920.19 MB
VULKAN imcol+mul OOM ~4992 MB
VULKAN direct ~1.17s 1920.19 MB
Ignore previous numbers. Now looking more like it. Good job @bssrdf :) , sorry for the confusion.

edit: added vulkan numbers from same device.

Thanks for the update, @Green-Sky. The sd.cpp result is consistent with what I tested in ggml backend op.
Surprised to see VULKAN is so fast 🤔. Curious to know what optimization it used. A pity I don't know how to program in it.

I hope to improve implicit's performance, especially for Fp16. There should be plenty of low hanging fruits there.

etasnadi · 2025-09-18T15:28:09Z

Corrected test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method time memory
CUDA imcol+mul ~1.68s 4992.19 MB
CUDA direct ~35.35s 1920.19 MB
CUDA implicitgemm ~2.20s 1920.19 MB
VULKAN imcol+mul OOM ~4992 MB
VULKAN direct ~1.17s 1920.19 MB
Ignore previous numbers. Now looking more like it. Good job @bssrdf :) , sorry for the confusion.
edit: added vulkan numbers from same device.

Thanks for the update, @Green-Sky. The sd.cpp result is consistent with what I tested in ggml backend op. Surprised to see VULKAN is so fast 🤔. Curious to know what optimization it used. A pity I don't know how to program in it.

I hope to improve implicit's performance, especially for Fp16. There should be plenty of low hanging fruits there.

Hi, there are a few optimizations used in Vulkan: mininizing branch divergence, reducing the number of modulo/div computations with shuffles. Later @jeffbolznv further optimized it by using constant int divs and optimized the blocktile sizes to common inputs.

Although the GLSL code is not too hard to read, I added the CUDA translation of the GLSL shader in a new PR: #16088. It also introduces some bank conflict reduction, and vectorized shmem loads so it is now faster than Vulkan on large inputs.

bssrdf · 2025-09-18T16:48:30Z

Corrected test drive in sd.cpp for VAE decoding:

768x1024 sd1 fp16 vae:

method time memory
CUDA imcol+mul ~1.68s 4992.19 MB
CUDA direct ~35.35s 1920.19 MB
CUDA implicitgemm ~2.20s 1920.19 MB
VULKAN imcol+mul OOM ~4992 MB
VULKAN direct ~1.17s 1920.19 MB
Ignore previous numbers. Now looking more like it. Good job @bssrdf :) , sorry for the confusion.
edit: added vulkan numbers from same device.

Thanks for the update, @Green-Sky. The sd.cpp result is consistent with what I tested in ggml backend op. Surprised to see VULKAN is so fast 🤔. Curious to know what optimization it used. A pity I don't know how to program in it.
I hope to improve implicit's performance, especially for Fp16. There should be plenty of low hanging fruits there.

Hi, there are a few optimizations used in Vulkan: mininizing branch divergence, reducing the number of modulo/div computations with shuffles. Later @jeffbolznv further optimized it by using constant int divs and optimized the blocktile sizes to common inputs.

Although the GLSL code is not too hard to read, I added the CUDA translation of the GLSL shader in a new PR: #16088. It also introduces some bank conflict reduction, and vectorized shmem loads so it is now faster than Vulkan on large inputs.

@etasnadi, thanks for working on the cuda backend. Yes, with these optimizations in place, cuda should be on par with vulkan if not faster. Great job!

bssrdf · 2025-09-18T17:38:47Z

Now #16088 is a better implementation.

bssrdf · 2026-01-29T16:47:26Z

I think that's expected - the cuda graph is basically a "display list" of captured commands and the host code won't be rerun when it's reused.

BTW, I think this failure started with #18934.

I see. ggml_cuda_graph_node_properties_match looks like not checking tensor type.

ggerganov · 2026-01-29T16:59:41Z

I just noticed the discussion. Think the changes from #19165 will likely fix the issue, but a type check would still be necessary to add for to the properties in order to distinguish same-sized types.

jeffbolznv · 2026-01-29T17:17:50Z

The tests are passing again at top of tree.

bssrdf · 2026-01-29T17:24:12Z

The tests are passing again at top of tree.

That's great. Will merge the latest and retest. Thanks, @jeffbolznv and @ggerganov.

bssrdf · 2026-01-29T17:39:29Z

Now all test cases passed.

Green-Sky · 2026-02-10T16:54:43Z

So, how are we looking? How is HIP doing?

bssrdf · 2026-02-11T02:31:10Z

So, how are we looking? How is HIP doing?

I don't know how to resolve the HIP build error. Anyone who could lend a hand will be appreciated.

daniandtheweb · 2026-02-23T03:29:19Z

I've been able to fix the HIP build issue.

Apparently the compiler doesn't read cudaFuncSetAttribute at lines 967 and 1165 of conv2d-implicit.cu as it should and instead of seeing 3 arguments it sees everything as an argument, even the elements between brackets.

My solution has been to add a (const void*) cast to conv2d_implicit_kernel... .

By fixing this I've been able to compile the program and run it without any issue.

bssrdf · 2026-02-23T16:00:07Z

I've been able to fix the HIP build issue.

Apparently the compiler doesn't read cudaFuncSetAttribute at lines 967 and 1165 of conv2d-implicit.cu as it should and instead of seeing 3 arguments it sees everything as an argument, even the elements between brackets.

My solution has been to add a (const void*) cast to conv2d_implicit_kernel... .

By fixing this I've been able to compile the program and run it without any issue.

@daniandtheweb, thanks for your help on this issue. Unfortunately adding (const void*) didn't work on Windows as the compiler rejected with errors. There is already an existing MACRO for the intended purpose so I used it. Let's see how it goes.

zhang-hui-yulo · 2026-03-05T07:54:17Z

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.

Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

bssrdf · 2026-03-06T13:24:30Z

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.

Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

Hi, if there is no compile error, can you try it with HIP backends and see if it works? The tensor core path is not meant for non-NVIDIA devices but the other path using cuda core may work. Sorry, I don't know much about HIP.

zhang-hui-yulo · 2026-03-06T13:31:41Z

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.
Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.
Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

Hi, if there is no compile error, can you try it with HIP backends and see if it works? The tensor core path is not meant for non-NVIDIA devices but the other path using cuda core may work. Sorry, I don't know much about HIP.

I don't know if conv2d-implicit.cu is compiled as there is no conv2d-implicit.obj file, looks like that hip is disabled in conv2d-implicit.cu by !defined(GGML_USE_HIP)

So I don't think the backend will have much impact to hip unless you totally replace the current cov2d with implicit gemm, I will have a try after removing !defined(GGML_USE_HIP).

bssrdf · 2026-03-06T13:46:58Z

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.
Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

May I ask how to compile conv2d-implicit on HIP? I just checkout bssrdf:conv2d-implicit and don't see compile error on my hip windows 6.4.2.
Based on my knowledge, hipFuncSetAttribute needs const void* as function type, so CUDA_SET_SHARED_MEMORY_LIMIT needs to be modified to support hip, or you can just get rid of the macro and directly call cudaFuncSetAttribute like fattn-mma-f16.cuh

Hi, if there is no compile error, can you try it with HIP backends and see if it works? The tensor core path is not meant for non-NVIDIA devices but the other path using cuda core may work. Sorry, I don't know much about HIP.

I don't know if conv2d-implicit.cu is compiled as there is no conv2d-implicit.obj file, looks like that hip is disabled in conv2d-implicit.cu by !defined(GGML_USE_HIP)

So I don't think the backend will have much impact to hip unless you totally replace the current cov2d with implicit gemm, I will have a try after removing !defined(GGML_USE_HIP).

!defined(GGML_USE_HIP) is only used for tensor core path. You should get the cuda core conv2d_implicit_kernel.

zhang-hui-yulo · 2026-03-09T06:29:48Z

I just go through the code in conv2d-implicit.cu, cpu side also disables hip so only conv2d_implicit_cuda will be executed, all cases pass on my 7900XTX, I don't think you need to worry about CUDA_SET_SHARED_MEMORY_LIMIT as it can be compiled with unsupported code as it will never be executed.

For hip support, like mmf and flash attention, you can submit cuda version first then I or someone others can adapter hip backend in the future.

zhang-hui-yulo · 2026-03-09T06:36:59Z

I just look at the main loop of implicit gemm, too many native ptx code in it which makes the kernel only works in cuda, I will suggest to change the code to adapter mma.cuh or it's hard to be ported to other platorms.

bssrdf · 2026-03-09T23:39:06Z

I just look at the main loop of implicit gemm, too many native ptx code in it which makes the kernel only works in cuda, I will suggest to change the code to adapter mma.cuh or it's hard to be ported to other platorms.

Unfortunately I didn't use the existing mma tiles when developing the conv2d kernel. I am also not the right person to adapt this kernel to AMD devices due to knowledge gap. I think it is probably better to merge this PR and leave development for other platforms in future PRs. Thanks.

zhang-hui-yulo · 2026-03-10T03:08:26Z

I just look at the main loop of implicit gemm, too many native ptx code in it which makes the kernel only works in cuda, I will suggest to change the code to adapter mma.cuh or it's hard to be ported to other platorms.

Unfortunately I didn't use the existing mma tiles when developing the conv2d kernel. I am also not the right person to adapt this kernel to AMD devices due to knowledge gap. I think it is probably better to merge this PR and leave development for other platforms in future PRs. Thanks.

You still need to adapter other NVIDIA GPUs like Volta, anyway I'm not the moderator, using mma.cuh is just a suggestion, the decision shall belongs to @JohannesGaessler

JohannesGaessler · 2026-03-10T19:33:42Z

I currently don't have the capacity to properly maintain convolution kernels, the biggest problem is that I don't have established workflows for proper QA. What exact device code implementation to go with should be left at the discretion of the person taking over the responsibility for long-term maintenance. My opinion is that using the interface in mma.cuh would be preferable but I don't think we should block an improvement based on that.

bssrdf · 2026-03-11T11:58:50Z

I currently don't have the capacity to properly maintain convolution kernels, the biggest problem is that I don't have established workflows for proper QA. What exact device code implementation to go with should be left at the discretion of the person taking over the responsibility for long-term maintenance. My opinion is that using the interface in mma.cuh would be preferable but I don't think we should block an improvement based on that.

Thanks, @JohannesGaessler. I can commit myself to the long-term maintenance of this cuda kernel and working with anyone on bringing in HIP code for AMD devices or newer Nvidia architectures e.g. Blackwell.

bssrdf added 5 commits September 2, 2025 22:47

Add implicit GEMM convolution operation for 2D tensors in CUDA

8a58931

Add implicit convolution support for 2D tensors in CPU and CUDA imple…

4d77287

…mentations

fix passing param as reference

3877608

Fix parameter order in conv2d_implicit and add comprehensive test cas…

6d84cbb

…es for 2D convolution

Fix boundary check in conv2d_implicit_kernel to include channel limits

5ffe97b

bssrdf marked this pull request as draft September 4, 2025 20:17

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 4, 2025

JohannesGaessler reviewed Sep 5, 2025

View reviewed changes

bssrdf added 4 commits September 5, 2025 08:29

Refactor conv2d_implicit_kernel for improved readability and consiste…

4b0f9d5

…ncy; update parameter comments and remove unused code

Refactor conv2d_implicit_kernel for improved bitwise operations; add …

83a3b7d

…test for implicit convolution

merged with upstream master

735886b

Merge branch 'master' into conv2d-implicit

2ec76aa

Green-Sky mentioned this pull request Sep 13, 2025

CUDA: Conv2d Tensor Core #15813

Closed

etasnadi mentioned this pull request Sep 18, 2025

ggml-cuda: Vulkan direct conv 2D ported to CUDA #16088

Open

bssrdf added 2 commits January 29, 2026 12:27

Merge branch 'master' into conv2d-implicit

5ce738c

remove run_whole_graph(); not needed after the fix

2d39f87

bssrdf mentioned this pull request Jan 29, 2026

add tensor type checking as part of cuda graph properties #19186

Merged

loci-dev mentioned this pull request Jan 29, 2026

UPSTREAM PR #19186: add tensor type checking as part of cuda graph properties auroralabs-loci/llama.cpp#1073

Open

bssrdf added 3 commits January 29, 2026 14:47

remove test-conv2d

805e9ac

fix to work for Turing

75e5a9b

exclude HIP from using tensor core

8744a9f

loci-dev mentioned this pull request Jan 31, 2026

UPSTREAM PR #15805: cuda : Add conv2d Implicit GEMM auroralabs-loci/llama.cpp#1105

Open

Merge branch 'master' into conv2d-implicit

2bd682b

bssrdf added 2 commits February 23, 2026 10:43

use CUDA_SET_SHARED_MEMORY_LIMIT as a fix for HIp build

e45f288

Merge branch 'master' into conv2d-implicit

f1bb125

Conversation

bssrdf commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Sep 4, 2025

Uh oh!

bssrdf commented Sep 4, 2025

Uh oh!

leejet commented Sep 5, 2025

Uh oh!

bssrdf commented Sep 5, 2025

Uh oh!

leejet commented Sep 5, 2025

Uh oh!

JohannesGaessler commented Sep 5, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Green-Sky commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

768x1024 sd1 fp16 vae:

Uh oh!

bssrdf commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

768x1024 sd1 fp16 vae:

Uh oh!

Green-Sky commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

768x1024 sd1 fp16 vae:

Uh oh!

bssrdf commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

768x1024 sd1 fp16 vae:

Uh oh!

etasnadi commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

768x1024 sd1 fp16 vae:

Uh oh!

bssrdf commented Sep 18, 2025

768x1024 sd1 fp16 vae:

Uh oh!

bssrdf commented Sep 18, 2025

Uh oh!

bssrdf commented Jan 29, 2026

Uh oh!

ggerganov commented Jan 29, 2026

Uh oh!

jeffbolznv commented Jan 29, 2026

Uh oh!

bssrdf commented Jan 29, 2026

bssrdf commented Sep 4, 2025 •

edited

Loading

Green-Sky commented Sep 11, 2025 •

edited

Loading

bssrdf commented Sep 11, 2025 •

edited

Loading

Green-Sky commented Sep 11, 2025 •

edited

Loading

Green-Sky commented Sep 11, 2025 •

edited

Loading

bssrdf commented Sep 11, 2025 •

edited

Loading

etasnadi commented Sep 18, 2025 •

edited

Loading

daniandtheweb commented Feb 23, 2026 •

edited

Loading

bssrdf commented Feb 23, 2026 •

edited

Loading

zhang-hui-yulo commented Mar 10, 2026 •

edited

Loading