CUDA: compress-mode size#12029
Conversation
That's quite a lot, I didn't realize that the build with all supported archs has gotten so bad. In the windows releases it seems to be 500M, so it's not that bad, but still pretty bad. I am not exactly sure what may be the downsides of enabling this option, it would be preferable if this was optional. Enabling it by default should be ok, though. |
And so it is for linux. Even before 12.8 it was compressing by default. Either with a
They say it costs startup time, which I think would be ok for almost all ml usecases that use cuda anyway. I just hope it's not for every kernel launch. I don't have a setup right now where I can test that myself, so if anyone can help here, that would be nice. Ok, I will make it an ggml option and enable it by default. Or should I make the option a string and just pass that? (none, speed, balance, size) |
Yes, that sounds good to me. |
cuda 12.8 added the option to specify stronger compression for binaries.
d7580f2 to
6cdc5d3
Compare
| # - speed (nvcc's default) | ||
| # - balance | ||
| # - size | ||
| list(APPEND CUDA_FLAGS -compress-mode=${GGML_CUDA_COMPRESSION_MODE}) |
There was a problem hiding this comment.
according to the cuda docu both are accepted, and i chose single dash, because the other options next to it are also single dash.
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
|
If somebody has this error, a pair of CUDA 12.8 and GCC 12 solved my issue This Ubuntu shell script helped me to set things up The The There is one thing, that by default my nvcc points to version 11, so just typing Which was likely the error, I tried to compile various versions with GCC-11, but for CUDA 12.8 I needed GCC-12 |
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
This patch sets cuda compression mode to
sizefor >= 12.8cuda 12.8 added the option to specify stronger compression for binaries.
I ran some tests in CI with the new ubuntu 12.8 docker image:
89-realarchIn this scenario, it appears it is not compressing by default at all?
60;61;70;75;80archesI did not test the runtime load overhead this should incur. But for most ggml-cuda usecases, the processes are usually long(er) lived, so the trade-off seems reasonable to me.