supports running on CPU for GGML_USE_CUBLAS=ON build#3946
supports running on CPU for GGML_USE_CUBLAS=ON build#3946ggerganov merged 3 commits intoggml-org:masterfrom
Conversation
1caf0c4 to
1a1ffd4
Compare
|
We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do. |
Some alternative ideas considered when prototyping this pull request (assuming
|
|
Most of this will become obsolete after llama.cpp is adapted to use ggml-backend. After that, the way this will be implemented is by making |
32f07ea to
42e642a
Compare
|
I cleaned up it a bit and think it should be relative easy for a ggml_backend migration. |
9c655dc to
b66fdd1
Compare
b66fdd1 to
c58e809
Compare
| if (ggml_cublas_loaded()) { | ||
| return ggml_cuda_host_malloc(n); | ||
| } else { | ||
| return malloc(n); | ||
| } |
There was a problem hiding this comment.
Should we move the ggml_cublas_loaded() checks in side the ggml_cuda_host_malloc() and ggml_cuda_host_free() calls?
There was a problem hiding this comment.
I can do that, but I feel it might be better to make it explicit. This way, downstream code will have a chance to differentiate whether the memory is actually allocating CUDA RAM, which is likely to involve certain memory alignment requirements.
ggerganov
left a comment
There was a problem hiding this comment.
I think this can serve as a temp solution, until we start using the new backend and refactor this code.
One question: if I had a machine with a CUDA device, but I still wanted to force CPU-only computation, what would be my option - set CUDA_VISIBLE_DEVICES=-1? Note that simply setting -ngl 0 would not work because ggml will keep moving data to the device and do some of the computations there instead of the CPU
| #define GGML_CUDA_MAX_DEVICES 16 | ||
|
|
||
| GGML_API void ggml_init_cublas(void); | ||
| GGML_API bool ggml_cublas_loaded(void); |
There was a problem hiding this comment.
Add comment that we differentiate between "initialized" and "loaded" state. The former means we have called ggml_init_cublas() but it's not guaranteed that there has been a CUDA device available, in which case ggml_cublas_loaded() is false.
I experimented with |
|
It's possible that the behaviour with |
That's not the case, CUDA is still used for large matrix multiplication regardless of the value of |
|
Kindly ping @cebtenzzre |
…#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
…#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
|
Due to newer changes adding link against libcuda - the fix is no longer working. It will generates following error message for a cuda build when running in non-cuda environment: #4606 seems to be the culprint Looking into workaround |
…#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
…#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)

Should work on a machine without CUDA runtime but
model.n_gpu_layers = 0.The current behavior in master is throwing following error on a non-cuda machine when
GGML_USE_CUBLAS=ONmaster
CPU
this PR
CPU
CPU but requesting ngl > 0
CUDA