ggml: avoid creating CUDA context during device init#20595
ggml: avoid creating CUDA context during device init#20595am17an merged 1 commit intoggml-org:masterfrom
Conversation
|
I guess this effectively reverts #20185? |
Not quite a git revert, but almost! Free VRAM per device: (95000 MiB free) -> removed (causes cudaSetDevice) |
|
@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high. |
Completely at random, actually, since I have an RTX PRO 6000 with 96 GB of memory, and it takes less on a smaller GPU, I wanted to include "less than or equal to" so as not to say 550MB for everyone. |
|
According to NVIDIA internal team (developer forums, 2018): "The pre-allocated memory amount is related to GPU SMs number. The GPU with more SMs requires a larger memory." There is no public API to predict or measure the context overhead separately. |
Thanks!!! Just wrote a small test to understand the memory usage on A100/80GB.. I think your observation is accurate :)
|
Make sure to read the contributing guidelines before submitting a PR
ggml_cuda_init() calls cudaSetDevice() on every GPU just to query free VRAM for logging. This triggers the creation of a CUDA primary context (120-550 MB depending on GPU), which is irreversible for the lifetime of the process. Every process that loads the backend pays this cost, even if it never uses the GPU (router mode).
This PR removes cudaSetDevice + cudaMemGetInfo from device init. The log loses the free VRAM part but still shows total VRAM via cudaGetDeviceProperties (no context needed). Free VRAM is queried later by FIT through its own cudaSetDevice path, so the context creation is simply deferred to first real use.
Tested on RTX PRO 6000 Blackwell. Router process not in nvidia-smi anymore. Model loading unchanged.
Fixes #20582