Skip to content

ggml: avoid creating CUDA context during device init#20595

Merged
am17an merged 1 commit intoggml-org:masterfrom
ServeurpersoCom:cuda-defer-context-init
Mar 15, 2026
Merged

ggml: avoid creating CUDA context during device init#20595
am17an merged 1 commit intoggml-org:masterfrom
ServeurpersoCom:cuda-defer-context-init

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom commented Mar 15, 2026

Make sure to read the contributing guidelines before submitting a PR

ggml_cuda_init() calls cudaSetDevice() on every GPU just to query free VRAM for logging. This triggers the creation of a CUDA primary context (120-550 MB depending on GPU), which is irreversible for the lifetime of the process. Every process that loads the backend pays this cost, even if it never uses the GPU (router mode).

This PR removes cudaSetDevice + cudaMemGetInfo from device init. The log loses the free VRAM part but still shows total VRAM via cudaGetDeviceProperties (no context needed). Free VRAM is queried later by FIT through its own cudaSetDevice path, so the context creation is simply deferred to first real use.

Tested on RTX PRO 6000 Blackwell. Router process not in nvidia-smi anymore. Model loading unchanged.

Fixes #20582

@ServeurpersoCom ServeurpersoCom requested a review from a team as a code owner March 15, 2026 15:35
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 15, 2026

I guess this effectively reverts #20185?

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

I guess this effectively reverts #20185?

Not quite a git revert, but almost!
Total VRAM in the header: found 1 CUDA device (Total VRAM: 98304 MiB) -> keep
Total VRAM per device: VRAM: 98304 MiB -> keep

Free VRAM per device: (95000 MiB free) -> removed (causes cudaSetDevice)

Copy link
Copy Markdown
Contributor

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tehsiuhuang
Copy link
Copy Markdown
Contributor

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

Completely at random, actually, since I have an RTX PRO 6000 with 96 GB of memory, and it takes less on a smaller GPU, I wanted to include "less than or equal to" so as not to say 550MB for everyone.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

According to NVIDIA internal team (developer forums, 2018): "The pre-allocated memory amount is related to GPU SMs number. The GPU with more SMs requires a larger memory." There is no public API to predict or measure the context overhead separately.
The primary context created by cudaSetDevice contains: per-SM local memory stack pool (scales linearly with SM count, dominant factor), GPU page tables for virtual address translation (scales with total VRAM size), ELF/SASS module metadata (even in CUDA_MODULE_LOADING=LAZY mode which is the default since CUDA 11.7), and driver internal structures (stream descriptors, event pools, scheduling state).
This explains why the cost varies across GPUs: RTX 3060 with 28 SMs and 12 GB VRAM pays around 120 MB, dual RTX 5090 around 300 MB per GPU, and RTX PRO 6000 Blackwell with 192 SMs and 96 GB VRAM pays around 550 MB. The ratio roughly follows SM count.

@tehsiuhuang
Copy link
Copy Markdown
Contributor

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

Completely at random, actually, since I have an RTX PRO 6000 with 96 GB of memory, and it takes less on a smaller GPU, I wanted to include "less than or equal to" so as not to say 550MB for everyone.

Thanks!!! Just wrote a small test to understand the memory usage on A100/80GB.. I think your observation is accurate :)

Stage API Called Per-GPU Memory Delta
0 (none) 4 MiB
1 cudaGetDeviceCount + cudaGetDeviceProperties 4 MiB +0 MiB
2 cudaSetDevice 423 MiB +419 MiB
3 cudaMemGetInfo 423 MiB +0 MiB

@am17an am17an merged commit ceef6b5 into ggml-org:master Mar 15, 2026
22 of 49 checks passed
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 15, 2026
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server router mode uses more VRAM than direct loading

4 participants