ggml: avoid creating CUDA context during device init by ServeurpersoCom · Pull Request #20595 · ggml-org/llama.cpp

ServeurpersoCom · 2026-03-15T15:35:30Z

Make sure to read the contributing guidelines before submitting a PR

ggml_cuda_init() calls cudaSetDevice() on every GPU just to query free VRAM for logging. This triggers the creation of a CUDA primary context (120-550 MB depending on GPU), which is irreversible for the lifetime of the process. Every process that loads the backend pays this cost, even if it never uses the GPU (router mode).

This PR removes cudaSetDevice + cudaMemGetInfo from device init. The log loses the free VRAM part but still shows total VRAM via cudaGetDeviceProperties (no context needed). Free VRAM is queried later by FIT through its own cudaSetDevice path, so the context creation is simply deferred to first real use.

Tested on RTX PRO 6000 Blackwell. Router process not in nvidia-smi anymore. Model loading unchanged.

Fixes #20582

am17an · 2026-03-15T15:40:37Z

I guess this effectively reverts #20185?

ServeurpersoCom · 2026-03-15T15:46:25Z

I guess this effectively reverts #20185?

Not quite a git revert, but almost!
Total VRAM in the header: found 1 CUDA device (Total VRAM: 98304 MiB) -> keep
Total VRAM per device: VRAM: 98304 MiB -> keep

Free VRAM per device: (95000 MiB free) -> removed (causes cudaSetDevice)

am17an

cc @tehsiuhuang

tehsiuhuang · 2026-03-15T16:21:28Z

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

ServeurpersoCom · 2026-03-15T16:36:32Z

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

Completely at random, actually, since I have an RTX PRO 6000 with 96 GB of memory, and it takes less on a smaller GPU, I wanted to include "less than or equal to" so as not to say 550MB for everyone.

ServeurpersoCom · 2026-03-15T16:40:30Z

According to NVIDIA internal team (developer forums, 2018): "The pre-allocated memory amount is related to GPU SMs number. The GPU with more SMs requires a larger memory." There is no public API to predict or measure the context overhead separately.
The primary context created by cudaSetDevice contains: per-SM local memory stack pool (scales linearly with SM count, dominant factor), GPU page tables for virtual address translation (scales with total VRAM size), ELF/SASS module metadata (even in CUDA_MODULE_LOADING=LAZY mode which is the default since CUDA 11.7), and driver internal structures (stream descriptors, event pools, scheduling state).
This explains why the cost varies across GPUs: RTX 3060 with 28 SMs and 12 GB VRAM pays around 120 MB, dual RTX 5090 around 300 MB per GPU, and RTX PRO 6000 Blackwell with 192 SMs and 96 GB VRAM pays around 550 MB. The ratio roughly follows SM count.

tehsiuhuang · 2026-03-15T16:40:59Z

@ServeurpersoCom Thanks for finding this!! For learning purpose.. how did you measure the "creation of a CUDA primary context (120-550 MB depending on GPU)". Suprisingly this is really high.

Completely at random, actually, since I have an RTX PRO 6000 with 96 GB of memory, and it takes less on a smaller GPU, I wanted to include "less than or equal to" so as not to say 550MB for everyone.

Thanks!!! Just wrote a small test to understand the memory usage on A100/80GB.. I think your observation is accurate :)

Stage	API Called	Per-GPU Memory	Delta
0	(none)	4 MiB	—
1	cudaGetDeviceCount + cudaGetDeviceProperties	4 MiB	+0 MiB
2	cudaSetDevice	423 MiB	+419 MiB
3	cudaMemGetInfo	423 MiB	+0 MiB

ggml: avoid creating CUDA context during device init

15f4c93

ServeurpersoCom requested a review from a team as a code owner March 15, 2026 15:35

am17an approved these changes Mar 15, 2026

View reviewed changes

JohannesGaessler approved these changes Mar 15, 2026

View reviewed changes

am17an merged commit ceef6b5 into ggml-org:master Mar 15, 2026
22 of 49 checks passed

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 15, 2026

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026

ggml: avoid creating CUDA context during device init (ggml-org#20595)

fddb481

jmig1109 mentioned this pull request Apr 9, 2026

Misc. bug: Regression from #20595 - llama-server loads two separate processes per GPU - consuming more VRAM #21692

Closed

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

ggml: avoid creating CUDA context during device init (ggml-org#20595)

960222e

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

ggml: avoid creating CUDA context during device init (ggml-org#20595)

a1da606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: avoid creating CUDA context during device init#20595

ggml: avoid creating CUDA context during device init#20595
am17an merged 1 commit intoggml-org:masterfrom
ServeurpersoCom:cuda-defer-context-init

ServeurpersoCom commented Mar 15, 2026 •

edited

Loading

Uh oh!

am17an commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

am17an left a comment

Uh oh!

tehsiuhuang commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

tehsiuhuang commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ServeurpersoCom commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

tehsiuhuang commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

ServeurpersoCom commented Mar 15, 2026

Uh oh!

tehsiuhuang commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ServeurpersoCom commented Mar 15, 2026 •

edited

Loading