server: support load model on startup, support preset-only options by ngxson · Pull Request #18206 · ggml-org/llama.cpp

ngxson · 2025-12-19T16:59:55Z

Example config:

[my_model]
load-on-startup = 1
no-mmap = 0
temp = 123.000

Note: it will throw an error if limit models-max is less than the number of models that requires autoload

ServeurpersoCom · 2025-12-19T17:47:30Z

Basic usecase test OK :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off  ; Disable automatic memory fitting
ngl = 999  ; Full GPU offload
ctk = q8_0 ; KV cache key quantization
ctv = q8_0 ; KV cache value quantization
fa = on    ; Enable flash attention
mlock = on ; Lock model in RAM
np = 4     ; Parallel request batching
kvu = on   ; Unified KV cache buffer

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072 ; Context size in tokens for this model
load-on-startup = 1 ; Load immediately on server startup

srv   load_models:   * MoE-Qwen3-Next-80B-A3B-Instruct
srv   load_models:   * MoE-Uncensored-GLM-4.5-Air-Derestricted-106B
srv   load_models: autoloading model Dense-Devstral-Small-2-24B-Instruct-2512
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 33293
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --mlock
srv          load:   --port
srv          load:   33293
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   131072
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --kv-unified
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8082

elfarolab · 2025-12-19T19:39:11Z

Faster than me! I was almost ready with the PR for this feature. Anyway, the code proposed here is better than mine. I recommend changing the name because it can confuse people, loading a model is different from starting it. I got confused by that too. I suggest renaming this .ini property to "autostart = true".

Thank you so much to everybody.

ServeurpersoCom · 2025-12-19T20:31:41Z

"load" model weight in memory vs. "start" inference on a model, using load sound good!

elfarolab · 2025-12-19T20:50:58Z

Are we talking about starting or loading a model? thx

ServeurpersoCom · 2025-12-19T21:11:26Z

Faster than me! I was almost ready with the PR for this feature. Anyway, the code proposed here is better than mine. I recommend changing the name because it can confuse people, loading a model is different from starting it. I got confused by that too. I suggest renaming this .ini property to "autostart = true".

Thank you so much to everybody.

In llama.cpp context, we "load" models (load/unload endpoints), not "start" them. The model gets loaded into memory and becomes available, "autoload" describes this action perfectly. What happens internally (spawning instances) is implementation detail, but from user perspective: you configure which models to auto-load at startup. I think "autoload" is the right term here.

ngxson · 2025-12-19T21:15:10Z

hmm yeah I think a more specific term load-on-startup can make it clearer.

the problem with autoload is that currently there is already a logic called "autoload" in the code base, the logic allows models to be automatically loaded if it is requested via API. we have a global flag --(no-)models-autoload for it, but we may implement autoload = true|false to control this behavior per-model

Co-authored-by: Pascal <admin@serveurperso.com>

elfarolab · 2025-12-19T22:38:04Z

it works for me.

version = 1

[*]
fit = off
ngl = 999
fa = on
jinja = true
ctk = q8_0
ctv = q8_0
mlock = on
np = 2
kvu = on

[unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF]
load-on-startup = true
m = /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf
mm = /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_mmproj-F16.gguf
c = 8192
threads = 8
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
batch-size = 2048
ubatch-size = 2048

/opt/llama.cpp/bin/llama-server --host 0.0.0.0 --port 8088 --models-preset /opt/llama.cpp/etc/models.ini
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7487 (b3079c1) with GNU 11.4.0 for Linux aarch64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

init: using 11 threads for HTTP server
srv load_models: Loaded 0 cached model presets
srv load_models: Loaded 2 custom model presets from /opt/llama.cpp/etc/models.ini
srv load_models: Available models (2) (*: custom preset)
srv load_models: * default
srv load_models: * unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load_models: (startup) loading model unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load: spawning server instance with name=unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF on port 42061
srv load: spawning server instance with args:
srv load: /opt/llama.cpp/bin/llama-server
srv load: --host
srv load: 127.0.0.1
srv load: --jinja
srv load: -kvu
srv load: --min-p
srv load: 0.0
srv load: --mlock
srv load: --port
srv load: 42061
srv load: --presence-penalty
srv load: 1.5
srv load: --temp
srv load: 0.7
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.8
srv load: --alias
srv load: unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load: --batch-size
srv load: 2048
srv load: --ctx-size
srv load: 8192
srv load: --cache-type-k
srv load: q8_0
srv load: --cache-type-v
srv load: q8_0
srv load: --flash-attn
srv load: on
srv load: --fit
srv load: off
srv load: --model
srv load: /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf
srv load: --mmproj
srv load: /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_mmproj-F16.gguf
srv load: --n-gpu-layers
srv load: 999
srv load: --parallel
srv load: 2
srv load: --threads
srv load: 8
srv load: --ubatch-size
srv load: 2048
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:8088

ServeurpersoCom

tested successfully, waiting for RISC-V CI to pass

Fluffkin · 2025-12-23T17:16:34Z

I'm not sure this works as intended. "load-on-startup = true" has the same behaviour as "load-on-startup = false". In fact the only way to set a model to be not loaded at startup is to exclude "load-on-startup" from it's preset entirely.

ngxson · 2025-12-23T18:13:27Z

yes that will need to be fixed. but in the meantime, you can also use INI comment:

; load-on-startup = true

dlippold · 2025-12-29T19:59:17Z

When the parameter load-on-startup is used in a preset file, the value 0 (zero) for the parameter --models-max (which should mean unlimited) is misinterpreted.

For a test I used a file /home/llamaserver/config/router-models.ini with the following content:

[Qwen3-Next]
load-on-startup = on
model = /opt/models/Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-Q5_K_M-00001-of-00002.gguf

When I execute the command

/home/llamaserver/bin/llama-server --port 9000 --offline --models-autoload --models-max 0 --models-preset /home/llamaserver/config/router-models.ini

I get the error message

main: failed to initialize router models: number of models to load on startup (1) exceeds models_max (0)

dlippold · 2025-12-29T20:10:28Z

To differentiate the meaning of the parameter --models-autoload from the parameter load-on-startup please extend the description of the parameter --models-autoload in the table Server-specific params on https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md , e.g. to

for router server, whether to automatically load a model when the model is requested

…gml-org#18206) * server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>

…18206) * server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>

…gml-org#18206) * server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>

ngxson added 2 commits December 19, 2025 17:55

server: support autoload model, support preset-only options

8c851fc

add docs

99af47a

ngxson requested a review from ServeurpersoCom December 19, 2025 16:59

ngxson requested a review from ggerganov as a code owner December 19, 2025 16:59

github-actions Bot added testing Everything test related examples server labels Dec 19, 2025

load-on-startup

74c3dd2

ngxson changed the title ~~server: support autoload model, support preset-only options~~ server: support load model on startup, support preset-only options Dec 19, 2025

fix

ee6cdd0

ServeurpersoCom reviewed Dec 19, 2025

View reviewed changes

Comment thread common/arg.cpp Outdated

Update common/arg.cpp

b3079c1

Co-authored-by: Pascal <admin@serveurperso.com>

ServeurpersoCom approved these changes Dec 20, 2025

View reviewed changes

ServeurpersoCom merged commit 9e39a1e into ggml-org:master Dec 20, 2025
135 of 136 checks passed

mikhail-shevtsov-wiregate mentioned this pull request Mar 4, 2026

server : add default-model preset and fallback logic #19855

Open

Conversation

ngxson commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elfarolab commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Dec 19, 2025

Uh oh!

elfarolab commented Dec 19, 2025

Uh oh!

ServeurpersoCom commented Dec 19, 2025

Uh oh!

ngxson commented Dec 19, 2025

Uh oh!

Uh oh!

elfarolab commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fluffkin commented Dec 23, 2025

Uh oh!

ngxson commented Dec 23, 2025

Uh oh!

dlippold commented Dec 29, 2025

Uh oh!

dlippold commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngxson commented Dec 19, 2025 •

edited

Loading

ServeurpersoCom commented Dec 19, 2025 •

edited

Loading

elfarolab commented Dec 19, 2025 •

edited

Loading

elfarolab commented Dec 19, 2025 •

edited

Loading