Skip to content

server: support load model on startup, support preset-only options#18206

Merged
ServeurpersoCom merged 5 commits intoggml-org:masterfrom
ngxson:xsn/server_models_autoload
Dec 20, 2025
Merged

server: support load model on startup, support preset-only options#18206
ServeurpersoCom merged 5 commits intoggml-org:masterfrom
ngxson:xsn/server_models_autoload

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Dec 19, 2025

Fix #18163

Fix #18035

Example config:

[my_model]
load-on-startup = 1
no-mmap = 0
temp = 123.000

Note: it will throw an error if limit models-max is less than the number of models that requires autoload

@ngxson ngxson requested a review from ggerganov as a code owner December 19, 2025 16:59
@github-actions github-actions Bot added testing Everything test related examples server labels Dec 19, 2025
@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Dec 19, 2025

Basic usecase test OK :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off  ; Disable automatic memory fitting
ngl = 999  ; Full GPU offload
ctk = q8_0 ; KV cache key quantization
ctv = q8_0 ; KV cache value quantization
fa = on    ; Enable flash attention
mlock = on ; Lock model in RAM
np = 4     ; Parallel request batching
kvu = on   ; Unified KV cache buffer

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072 ; Context size in tokens for this model
load-on-startup = 1 ; Load immediately on server startup
srv   load_models:   * MoE-Qwen3-Next-80B-A3B-Instruct
srv   load_models:   * MoE-Uncensored-GLM-4.5-Air-Derestricted-106B
srv   load_models: autoloading model Dense-Devstral-Small-2-24B-Instruct-2512
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 33293
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --mlock
srv          load:   --port
srv          load:   33293
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   131072
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --kv-unified
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8082

@elfarolab
Copy link
Copy Markdown

elfarolab commented Dec 19, 2025

Faster than me! I was almost ready with the PR for this feature. Anyway, the code proposed here is better than mine. I recommend changing the name because it can confuse people, loading a model is different from starting it. I got confused by that too. I suggest renaming this .ini property to "autostart = true".

Thank you so much to everybody.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

"load" model weight in memory vs. "start" inference on a model, using load sound good!

@elfarolab
Copy link
Copy Markdown

Are we talking about starting or loading a model? thx

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Faster than me! I was almost ready with the PR for this feature. Anyway, the code proposed here is better than mine. I recommend changing the name because it can confuse people, loading a model is different from starting it. I got confused by that too. I suggest renaming this .ini property to "autostart = true".

Thank you so much to everybody.

In llama.cpp context, we "load" models (load/unload endpoints), not "start" them. The model gets loaded into memory and becomes available, "autoload" describes this action perfectly. What happens internally (spawning instances) is implementation detail, but from user perspective: you configure which models to auto-load at startup. I think "autoload" is the right term here.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Dec 19, 2025

hmm yeah I think a more specific term load-on-startup can make it clearer.

the problem with autoload is that currently there is already a logic called "autoload" in the code base, the logic allows models to be automatically loaded if it is requested via API. we have a global flag --(no-)models-autoload for it, but we may implement autoload = true|false to control this behavior per-model

@ngxson ngxson changed the title server: support autoload model, support preset-only options server: support load model on startup, support preset-only options Dec 19, 2025
Comment thread common/arg.cpp Outdated
Co-authored-by: Pascal <admin@serveurperso.com>
@elfarolab
Copy link
Copy Markdown

elfarolab commented Dec 19, 2025

it works for me.

version = 1

[*]
fit = off
ngl = 999
fa = on
jinja = true
ctk = q8_0
ctv = q8_0
mlock = on
np = 2
kvu = on

[unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF]
load-on-startup = true
m = /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf
mm = /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_mmproj-F16.gguf
c = 8192
threads = 8
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
batch-size = 2048
ubatch-size = 2048

/opt/llama.cpp/bin/llama-server --host 0.0.0.0 --port 8088 --models-preset /opt/llama.cpp/etc/models.ini
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7487 (b3079c1) with GNU 11.4.0 for Linux aarch64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 870 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 11 threads for HTTP server
srv load_models: Loaded 0 cached model presets
srv load_models: Loaded 2 custom model presets from /opt/llama.cpp/etc/models.ini
srv load_models: Available models (2) (*: custom preset)
srv load_models: * default
srv load_models: * unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load_models: (startup) loading model unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load: spawning server instance with name=unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF on port 42061
srv load: spawning server instance with args:
srv load: /opt/llama.cpp/bin/llama-server
srv load: --host
srv load: 127.0.0.1
srv load: --jinja
srv load: -kvu
srv load: --min-p
srv load: 0.0
srv load: --mlock
srv load: --port
srv load: 42061
srv load: --presence-penalty
srv load: 1.5
srv load: --temp
srv load: 0.7
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.8
srv load: --alias
srv load: unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF
srv load: --batch-size
srv load: 2048
srv load: --ctx-size
srv load: 8192
srv load: --cache-type-k
srv load: q8_0
srv load: --cache-type-v
srv load: q8_0
srv load: --flash-attn
srv load: on
srv load: --fit
srv load: off
srv load: --model
srv load: /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf
srv load: --mmproj
srv load: /opt/llama-models/Qwen3-VL-30B-A3B-Instruct/unsloth_Qwen3-VL-30B-A3B-Instruct-GGUF_mmproj-F16.gguf
srv load: --n-gpu-layers
srv load: 999
srv load: --parallel
srv load: 2
srv load: --threads
srv load: 8
srv load: --ubatch-size
srv load: 2048
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:8088

Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested successfully, waiting for RISC-V CI to pass

@ServeurpersoCom ServeurpersoCom merged commit 9e39a1e into ggml-org:master Dec 20, 2025
135 of 136 checks passed
@Fluffkin
Copy link
Copy Markdown

I'm not sure this works as intended. "load-on-startup = true" has the same behaviour as "load-on-startup = false". In fact the only way to set a model to be not loaded at startup is to exclude "load-on-startup" from it's preset entirely.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Dec 23, 2025

yes that will need to be fixed. but in the meantime, you can also use INI comment:

; load-on-startup = true

@dlippold
Copy link
Copy Markdown

When the parameter load-on-startup is used in a preset file, the value 0 (zero) for the parameter --models-max (which should mean unlimited) is misinterpreted.

For a test I used a file /home/llamaserver/config/router-models.ini with the following content:

[Qwen3-Next]
load-on-startup = on
model = /opt/models/Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-Q5_K_M-00001-of-00002.gguf

When I execute the command

/home/llamaserver/bin/llama-server --port 9000 --offline --models-autoload --models-max 0 --models-preset /home/llamaserver/config/router-models.ini

I get the error message

main: failed to initialize router models: number of models to load on startup (1) exceeds models_max (0)

@dlippold
Copy link
Copy Markdown

To differentiate the meaning of the parameter --models-autoload from the parameter load-on-startup please extend the description of the parameter --models-autoload in the table Server-specific params on https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md , e.g. to

for router server, whether to automatically load a model when the model is requested

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…gml-org#18206)

* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
…18206)

* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…gml-org#18206)

* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

presets: add preset-only options Feature request: "autoload" model in router mode

5 participants