llama-server: add router multi-model tests (#17704) by ServeurpersoCom · Pull Request #17722 · ggml-org/llama.cpp

ServeurpersoCom · 2025-12-03T09:38:25Z

Add 4 test cases for model router:

test_router_unload_model: explicit model unloading
test_router_models_max_evicts_lru: LRU eviction with --models-max
test_router_no_models_autoload: --no-models-autoload flag behavior
test_router_api_key_required: API key authentication

Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing.

utils.py changes:

Add models_max, models_dir, no_models_autoload attributes to ServerProcess
Handle JSONDecodeError for non-JSON error responses (fallback to text)

Make sure to read the contributing guidelines before submitting a PR

Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text)

ngxson · 2025-12-03T10:06:11Z

I mirrored the PR here for faster CI: ngxson#48

ngxson

Great jobs, thanks! Merging once CI passes

ServeurpersoCom · 2025-12-03T10:44:35Z

On empty tmp/* I check this after eating :

FAILED unit/test_router.py::test_router_models_max_evicts_lru - AssertionError: Timed out waiting for ggml-org/tinygemma3-GGUF:Q8_0 to reach {'loaded'}, last status: unloaded

ngxson · 2025-12-03T10:46:31Z

@ServeurpersoCom that can happen if ServerPreset.load_all() was never called. We had a fixture to automatically call it before any tests start, but maybe it's not called in your case?

ServeurpersoCom · 2025-12-03T10:51:48Z

@ServeurpersoCom that can happen if ServerPreset.load_all() was never called. We had a fixture to automatically call it before any tests start, but maybe it's not called in your case?

Yes. Missing. I need this :

import pytest
from utils import ServerPreset

@pytest.fixture(scope="module", autouse=True)
def ensure_models_cached():
"""Ensure all models are cached before router tests"""
ServerPreset.load_all()

ngxson · 2025-12-03T11:38:30Z

Hmm not sure why we're getting IP rate-limited by HF. Can you make sure that the offline option is set for all models?

The idea is that load_all() will be called once just to download models to cache, then tests will run with offline = True to avoid sending too many requests to HF server

ServeurpersoCom · 2025-12-03T12:40:52Z

I try also :

(root|~/llama.cpp.pascal) git diff
diff --git a/tools/server/tests/unit/test_router.py b/tools/server/tests/unit/test_router.py
index 87710d511..fb304386c 100644
--- a/tools/server/tests/unit/test_router.py
+++ b/tools/server/tests/unit/test_router.py
@@ -3,6 +3,11 @@ from utils import *

 server: ServerProcess

+@pytest.fixture(scope="module", autouse=True)
+def preload_models():
+    # Preload all models before running router tests
+    ServerPreset.load_all()
+
 @pytest.fixture(autouse=True)
 def create_server():
     global server
@@ -17,7 +22,6 @@ def create_server():
     ]
 )
 def test_router_chat_completion_stream(model: str, success: bool):
-    # TODO: make sure the model is in cache (ie. ServerProcess.load_all()) before starting the router server
     global server
     server.start()
     content = ""
(root|~/llama.cpp.pascal) ./test.sh

ngxson · 2025-12-03T12:54:31Z

I mean, the scope="module" is suppose to be outside of the test file, as it will technically affect all test units.

The fixture is already added to conftest.py

ServeurpersoCom · 2025-12-03T13:07:58Z

OK I found the problem

Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models

ServeurpersoCom · 2025-12-03T13:15:34Z

(root|~/llama.cpp.pascal) ./test.sh
===================================================================== test session starts ======================================================================
platform linux -- Python 3.11.2, pytest-8.3.5, pluggy-1.6.0 -- /root/llama.cpp.pascal/tools/server/tests/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /root/llama.cpp.pascal/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.12.0
collected 6 items

unit/test_router.py::test_router_chat_completion_stream[ggml-org/tinygemma3-GGUF:Q8_0-True] PASSED                                                       [ 16%]
unit/test_router.py::test_router_chat_completion_stream[non-existent/model-False] PASSED                                                                 [ 33%]
unit/test_router.py::test_router_unload_model PASSED                                                                                                     [ 50%]
unit/test_router.py::test_router_models_max_evicts_lru PASSED                                                                                            [ 66%]
unit/test_router.py::test_router_no_models_autoload PASSED                                                                                               [ 83%]
unit/test_router.py::test_router_api_key_required PASSED                                                                                                 [100%]

====================================================================== 6 passed in 15.92s ======================================================================

ngxson · 2025-12-03T13:18:18Z

Can be merged once the CI ngxson#48 passes

ngxson · 2025-12-03T13:29:34Z

~~Hmm, seems like on windows, the fixture isn't called: https://github.com/ngxson/llama.cpp/actions/runs/19895083434/job/57023680373?pr=48~~

Maybe there is a problem with multi-shard model?

ngxson · 2025-12-03T13:42:49Z

If the problem still persist with multi-shard model, we can remove the offline for multi-shard model, because it is only run used once anyway.

We can add a TODO and fix the problem later (not very important)

Edit: ah, nevermind, your fix seems to be OK!

* llama-server: add router multi-model tests (ggml-org#17704) Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text) * llama-server: update test models to new HF repos * add offline * llama-server: fix router LRU eviction test and add preloading Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models * llama-server: fix split model download on Windows --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* llama-server: add router multi-model tests (#17704) Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text) * llama-server: update test models to new HF repos * add offline * llama-server: fix router LRU eviction test and add preloading Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models * llama-server: fix split model download on Windows --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* llama-server: add router multi-model tests (ggml-org#17704) Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text) * llama-server: update test models to new HF repos * add offline * llama-server: fix router LRU eviction test and add preloading Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models * llama-server: fix split model download on Windows --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

ServeurpersoCom self-assigned this Dec 3, 2025

ServeurpersoCom requested a review from ngxson December 3, 2025 09:39

ngxson reviewed Dec 3, 2025

View reviewed changes

Comment thread tools/server/tests/unit/test_router.py

llama-server: update test models to new HF repos

3905449

ServeurpersoCom force-pushed the test/router-load-unload branch from 1eabe34 to 3905449 Compare December 3, 2025 10:17

ngxson approved these changes Dec 3, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 3, 2025

UPSTREAM PR #17722: llama-server: add router multi-model tests (#17704) auroralabs-loci/llama.cpp#415

Open

github-actions Bot added examples python python script changes server labels Dec 3, 2025

ngxson reviewed Dec 3, 2025

View reviewed changes

Comment thread tools/server/tests/utils.py

add offline

ec7dc2e

llama-server: fix split model download on Windows

583463b

ngxson merged commit e7c2cf1 into ggml-org:master Dec 3, 2025
4 of 9 checks passed

Conversation

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

Uh oh!

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ServeurpersoCom commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

ngxson commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Dec 3, 2025 •

edited

Loading

ngxson commented Dec 3, 2025 •

edited

Loading