server: add router device memory margin parameter for dynamic unloading#21231
server: add router device memory margin parameter for dynamic unloading#21231
Conversation
| } | ||
| } | ||
|
|
||
| static uint64_t get_model_memory_mb(const common_preset& preset) { |
There was a problem hiding this comment.
IIRC this is mostly the same logic with -fit, right? If so, is it possible to merge them into a new function called common_fit_get_model_memory()
Btw, I think it can be useful to return more details about mem usage, for example: mem usage by context and by weight. In the future, we can also return usage per backend.
| add_opt(common_arg( | ||
| {"--models-memory-max"}, "N", | ||
| string_format("for router server, maximum memory usage in MB (default: %d, 0 = unlimited)", params.models_memory_max), | ||
| [](common_params & params, int value) { | ||
| params.models_memory_max = value; | ||
| } | ||
| ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MEMORY_MAX")); |
There was a problem hiding this comment.
UX-wise, I would prefer specifying the reverse like what --fit-target does. Instead of specifying max used memory, it's more intuitive to specify margin. But I'm not quite sure how hard it is to implement such logic here
There was a problem hiding this comment.
Good idea, this router feature can likely share both the memory calculation and the defaults with the autofit code.
|
@0cc4m I don't understand the description and the use case that you have. Can you provide an example with some sample model sizes? |
|
Sure, I have a DGX Spark (~120GB shared memory) running llama-server in router mode. It serves a Qwen3 0.8B Embedding model (~6GB), a Qwen3.5 35B Q6_K (~37GB) and a Qwen3.5 122B Q4_K_M (~81GB). By default, llama-server will load up to 4 models before starting to unload. A basic agent requires the Embedding model + the 35B, their combined ~43GB can be loaded simultaneously no problem. For a coding agent I use the 122B. Once I load that, it tries to add 81GB to the existing 43, the combined requirement of 124GB exceeds the limit and it will OOM the server (and freeze the system until the process is killed), currently. I can prevent that by only allowing a single model to be loaded at a time, but then simultaneous use of the 35B + the Embedding model will constantly load+unload them while they get used. This PR adds a memory limit to the llama-server, which allows it to recognize when loading a further model will overflow any device's memory, and start unloading models beforehand to try to make enough space. That way I can simultaneously use multiple small models, but also a large model that needs all memory to work. Does that help? |
|
Here's an example debug log of the current state when I load the Qwen3 Embedding model, the 35B, the 122B and gpt-oss 120B: log |
| // Returns the projected memory use (model + context + compute) in bytes | ||
| // for the given device within this context. Returns 0 if the device is not used. | ||
| LLAMA_API uint64_t llama_context_device_memory( | ||
| const struct llama_context * ctx, | ||
| ggml_backend_dev_t device); | ||
|
|
There was a problem hiding this comment.
Most likely the device querying API should be more generic, so this signature is likely to be obsoleted. Let's move to llama-ext.h for now.
4312ed2 to
1d4a5f9
Compare
| void add_model(server_model_meta && meta); | ||
|
|
||
| // not thread-safe, caller must hold mutex | ||
| uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const; |
There was a problem hiding this comment.
| uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const; | |
| uint64_t get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const; |
|
|
||
| // unload least recently used models if the limit is reached | ||
| void unload_lru(); | ||
| void unload_lru(const model_memory_map& new_model_memory_per_device); |
There was a problem hiding this comment.
| void unload_lru(const model_memory_map& new_model_memory_per_device); | |
| void unload_lru(const model_memory_map & new_model_memory_per_device); |
| ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MAX")); | ||
| add_opt(common_arg( | ||
| {"--models-memory-margin"}, "N", | ||
| string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin), |
There was a problem hiding this comment.
| string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin), | |
| string_format("for router server, MiB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin), |
| if (total > 0) { | ||
| const uint64_t available = (free > memory_margin) ? free - memory_margin : 0; | ||
| memory_per_device[dev] = available; | ||
| SRV_DBG("device %s: available memory after margin=%lu MB\n", |
There was a problem hiding this comment.
| SRV_DBG("device %s: available memory after margin=%lu MB\n", | |
| SRV_DBG("device %s: available memory after margin=%lu MiB\n", |
| model_memory_map total_memory_per_device; | ||
| for (const auto & m : mapping) { | ||
| if (m.second.meta.is_running()) { | ||
| for (const auto& [key, value] : m.second.meta.memory_per_device) { |
There was a problem hiding this comment.
| for (const auto& [key, value] : m.second.meta.memory_per_device) { | |
| for (const auto & [key, value] : m.second.meta.memory_per_device) { |
|
|
||
| uint64_t memory_exceeded = 0; | ||
|
|
||
| for (const auto& [key, limit] : memory_per_device) { |
There was a problem hiding this comment.
| for (const auto& [key, limit] : memory_per_device) { | |
| for (const auto & [key, limit] : memory_per_device) { |
|
|
||
| void server_models::unload_lru() { | ||
| if (base_params.models_max <= 0) { | ||
| uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const { |
There was a problem hiding this comment.
| uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const { | |
| uint64_t server_models::get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const { |
| common_preset base_preset; // base preset from llama-server CLI args | ||
|
|
||
| // available memory per device | ||
| std::map<ggml_backend_dev_t, uint64_t> memory_per_device; |
There was a problem hiding this comment.
| std::map<ggml_backend_dev_t, uint64_t> memory_per_device; | |
| model_memory_map memory_per_device; |
| return it != m.end() ? it->second : 0; | ||
| }; | ||
|
|
||
| uint64_t memory_exceeded = 0; |
There was a problem hiding this comment.
Let's keep all memory sizes in size_t type for consistency.
| int port = 0; | ||
| server_model_status status = SERVER_MODEL_STATUS_UNLOADED; | ||
| int64_t last_used = 0; // for LRU unloading | ||
| model_memory_map memory_per_device; // projected bytes per device |
There was a problem hiding this comment.
I am still not sure why we have to keep a memory map both in server_model_meta and server_models. Isn't one map in struct server_models going to be enough?
There was a problem hiding this comment.
The idea here was to create a map of how much memory is available within the margin for each device when the server is started. That's what is stored in the server_models struct. It should at least get a clearer name.
server_model_meta then stores the requirement per device for that model. That way it can be summed up and compared to the values in server_models.
There was a problem hiding this comment.
I see. Yes, better names are needed.
|
I've addressed the feedback. |
|
I'll take a look next week. |
0124ec9 to
3c53be1
Compare
| if (params.model.path.empty()) { | ||
| return {}; | ||
| } |
There was a problem hiding this comment.
@0cc4m This check prevents using the functionality for models downloaded with -hf because they don't have a path. Should be fixed before merging.
There was a problem hiding this comment.
You're right, I hadn't considered that. Currently it will let it through, but estimate it as having no size, which is not ideal. I could fill the map on second load, but that risks an OOM when it's loading for the first time. I could try to estimate with file size before downloading, but that's unreliable and hard to map to devices. Or I could trigger the download before actually loading with common_download_model, estimate and then load. Not sure if that would cause issues in other places, I'd have to try it. What do you think?
There was a problem hiding this comment.
Hm not sure. Seems complicated.
Maybe the right way is to have a --download-only CLI argument. If set, the llama-server (and any other tool) would just download the model and exit.
This way, we can split the model loading in stages:
llama-server ... --download-only- Perform memory size checks
llama-server ... --offline
There was a problem hiding this comment.
I implemented this, please take a look when you find some time.
61c2568 to
cf0ebc4
Compare
…ing models when they exceed a memory size threshold
cf0ebc4 to
da1f168
Compare
|
This is an interesting thing, I use both small and large models as well, on different systems. if I set my max models to something like 4, and loads a relative big model, and then 3 smaller once, i will the larger model gets evicted partial from vram and gets very slow. if i then unloads all but the large model, i will be looking at a vram consumption of only a few GB, rest of the layers have been moved to system memory, but they dont come bare into system memory automaticaly. atleast this way it get completely unloaded. |
|
@ggerganov This is still waiting for your review. |
|
|
||
| // Returns the projected memory use (model + context + compute) in bytes | ||
| // for the given device within this context. Returns 0 if the device is not used. | ||
| LLAMA_API uint64_t llama_context_device_memory( | ||
| const struct llama_context * ctx, | ||
| ggml_backend_dev_t device); |
There was a problem hiding this comment.
Instead of adding this new function, can you reuse the llama_get_memory_breakdown that we added recently (#22171)?
| device_memory_map mem; | ||
| if (base_params.models_memory_margin > 0) { | ||
| std::lock_guard<std::mutex> lk(mutex); | ||
| auto & meta = mapping[name].meta; | ||
| meta.dmm_req = get_model_memory_per_device(meta.preset); | ||
| if (meta.dmm_req.empty()) { | ||
| SRV_WRN("failed to estimate memory for model %s, memory limits will not apply\n", name.c_str()); | ||
| } | ||
| mem = meta.dmm_req; | ||
| } |
There was a problem hiding this comment.
Should this chunk of code be moved at the start of _load() so you can deduplicate it here?
| SRV_ERR("failed to load model %s after download: %s\n", name.c_str(), e.what()); | ||
| update_status(name, SERVER_MODEL_STATUS_UNLOADED, 1); | ||
| } | ||
| }).detach(); |
There was a problem hiding this comment.
In general, I consider detaching threads an anti-pattern. Add a TODO to keep track of the threads in server_models and join them periodically (maybe on each load() and/or update_status()).
Overview
I have a server running which serves small embedding models and large text models. Currently there's no way to allow multiple small models to coexist in memory, but to unload them when a large model needs space. This PR solves that by adding a
--models-memory-marginparameter which works similarly to the autofit feature memory margin. The router keeps track of memory requirements per model and uses it to unload models in the same order as previously defined for the max number of models, once that margin is exceeded on any device.I've been running this on my server and it solves the issue I had.
Requirements