server: add router device memory margin parameter for dynamic unloading by 0cc4m · Pull Request #21231 · ggml-org/llama.cpp

0cc4m · 2026-03-31T14:40:58Z

Overview

I have a server running which serves small embedding models and large text models. Currently there's no way to allow multiple small models to coexist in memory, but to unload them when a large model needs space. This PR solves that by adding a --models-memory-margin parameter which works similarly to the autofit feature memory margin. The router keeps track of memory requirements per model and uses it to unload models in the same order as previously defined for the max number of models, once that margin is exceeded on any device.

I've been running this on my server and it solves the issue I had.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, Qwen3.5 122B and Claude were used for assistance.

ngxson · 2026-03-31T15:03:25Z

    }
 }

+static uint64_t get_model_memory_mb(const common_preset& preset) {


IIRC this is mostly the same logic with -fit, right? If so, is it possible to merge them into a new function called common_fit_get_model_memory()

Btw, I think it can be useful to return more details about mem usage, for example: mem usage by context and by weight. In the future, we can also return usage per backend.

ngxson · 2026-03-31T15:05:26Z

+    add_opt(common_arg(
+        {"--models-memory-max"}, "N",
+        string_format("for router server, maximum memory usage in MB (default: %d, 0 = unlimited)", params.models_memory_max),
+        [](common_params & params, int value) {
+            params.models_memory_max = value;
+        }
+    ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MEMORY_MAX"));


UX-wise, I would prefer specifying the reverse like what --fit-target does. Instead of specifying max used memory, it's more intuitive to specify margin. But I'm not quite sure how hard it is to implement such logic here

Good idea, this router feature can likely share both the memory calculation and the defaults with the autofit code.

ggerganov · 2026-04-02T07:32:47Z

@0cc4m I don't understand the description and the use case that you have. Can you provide an example with some sample model sizes?

0cc4m · 2026-04-02T08:00:59Z

Sure, I have a DGX Spark (~120GB shared memory) running llama-server in router mode. It serves a Qwen3 0.8B Embedding model (~6GB), a Qwen3.5 35B Q6_K (~37GB) and a Qwen3.5 122B Q4_K_M (~81GB). By default, llama-server will load up to 4 models before starting to unload.

A basic agent requires the Embedding model + the 35B, their combined ~43GB can be loaded simultaneously no problem. For a coding agent I use the 122B. Once I load that, it tries to add 81GB to the existing 43, the combined requirement of 124GB exceeds the limit and it will OOM the server (and freeze the system until the process is killed), currently. I can prevent that by only allowing a single model to be loaded at a time, but then simultaneous use of the 35B + the Embedding model will constantly load+unload them while they get used.

This PR adds a memory limit to the llama-server, which allows it to recognize when loading a further model will overflow any device's memory, and start unloading models beforehand to try to make enough space. That way I can simultaneously use multiple small models, but also a large model that needs all memory to work.

Does that help?

0cc4m · 2026-04-02T08:18:38Z

Here's an example debug log of the current state when I load the Qwen3 Embedding model, the 35B, the 122B and gpt-oss 120B:

log

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122502 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122502 MiB
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
TU: error: ../src/freedreno/vulkan/tu_knl.cc:387: failed to open device /dev/dri/renderD128 (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8583 (011e44842) with GNU 13.3.0 for Linux aarch64
system info: n_threads = 20, n_threads_batch = 20, total_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 19 threads for HTTP server
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
[...]
srv  get_memory_e: device CUDA0: total=0 MB, new=6737 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=568 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=6737 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=568 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3-Embedding-0.6B on port 59659
[...]
srv  get_memory_e: device CUDA0: total=6737 MB, new=38712 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=568 MB, new=2477 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=6737 MB, new=38712 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=568 MB, new=2477 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3.5-35B-A3B on port 40255
[...]
srv  get_memory_e: device CUDA0: total=45450 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=3046 MB, new=2505 MB, limit=121478 MB
srv    unload_lru: limits reached (count=2, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3-Embedding-0.6B
srv        unload: stopping model instance name=Qwen3-Embedding-0.6B
srv    operator(): stopping model instance name=Qwen3-Embedding-0.6B
[59659] srv    operator(): exit command received, exiting...
[59659] que    start_loop: processing new tasks
[59659] que    start_loop: terminate
[59659] srv    operator(): operator(): cleaning up before exit...
[59659] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 71596380
[59659] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free    self   model   context   compute    unaccounted |
[59659] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 69918 + (6169 =  1136 +    3584 +    1448) +       46415 |
[59659] llama_memory_breakdown_print: |   - Host               |                    568 =   296 +       0 +     272                |
[59659] ~llama_context:      CUDA0 compute buffer size is 1448.9611 MiB, matches expectation of 1448.9611 MiB
[59659] ~llama_context:  CUDA_Host compute buffer size is 272.0547 MiB, matches expectation of 272.0547 MiB
srv    operator(): instance name=Qwen3-Embedding-0.6B exited with status 0
srv  get_memory_e: device CUDA0: total=38712 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=2477 MB, new=2505 MB, limit=121478 MB
srv    unload_lru: limits reached (count=1, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3.5-35B-A3B
srv        unload: stopping model instance name=Qwen3.5-35B-A3B
srv    operator(): stopping model instance name=Qwen3.5-35B-A3B
[40255] srv    operator(): exit command received, exiting...
[40255] que    start_loop: processing new tasks
[40255] que    start_loop: terminate
[40255] srv    operator(): operator(): cleaning up before exit...
[40255] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 79150948
[40255] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
[40255] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 77295 + (36820 = 28233 +    5371 +    3216) +        8386 |
[40255] llama_memory_breakdown_print: |   - Host               |                    2477 =   397 +       0 +    2080                |
[40255] ~llama_context:      CUDA0 compute buffer size is 3216.0704 MiB, matches expectation of 3216.0704 MiB
[40255] ~llama_context:  CUDA_Host compute buffer size is 2080.0782 MiB, matches expectation of 2080.0782 MiB
srv    operator(): instance name=Qwen3.5-35B-A3B exited with status 0
srv  get_memory_e: device CUDA0: total=0 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=2505 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=2505 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3.5-122B-A10B on port 56689
[...]
srv  get_memory_e: device CUDA0: total=82819 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=2505 MB, new=1673 MB, limit=121478 MB
srv    unload_lru: limits reached (count=1, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3.5-122B-A10B
srv        unload: stopping model instance name=Qwen3.5-122B-A10B
srv    operator(): stopping model instance name=Qwen3.5-122B-A10B
[56689] srv    operator(): exit command received, exiting...
[56689] que    start_loop: processing new tasks
[56689] que    start_loop: terminate
[56689] srv    operator(): operator(): cleaning up before exit...
[56689] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 33223964
[56689] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
[56689] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 32445 + (81170 = 71093 +    6740 +    3336) +        8887 |
[56689] llama_memory_breakdown_print: |   - Host               |                    2505 =   409 +       0 +    2096                |
[56689] ~llama_context:      CUDA0 compute buffer size is 3336.0704 MiB, matches expectation of 3336.0704 MiB
[56689] ~llama_context:  CUDA_Host compute buffer size is 2096.0782 MiB, matches expectation of 2096.0782 MiB
srv    operator(): instance name=Qwen3.5-122B-A10B exited with status 0
srv  get_memory_e: device CUDA0: total=0 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=1673 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=1673 MB, limit=121478 MB
srv          load: spawning server instance with name=gpt-oss-120B on port 33475

ggerganov · 2026-04-02T08:35:16Z

+    // Returns the projected memory use (model + context + compute) in bytes
+    // for the given device within this context. Returns 0 if the device is not used.
+    LLAMA_API uint64_t llama_context_device_memory(
+            const struct llama_context * ctx,
+            ggml_backend_dev_t           device);
+


Most likely the device querying API should be more generic, so this signature is likely to be obsoleted. Let's move to llama-ext.h for now.

ggerganov · 2026-04-03T08:19:03Z

    void add_model(server_model_meta && meta);

+    // not thread-safe, caller must hold mutex
+    uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const;


Suggested change

uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const;

uint64_t get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const;

ggerganov · 2026-04-03T08:19:10Z


    // unload least recently used models if the limit is reached
-    void unload_lru();
+    void unload_lru(const model_memory_map& new_model_memory_per_device);


Suggested change

void unload_lru(const model_memory_map& new_model_memory_per_device);

void unload_lru(const model_memory_map & new_model_memory_per_device);

ggerganov · 2026-04-03T08:22:01Z

    ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MAX"));
+    add_opt(common_arg(
+        {"--models-memory-margin"}, "N",
+        string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),


Suggested change

string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),

string_format("for router server, MiB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),

ggerganov · 2026-04-03T08:23:20Z

+            if (total > 0) {
+                const uint64_t available = (free > memory_margin) ? free - memory_margin : 0;
+                memory_per_device[dev] = available;
+                SRV_DBG("device %s: available memory after margin=%lu MB\n",


Suggested change

SRV_DBG("device %s: available memory after margin=%lu MB\n",

SRV_DBG("device %s: available memory after margin=%lu MiB\n",

ggerganov · 2026-04-03T08:24:10Z

+    model_memory_map total_memory_per_device;
+    for (const auto & m : mapping) {
+        if (m.second.meta.is_running()) {
+            for (const auto& [key, value] : m.second.meta.memory_per_device) {


Suggested change

for (const auto& [key, value] : m.second.meta.memory_per_device) {

for (const auto & [key, value] : m.second.meta.memory_per_device) {

ggerganov · 2026-04-03T08:24:22Z

+
+    uint64_t memory_exceeded = 0;
+
+    for (const auto& [key, limit] : memory_per_device) {


Suggested change

for (const auto& [key, limit] : memory_per_device) {

for (const auto & [key, limit] : memory_per_device) {

ggerganov · 2026-04-03T08:25:20Z


-void server_models::unload_lru() {
-    if (base_params.models_max <= 0) {
+uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const {


Suggested change

uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const {

uint64_t server_models::get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const {

ggerganov · 2026-04-03T08:31:00Z

    common_preset base_preset; // base preset from llama-server CLI args

+    // available memory per device
+    std::map<ggml_backend_dev_t, uint64_t> memory_per_device;


Suggested change

std::map<ggml_backend_dev_t, uint64_t> memory_per_device;

model_memory_map memory_per_device;

ggerganov · 2026-04-03T08:36:45Z

+        return it != m.end() ? it->second : 0;
+    };
+
+    uint64_t memory_exceeded = 0;


Let's keep all memory sizes in size_t type for consistency.

ggerganov · 2026-04-03T08:38:35Z

    int port = 0;
    server_model_status status = SERVER_MODEL_STATUS_UNLOADED;
    int64_t last_used = 0; // for LRU unloading
+    model_memory_map memory_per_device; // projected bytes per device


I am still not sure why we have to keep a memory map both in server_model_meta and server_models. Isn't one map in struct server_models going to be enough?

The idea here was to create a map of how much memory is available within the margin for each device when the server is started. That's what is stored in the server_models struct. It should at least get a clearer name.

server_model_meta then stores the requirement per device for that model. That way it can be summed up and compared to the values in server_models.

I see. Yes, better names are needed.

0cc4m · 2026-04-07T11:35:40Z

I've addressed the feedback.

ggerganov · 2026-04-11T06:47:44Z

I'll take a look next week.

ggerganov · 2026-04-16T11:34:14Z

+    if (params.model.path.empty()) {
+        return {};
+    }


@0cc4m This check prevents using the functionality for models downloaded with -hf because they don't have a path. Should be fixed before merging.

You're right, I hadn't considered that. Currently it will let it through, but estimate it as having no size, which is not ideal. I could fill the map on second load, but that risks an OOM when it's loading for the first time. I could try to estimate with file size before downloading, but that's unreliable and hard to map to devices. Or I could trigger the download before actually loading with common_download_model, estimate and then load. Not sure if that would cause issues in other places, I'd have to try it. What do you think?

Hm not sure. Seems complicated.

Maybe the right way is to have a --download-only CLI argument. If set, the llama-server (and any other tool) would just download the model and exit.

This way, we can split the model loading in stages:

llama-server ... --download-only

Perform memory size checks

llama-server ... --offline

I implemented this, please take a look when you find some time.

…ing models when they exceed a memory size threshold

…eparately

MGAndreasen · 2026-05-03T21:50:26Z

This is an interesting thing, I use both small and large models as well, on different systems. if I set my max models to something like 4, and loads a relative big model, and then 3 smaller once, i will the larger model gets evicted partial from vram and gets very slow. if i then unloads all but the large model, i will be looking at a vram consumption of only a few GB, rest of the layers have been moved to system memory, but they dont come bare into system memory automaticaly. atleast this way it get completely unloaded.

0cc4m · 2026-05-04T05:51:14Z

@ggerganov This is still waiting for your review.

ggerganov · 2026-05-04T05:57:10Z

+
+// Returns the projected memory use (model + context + compute) in bytes
+// for the given device within this context. Returns 0 if the device is not used.
+LLAMA_API uint64_t llama_context_device_memory(
+        const struct llama_context * ctx,
+        ggml_backend_dev_t           device);


Instead of adding this new function, can you reuse the llama_get_memory_breakdown that we added recently (#22171)?

ggerganov · 2026-05-04T06:11:03Z

+                device_memory_map mem;
+                if (base_params.models_memory_margin > 0) {
+                    std::lock_guard<std::mutex> lk(mutex);
+                    auto & meta = mapping[name].meta;
+                    meta.dmm_req = get_model_memory_per_device(meta.preset);
+                    if (meta.dmm_req.empty()) {
+                        SRV_WRN("failed to estimate memory for model %s, memory limits will not apply\n", name.c_str());
+                    }
+                    mem = meta.dmm_req;
+                }


Should this chunk of code be moved at the start of _load() so you can deduplicate it here?

ggerganov · 2026-05-04T06:14:31Z

+                    SRV_ERR("failed to load model %s after download: %s\n", name.c_str(), e.what());
+                    update_status(name, SERVER_MODEL_STATUS_UNLOADED, 1);
+                }
+            }).detach();


In general, I consider detaching threads an anti-pattern. Add a TODO to keep track of the threads in server_models and join them periodically (maybe on each load() and/or update_status()).

0cc4m requested review from a team as code owners March 31, 2026 14:40

ngxson reviewed Mar 31, 2026

View reviewed changes

github-actions Bot added examples server labels Mar 31, 2026

0cc4m requested a review from ggerganov as a code owner April 2, 2026 07:25

0cc4m changed the title ~~server: add router max memory parameter for dynamic unloading~~ server: add router device memory margin parameter for dynamic unloading Apr 2, 2026

ggerganov reviewed Apr 2, 2026

View reviewed changes

0cc4m force-pushed the 0cc4m/server-memory-limit branch from 4312ed2 to 1d4a5f9 Compare April 3, 2026 08:14

ggerganov reviewed Apr 3, 2026

View reviewed changes

0cc4m requested review from ggerganov and ngxson April 10, 2026 12:18

ggerganov self-assigned this Apr 11, 2026

0cc4m force-pushed the 0cc4m/server-memory-limit branch from 0124ec9 to 3c53be1 Compare April 13, 2026 08:15

ggerganov reviewed Apr 16, 2026

View reviewed changes

0cc4m force-pushed the 0cc4m/server-memory-limit branch from 61c2568 to cf0ebc4 Compare April 21, 2026 12:42

0cc4m mentioned this pull request Apr 23, 2026

server: router fix model unload reload deadlock #22284

Draft

0cc4m added 7 commits May 1, 2026 15:38

server: add --models-memory-max parameter to allow dynamically unload…

0a019ed

…ing models when they exceed a memory size threshold

estimate with to-be-loaded model size included

e6468c1

use no_alloc to get memory requirements for model load

af28cd2

only set model memory_mb if not previously calculated

18163c4

use memory margin instead of total size limit, apply to each device s…

3c815b3

…eparately

add server memory debug logging

527c91a

move llama_context_device_memory function to llama-ext.h

f750bae

0cc4m and others added 7 commits May 1, 2026 15:38

fix model count exceeded check

f4a384b

improve memory_per_device map naming

f24011f

improve variable naming, fix style

b440ee0

also strip models memory margin from child processes

972813c

cont : clean-up

01dd393

handle models that need to be downloaded before estimation

884901f

load directly from downloaded state

da1f168

0cc4m force-pushed the 0cc4m/server-memory-limit branch from cf0ebc4 to da1f168 Compare May 1, 2026 13:39

ggerganov reviewed May 4, 2026

View reviewed changes

	uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const;
	uint64_t get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const;

	void unload_lru(const model_memory_map& new_model_memory_per_device);
	void unload_lru(const model_memory_map & new_model_memory_per_device);

	string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),
	string_format("for router server, MiB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),

	SRV_DBG("device %s: available memory after margin=%lu MB\n",
	SRV_DBG("device %s: available memory after margin=%lu MiB\n",

	for (const auto& [key, value] : m.second.meta.memory_per_device) {
	for (const auto & [key, value] : m.second.meta.memory_per_device) {


		uint64_t memory_exceeded = 0;

		for (const auto& [key, limit] : memory_per_device) {

	for (const auto& [key, limit] : memory_per_device) {
	for (const auto & [key, limit] : memory_per_device) {

	uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const {
	uint64_t server_models::get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const {

	std::map<ggml_backend_dev_t, uint64_t> memory_per_device;
	model_memory_map memory_per_device;

Conversation

0cc4m commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Apr 2, 2026

Uh oh!

0cc4m commented Apr 2, 2026

Uh oh!

0cc4m commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Apr 7, 2026

Uh oh!

ggerganov commented Apr 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MGAndreasen commented May 3, 2026

Uh oh!

0cc4m commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0cc4m commented Mar 31, 2026 •

edited

Loading