Skip to content

server: add router device memory margin parameter for dynamic unloading#21231

Open
0cc4m wants to merge 14 commits intomasterfrom
0cc4m/server-memory-limit
Open

server: add router device memory margin parameter for dynamic unloading#21231
0cc4m wants to merge 14 commits intomasterfrom
0cc4m/server-memory-limit

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented Mar 31, 2026

Overview

I have a server running which serves small embedding models and large text models. Currently there's no way to allow multiple small models to coexist in memory, but to unload them when a large model needs space. This PR solves that by adding a --models-memory-margin parameter which works similarly to the autofit feature memory margin. The router keeps track of memory requirements per model and uses it to unload models in the same order as previously defined for the max number of models, once that margin is exceeded on any device.

I've been running this on my server and it solves the issue I had.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, Qwen3.5 122B and Claude were used for assistance.

@0cc4m 0cc4m requested review from a team as code owners March 31, 2026 14:40
Comment thread tools/server/server-models.cpp Outdated
}
}

static uint64_t get_model_memory_mb(const common_preset& preset) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is mostly the same logic with -fit, right? If so, is it possible to merge them into a new function called common_fit_get_model_memory()

Btw, I think it can be useful to return more details about mem usage, for example: mem usage by context and by weight. In the future, we can also return usage per backend.

Comment thread common/arg.cpp Outdated
Comment on lines +3045 to +3051
add_opt(common_arg(
{"--models-memory-max"}, "N",
string_format("for router server, maximum memory usage in MB (default: %d, 0 = unlimited)", params.models_memory_max),
[](common_params & params, int value) {
params.models_memory_max = value;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MEMORY_MAX"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX-wise, I would prefer specifying the reverse like what --fit-target does. Instead of specifying max used memory, it's more intuitive to specify margin. But I'm not quite sure how hard it is to implement such logic here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this router feature can likely share both the memory calculation and the defaults with the autofit code.

@ggerganov
Copy link
Copy Markdown
Member

@0cc4m I don't understand the description and the use case that you have. Can you provide an example with some sample model sizes?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Apr 2, 2026

Sure, I have a DGX Spark (~120GB shared memory) running llama-server in router mode. It serves a Qwen3 0.8B Embedding model (~6GB), a Qwen3.5 35B Q6_K (~37GB) and a Qwen3.5 122B Q4_K_M (~81GB). By default, llama-server will load up to 4 models before starting to unload.

A basic agent requires the Embedding model + the 35B, their combined ~43GB can be loaded simultaneously no problem. For a coding agent I use the 122B. Once I load that, it tries to add 81GB to the existing 43, the combined requirement of 124GB exceeds the limit and it will OOM the server (and freeze the system until the process is killed), currently. I can prevent that by only allowing a single model to be loaded at a time, but then simultaneous use of the 35B + the Embedding model will constantly load+unload them while they get used.

This PR adds a memory limit to the llama-server, which allows it to recognize when loading a further model will overflow any device's memory, and start unloading models beforehand to try to make enough space. That way I can simultaneously use multiple small models, but also a large model that needs all memory to work.

Does that help?

@0cc4m 0cc4m changed the title server: add router max memory parameter for dynamic unloading server: add router device memory margin parameter for dynamic unloading Apr 2, 2026
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Apr 2, 2026

Here's an example debug log of the current state when I load the Qwen3 Embedding model, the 35B, the 122B and gpt-oss 120B:

log
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122502 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122502 MiB
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
TU: error: ../src/freedreno/vulkan/tu_knl.cc:387: failed to open device /dev/dri/renderD128 (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8583 (011e44842) with GNU 13.3.0 for Linux aarch64
system info: n_threads = 20, n_threads_batch = 20, total_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 19 threads for HTTP server
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
load_backend: loaded CUDA backend from /home/user/git/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded Vulkan backend from /home/user/git/llama.cpp/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/user/git/llama.cpp/build/bin/libggml-cpu.so
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 120607032
srv  server_model: device CUDA0: available memory after margin=116756 MB
srv  server_model: device Vulkan0: available memory after margin=90851 MB
srv  server_model: device CPU: available memory after margin=121478 MB
[...]
srv  get_memory_e: device CUDA0: total=0 MB, new=6737 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=568 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=6737 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=568 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3-Embedding-0.6B on port 59659
[...]
srv  get_memory_e: device CUDA0: total=6737 MB, new=38712 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=568 MB, new=2477 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=6737 MB, new=38712 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=568 MB, new=2477 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3.5-35B-A3B on port 40255
[...]
srv  get_memory_e: device CUDA0: total=45450 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=3046 MB, new=2505 MB, limit=121478 MB
srv    unload_lru: limits reached (count=2, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3-Embedding-0.6B
srv        unload: stopping model instance name=Qwen3-Embedding-0.6B
srv    operator(): stopping model instance name=Qwen3-Embedding-0.6B
[59659] srv    operator(): exit command received, exiting...
[59659] que    start_loop: processing new tasks
[59659] que    start_loop: terminate
[59659] srv    operator(): operator(): cleaning up before exit...
[59659] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 71596380
[59659] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free    self   model   context   compute    unaccounted |
[59659] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 69918 + (6169 =  1136 +    3584 +    1448) +       46415 |
[59659] llama_memory_breakdown_print: |   - Host               |                    568 =   296 +       0 +     272                |
[59659] ~llama_context:      CUDA0 compute buffer size is 1448.9611 MiB, matches expectation of 1448.9611 MiB
[59659] ~llama_context:  CUDA_Host compute buffer size is 272.0547 MiB, matches expectation of 272.0547 MiB
srv    operator(): instance name=Qwen3-Embedding-0.6B exited with status 0
srv  get_memory_e: device CUDA0: total=38712 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=2477 MB, new=2505 MB, limit=121478 MB
srv    unload_lru: limits reached (count=1, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3.5-35B-A3B
srv        unload: stopping model instance name=Qwen3.5-35B-A3B
srv    operator(): stopping model instance name=Qwen3.5-35B-A3B
[40255] srv    operator(): exit command received, exiting...
[40255] que    start_loop: processing new tasks
[40255] que    start_loop: terminate
[40255] srv    operator(): operator(): cleaning up before exit...
[40255] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 79150948
[40255] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
[40255] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 77295 + (36820 = 28233 +    5371 +    3216) +        8386 |
[40255] llama_memory_breakdown_print: |   - Host               |                    2477 =   397 +       0 +    2080                |
[40255] ~llama_context:      CUDA0 compute buffer size is 3216.0704 MiB, matches expectation of 3216.0704 MiB
[40255] ~llama_context:  CUDA_Host compute buffer size is 2080.0782 MiB, matches expectation of 2080.0782 MiB
srv    operator(): instance name=Qwen3.5-35B-A3B exited with status 0
srv  get_memory_e: device CUDA0: total=0 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=2505 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=82819 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=2505 MB, limit=121478 MB
srv          load: spawning server instance with name=Qwen3.5-122B-A10B on port 56689
[...]
srv  get_memory_e: device CUDA0: total=82819 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=2505 MB, new=1673 MB, limit=121478 MB
srv    unload_lru: limits reached (count=1, memory margin exceeded on 1 device(s)), removing LRU name=Qwen3.5-122B-A10B
srv        unload: stopping model instance name=Qwen3.5-122B-A10B
srv    operator(): stopping model instance name=Qwen3.5-122B-A10B
[56689] srv    operator(): exit command received, exiting...
[56689] que    start_loop: processing new tasks
[56689] que    start_loop: terminate
[56689] srv    operator(): operator(): cleaning up before exit...
[56689] ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 33223964
[56689] llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
[56689] llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122502 = 32445 + (81170 = 71093 +    6740 +    3336) +        8887 |
[56689] llama_memory_breakdown_print: |   - Host               |                    2505 =   409 +       0 +    2096                |
[56689] ~llama_context:      CUDA0 compute buffer size is 3336.0704 MiB, matches expectation of 3336.0704 MiB
[56689] ~llama_context:  CUDA_Host compute buffer size is 2096.0782 MiB, matches expectation of 2096.0782 MiB
srv    operator(): instance name=Qwen3.5-122B-A10B exited with status 0
srv  get_memory_e: device CUDA0: total=0 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=1673 MB, limit=121478 MB
srv  get_memory_e: device CUDA0: total=0 MB, new=67886 MB, limit=116756 MB
srv  get_memory_e: device Vulkan0: total=0 MB, new=0 MB, limit=90851 MB
srv  get_memory_e: device CPU: total=0 MB, new=1673 MB, limit=121478 MB
srv          load: spawning server instance with name=gpt-oss-120B on port 33475

Comment thread include/llama.h Outdated
Comment on lines +1539 to +1544
// Returns the projected memory use (model + context + compute) in bytes
// for the given device within this context. Returns 0 if the device is not used.
LLAMA_API uint64_t llama_context_device_memory(
const struct llama_context * ctx,
ggml_backend_dev_t device);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely the device querying API should be more generic, so this signature is likely to be obsoleted. Let's move to llama-ext.h for now.

@0cc4m 0cc4m force-pushed the 0cc4m/server-memory-limit branch from 4312ed2 to 1d4a5f9 Compare April 3, 2026 08:14
Comment thread tools/server/server-models.h Outdated
void add_model(server_model_meta && meta);

// not thread-safe, caller must hold mutex
uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const;
uint64_t get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const;

Comment thread tools/server/server-models.h Outdated

// unload least recently used models if the limit is reached
void unload_lru();
void unload_lru(const model_memory_map& new_model_memory_per_device);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void unload_lru(const model_memory_map& new_model_memory_per_device);
void unload_lru(const model_memory_map & new_model_memory_per_device);

Comment thread common/arg.cpp Outdated
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MAX"));
add_opt(common_arg(
{"--models-memory-margin"}, "N",
string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string_format("for router server, MB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),
string_format("for router server, MiB of memory to leave free, per device (default: %d, 0 = unlimited)", params.models_memory_margin),

Comment thread tools/server/server-models.cpp Outdated
if (total > 0) {
const uint64_t available = (free > memory_margin) ? free - memory_margin : 0;
memory_per_device[dev] = available;
SRV_DBG("device %s: available memory after margin=%lu MB\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SRV_DBG("device %s: available memory after margin=%lu MB\n",
SRV_DBG("device %s: available memory after margin=%lu MiB\n",

Comment thread tools/server/server-models.cpp Outdated
model_memory_map total_memory_per_device;
for (const auto & m : mapping) {
if (m.second.meta.is_running()) {
for (const auto& [key, value] : m.second.meta.memory_per_device) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (const auto& [key, value] : m.second.meta.memory_per_device) {
for (const auto & [key, value] : m.second.meta.memory_per_device) {

Comment thread tools/server/server-models.cpp Outdated

uint64_t memory_exceeded = 0;

for (const auto& [key, limit] : memory_per_device) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (const auto& [key, limit] : memory_per_device) {
for (const auto & [key, limit] : memory_per_device) {

Comment thread tools/server/server-models.cpp Outdated

void server_models::unload_lru() {
if (base_params.models_max <= 0) {
uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t server_models::get_memory_exceeded(const model_memory_map& new_model_memory_per_device) const {
uint64_t server_models::get_memory_exceeded(const model_memory_map & new_model_memory_per_device) const {

Comment thread tools/server/server-models.h Outdated
common_preset base_preset; // base preset from llama-server CLI args

// available memory per device
std::map<ggml_backend_dev_t, uint64_t> memory_per_device;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::map<ggml_backend_dev_t, uint64_t> memory_per_device;
model_memory_map memory_per_device;

Comment thread tools/server/server-models.cpp Outdated
return it != m.end() ? it->second : 0;
};

uint64_t memory_exceeded = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep all memory sizes in size_t type for consistency.

Comment thread tools/server/server-models.h Outdated
int port = 0;
server_model_status status = SERVER_MODEL_STATUS_UNLOADED;
int64_t last_used = 0; // for LRU unloading
model_memory_map memory_per_device; // projected bytes per device
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure why we have to keep a memory map both in server_model_meta and server_models. Isn't one map in struct server_models going to be enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to create a map of how much memory is available within the margin for each device when the server is started. That's what is stored in the server_models struct. It should at least get a clearer name.

server_model_meta then stores the requirement per device for that model. That way it can be summed up and compared to the values in server_models.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, better names are needed.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Apr 7, 2026

I've addressed the feedback.

@0cc4m 0cc4m requested review from ggerganov and ngxson April 10, 2026 12:18
@ggerganov
Copy link
Copy Markdown
Member

I'll take a look next week.

@ggerganov ggerganov self-assigned this Apr 11, 2026
@0cc4m 0cc4m force-pushed the 0cc4m/server-memory-limit branch from 0124ec9 to 3c53be1 Compare April 13, 2026 08:15
Comment thread tools/server/server-models.cpp Outdated
Comment on lines +610 to +612
if (params.model.path.empty()) {
return {};
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0cc4m This check prevents using the functionality for models downloaded with -hf because they don't have a path. Should be fixed before merging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I hadn't considered that. Currently it will let it through, but estimate it as having no size, which is not ideal. I could fill the map on second load, but that risks an OOM when it's loading for the first time. I could try to estimate with file size before downloading, but that's unreliable and hard to map to devices. Or I could trigger the download before actually loading with common_download_model, estimate and then load. Not sure if that would cause issues in other places, I'd have to try it. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure. Seems complicated.

Maybe the right way is to have a --download-only CLI argument. If set, the llama-server (and any other tool) would just download the model and exit.

This way, we can split the model loading in stages:

  • llama-server ... --download-only
  • Perform memory size checks
  • llama-server ... --offline

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this, please take a look when you find some time.

@0cc4m 0cc4m force-pushed the 0cc4m/server-memory-limit branch from cf0ebc4 to da1f168 Compare May 1, 2026 13:39
@MGAndreasen
Copy link
Copy Markdown

This is an interesting thing, I use both small and large models as well, on different systems. if I set my max models to something like 4, and loads a relative big model, and then 3 smaller once, i will the larger model gets evicted partial from vram and gets very slow. if i then unloads all but the large model, i will be looking at a vram consumption of only a few GB, rest of the layers have been moved to system memory, but they dont come bare into system memory automaticaly. atleast this way it get completely unloaded.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented May 4, 2026

@ggerganov This is still waiting for your review.

Comment thread src/llama-ext.h
Comment on lines +91 to +96

// Returns the projected memory use (model + context + compute) in bytes
// for the given device within this context. Returns 0 if the device is not used.
LLAMA_API uint64_t llama_context_device_memory(
const struct llama_context * ctx,
ggml_backend_dev_t device);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this new function, can you reuse the llama_get_memory_breakdown that we added recently (#22171)?

Comment on lines +757 to +766
device_memory_map mem;
if (base_params.models_memory_margin > 0) {
std::lock_guard<std::mutex> lk(mutex);
auto & meta = mapping[name].meta;
meta.dmm_req = get_model_memory_per_device(meta.preset);
if (meta.dmm_req.empty()) {
SRV_WRN("failed to estimate memory for model %s, memory limits will not apply\n", name.c_str());
}
mem = meta.dmm_req;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this chunk of code be moved at the start of _load() so you can deduplicate it here?

SRV_ERR("failed to load model %s after download: %s\n", name.c_str(), e.what());
update_status(name, SERVER_MODEL_STATUS_UNLOADED, 1);
}
}).detach();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I consider detaching threads an anti-pattern. Add a TODO to keep track of the threads in server_models and join them periodically (maybe on each load() and/or update_status()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants