[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction by ngxson · Pull Request #70 · ngxson/llama.cpp

ngxson · 2025-12-25T19:55:54Z

Mirror from upstream PR: ggml-org#18373

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

New Features
- Model pinning in server mode: models marked as pinned remain loaded and are exempt from automatic LRU unloading when the max model limit is reached; pinned models persist until explicitly unloaded or the server restarts.
Documentation
- Server README updated to document the new preset-only "pin" option and its effects on model loading/unloading.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-25T19:56:08Z

📝 Walkthrough

Walkthrough

Adds a preset-only "pin" argument and a per-model pinned flag; models with the pin preset are marked during load and are excluded from LRU-based eviction when models_max is reached, causing an error if no unpinned models can be evicted.

Changes

Cohort / File(s)	Summary
Preset Argument Infrastructure `common/arg.h`, `common/arg.cpp`	Add macro `COMMON_ARG_PRESET_PIN` and register a preset-only `"pin"` argument (env `COMMON_ARG_PRESET_PIN`) in `common_params_add_preset_options`.
Model Metadata `tools/server/server-models.h`	Add `bool pinned = false;` to `struct server_model_meta`.
Model Loading & Eviction `tools/server/server-models.cpp`	`load_models()` detects `COMMON_ARG_PRESET_PIN` in model presets and sets `inst.meta.pinned = true`; `unload_lru()` skips pinned models when selecting LRU candidates and throws a runtime_error if `models_max` is reached but no unpinned models are available for eviction.
Documentation `tools/server/README.md`	Document new preset-only `pin` option and its effect on preventing automatic model unloading.

Sequence Diagram(s)

sequenceDiagram
  participant Startup as Server Startup
  participant Repo as Preset Repo
  participant Loader as Model Loader
  participant Meta as Model Meta
  participant Evictor as LRU Evictor

  Startup->>Repo: read model presets
  Repo-->>Loader: preset list per model
  Loader->>Meta: create server_model_meta
  alt preset contains COMMON_ARG_PRESET_PIN
    Loader->>Meta: set meta.pinned = true
  end
  Note right of Meta: pinned flag persists in memory

  Evictor->>Meta: request candidates when models_max reached
  alt unpinned candidates exist
    Evictor->>Loader: unload least-recently-used unpinned model
  else all active models pinned
    Evictor-->>Startup: throw runtime_error / fail to load additional model
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I found a tiny pin and tucked it in a row,
Now models sleep tight where the warm mem-currents flow.
When LRU comes sniffing for someone to send away,
My pinned pals stay resting until a new day.
Hop, nibble, guard—my tiny pin holds sway.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The description identifies this as a mirror PR and provides the upstream source link, but lacks detailed information about what the feature does, why it's needed, and testing approach.	Expand the description to explain the model pinning feature, its purpose, and testing details, or confirm if the upstream PR description is considered sufficient.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature being added: a model pinning mechanism to prevent LRU eviction of critical models.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9e65c3 and 95d2017.

📒 Files selected for processing (1)

tools/server/server-models.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

tools/server/server-models.cpp

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Add COMMON_ARG_PRESET_PIN define - Add pin preset option for model protection - Add pinned field to server_model_meta - Modify load_models to set pinned flag - Update unload_lru to skip pinned models

…pinned When models_max limit is reached but all active models are pinned, log a warning message to clarify that automatic unload cannot succeed. ... also add --pin preset option documentation.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

common/arg.cpp (1)
3527-3532: Inconsistent argument definition: value_hint should match handler signature.

The pin preset argument is defined with value_hint = nullptr (line 3529) but has a handler that accepts a string value (line 3531). This is inconsistent with other preset arguments:

load-on-startup has value_hint = "NAME" (line 3516)

stop-timeout has value_hint = "SECONDS" (line 3522)

Since the implementation in server-models.cpp (line 250) only checks for the option's presence and doesn't use its value, consider either:

Using a void handler: [](common_params &) { /* unused */ } with no value_hint, or

Adding a value_hint like "BOOL" and documenting that any value enables pinning
🔎 Option 1: Use void handler (simpler)
 args.push_back(common_arg(
     {"pin"},
-    nullptr,
     "in server router mode, do not unload this model if models_max is exceeded",
-    [](common_params &, const std::string &) { /* unused */ }
+    [](common_params &) { /* unused */ }
 ).set_env(COMMON_ARG_PRESET_PIN).set_preset_only());
tools/server/server-models.cpp (1)
394-396: Consider clarifying the comment about pinned models.

The comment explains that pinned models won't be evicted, but it could be more explicit about the consequence: if all active models are pinned, models_max can be exceeded and a warning will be logged (lines 413-415).
🔎 Suggested comment improvement
-                // If all active models are pinned, this condition never holds and no LRU eviction will occur.
-                // The server will keep all pinned models in memory, potentially exceeding models_max.
+                // Pinned models are excluded from LRU eviction. If all active models are pinned,
+                // no eviction occurs and models_max may be exceeded (a warning is logged below).
                 if (!m.second.meta.pinned && m.second.meta.last_used < lru_last_used) {

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2cc6d and d9e65c3.

📒 Files selected for processing (5)

common/arg.cpp
common/arg.h
tools/server/README.md
tools/server/server-models.cpp
tools/server/server-models.h

🚧 Files skipped from review as they are similar to previous changes (1)

tools/server/README.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

common/arg.h
tools/server/server-models.h
tools/server/server-models.cpp
common/arg.cpp

{ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**}

📄 CodeRabbit inference engine (AGENTS.md)

Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to tools/server require AI usage disclosure if significant code is generated

Files:

tools/server/server-models.h
tools/server/server-models.cpp

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-12-24T11:28:22.582Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

tools/server/server-models.cpp

🪛 Cppcheck (2.19.0)

tools/server/server-models.cpp

[error] 414-414: syntax error

(syntaxError)

🔇 Additional comments (4)

common/arg.h (1)

14-14: LGTM! Preset macro follows established pattern.

The new COMMON_ARG_PRESET_PIN macro is consistent with other preset identifiers and properly defines the environment variable for model pinning.

tools/server/server-models.h (1)

61-61: Well-documented field addition.

The new pinned field is clearly documented and has an appropriate default value. The inline comment makes the intent immediately clear to maintainers.

tools/server/server-models.cpp (2)

247-253: Pin option handling is correct.

The implementation correctly checks for the presence of the COMMON_ARG_PRESET_PIN option and sets the pinned flag accordingly. This follows the same pattern as the stop-timeout option handling above (lines 234-245).

413-415: Warning message is clear and appropriate.

The warning correctly alerts operators when models_max is reached but all active models are pinned, preventing automatic unloading. This is the expected behavior when users explicitly pin models.

Note: The static analysis tool reports a syntax error at line 414, but this appears to be a false positive. The code is syntactically correct.

… LRU eviction

github-actions bot added examples server labels Dec 25, 2025

Michel Belleau added 2 commits December 25, 2025 22:00

Add pin option to protect models from LRU eviction in router mode

8d59438

- Add COMMON_ARG_PRESET_PIN define - Add pin preset option for model protection - Add pinned field to server_model_meta - Modify load_models to set pinned flag - Update unload_lru to skip pinned models

Add warning when LRU eviction cannot succeed due to all models being …

d9e65c3

…pinned When models_max limit is reached but all active models are pinned, log a warning message to clarify that automatic unload cannot succeed. ... also add --pin preset option documentation.

malaiwah force-pushed the feature/pin-models branch from 1a2cc6d to d9e65c3 Compare December 26, 2025 03:01

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

Throw error when models_max limit reached with no unpinned models for…

95d2017

… LRU eviction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70

[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70
ngxson wants to merge 3 commits intongxson:masterfrom
malaiwah:feature/pin-models

ngxson commented Dec 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 25, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ngxson commented Dec 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngxson commented Dec 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 25, 2025 •

edited

Loading