Skip to content

[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70

Open
ngxson wants to merge 3 commits intongxson:masterfrom
malaiwah:feature/pin-models
Open

[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70
ngxson wants to merge 3 commits intongxson:masterfrom
malaiwah:feature/pin-models

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 25, 2025

Mirror from upstream PR: ggml-org#18373

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

  • New Features

    • Model pinning in server mode: models marked as pinned remain loaded and are exempt from automatic LRU unloading when the max model limit is reached; pinned models persist until explicitly unloaded or the server restarts.
  • Documentation

    • Server README updated to document the new preset-only "pin" option and its effects on model loading/unloading.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 25, 2025

📝 Walkthrough

Walkthrough

Adds a preset-only "pin" argument and a per-model pinned flag; models with the pin preset are marked during load and are excluded from LRU-based eviction when models_max is reached, causing an error if no unpinned models can be evicted.

Changes

Cohort / File(s) Summary
Preset Argument Infrastructure
common/arg.h, common/arg.cpp
Add macro COMMON_ARG_PRESET_PIN and register a preset-only "pin" argument (env COMMON_ARG_PRESET_PIN) in common_params_add_preset_options.
Model Metadata
tools/server/server-models.h
Add bool pinned = false; to struct server_model_meta.
Model Loading & Eviction
tools/server/server-models.cpp
load_models() detects COMMON_ARG_PRESET_PIN in model presets and sets inst.meta.pinned = true; unload_lru() skips pinned models when selecting LRU candidates and throws a runtime_error if models_max is reached but no unpinned models are available for eviction.
Documentation
tools/server/README.md
Document new preset-only pin option and its effect on preventing automatic model unloading.

Sequence Diagram(s)

sequenceDiagram
  participant Startup as Server Startup
  participant Repo as Preset Repo
  participant Loader as Model Loader
  participant Meta as Model Meta
  participant Evictor as LRU Evictor

  Startup->>Repo: read model presets
  Repo-->>Loader: preset list per model
  Loader->>Meta: create server_model_meta
  alt preset contains COMMON_ARG_PRESET_PIN
    Loader->>Meta: set meta.pinned = true
  end
  Note right of Meta: pinned flag persists in memory

  Evictor->>Meta: request candidates when models_max reached
  alt unpinned candidates exist
    Evictor->>Loader: unload least-recently-used unpinned model
  else all active models pinned
    Evictor-->>Startup: throw runtime_error / fail to load additional model
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I found a tiny pin and tucked it in a row,
Now models sleep tight where the warm mem-currents flow.
When LRU comes sniffing for someone to send away,
My pinned pals stay resting until a new day.
Hop, nibble, guard—my tiny pin holds sway.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The description identifies this as a mirror PR and provides the upstream source link, but lacks detailed information about what the feature does, why it's needed, and testing approach. Expand the description to explain the model pinning feature, its purpose, and testing details, or confirm if the upstream PR description is considered sufficient.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature being added: a model pinning mechanism to prevent LRU eviction of critical models.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9e65c3 and 95d2017.

📒 Files selected for processing (1)
  • tools/server/server-models.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/server/server-models.cpp

Comment @coderabbitai help to get the list of available commands and usage tips.

Michel Belleau added 2 commits December 25, 2025 22:00
- Add COMMON_ARG_PRESET_PIN define
- Add pin preset option for model protection
- Add pinned field to server_model_meta
- Modify load_models to set pinned flag
- Update unload_lru to skip pinned models
…pinned

When models_max limit is reached but all active models are pinned, log a
warning message to clarify that automatic unload cannot succeed.

... also add --pin preset option documentation.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
common/arg.cpp (1)

3527-3532: Inconsistent argument definition: value_hint should match handler signature.

The pin preset argument is defined with value_hint = nullptr (line 3529) but has a handler that accepts a string value (line 3531). This is inconsistent with other preset arguments:

  • load-on-startup has value_hint = "NAME" (line 3516)
  • stop-timeout has value_hint = "SECONDS" (line 3522)

Since the implementation in server-models.cpp (line 250) only checks for the option's presence and doesn't use its value, consider either:

  1. Using a void handler: [](common_params &) { /* unused */ } with no value_hint, or
  2. Adding a value_hint like "BOOL" and documenting that any value enables pinning
🔎 Option 1: Use void handler (simpler)
 args.push_back(common_arg(
     {"pin"},
-    nullptr,
     "in server router mode, do not unload this model if models_max is exceeded",
-    [](common_params &, const std::string &) { /* unused */ }
+    [](common_params &) { /* unused */ }
 ).set_env(COMMON_ARG_PRESET_PIN).set_preset_only());
tools/server/server-models.cpp (1)

394-396: Consider clarifying the comment about pinned models.

The comment explains that pinned models won't be evicted, but it could be more explicit about the consequence: if all active models are pinned, models_max can be exceeded and a warning will be logged (lines 413-415).

🔎 Suggested comment improvement
-                // If all active models are pinned, this condition never holds and no LRU eviction will occur.
-                // The server will keep all pinned models in memory, potentially exceeding models_max.
+                // Pinned models are excluded from LRU eviction. If all active models are pinned,
+                // no eviction occurs and models_max may be exceeded (a warning is logged below).
                 if (!m.second.meta.pinned && m.second.meta.last_used < lru_last_used) {
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2cc6d and d9e65c3.

📒 Files selected for processing (5)
  • common/arg.cpp
  • common/arg.h
  • tools/server/README.md
  • tools/server/server-models.cpp
  • tools/server/server-models.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/server/README.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • common/arg.h
  • tools/server/server-models.h
  • tools/server/server-models.cpp
  • common/arg.cpp
{ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**}

📄 CodeRabbit inference engine (AGENTS.md)

Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to tools/server require AI usage disclosure if significant code is generated

Files:

  • tools/server/server-models.h
  • tools/server/server-models.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/server/server-models.cpp
🪛 Cppcheck (2.19.0)
tools/server/server-models.cpp

[error] 414-414: syntax error

(syntaxError)

🔇 Additional comments (4)
common/arg.h (1)

14-14: LGTM! Preset macro follows established pattern.

The new COMMON_ARG_PRESET_PIN macro is consistent with other preset identifiers and properly defines the environment variable for model pinning.

tools/server/server-models.h (1)

61-61: Well-documented field addition.

The new pinned field is clearly documented and has an appropriate default value. The inline comment makes the intent immediately clear to maintainers.

tools/server/server-models.cpp (2)

247-253: Pin option handling is correct.

The implementation correctly checks for the presence of the COMMON_ARG_PRESET_PIN option and sets the pinned flag accordingly. This follows the same pattern as the stop-timeout option handling above (lines 234-245).


413-415: Warning message is clear and appropriate.

The warning correctly alerts operators when models_max is reached but all active models are pinned, preventing automatic unloading. This is the expected behavior when users explicitly pin models.

Note: The static analysis tool reports a syntax error at line 414, but this appears to be a false positive. The code is syntactically correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant