[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70
[Mirror] feat: Add model pinning feature to protect critical models from LRU eviction#70ngxson wants to merge 3 commits intongxson:masterfrom
Conversation
📝 WalkthroughWalkthroughAdds a preset-only "pin" argument and a per-model Changes
Sequence Diagram(s)sequenceDiagram
participant Startup as Server Startup
participant Repo as Preset Repo
participant Loader as Model Loader
participant Meta as Model Meta
participant Evictor as LRU Evictor
Startup->>Repo: read model presets
Repo-->>Loader: preset list per model
Loader->>Meta: create server_model_meta
alt preset contains COMMON_ARG_PRESET_PIN
Loader->>Meta: set meta.pinned = true
end
Note right of Meta: pinned flag persists in memory
Evictor->>Meta: request candidates when models_max reached
alt unpinned candidates exist
Evictor->>Loader: unload least-recently-used unpinned model
else all active models pinned
Evictor-->>Startup: throw runtime_error / fail to load additional model
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Comment |
- Add COMMON_ARG_PRESET_PIN define - Add pin preset option for model protection - Add pinned field to server_model_meta - Modify load_models to set pinned flag - Update unload_lru to skip pinned models
…pinned When models_max limit is reached but all active models are pinned, log a warning message to clarify that automatic unload cannot succeed. ... also add --pin preset option documentation.
1a2cc6d to
d9e65c3
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
common/arg.cpp (1)
3527-3532: Inconsistent argument definition: value_hint should match handler signature.The
pinpreset argument is defined withvalue_hint = nullptr(line 3529) but has a handler that accepts a string value (line 3531). This is inconsistent with other preset arguments:
load-on-startuphasvalue_hint = "NAME"(line 3516)stop-timeouthasvalue_hint = "SECONDS"(line 3522)Since the implementation in
server-models.cpp(line 250) only checks for the option's presence and doesn't use its value, consider either:
- Using a void handler:
[](common_params &) { /* unused */ }with no value_hint, or- Adding a value_hint like
"BOOL"and documenting that any value enables pinning🔎 Option 1: Use void handler (simpler)
args.push_back(common_arg( {"pin"}, - nullptr, "in server router mode, do not unload this model if models_max is exceeded", - [](common_params &, const std::string &) { /* unused */ } + [](common_params &) { /* unused */ } ).set_env(COMMON_ARG_PRESET_PIN).set_preset_only());tools/server/server-models.cpp (1)
394-396: Consider clarifying the comment about pinned models.The comment explains that pinned models won't be evicted, but it could be more explicit about the consequence: if all active models are pinned,
models_maxcan be exceeded and a warning will be logged (lines 413-415).🔎 Suggested comment improvement
- // If all active models are pinned, this condition never holds and no LRU eviction will occur. - // The server will keep all pinned models in memory, potentially exceeding models_max. + // Pinned models are excluded from LRU eviction. If all active models are pinned, + // no eviction occurs and models_max may be exceeded (a warning is logged below). if (!m.second.meta.pinned && m.second.meta.last_used < lru_last_used) {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
common/arg.cppcommon/arg.htools/server/README.mdtools/server/server-models.cpptools/server/server-models.h
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/server/README.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}
📄 CodeRabbit inference engine (AGENTS.md)
Always format C++ code before committing using
git clang-formatwith the project's.clang-formatconfiguration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignmentvoid * ptr, reference alignmentint & ref)
Files:
common/arg.htools/server/server-models.htools/server/server-models.cppcommon/arg.cpp
{ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**}
📄 CodeRabbit inference engine (AGENTS.md)
Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to
tools/serverrequire AI usage disclosure if significant code is generated
Files:
tools/server/server-models.htools/server/server-models.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
tools/server/server-models.cpp
🪛 Cppcheck (2.19.0)
tools/server/server-models.cpp
[error] 414-414: syntax error
(syntaxError)
🔇 Additional comments (4)
common/arg.h (1)
14-14: LGTM! Preset macro follows established pattern.The new
COMMON_ARG_PRESET_PINmacro is consistent with other preset identifiers and properly defines the environment variable for model pinning.tools/server/server-models.h (1)
61-61: Well-documented field addition.The new
pinnedfield is clearly documented and has an appropriate default value. The inline comment makes the intent immediately clear to maintainers.tools/server/server-models.cpp (2)
247-253: Pin option handling is correct.The implementation correctly checks for the presence of the
COMMON_ARG_PRESET_PINoption and sets thepinnedflag accordingly. This follows the same pattern as thestop-timeoutoption handling above (lines 234-245).
413-415: Warning message is clear and appropriate.The warning correctly alerts operators when
models_maxis reached but all active models are pinned, preventing automatic unloading. This is the expected behavior when users explicitly pin models.Note: The static analysis tool reports a syntax error at line 414, but this appears to be a false positive. The code is syntactically correct.
Mirror from upstream PR: ggml-org#18373
Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.