[Mirror] server: improve slots scheduling for n_cmpl by ngxson · Pull Request #80 · ngxson/llama.cpp

ngxson · 2026-01-12T17:24:06Z

Mirror from upstream PR: ggml-org#18789

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

Refactor
- Centralized parent/child orchestration: tasks may include child entries and role checks now drive slot behavior and coordinated multi-slot launches.
Bug Fixes
- Fewer scheduler conflicts with improved deferral, rollback and unreservation for partial or failed multi-slot launches.
API
- Waiting-task registration now accepts task IDs; deferred-task popping can be prioritized by slot ID.
Tests
- Tests updated to exercise repeated multi-request scenarios and verify cache reuse.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-12T17:24:23Z

📝 Walkthrough

Walkthrough

Refactors orchestration to use task-level parent/child predicates, moves child tasks into server_task with add_child()/is_parent()/is_child(), adds multi-slot helpers (get_free_slots, launch_slots_with_parent_task), changes queue APIs to ID-based waiting and targeted deferred popping, and updates slot callbacks/logging.

Changes

Cohort / File(s)	Summary
Server context & launch flow `tools/server/server-context.cpp`	Added `get_free_slots(...)` and `launch_slots_with_parent_task(...)`; `process_single_task` now reserves/launches child slots before parent; updated `get_slot_by_id(int id_slot)` parameter name; control flow now checks `slot.task->is_child()/is_parent()`.
Slot structure & callbacks `tools/server/server-context.cpp` (struct changes)	`server_slot::callback_on_release` signature changed to `std::function<void(int /* slot_id */)>`; removed `server_slot::is_child()/is_parent()`; runtime role checks use `slot.task->is_child()/is_parent()`; slot.state initialized from `slot.task->is_child()`.
Task model & child management `tools/server/server-task.h`	Replaced `n_children` with `std::vector<server_task> child_tasks`; added `is_parent()/is_child()` predicates; `create_child(...)` → `add_child(...)` which appends child tasks into `child_tasks`; `get_list_id()` now includes child IDs.
Queue API / posting & deferred pop `tools/server/server-queue.h`, `tools/server/server-queue.cpp`	`pop_deferred_task()` → `pop_deferred_task(int id_slot)` to prioritize deferred tasks for a slot; `add_waiting_tasks(...)` → `add_waiting_task_ids(const std::unordered_set<int>&)`; `post_task` asserts non-parent; `post_tasks` indexes parents+children and populates waiting ID set.
Completions & task insertion `tools/server/server-context.cpp`, `tools/server/server-queue.cpp`	`handle_completions_impl` now uses `task.add_child(...)` and appends parent tasks to the queue (instead of inserting at front); indexing and deferred/queuing adjusted for multi-slot launches.
CLI tokenization & safety checks `tools/server/server-context.cpp`	`tokenize_cli_input` now asserts `task.cli_input != nullptr`; related logging/comments updated.
Tests `tools/server/tests/unit/test_chat_completion.py`	`test_chat_completions_multiple_choices` repeats requests with `id_slot: 0` to force same-slot usage and verify cache/reuse behavior across requests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Queue as ResponseQueue
    participant Context as ServerContext
    participant Slots as SlotManager

    Client->>Queue: post_tasks(parent + children)
    Queue->>Context: enqueue parent_task (with child metadata)
    Context->>Slots: get_free_slots(n_needed, exclude_id_slot)
    alt enough slots
        Slots-->>Context: child_slots list
        Context->>Slots: launch_slots_with_parent_task(parent_slot, child_slots, parent_task)
        Slots-->>Context: child slots started
        Slots-->>Context: parent slot started
        Context->>Queue: update states / notify_started
        Queue-->>Client: tasks_started
    else insufficient slots
        Slots-->>Context: not_enough_slots
        Context->>Queue: pop_deferred_task(id_slot) / defer parent_task
        Queue-->>Client: task_deferred
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nudged the slots and set a queue,

I tucked my children in two by two,
They hop ahead, then I take my place,
I call back with an id — freed my space,
A tidy, scheduled rabbit race!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete, providing only a mirror reference and review instruction without substantive content about the changes, objectives, or rationale.	Add a detailed description explaining the refactoring goals, key changes to slot lifecycle, task-based role checking, and improvements to scheduling logic for handling n_cmpl scenarios.
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Mirror] server: improve slots scheduling for n_cmpl' directly relates to the main changes, which refactor server slot scheduling by moving parent/child checks from slots to tasks and enhancing multi-slot launch orchestration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5474a and 64b48eb.

📒 Files selected for processing (2)

tools/server/server-context.cpp
tools/server/server-task.h

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

🔇 Additional comments (16)

tools/server/server-task.h (4)

125-127: LGTM!

The child_tasks vector with clear documentation about its temporary scheduling usage and lifetime constraints is well-designed.

198-207: LGTM!

The updated get_list_id correctly aggregates both parent and child task IDs.

209-226: LGTM!

The add_child method correctly creates child task copies with appropriate seed diversification. Setting id_slot = -1 ensures child tasks cannot specify a slot, and the seed increment logic properly avoids identical outputs across completions.

234-240: LGTM!

The is_parent() and is_child() predicates are clean and provide clear semantics for task role identification.

tools/server/server-context.cpp (12)

161-161: LGTM!

The parameter name comment clarifies the callback's intent.

312-315: LGTM!

Correctly updated to use task-based is_child() predicate.

797-799: LGTM!

The callback now properly passes the slot_id to enable targeted deferred task retrieval.

912-920: LGTM!

Parameter rename improves clarity.

1587-1587: LGTM!

The assertion provides a defensive check. Callers already guard against null cli_input before invoking this function.

1621-1632: LGTM!

The get_free_slots helper correctly identifies available slots while excluding a specified slot (typically the parent slot).

1634-1677: LGTM!

The launch_slots_with_parent_task helper properly orchestrates launching child tasks before the parent, with coordinated rollback on failure. The return values are now correctly using true/false for success/failure.

1719-1735: LGTM!

The parent task scheduling logic correctly reserves child slots separately from the parent slot by using get_free_slots(n_child_tasks, slot->id) with exclusion. This addresses the previous critical issue where the parent slot was incorrectly counted toward child slot requirements.

1188-1192: LGTM!

Slot state initialization correctly uses task-based is_child() predicate.

2003-2007: LGTM!

Context shift is correctly disabled for both parent and child tasks to preserve shared prompt integrity.

2703-2720: LGTM!

The child slot activation logic correctly copies the parent's state to all waiting child slots after prompt processing completes.

2995-3024: LGTM!

The completions task construction correctly uses add_child() for child task creation. The commented tasks.reserve() note appropriately acknowledges the dynamic sizing due to child tasks.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tools/server/server-queue.cpp (1)
338-355: Potential correctness issue: post_tasks() requires task IDs to be assigned before calling it.

id_tasks = server_task::get_list_id(tasks) + add_waiting_task_ids(id_tasks) will misbehave if any task.id / child_task.id is still -1 (the queue’s post(std::vector<...>) assigns ids later, but the response reader needs stable ids now). Please either assert IDs are already assigned or assign them here before building id_tasks.
Proposed fix (assert ids are assigned before using them)
 void server_response_reader::post_tasks(std::vector<server_task> && tasks, bool front) {
     GGML_ASSERT(id_tasks.empty() && "post_tasks() can only be called once per reader");
+    for (const auto & t : tasks) {
+        GGML_ASSERT(t.id != -1 && "post_tasks(): task.id must be assigned before posting");
+        for (const auto & c : t.child_tasks) {
+            GGML_ASSERT(c.id != -1 && "post_tasks(): child_task.id must be assigned before posting");
+        }
+    }
     id_tasks = server_task::get_list_id(tasks);
-    states.reserve(tasks.size());
+    states.reserve(id_tasks.size());
     size_t index = 0;
     for (auto & task : tasks) {
         task.index = index++;
         states.push_back(task.create_state());
         // for child tasks
         for (auto & child_task : task.child_tasks) {
             child_task.index = index++;
             states.push_back(child_task.create_state());
         }
     }
     GGML_ASSERT(states.size() == id_tasks.size());
     queue_results.add_waiting_task_ids(id_tasks);
     queue_tasks.post(std::move(tasks), front);
 }
tools/server/server-task.h (1)
123-211: Fix invariant: n_children vs child_tasks.size() can desync (possible OOB later).

Right now is_parent() is n_children > 0, but the actual child storage is child_tasks. If any code sets one without the other, downstream code that indexes children by n_children can read past child_tasks. Recommend making n_children derived (or at least kept in sync inside add_child).
Proposed fix (make “parent” derived from stored children)
 struct server_task {
@@
-    int n_children =  0; // number of tasks reusing this prompt
+    int n_children =  0; // number of tasks reusing this prompt (keep in sync with child_tasks.size())
@@
     void add_child(int id_parent, int id_child) {
         server_task copy;
@@
         if (copy.params.sampling.seed != LLAMA_DEFAULT_SEED) {
             copy.params.sampling.seed += (uint32_t)child_tasks.size() + 1;
         }
 
         child_tasks.push_back(std::move(copy));
+        n_children = (int) child_tasks.size();
     }
@@
     bool is_parent() const {
-        return n_children > 0;
+        return !child_tasks.empty();
     }
 };

🤖 Fix all issues with AI agents

In @tools/server/server-context.cpp:
- Around line 3075-3086: Remove the unused local variable child_tasks declared
in the block that prepares child tasks in server-context (the line
"std::vector<server_task> child_tasks;"); delete that declaration and keep the
existing logic that sets task.n_children, calls task.add_child(...), and pushes
task into tasks (no other modifications). Ensure there are no remaining
references to child_tasks after removal and rebuild to confirm the
unused-variable warning is gone.
- Around line 1653-1786: The bug is that try_reserve_n_slots counts the
already-reserved parent slot (slot->task_id_next == task.id) toward child slots;
change try_reserve_n_slots to accept an optional exclude_slot_id (e.g.,
signature try_reserve_n_slots(size_t n_required, int task_id, int
exclude_slot_id = -1) or similar) and skip any slot whose id equals
exclude_slot_id when scanning/reserving, then call it from process_single_task
with the parent slot id (try_reserve_n_slots(task.n_children, task.id,
slot->id)); update any other call sites accordingly and ensure
launch_slots_with_child_tasks still assumes the parent slot is excluded from
child reservations.

🧹 Nitpick comments (2)

tools/server/tests/unit/test_chat_completion.py (1)

494-507: Add a request timeout to avoid hangs if the scheduler regresses.

This test intentionally stresses “slot reuse”; if it deadlocks, pytest may hang. Consider passing a short timeout= to server.make_request(...) (if supported by the test utility) so failures are bounded-time.

tools/server/server-queue.cpp (1)

220-227: OK, but consider reducing per-id debug spam.

Looping RES_DBG(...) for every id in a batch can get very noisy for large batches; consider one aggregate log (or guard behind a higher debug level) if this shows up in practice.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcf7546 and 821e329.

📒 Files selected for processing (5)

tools/server/server-context.cpp
tools/server/server-queue.cpp
tools/server/server-queue.h
tools/server/server-task.h
tools/server/tests/unit/test_chat_completion.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-12-20T21:18:21.151Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:21.151Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

tools/server/server-queue.cpp

🧬 Code graph analysis (2)

tools/server/server-queue.h (1)

tools/server/server-task.h (1)

unordered_set (169-178)

tools/server/tests/unit/test_chat_completion.py (2)

tools/server/server-context.cpp (1)

res (1063-1088)

tools/server/tests/utils.py (1)

make_request (288-319)

🔇 Additional comments (4)

tools/server/server-queue.cpp (1)

328-336: Good guardrail: prevent posting “parent” via post_task().

The new GGML_ASSERT(!task.is_parent()) makes the API expectation explicit and avoids silently mis-scheduling parent/child sets.

tools/server/server-context.cpp (2)

87-90: Slot reservation primitive looks fine.

task_id_next + reset clearing it is a clear way to encode “only accept task X next” at the slot level.

Also applies to: 183-183

2060-2064: Good: explicitly forbid context shift for shared prompts.

Rejecting ctx-shift when is_parent()/is_child() prevents hard-to-reason-about KV mutations across linked tasks.

tools/server/server-queue.h (1)

124-140: API migration to add_waiting_task_ids(...) is complete.

Verification confirms no remaining calls to the old API. The new method is declared, implemented, and actively used in the codebase.

tools/server/server-context.cpp

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tools/server/server-context.cpp (1)

1684-1721: Potential partial failure scenario worth noting.

If launch_slot_with_task fails for a child task (line 1701), any previously-launched child slots continue processing without a parent being launched. This could result in orphaned task results.

The caller does call unreserve_slots on failure (line 1779), which prevents further slot reservations, but it doesn't cancel already-launched children.

This is likely acceptable for now given the complexity of proper cleanup, but worth documenting as a known limitation or addressing in a follow-up if it causes issues in practice.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 821e329 and 25702ba.

📒 Files selected for processing (1)

tools/server/server-context.cpp

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

🔇 Additional comments (12)

tools/server/server-context.cpp (12)

87-90: LGTM!

The new task_id_next field provides a clean mechanism for slot reservation. The comment clearly explains its purpose for coordinating parent + child task scheduling.

183-183: LGTM!

Properly resets the reservation field when a slot is reset.

345-345: LGTM!

The task pointer is asserted non-null earlier in the function (line 335), so this access is safe.

1218-1224: LGTM!

The slot.task is assigned via make_unique immediately before this access, so it's guaranteed to be non-null.

1619-1619: LGTM!

The assertion is appropriate since the caller (process_single_task at line 1732) checks task.cli_input != nullptr before invoking this function.

1653-1674: LGTM!

The reservation logic is sound:

Skips the current (parent) slot

Stops early when enough slots are reserved

Only reserves slots that are either already reserved for this task or completely free

Returns success only when all required slots are actually available (not processing)

The design correctly handles the case where slots may be reserved but still finishing previous work.

1676-1682: LGTM!

Simple and correct cleanup function that releases all slot reservations for a given task.

1767-1790: LGTM!

The parent task scheduling logic correctly implements a "sync barrier" pattern:

Reserves the current slot for this task

Attempts to reserve required child slots (reusing any existing reservations for this task)

Only proceeds when all slots are both reserved and available

If not all slots are ready, clears only the current slot's reservation and defers

The design handles the case where a deferred task may be assigned to a different slot on retry - existing reservations from prior attempts are correctly reused via the task_id matching.

1806-1806: LGTM!

Proper cleanup of slot reservations when a task is cancelled. This prevents reservation leaks.

2064-2064: LGTM!

The slot is in SLOT_STATE_GENERATING at this point (verified by line 2049), which guarantees slot.task is valid.

2764-2764: LGTM!

Consistent with the refactor moving predicates to server_task. The slot is in SLOT_STATE_DONE_PROMPT state, ensuring slot.task is valid.

3058-3088: LGTM!

The child task creation is refactored to use task.add_child(...), which attaches children to the parent task. This aligns with the new launch_slots_with_child_tasks helper that expects parent_task.child_tasks to contain the child tasks for coordinated launch.

ngxson · 2026-01-12T22:18:37Z

@coderabbitai review

coderabbitai · 2026-01-12T22:18:43Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/server/server-context.cpp (1)

1006-1027: Missing reservation check in LRU slot selection.

The LRU fallback loop does not check slot.task_id_next, unlike the LCP-based selection above it. This could cause a reserved slot to be selected for an unrelated task.

🐛 Proposed fix to add reservation check in LRU selection

         // find the slot that has been least recently used
         if (ret == nullptr) {
             int64_t t_last = -1;
 
             for (server_slot & slot : slots) {
                 // skip the slot if it is not available
                 if (slot.is_processing()) {
                     continue;
                 }
 
+                // skip slot reserved for other tasks
+                if (slot.task_id_next != -1 && slot.task_id_next != task.id) {
+                    continue;
+                }
+
                 // select the current slot if the criteria match
                 if (!ret || slot.t_last_used <= t_last) {
                     t_last = slot.t_last_used;
                     ret = &slot;
                 }
             }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da6e2ba and d6b0d23.

📒 Files selected for processing (1)

tools/server/server-context.cpp

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: ggml-ci-arm64-cpu-kleidiai
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ggml-ci-arm64-cpu-high-perf
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: openEuler-latest-cmake-cann (aarch64, 910b, Release)
GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
GitHub Check: server (UNDEFINED, RelWithDebInfo)
GitHub Check: server (ADDRESS, RelWithDebInfo)
GitHub Check: server (Release)
GitHub Check: server-windows
GitHub Check: server-build (Release)
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: WebUI Checks
GitHub Check: pyright type-check

🔇 Additional comments (12)

tools/server/server-context.cpp (12)

87-90: LGTM!

The new task_id_next field for slot reservation is well-documented with a clear comment explaining its purpose.

345-352: LGTM!

The release() method correctly passes task_id_next to the callback, enabling deferred task processing when a reserved slot becomes available.

970-973: LGTM!

Correctly skips slots reserved for other tasks while allowing a slot to be used if it's reserved for the requesting task.

1624-1657: LGTM!

The GGML_ASSERT is appropriate as the function is only called when cli_input is set (verified by the caller at line 1755).

1659-1690: LGTM!

The reservation logic correctly handles:

Reusing slots already reserved for this task

Only succeeding when all required slots are both reserved and not processing

Proper unreservation cleanup in unreserve_slots

1692-1744: LGTM!

The multi-slot launch logic is well-structured with proper cleanup paths:

Child slots are launched first, then the parent

On failure, all launched slots are released and reservations are cleared

Reservations are automatically cleared via slot.reset() in launch_slot_with_task

1792-1816: LGTM!

The parent task scheduling logic correctly:

Reserves the current slot for this task

Attempts to reserve additional child slots

Launches all slots together on success, or defers while keeping partial reservations

2091-2095: LGTM!

Correctly updated to use slot.task->is_parent() and slot.task->is_child() after the refactoring that moved these predicates to the task.

2789-2814: LGTM!

The parent task prompt completion handling correctly uses the updated slot.task->is_parent() API.

3085-3116: LGTM!

The refactoring correctly uses task.add_child() to attach child tasks to the parent. Only the parent task is pushed to the queue, and children are launched together via launch_slots_with_child_tasks.

1226-1230: LGTM!

Child slots correctly start in SLOT_STATE_WAIT_OTHER to wait for the parent's prompt processing, while parent/standalone slots start in SLOT_STATE_STARTED.

829-832: The callback implementation is correct. When task_id_next is -1, pop_deferred_task() explicitly pops the front available deferred task from the queue (not a no-op). The function safely handles this case by checking if the deferred queue is not empty before proceeding, and the -1 value is intentionally used to pop the next available task without requiring a specific task ID.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tools/server/server-context.cpp`:
- Around line 1662-1704: The function launch_slots_with_parent_task currently
returns bool but uses integer return codes (0 for success, 2 for failure),
causing inverted logic; change its signature from bool to int and keep the
existing return values (0 = success, non-zero = failure), and update all call
sites that invoke launch_slots_with_parent_task (and any checks like
!launch_slots_with_parent_task(...)) to interpret the int correctly (e.g., check
for != 0 on failure or == 0 on success); no change needed to
launch_slot_with_task or release_slots logic besides adjusting callers to use
the int result.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba86ad9 and 79c1967.

📒 Files selected for processing (1)

tools/server/server-context.cpp

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
GitHub Check: openEuler-latest-cmake-cann (x86, 910b, Release)
GitHub Check: build-cmake-pkg / linux
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-latest-llguidance
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: pyright type-check
GitHub Check: WebUI Checks
GitHub Check: server-build (Release)

🔇 Additional comments (10)

tools/server/server-context.cpp (10)

170-170: LGTM!

Good documentation improvement to clarify the parameter represents slot_id.

340-342: LGTM!

The refactor to use slot.task->is_child() instead of slot-level checks is cleaner. The task is guaranteed non-null by the assert at line 330.

824-826: LGTM!

The callback properly uses the slot_id parameter to pop deferred tasks specifically associated with the released slot.

939-947: LGTM!

Good naming improvement to distinguish id_slot from other ID types.

1215-1219: LGTM!

Correct state assignment: child slots wait for the parent to process the prompt before they can proceed.

1614-1614: LGTM!

The assert makes the precondition explicit. The caller at line 1715 already checks task.cli_input != nullptr before calling this function.

1746-1762: LGTM, pending the return value fix.

The parent task scheduling logic is correct:

Get free slots for child tasks (excluding parent slot)

If not enough slots, defer the task

Launch all slots together using launch_slots_with_parent_task

This addresses the previous review comment about counting the parent slot toward child slot requirements.

2030-2034: LGTM!

Correctly prevents context shift for parent/child slots, which would desynchronize the shared prompt state.

2730-2748: LGTM!

The parent-child state synchronization is correct:

Parent completes prompt processing (SLOT_STATE_DONE_PROMPT)

Finds all children waiting (SLOT_STATE_WAIT_OTHER with matching id_parent)

Copies KV cache and state to each child

Children transition to SLOT_STATE_DONE_PROMPT to begin generation

3042-3049: LGTM!

The refactored child task creation using task.add_child() is cleaner than the previous approach. Child tasks are now embedded within the parent task, which allows launch_slots_with_parent_task to orchestrate them together.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-14T22:11:25Z

tools/server/server-context.cpp

+    bool launch_slots_with_parent_task(server_slot & parent_slot, std::vector<server_slot *> & child_slots, server_task && parent_task) {
+        GGML_ASSERT(!parent_slot.is_processing());
+        GGML_ASSERT(parent_task.is_parent());
+        GGML_ASSERT(child_slots.size() == parent_task.child_tasks.size());
+
+        int id_parent = parent_task.id;
+
+        SRV_INF("launching slots for parent task id_task = %d with %zu child tasks\n", id_parent, parent_task.child_tasks.size());
+
+        // to be called in case of failure to release all launched slots
+        auto release_slots = [this, id_parent]() {
+            for (auto & slot : slots) {
+                if (slot.is_processing() && (
+                        slot.task->id == id_parent ||
+                        slot.task->id_parent == id_parent
+                )) {
+                    slot.release();
+                }
+            }
+        };
+
+        // launch all child tasks first
+        size_t idx = 0;
+        GGML_ASSERT(child_slots.size() == parent_task.child_tasks.size());
+        for (auto * slot : child_slots) {
+            int id_child = parent_task.child_tasks[idx].id;
+            if (!launch_slot_with_task(*slot, std::move(parent_task.child_tasks[idx]))) {
+                SRV_ERR("failed to launch slot with child task, id_task = %d\n", id_child);
+                release_slots();
+                return 2;
+            }
+            idx++;
+        }
+
+        // finally, launch the parent task
+        if (!launch_slot_with_task(parent_slot, std::move(parent_task))) {
+            SRV_ERR("failed to launch slot with task, id_task = %d\n", id_parent);
+            release_slots();
+            return 2;
+        }
+
+        return 0;
+    }


⚠️ Potential issue | 🔴 Critical

Critical: Return values are inverted due to bool/int mismatch.

The function returns bool but uses integer return values (return 0 for success, return 2 for failure). Due to implicit conversion:

return 0 → false (but should indicate success)

return 2 → true (but should indicate failure)

At the call site (line 1755), !launch_slots_with_parent_task(...) will:

On success (returns 0/false): !false = true → enters error branch incorrectly

On failure (returns 2/true): !true = false → skips error branch incorrectly

🐛 Proposed fix

// launch multiple slots for parent + child tasks bool launch_slots_with_parent_task(server_slot & parent_slot, std::vector<server_slot *> & child_slots, server_task && parent_task) { GGML_ASSERT(!parent_slot.is_processing()); GGML_ASSERT(parent_task.is_parent()); GGML_ASSERT(child_slots.size() == parent_task.child_tasks.size()); int id_parent = parent_task.id; SRV_INF("launching slots for parent task id_task = %d with %zu child tasks\n", id_parent, parent_task.child_tasks.size()); // to be called in case of failure to release all launched slots auto release_slots = [this, id_parent]() { for (auto & slot : slots) { if (slot.is_processing() && ( slot.task->id == id_parent || slot.task->id_parent == id_parent )) { slot.release(); } } }; // launch all child tasks first size_t idx = 0; GGML_ASSERT(child_slots.size() == parent_task.child_tasks.size()); for (auto * slot : child_slots) { int id_child = parent_task.child_tasks[idx].id; if (!launch_slot_with_task(*slot, std::move(parent_task.child_tasks[idx]))) { SRV_ERR("failed to launch slot with child task, id_task = %d\n", id_child); release_slots(); - return 2; + return false; } idx++; } // finally, launch the parent task if (!launch_slot_with_task(parent_slot, std::move(parent_task))) { SRV_ERR("failed to launch slot with task, id_task = %d\n", id_parent); release_slots(); - return 2; + return false; } - return 0; + return true; }

🤖 Prompt for AI Agents

In `@tools/server/server-context.cpp` around lines 1662 - 1704, The function launch_slots_with_parent_task currently returns bool but uses integer return codes (0 for success, 2 for failure), causing inverted logic; change its signature from bool to int and keep the existing return values (0 = success, non-zero = failure), and update all call sites that invoke launch_slots_with_parent_task (and any checks like !launch_slots_with_parent_task(...)) to interpret the int correctly (e.g., check for != 0 on failure or == 0 on success); no change needed to launch_slot_with_task or release_slots logic besides adjusting callers to use the int result.

ngxson added 3 commits January 12, 2026 17:41

server : make sure children tasks are scheduled to launch with parent

b55964a

fix

e32545f

add comment pointing to this PR

821e329

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

tools/server/server-context.cpp Outdated Show resolved Hide resolved

tools/server/server-context.cpp Show resolved Hide resolved

fix

25702ba

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

github-actions bot added examples python server labels Jan 12, 2026

clean up

9481b9d

ngxson added 3 commits January 13, 2026 14:22

more debug messages

f0349e4

add pop_deferred_task with specific ID version

da6e2ba

improve the logic

d6b0d23

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

ngxson added 2 commits January 14, 2026 19:01

simple approach

ba86ad9

no double move

79c1967

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

ngxson added 3 commits January 15, 2026 11:16

Merge branch 'master' into xsn/n_cmpl_sync_barrier

d5505b1

correct return type of launch_slots_with_parent_task

8b5474a

Merge branch 'master' into xsn/n_cmpl_sync_barrier

64b48eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mirror] server: improve slots scheduling for n_cmpl#80

[Mirror] server: improve slots scheduling for n_cmpl#80
ngxson wants to merge 13 commits intomasterfrom
xsn/n_cmpl_sync_barrier

ngxson commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

ngxson commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ngxson commented Jan 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngxson commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2026 •

edited

Loading