feat: Add n>1 support to the frontend and the vLLM backend by kornelcsernai-harmonic · Pull Request #6350 · ai-dynamo/dynamo

kornelcsernai-harmonic · 2026-02-18T00:17:53Z

Overview:

Adds support for generating multiple choices (n>1) in chat completions for vLLM.

Details:

Pass n to vLLM and keep track of each decoder independently, handling stopping. Currently no request migration support.
Report combined usage statistics.

Where should the reviewer start?

handlers.py — Both token mode and text mode now iterate over all res.outputs instead of just outputs[0]. Tracks per-choice state (previous_text_per_choice, output_tokens_per_choice, finished_choices). Keeps track of finished decoders in finished_choices. Usage is attached only when all n
choices finish. Maps the n parameter through to vLLM's SamplingParams.
backend.rs — Creates a separate Decoder per choice index (keyed 0..n in a HashMap). Tracks finished_choices and only calls stop_generating() when all choices are done. Filters out already-finished choices via filter_map.
delta.rs — Uses delta.index.unwrap_or(0) instead of hardcoded index = 0, so each choice's index propagates correctly into the streaming response.
migration.rs — Request migration is currently out of scope for this PR so it disables request migration when n > 1 (sets migration limit to 0).
test_vllm.py — An e2e test that sends a request with n=5

Things to review: correctness, performance.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [FEATURE]: Add support for multiple outputs to vLLM backend #6116

Summary by CodeRabbit

Release Notes

New Features
- Added support for generating multiple completions simultaneously in a single request.
- Enhanced streaming to properly track and identify each completion independently.
- Improved token accounting to accurately count tokens across all generated outputs.
- Disabled automatic retries for multi-completion requests.
Tests
- Added comprehensive test coverage for multi-completion streaming functionality.

copy-pr-bot · 2026-02-18T00:17:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-02-18T00:18:01Z

👋 Hi kornelcsernai-harmonic! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

github-actions · 2026-03-21T09:42:05Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>

coderabbitai · 2026-04-21T21:36:24Z

Walkthrough

The pull request adds end-to-end support for the OpenAI n parameter to enable streaming multiple independent completions. Changes span Python handlers (per-choice token and text streaming), Rust backend (per-choice decoders and state tracking), migration logic (disabling retries for multi-choice requests), protocol handling (choice index passthrough), and an integration test validating five concurrent choices.

Changes

Cohort / File(s)	Summary
Multi-choice streaming support (Python) `components/src/dynamo/vllm/handlers.py`	Reworked token and text mode streaming to process all outputs independently per choice instead of only the first output. Added per-choice token tracking (`output_tokens_per_choice`), per-choice text buffering (`previous_text_per_choice`), and `finished_choices` set to skip duplicate updates. Updated `_build_completion_usage` to sum token counts across all outputs. Streamed chunks now include `"index"` field and correct `n` value.
Multi-choice backend processing (Rust) `lib/llm/src/backend.rs`	Replaced single `Decoder` with per-choice `HashMap<u32, Decoder>` and added stream-level `finished_choices` tracking. Modified decoder initialization to create `n` independent decoders and updated the `stream::unfold` loop to derive `choice_index` from `data.index`, skip already-finished choices, detect per-choice completion signals, and halt generation when all `n` choices finish.
Migration handling `lib/llm/src/migration.rs`	Added logic to read `sampling_options.n` and set `effective_migration_limit` to `0` when `n > 1`, preventing migration/retries for multi-choice requests. Otherwise preserves existing migration limit.
Protocol choice index `lib/llm/src/protocols/openai/chat_completions/delta.rs`	Changed streaming choice index from hardcoded `0` to `delta.index.unwrap_or(0)`, allowing backend-provided choice indices to flow through while retaining default behavior when absent.
Integration test `tests/frontend/test_vllm.py`	Added `test_multiple_choices_n5` to verify streaming with `n=5`. Validates that all five choice indices are observed, all complete with finish reasons, and each produces non-empty content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main feature addition: support for n>1 (multiple choices) in the frontend and vLLM backend, which is the primary objective of this changeset.
Description check	✅ Passed	The PR description provides a comprehensive overview, detailed implementation changes across all modified files, clear guidance on where reviewers should focus, and references the related GitHub issue.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/vllm/handlers.py (1)

1577-1644: ⚠️ Potential issue | 🟠 Major

Gate token-mode usage until the final choice finishes.

completion_usage is attached on every per-choice finish_reason; with n > 1, an early-finishing choice can report stale combined usage before other choices finish, and later choices can duplicate it. Mirror text mode and only attach usage when finished_choices ∪ {choice_index} reaches n.

🐛 Proposed fix

             output_tokens_per_choice: Dict[int, int] = {}
             finished_choices: set[int] = set()
+            n = getattr(sampling_params, "n", 1) or 1
             async for res in gen:
@@
                     if output.finish_reason:
                         out["finish_reason"] = normalize_finish_reason(
                             output.finish_reason
                         )
-                        out[
-                            "completion_usage"
-                        ] = BaseWorkerHandler._build_completion_usage(
-                            request_output=res,
-                            embedding_sequence_length=embedding_sequence_length,
-                        )
+                        if len(finished_choices | {choice_index}) >= n:
+                            out[
+                                "completion_usage"
+                            ] = BaseWorkerHandler._build_completion_usage(
+                                request_output=res,
+                                embedding_sequence_length=embedding_sequence_length,
+                            )
                         # Log completion with LoRA info (debug level to avoid log spam)

🧹 Nitpick comments (1)

tests/frontend/test_vllm.py (1)

509-517: Avoid += string concatenation in the streaming loop.

Collect fragments per choice and join only when needed.

🧹 Proposed fix

-    content_per_choice: dict = {}
+    content_per_choice: dict[int, list[str]] = {}
     for chunk in chunks:
         for choice in chunk.get("choices", []):
             idx = choice["index"]
             all_indices.add(idx)
             delta_content = choice.get("delta", {}).get("content", "")
             if delta_content:
-                content_per_choice.setdefault(idx, "")
-                content_per_choice[idx] += delta_content
+                content_per_choice.setdefault(idx, []).append(delta_content)
@@
     # Verify each choice produced some content
     for i in range(5):
-        assert content_per_choice.get(i), f"Choice {i} has no content"
+        assert "".join(content_per_choice.get(i, [])), f"Choice {i} has no content"

As per coding guidelines, avoid += string concatenation inside loops.

Also applies to: 539-541

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_vllm.py` around lines 509 - 517, The loop in the
streaming test uses in-loop string concatenation on content_per_choice
(content_per_choice[idx] += delta_content) which is inefficient; instead collect
fragments per choice into a list and join when needed: change content_per_choice
to map indices to lists (append delta_content to content_per_choice[idx]) within
the chunks/choice loop, and later (where code currently assumes a single string,
e.g., the same pattern around lines ~539-541) join the list with
''.join(content_per_choice[idx]) before assertions or further processing; update
any usages of content_per_choice to expect the joined string at consumption
time.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/frontend/test_vllm.py`:
- Around line 483-502: Wrap the streaming POST call in a context manager to
ensure the socket is always closed: change the bare requests.post(...) that
assigns to response and the subsequent response.iter_lines(...) loop to use
"with requests.post(...) as response:" so the returned Response is closed even
if assertions or json decoding fail; locate this change around the Response
usage in tests/frontend/test_vllm.py where response is created and iterated with
response.iter_lines(decode_unicode=True).
- Line 466: The local import "import json" currently placed inside a function in
test_vllm.py must be moved to module scope; remove the function-local "import
json" and add "import json" with the other top-level imports at the top of the
file so all imports remain at module scope and adhere to the repo import rule.

---

Nitpick comments:
In `@tests/frontend/test_vllm.py`:
- Around line 509-517: The loop in the streaming test uses in-loop string
concatenation on content_per_choice (content_per_choice[idx] += delta_content)
which is inefficient; instead collect fragments per choice into a list and join
when needed: change content_per_choice to map indices to lists (append
delta_content to content_per_choice[idx]) within the chunks/choice loop, and
later (where code currently assumes a single string, e.g., the same pattern
around lines ~539-541) join the list with ''.join(content_per_choice[idx])
before assertions or further processing; update any usages of content_per_choice
to expect the joined string at consumption time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fd250d6b-394d-432b-9ca4-19b59148345d

📥 Commits

Reviewing files that changed from the base of the PR and between 55a949c and f3ad4b6.

📒 Files selected for processing (5)

components/src/dynamo/vllm/handlers.py
lib/llm/src/backend.rs
lib/llm/src/migration.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs
tests/frontend/test_vllm.py

coderabbitai · 2026-04-21T21:36:27Z

+    request, start_services: ServicePorts, predownload_models
+) -> None:
+    """Test that n=5 streaming returns 5 distinct choices with correct indices."""
+    import json


⚠️ Potential issue | 🟡 Minor

Move the json import to module scope.

The new function-local import violates the repo’s import rule.

🧹 Proposed fix

from __future__ import annotations +import json import logging import os import shutil @@ def test_multiple_choices_n5( request, start_services: ServicePorts, predownload_models ) -> None: """Test that n=5 streaming returns 5 distinct choices with correct indices.""" - import json payload = {

As per coding guidelines, keep all imports at the top of the file (no imports inside functions/classes).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/frontend/test_vllm.py` at line 466, The local import "import json" currently placed inside a function in test_vllm.py must be moved to module scope; remove the function-local "import json" and add "import json" with the other top-level imports at the top of the file so all imports remain at module scope and adhere to the repo import rule.

coderabbitai · 2026-04-21T21:36:27Z

+    response = requests.post(
+        f"{base_url}/v1/chat/completions",
+        headers={"Content-Type": "application/json"},
+        json=payload,
+        stream=True,
+        timeout=180,
+    )
+    assert (
+        response.status_code == 200
+    ), f"Streaming request failed with status {response.status_code}: {response.text}"
+
+    # Parse SSE events
+    chunks = []
+    for line in response.iter_lines(decode_unicode=True):
+        if not line or not line.startswith("data: "):
+            continue
+        data_str = line[len("data: ") :]
+        if data_str == "[DONE]":
+            break
+        chunks.append(json.loads(data_str))


⚠️ Potential issue | 🟡 Minor

Close the streamed response with a context manager.

stream=True keeps the connection open while iterating; wrap the response in with so the socket is released even on assertion/JSON failures.

🧹 Proposed fix

- response = requests.post( - f"{base_url}/v1/chat/completions", - headers={"Content-Type": "application/json"}, - json=payload, - stream=True, - timeout=180, - ) - assert ( - response.status_code == 200 - ), f"Streaming request failed with status {response.status_code}: {response.text}" - - # Parse SSE events chunks = [] - for line in response.iter_lines(decode_unicode=True): - if not line or not line.startswith("data: "): - continue - data_str = line[len("data: ") :] - if data_str == "[DONE]": - break - chunks.append(json.loads(data_str)) + with requests.post( + f"{base_url}/v1/chat/completions", + headers={"Content-Type": "application/json"}, + json=payload, + stream=True, + timeout=180, + ) as response: + assert ( + response.status_code == 200 + ), f"Streaming request failed with status {response.status_code}: {response.text}" + + # Parse SSE events + for line in response.iter_lines(decode_unicode=True): + if not line or not line.startswith("data: "): + continue + data_str = line[len("data: ") :] + if data_str == "[DONE]": + break + chunks.append(json.loads(data_str))

As per coding guidelines, for tests, ensure no leaked file handles (always use with).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

response = requests.post(

f"{base_url}/v1/chat/completions",

headers={"Content-Type": "application/json"},

json=payload,

stream=True,

timeout=180,

)

assert (

response.status_code == 200

), f"Streaming request failed with status {response.status_code}: {response.text}"

# Parse SSE events

chunks = []

for line in response.iter_lines(decode_unicode=True):

if not line or not line.startswith("data: "):

continue

data_str = line[len("data: ") :]

if data_str == "[DONE]":

break

chunks.append(json.loads(data_str))

chunks = []

with requests.post(

f"{base_url}/v1/chat/completions",

headers={"Content-Type": "application/json"},

json=payload,

stream=True,

timeout=180,

) as response:

assert (

response.status_code == 200

), f"Streaming request failed with status {response.status_code}: {response.text}"

# Parse SSE events

for line in response.iter_lines(decode_unicode=True):

if not line or not line.startswith("data: "):

continue

data_str = line[len("data: ") :]

if data_str == "[DONE]":

break

chunks.append(json.loads(data_str))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/frontend/test_vllm.py` around lines 483 - 502, Wrap the streaming POST call in a context manager to ensure the socket is always closed: change the bare requests.post(...) that assigns to response and the subsequent response.iter_lines(...) loop to use "with requests.post(...) as response:" so the returned Response is closed even if assertions or json decoding fail; locate this change around the Response usage in tests/frontend/test_vllm.py where response is created and iterated with response.iter_lines(decode_unicode=True).

indrajit96 · 2026-04-27T21:59:01Z

Hi @kornelcsernai-harmonic
Thanks for your PR.
We are working on a broader scoped PR to add this n option accross backends and rust front-end here
#8744
It's currently running CI and should be merged this week and should unblock you.
Once merged we will close your PR.

pull-request-size Bot added the size/L label Feb 18, 2026

github-actions Bot added external-contribution Pull request is from an external contributor backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Feb 18, 2026

kornelcsernai-harmonic changed the title ~~Add n>1 support to the frontend and the vLLM backend~~ feat: Add n>1 support to the frontend and the vLLM backend Feb 18, 2026

github-actions Bot added the feat label Feb 18, 2026

kornelcsernai-harmonic changed the title ~~feat: Add n>1 support to the frontend and the vLLM backend~~ feat: [WIP] Add n>1 support to the frontend and the vLLM backend Feb 18, 2026

github-actions Bot added the Stale label Mar 21, 2026

kornelcsernai-harmonic force-pushed the multiple-choices branch 2 times, most recently from bc3ad9b to f5c9d22 Compare April 6, 2026 23:46

kornelcsernai-harmonic marked this pull request as ready for review April 7, 2026 00:19

kornelcsernai-harmonic requested a review from a team as a code owner April 7, 2026 00:19

kornelcsernai-harmonic requested a review from a team April 7, 2026 00:19

kornelcsernai-harmonic requested a review from a team as a code owner April 7, 2026 00:19

kornelcsernai-harmonic changed the title ~~feat: [WIP] Add n>1 support to the frontend and the vLLM backend~~ feat: Add n>1 support to the frontend and the vLLM backend Apr 21, 2026

kornelcsernai-harmonic added 4 commits April 21, 2026 21:28

add n>1 support

8ee9fed

Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>

test

b8f6bd6

Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>

lint

80f087c

Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>

lint

f3ad4b6

Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>

kornelcsernai-harmonic force-pushed the multiple-choices branch from fc3953c to f3ad4b6 Compare April 21, 2026 21:29

kornelcsernai-harmonic requested a review from a team as a code owner April 21, 2026 21:29

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

github-actions Bot removed the Stale label Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add n>1 support to the frontend and the vLLM backend#6350

feat: Add n>1 support to the frontend and the vLLM backend#6350
kornelcsernai-harmonic wants to merge 4 commits into
ai-dynamo:mainfrom
kornelcsernai-harmonic:multiple-choices

kornelcsernai-harmonic commented Feb 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Feb 18, 2026

Uh oh!

github-actions Bot commented Feb 18, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

indrajit96 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kornelcsernai-harmonic commented Feb 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Feb 18, 2026

Uh oh!

github-actions Bot commented Feb 18, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

indrajit96 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kornelcsernai-harmonic commented Feb 18, 2026 •

edited by coderabbitai Bot

Loading