Skip to content

feat: Add n>1 support to the frontend and the vLLM backend#6350

Open
kornelcsernai-harmonic wants to merge 4 commits into
ai-dynamo:mainfrom
kornelcsernai-harmonic:multiple-choices
Open

feat: Add n>1 support to the frontend and the vLLM backend#6350
kornelcsernai-harmonic wants to merge 4 commits into
ai-dynamo:mainfrom
kornelcsernai-harmonic:multiple-choices

Conversation

@kornelcsernai-harmonic
Copy link
Copy Markdown
Contributor

@kornelcsernai-harmonic kornelcsernai-harmonic commented Feb 18, 2026

Overview:

Adds support for generating multiple choices (n>1) in chat completions for vLLM.

Details:

Pass n to vLLM and keep track of each decoder independently, handling stopping. Currently no request migration support.
Report combined usage statistics.

Where should the reviewer start?

  • handlers.py — Both token mode and text mode now iterate over all res.outputs instead of just outputs[0]. Tracks per-choice state (previous_text_per_choice, output_tokens_per_choice, finished_choices). Keeps track of finished decoders in finished_choices. Usage is attached only when all n
    choices finish. Maps the n parameter through to vLLM's SamplingParams.
  • backend.rs — Creates a separate Decoder per choice index (keyed 0..n in a HashMap). Tracks finished_choices and only calls stop_generating() when all choices are done. Filters out already-finished choices via filter_map.
  • delta.rs — Uses delta.index.unwrap_or(0) instead of hardcoded index = 0, so each choice's index propagates correctly into the streaming response.
  • migration.rs — Request migration is currently out of scope for this PR so it disables request migration when n > 1 (sets migration limit to 0).
  • test_vllm.py — An e2e test that sends a request with n=5

Things to review: correctness, performance.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for generating multiple completions simultaneously in a single request.
    • Enhanced streaming to properly track and identify each completion independently.
    • Improved token accounting to accurately count tokens across all generated outputs.
    • Disabled automatic retries for multi-completion requests.
  • Tests

    • Added comprehensive test coverage for multi-completion streaming functionality.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi kornelcsernai-harmonic! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Feb 18, 2026
@kornelcsernai-harmonic kornelcsernai-harmonic changed the title Add n>1 support to the frontend and the vLLM backend feat: Add n>1 support to the frontend and the vLLM backend Feb 18, 2026
@github-actions github-actions Bot added the feat label Feb 18, 2026
@kornelcsernai-harmonic kornelcsernai-harmonic changed the title feat: Add n>1 support to the frontend and the vLLM backend feat: [WIP] Add n>1 support to the frontend and the vLLM backend Feb 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the Stale label Mar 21, 2026
@kornelcsernai-harmonic kornelcsernai-harmonic force-pushed the multiple-choices branch 2 times, most recently from bc3ad9b to f5c9d22 Compare April 6, 2026 23:46
@kornelcsernai-harmonic kornelcsernai-harmonic marked this pull request as ready for review April 7, 2026 00:19
@kornelcsernai-harmonic kornelcsernai-harmonic requested a review from a team as a code owner April 7, 2026 00:19
@kornelcsernai-harmonic kornelcsernai-harmonic requested a review from a team April 7, 2026 00:19
@kornelcsernai-harmonic kornelcsernai-harmonic requested a review from a team as a code owner April 7, 2026 00:19
@kornelcsernai-harmonic kornelcsernai-harmonic changed the title feat: [WIP] Add n>1 support to the frontend and the vLLM backend feat: Add n>1 support to the frontend and the vLLM backend Apr 21, 2026
Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>
Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>
Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>
Signed-off-by: Kornel Csernai <239206175+kornelcsernai-harmonic@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Walkthrough

The pull request adds end-to-end support for the OpenAI n parameter to enable streaming multiple independent completions. Changes span Python handlers (per-choice token and text streaming), Rust backend (per-choice decoders and state tracking), migration logic (disabling retries for multi-choice requests), protocol handling (choice index passthrough), and an integration test validating five concurrent choices.

Changes

Cohort / File(s) Summary
Multi-choice streaming support (Python)
components/src/dynamo/vllm/handlers.py
Reworked token and text mode streaming to process all outputs independently per choice instead of only the first output. Added per-choice token tracking (output_tokens_per_choice), per-choice text buffering (previous_text_per_choice), and finished_choices set to skip duplicate updates. Updated _build_completion_usage to sum token counts across all outputs. Streamed chunks now include "index" field and correct n value.
Multi-choice backend processing (Rust)
lib/llm/src/backend.rs
Replaced single Decoder with per-choice HashMap<u32, Decoder> and added stream-level finished_choices tracking. Modified decoder initialization to create n independent decoders and updated the stream::unfold loop to derive choice_index from data.index, skip already-finished choices, detect per-choice completion signals, and halt generation when all n choices finish.
Migration handling
lib/llm/src/migration.rs
Added logic to read sampling_options.n and set effective_migration_limit to 0 when n > 1, preventing migration/retries for multi-choice requests. Otherwise preserves existing migration limit.
Protocol choice index
lib/llm/src/protocols/openai/chat_completions/delta.rs
Changed streaming choice index from hardcoded 0 to delta.index.unwrap_or(0), allowing backend-provided choice indices to flow through while retaining default behavior when absent.
Integration test
tests/frontend/test_vllm.py
Added test_multiple_choices_n5 to verify streaming with n=5. Validates that all five choice indices are observed, all complete with finish reasons, and each produces non-empty content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main feature addition: support for n>1 (multiple choices) in the frontend and vLLM backend, which is the primary objective of this changeset.
Description check ✅ Passed The PR description provides a comprehensive overview, detailed implementation changes across all modified files, clear guidance on where reviewers should focus, and references the related GitHub issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/handlers.py (1)

1577-1644: ⚠️ Potential issue | 🟠 Major

Gate token-mode usage until the final choice finishes.

completion_usage is attached on every per-choice finish_reason; with n > 1, an early-finishing choice can report stale combined usage before other choices finish, and later choices can duplicate it. Mirror text mode and only attach usage when finished_choices ∪ {choice_index} reaches n.

🐛 Proposed fix
             output_tokens_per_choice: Dict[int, int] = {}
             finished_choices: set[int] = set()
+            n = getattr(sampling_params, "n", 1) or 1
             async for res in gen:
@@
                     if output.finish_reason:
                         out["finish_reason"] = normalize_finish_reason(
                             output.finish_reason
                         )
-                        out[
-                            "completion_usage"
-                        ] = BaseWorkerHandler._build_completion_usage(
-                            request_output=res,
-                            embedding_sequence_length=embedding_sequence_length,
-                        )
+                        if len(finished_choices | {choice_index}) >= n:
+                            out[
+                                "completion_usage"
+                            ] = BaseWorkerHandler._build_completion_usage(
+                                request_output=res,
+                                embedding_sequence_length=embedding_sequence_length,
+                            )
                         # Log completion with LoRA info (debug level to avoid log spam)
🧹 Nitpick comments (1)
tests/frontend/test_vllm.py (1)

509-517: Avoid += string concatenation in the streaming loop.

Collect fragments per choice and join only when needed.

🧹 Proposed fix
-    content_per_choice: dict = {}
+    content_per_choice: dict[int, list[str]] = {}
     for chunk in chunks:
         for choice in chunk.get("choices", []):
             idx = choice["index"]
             all_indices.add(idx)
             delta_content = choice.get("delta", {}).get("content", "")
             if delta_content:
-                content_per_choice.setdefault(idx, "")
-                content_per_choice[idx] += delta_content
+                content_per_choice.setdefault(idx, []).append(delta_content)
@@
     # Verify each choice produced some content
     for i in range(5):
-        assert content_per_choice.get(i), f"Choice {i} has no content"
+        assert "".join(content_per_choice.get(i, [])), f"Choice {i} has no content"

As per coding guidelines, avoid += string concatenation inside loops.

Also applies to: 539-541

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_vllm.py` around lines 509 - 517, The loop in the
streaming test uses in-loop string concatenation on content_per_choice
(content_per_choice[idx] += delta_content) which is inefficient; instead collect
fragments per choice into a list and join when needed: change content_per_choice
to map indices to lists (append delta_content to content_per_choice[idx]) within
the chunks/choice loop, and later (where code currently assumes a single string,
e.g., the same pattern around lines ~539-541) join the list with
''.join(content_per_choice[idx]) before assertions or further processing; update
any usages of content_per_choice to expect the joined string at consumption
time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/frontend/test_vllm.py`:
- Around line 483-502: Wrap the streaming POST call in a context manager to
ensure the socket is always closed: change the bare requests.post(...) that
assigns to response and the subsequent response.iter_lines(...) loop to use
"with requests.post(...) as response:" so the returned Response is closed even
if assertions or json decoding fail; locate this change around the Response
usage in tests/frontend/test_vllm.py where response is created and iterated with
response.iter_lines(decode_unicode=True).
- Line 466: The local import "import json" currently placed inside a function in
test_vllm.py must be moved to module scope; remove the function-local "import
json" and add "import json" with the other top-level imports at the top of the
file so all imports remain at module scope and adhere to the repo import rule.

---

Nitpick comments:
In `@tests/frontend/test_vllm.py`:
- Around line 509-517: The loop in the streaming test uses in-loop string
concatenation on content_per_choice (content_per_choice[idx] += delta_content)
which is inefficient; instead collect fragments per choice into a list and join
when needed: change content_per_choice to map indices to lists (append
delta_content to content_per_choice[idx]) within the chunks/choice loop, and
later (where code currently assumes a single string, e.g., the same pattern
around lines ~539-541) join the list with ''.join(content_per_choice[idx])
before assertions or further processing; update any usages of content_per_choice
to expect the joined string at consumption time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fd250d6b-394d-432b-9ca4-19b59148345d

📥 Commits

Reviewing files that changed from the base of the PR and between 55a949c and f3ad4b6.

📒 Files selected for processing (5)
  • components/src/dynamo/vllm/handlers.py
  • lib/llm/src/backend.rs
  • lib/llm/src/migration.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • tests/frontend/test_vllm.py

request, start_services: ServicePorts, predownload_models
) -> None:
"""Test that n=5 streaming returns 5 distinct choices with correct indices."""
import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move the json import to module scope.

The new function-local import violates the repo’s import rule.

🧹 Proposed fix
 from __future__ import annotations
 
+import json
 import logging
 import os
 import shutil
@@
 def test_multiple_choices_n5(
     request, start_services: ServicePorts, predownload_models
 ) -> None:
     """Test that n=5 streaming returns 5 distinct choices with correct indices."""
-    import json
 
     payload = {

As per coding guidelines, keep all imports at the top of the file (no imports inside functions/classes).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_vllm.py` at line 466, The local import "import json"
currently placed inside a function in test_vllm.py must be moved to module
scope; remove the function-local "import json" and add "import json" with the
other top-level imports at the top of the file so all imports remain at module
scope and adhere to the repo import rule.

Comment on lines +483 to +502
response = requests.post(
f"{base_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
stream=True,
timeout=180,
)
assert (
response.status_code == 200
), f"Streaming request failed with status {response.status_code}: {response.text}"

# Parse SSE events
chunks = []
for line in response.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data_str = line[len("data: ") :]
if data_str == "[DONE]":
break
chunks.append(json.loads(data_str))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Close the streamed response with a context manager.

stream=True keeps the connection open while iterating; wrap the response in with so the socket is released even on assertion/JSON failures.

🧹 Proposed fix
-    response = requests.post(
-        f"{base_url}/v1/chat/completions",
-        headers={"Content-Type": "application/json"},
-        json=payload,
-        stream=True,
-        timeout=180,
-    )
-    assert (
-        response.status_code == 200
-    ), f"Streaming request failed with status {response.status_code}: {response.text}"
-
-    # Parse SSE events
     chunks = []
-    for line in response.iter_lines(decode_unicode=True):
-        if not line or not line.startswith("data: "):
-            continue
-        data_str = line[len("data: ") :]
-        if data_str == "[DONE]":
-            break
-        chunks.append(json.loads(data_str))
+    with requests.post(
+        f"{base_url}/v1/chat/completions",
+        headers={"Content-Type": "application/json"},
+        json=payload,
+        stream=True,
+        timeout=180,
+    ) as response:
+        assert (
+            response.status_code == 200
+        ), f"Streaming request failed with status {response.status_code}: {response.text}"
+
+        # Parse SSE events
+        for line in response.iter_lines(decode_unicode=True):
+            if not line or not line.startswith("data: "):
+                continue
+            data_str = line[len("data: ") :]
+            if data_str == "[DONE]":
+                break
+            chunks.append(json.loads(data_str))

As per coding guidelines, for tests, ensure no leaked file handles (always use with).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = requests.post(
f"{base_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
stream=True,
timeout=180,
)
assert (
response.status_code == 200
), f"Streaming request failed with status {response.status_code}: {response.text}"
# Parse SSE events
chunks = []
for line in response.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data_str = line[len("data: ") :]
if data_str == "[DONE]":
break
chunks.append(json.loads(data_str))
chunks = []
with requests.post(
f"{base_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
stream=True,
timeout=180,
) as response:
assert (
response.status_code == 200
), f"Streaming request failed with status {response.status_code}: {response.text}"
# Parse SSE events
for line in response.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data_str = line[len("data: ") :]
if data_str == "[DONE]":
break
chunks.append(json.loads(data_str))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_vllm.py` around lines 483 - 502, Wrap the streaming POST
call in a context manager to ensure the socket is always closed: change the bare
requests.post(...) that assigns to response and the subsequent
response.iter_lines(...) loop to use "with requests.post(...) as response:" so
the returned Response is closed even if assertions or json decoding fail; locate
this change around the Response usage in tests/frontend/test_vllm.py where
response is created and iterated with response.iter_lines(decode_unicode=True).

@github-actions github-actions Bot removed the Stale label Apr 23, 2026
@indrajit96
Copy link
Copy Markdown
Contributor

Hi @kornelcsernai-harmonic
Thanks for your PR.
We are working on a broader scoped PR to add this n option accross backends and rust front-end here
#8744
It's currently running CI and should be merged this week and should unblock you.
Once merged we will close your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend external-contribution Pull request is from an external contributor feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants