feat: Support n optional OAI field in dynamo frontend and vLLM backend#8744
Conversation
Seems like this isn't true given the SGLang follow up PR: #8745? |
WalkthroughThis change enables handling multiple parallel choices (OpenAI Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/llm/src/backend.rs (1)
195-222:⚠️ Potential issue | 🟡 MinorTwo small observations on per-choice finish tracking in the early-return branches.
Lines 195–205 (text already populated): you correctly insert into
finished_choiceswhendata.finish_reason.is_some(), but never callstate.stream.context().stop_generating()or setstate.finished = trueeven when all choices have finished. In this branch the upstream engine owns termination, so this is probably fine, but it's inconsistent with the symmetric block at lines 274–280 — worth a one-line comment explaining why upstream is trusted here.Lines 207–222 (out-of-range choice index): emitting
FinishReason::Errorand continuing is reasonable (don't poison the whole stream over a bogus index), but if a misbehaving engine keeps emitting the same bad index, this will spam error chunks until upstream ends. Consider also insertingchoice_idx(or a sentinel) intofinished_choicesand short-circuiting subsequent occurrences.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/backend.rs` around lines 195 - 222, In the early-return branch that handles already-decoded text (the block that checks output.data.text and !state.validate_engine_decode), add a one-line comment explaining we trust the upstream engine to signal termination so we don't call state.stream.context().stop_generating() or set state.finished there; additionally, when you insert into state.finished_choices on data.finish_reason.is_some(), also check whether all choices are finished and if so set state.finished = true and call state.stream.context().stop_generating(). In the out-of-range choice index error path (the branch that uses state.decoders.get_mut(&choice_idx) and emits FinishReason::Error), also insert the offending choice_idx (or a sentinel like usize::MAX) into state.finished_choices and return early so repeated bad-index events are short-circuited; ensure the same all-choices-finished check (set state.finished and call stop_generating()) runs after inserting the sentinel.components/src/dynamo/vllm/handlers.py (1)
1958-1965:⚠️ Potential issue | 🟡 MinorHardcoded
"index": 0in the empty-outputs error path leaves other choices stranded forn > 1.When the engine returns a
RequestOutputwith nooutputs, this yields a single chunk with"index": 0and breaks. Forn > 1, choices1..n-1will never receive a final chunk; the Rust backend's per-choicefinished_choicesset will only mark choice 0 as finished and rely on the upstream stream ending to terminate. This works today (stream ends → unfold returnsNone) but means downstream clients won't see explicit error finish reasons for the other choices. Consider yielding one error chunk per requested choice index.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/handlers.py` around lines 1958 - 1965, The empty-outputs branch yields a single error chunk with "index": 0 which leaves other choices unmarked when n>1; instead iterate over the number of requested choices (the same n used to create the RequestOutput / completions request — e.g., request.num_completions or request_output.num_choices) and yield one error chunk per choice index with finish_reason "error: No outputs from vLLM engine" and the corresponding "index" (0..n-1), then return/stop the generator; update the code around the RequestOutput empty-outputs handling where the single yield currently occurs so every choice gets an explicit final error chunk.
♻️ Duplicate comments (1)
components/src/dynamo/vllm/handlers.py (1)
2298-2313:⚠️ Potential issue | 🟡 MinorSame hardcoded
index: 0issue in the text-mode empty-outputs error path.Mirrors the token-mode path on line 1962 — for
n > 1only one error chunk is yielded withindex: 0. Recommend emitting an error chunk per requested choice index here as well so the OpenAI-shaped response contains a finish for every choice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/handlers.py` around lines 2298 - 2313, The current empty-output error path inside the async for loop over gen yields a single error chunk with a hardcoded "index: 0"; change it to emit an error choice entry for every requested choice index (the same behavior as the token-mode path that checks n > 1). Build the "choices" array by iterating over the requested choice count (n) and creating entries with "index" set to each choice number and the same delta/finish_reason ("role": "assistant", "content": "", "finish_reason": "error"), then yield that single error chunk using openai_request_id so each requested choice receives a finish signal.
🧹 Nitpick comments (5)
tests/serve/test_vllm.py (1)
92-99: Optional: strengthen then=2assertion beyond just count.The new payload only verifies that two
choicesentries are returned (viaexpected_num_choices=2), not that each contains valid content nor that theirindexvalues are{0, 1}. With greedy sampling (default temperature for Qwen) both choices may be identical, but they should each still carry independentmessage.contentand a uniqueindex.Consider extending
ChatPayload.validate(or asserting inline) to walk every choice and verify each has non-empty content and a uniqueindex, so a regression that, say, returns two empty choices or duplicated index=0 would be caught here rather than only by the Rust unit test.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/serve/test_vllm.py` around lines 92 - 99, The test only asserts expected_num_choices=2 but doesn't verify each returned choice contains non-empty content or unique indices; update ChatPayload.validate (or add inline assertions in chat_payload usage) to iterate over response.choices and assert each choice.message.content is non-empty (truthy) and that the set of choice.index values has length equal to expected_num_choices (i.e., indices are unique and cover {0,1} for n=2), failing the test if any choice is empty or indices duplicate.components/src/dynamo/vllm/llm_engine.py (1)
144-156: Minor:completion_usageis recomputed/emitted for every finishing output and reflects cumulative-but-not-final totals.When
n > 1and multiple outputs in the sameresfinish (or finish across nearby iterations), this block emitscompletion_usageonce per finishing output. Each emission usessum(len(choice.token_ids) for choice in res.outputs)evaluated at that moment, so:
- If output 0 finishes while output 1 is still mid-generation, the
completion_tokensreported here is partial.- A later finish for output 1 will emit again with a different total.
The Rust frontend (
delta.rs) currently only readsprompt_tokensfromcompletion_usageand accumulatescompletion_tokensitself per delta, so this is not a correctness bug today. But the field is misleading to anyone reading the chunk in isolation, and it's redundant work per chunk.Suggestion (optional): only attach
completion_usageon the last finishing output of the request (e.g. when allres.outputshave afinish_reason), or just emitprompt_tokensand let the frontend own the completion total.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/llm_engine.py` around lines 144 - 156, The code currently emits out["completion_usage"] whenever an individual output has output.finish_reason, causing partial or changing completion_tokens because it sums len(choice.token_ids) over res.outputs at that moment; update the logic in the block that checks output.finish_reason so that completion_usage is only attached when all outputs are finished (e.g., check that every choice in res.outputs has a finish_reason) or alternatively only include prompt_tokens (using res.prompt_token_ids) and omit completion_tokens entirely so the frontend can accumulate per-delta totals; rename or gate the out["completion_usage"] assignment accordingly to avoid recomputing per finishing output.tests/utils/payloads.py (1)
169-182: LGTM —expected_num_choicesvalidation is correct and additive.
super().validate(...)keeps the existing substring check on the first-choice content, and the new block precisely asserts the choices list length againstexpected_num_choices. Backwards compatible (no-op whenNone).Nit (optional):
response.json()is parsed both here and insideextract_content(requests.Response.json()re-parses on each call). You could either cache the parsed result on the response object viagetattr(response, "_cached_json", None)or pass it down. Trivial perf, not worth blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utils/payloads.py` around lines 169 - 182, The validate method currently calls response.json() again which re-parses the response; to avoid redundant parsing, modify validate (the validate method in tests/utils/payloads.py) to accept or retrieve a cached parsed JSON and use that instead of calling response.json() twice—either add an optional parameter (e.g., parsed_result) or check for a cached attribute on response (getattr(response, "_cached_json", None)) and fall back to response.json() only if missing, then use that parsed_result for the choices length check and ensure any callers (e.g., extract_content) pass or set the cached JSON accordingly.lib/llm/src/migration.rs (1)
155-163: LGTM — clean mirror of the guided-decoding guard.The condition correctly disables migration when
n > 1because per-choice generation state cannot be replayed by re-feeding accumulated tokens (each choice diverges independently).Optional: consider adding a
test_retry_manager_no_migration_for_n_gt_1test analogous totest_retry_manager_no_migration_for_guided_decodingto lock in this behavior, since the n>1 path is now PR-significant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/migration.rs` around lines 155 - 163, Add a unit test that asserts migration is disabled when sampling_options.n > 1, mirroring the existing test for guided decoding; specifically, create a test (e.g., test_retry_manager_no_migration_for_n_gt_1) that constructs a PreprocessedRequest with sampling_options.n = Some(2) and verifies the same behavior as the guided-decoding test (retries_left becomes 0 and a warning is emitted) to prevent regressions around the branch that checks preprocessed_request.sampling_options.n and sets retries_left = 0 in the migration logic.components/src/dynamo/frontend/vllm_processor.py (1)
619-637: Inconsistent error handling: postprocessor-missbreakonly exits the inner loop.Line 599 handles an unknown engine
indexbybreak-ing the outerasync for dynamo_responseloop, terminating streaming. The symmetric case at line 634 (postprocessor not found foroutput.index) onlybreaks the innerfor output in ...loop — the outer loop continues processing further backend responses, and any subsequent outputs with a valid index will still be appended tochoicesand emitted, despite the request being in a half-broken state. Recommendreturn(after running cleanup via thefinallyblock) to halt the request entirely on this error, matching the line 599 behavior.♻️ Proposed change
for output in vllm_out.request_outputs[0].outputs: post = post_processors.get(output.index) if post is None: yield { "error": { "message": ( f"Invalid postprocessor choice index {output.index} " f"for request {request_id}" ), "type": "internal_error", } } - break + return choice = post.process_output(output) if choice: choices.append(choice)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/frontend/vllm_processor.py` around lines 619 - 637, The error handling for a missing postprocessor is inconsistent: inside the loop over vllm_out.request_outputs[0].outputs you call post_processors.get(output.index) and on miss you yield an error then break the inner loop, but you should halt the entire request like the earlier unknown engine case. Replace the inner-loop break with a return (so the generator stops and the surrounding finally cleanup runs) in the block where post is None (the yield error branch referencing post_processors, output.index, request_id, and choices) to match the behavior at the other error site.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/src/dynamo/vllm/llm_engine.py`:
- Around line 125-142: Replace the assert checks in the initialization path with
explicit runtime checks: in the method where self.engine_client and
self._default_sampling_params are validated (in llm_engine.py, locate the code
using self.engine_client and self._default_sampling_params before using the vLLM
client), change the assert statements to explicit if checks that raise
RuntimeError("Engine not initialized") when either self.engine_client is None or
self._default_sampling_params is None so failures surface predictably.
---
Outside diff comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 1958-1965: The empty-outputs branch yields a single error chunk
with "index": 0 which leaves other choices unmarked when n>1; instead iterate
over the number of requested choices (the same n used to create the
RequestOutput / completions request — e.g., request.num_completions or
request_output.num_choices) and yield one error chunk per choice index with
finish_reason "error: No outputs from vLLM engine" and the corresponding "index"
(0..n-1), then return/stop the generator; update the code around the
RequestOutput empty-outputs handling where the single yield currently occurs so
every choice gets an explicit final error chunk.
In `@lib/llm/src/backend.rs`:
- Around line 195-222: In the early-return branch that handles already-decoded
text (the block that checks output.data.text and !state.validate_engine_decode),
add a one-line comment explaining we trust the upstream engine to signal
termination so we don't call state.stream.context().stop_generating() or set
state.finished there; additionally, when you insert into state.finished_choices
on data.finish_reason.is_some(), also check whether all choices are finished and
if so set state.finished = true and call
state.stream.context().stop_generating(). In the out-of-range choice index error
path (the branch that uses state.decoders.get_mut(&choice_idx) and emits
FinishReason::Error), also insert the offending choice_idx (or a sentinel like
usize::MAX) into state.finished_choices and return early so repeated bad-index
events are short-circuited; ensure the same all-choices-finished check (set
state.finished and call stop_generating()) runs after inserting the sentinel.
---
Duplicate comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 2298-2313: The current empty-output error path inside the async
for loop over gen yields a single error chunk with a hardcoded "index: 0";
change it to emit an error choice entry for every requested choice index (the
same behavior as the token-mode path that checks n > 1). Build the "choices"
array by iterating over the requested choice count (n) and creating entries with
"index" set to each choice number and the same delta/finish_reason ("role":
"assistant", "content": "", "finish_reason": "error"), then yield that single
error chunk using openai_request_id so each requested choice receives a finish
signal.
---
Nitpick comments:
In `@components/src/dynamo/frontend/vllm_processor.py`:
- Around line 619-637: The error handling for a missing postprocessor is
inconsistent: inside the loop over vllm_out.request_outputs[0].outputs you call
post_processors.get(output.index) and on miss you yield an error then break the
inner loop, but you should halt the entire request like the earlier unknown
engine case. Replace the inner-loop break with a return (so the generator stops
and the surrounding finally cleanup runs) in the block where post is None (the
yield error branch referencing post_processors, output.index, request_id, and
choices) to match the behavior at the other error site.
In `@components/src/dynamo/vllm/llm_engine.py`:
- Around line 144-156: The code currently emits out["completion_usage"] whenever
an individual output has output.finish_reason, causing partial or changing
completion_tokens because it sums len(choice.token_ids) over res.outputs at that
moment; update the logic in the block that checks output.finish_reason so that
completion_usage is only attached when all outputs are finished (e.g., check
that every choice in res.outputs has a finish_reason) or alternatively only
include prompt_tokens (using res.prompt_token_ids) and omit completion_tokens
entirely so the frontend can accumulate per-delta totals; rename or gate the
out["completion_usage"] assignment accordingly to avoid recomputing per
finishing output.
In `@lib/llm/src/migration.rs`:
- Around line 155-163: Add a unit test that asserts migration is disabled when
sampling_options.n > 1, mirroring the existing test for guided decoding;
specifically, create a test (e.g., test_retry_manager_no_migration_for_n_gt_1)
that constructs a PreprocessedRequest with sampling_options.n = Some(2) and
verifies the same behavior as the guided-decoding test (retries_left becomes 0
and a warning is emitted) to prevent regressions around the branch that checks
preprocessed_request.sampling_options.n and sets retries_left = 0 in the
migration logic.
In `@tests/serve/test_vllm.py`:
- Around line 92-99: The test only asserts expected_num_choices=2 but doesn't
verify each returned choice contains non-empty content or unique indices; update
ChatPayload.validate (or add inline assertions in chat_payload usage) to iterate
over response.choices and assert each choice.message.content is non-empty
(truthy) and that the set of choice.index values has length equal to
expected_num_choices (i.e., indices are unique and cover {0,1} for n=2), failing
the test if any choice is empty or indices duplicate.
In `@tests/utils/payloads.py`:
- Around line 169-182: The validate method currently calls response.json() again
which re-parses the response; to avoid redundant parsing, modify validate (the
validate method in tests/utils/payloads.py) to accept or retrieve a cached
parsed JSON and use that instead of calling response.json() twice—either add an
optional parameter (e.g., parsed_result) or check for a cached attribute on
response (getattr(response, "_cached_json", None)) and fall back to
response.json() only if missing, then use that parsed_result for the choices
length check and ensure any callers (e.g., extract_content) pass or set the
cached JSON accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a915cb4b-a6a7-43e4-8106-6f9313826b9d
📒 Files selected for processing (9)
components/src/dynamo/frontend/vllm_processor.pycomponents/src/dynamo/vllm/handlers.pycomponents/src/dynamo/vllm/llm_engine.pylib/llm/src/backend.rslib/llm/src/migration.rslib/llm/src/protocols/openai/chat_completions/delta.rstests/serve/test_vllm.pytests/utils/payload_builder.pytests/utils/payloads.py
I meant engines support it natively. |
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
…mo into ibhosale/n-options-output
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
…end (#8744) Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Overview:
Add OpenAI-compatible
nchoice support across Dynamo chat completions so a request withn > 1can return multiple choices instead of only choice0.WHY:
OAI Chat Completions supports
nas the number of choices to generate for one input message. In the OpenAI API reference, this is documented herevLLM support here
All Backends (engines natively) support it. Dynamo was missing the request and response plumbing for the same.
Will follow-up with PRs for Sglang and TRTLLM
Details:
Rust OpenAI request/response path
lib/llm/src/backend.rsandlib/llm/src/migration.rscarrynthrough Dynamo’s preprocessed request path.lib/llm/src/protocols/openai/chat_completions/delta.rsaddsindexto streamed backend deltas so each generated choice can be routed back to the correct OpenAI choice.npath remains unchanged logically because missing backend indexes default to choice0.vLLM frontend processor + backend
components/src/dynamo/frontend/vllm_processor.pycopiesnintoSamplingParamsand forwards it insampling_options.n > 1, it recreates vLLM’s parent/child request bookkeeping withParentRequest, because Dynamo bypasses vLLM’s normal HTTP engine path and feeds backend token chunks directly intoOutputProcessor.StreamingPostProcessorper choice index so text deltas, tool parsing, reasoning parsing, and finish reasons do not bleed across choices.components/src/dynamo/vllm/llm_engine.pyandcomponents/src/dynamo/vllm/handlers.pypreserve each vLLM output’sindex.Tests
tests/utils/payloads.pyandtests/utils/payload_builder.pyextend the existing chat payload builder to optionally setn.tests/serve/test_vllm.pyadd pre-merge E2E coverage validatingn=2returns two choices.Where should the reviewer start?
Start with the request/response contract:
lib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/backend.rslib/llm/src/migration.rsThen backend-specific behavior:
components/src/dynamo/frontend/vllm_processor.pyThen validation:
tests/serve/test_vllm.pyRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes