feat: Expose async vLLM engine as HTTP server#1110
Conversation
WalkthroughAdds an optional in-process FastAPI/uvicorn HTTP bridge for async vLLM workers, a token-sequence correction utility, new vLLM config flags, exposes per-worker OpenAI-style server base URLs from VllmGeneration, factors node IP/port helpers, and expands unit tests for HTTP, token-correction, and generation/weight/memory scenarios. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client as Client
participant Server as In-process HTTP Server
participant Worker as VllmAsyncGenerationWorker
participant Engine as AsyncLLM Engine
Note over Worker,Server: Started only when ``expose_http_server=True``
Worker->>Worker: _create_engine() (store AsyncEngineArgs, init llm)
Worker->>Worker: _setup_vllm_server() -> (thread, base_url, server)
Worker-->>Caller: _report_dp_openai_server_base_urls() returns base_url list
rect #E6F7FF
Client->>Server: POST /v1/chat/completions
Server->>Worker: build request (mixins, derive prefix token ids)
Server->>Server: call _maybe_correct_merged_tokens()
Server->>Engine: submit generation request
Engine-->>Server: tokens / logprobs
Server-->>Client: ChatCompletionResponse (or streaming)
end
sequenceDiagram
autonumber
participant Test as Test Suite
participant Server as In-process HTTP Server
Test->>Server: POST /tokenize
Server->>Server: perform tokenization via tokenizer
Server-->>Test: tokenization response
Test->>Server: request shutdown
Server-->>Test: server stops
Test-x Server: further requests -> ConnectionError
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Comment |
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (7)
nemo_rl/models/generation/vllm/config.py (1)
32-33: Config keys look good; consider a typed kwargs shape.Optional: replace
dict[str, Any]with a dedicatedTypedDict(e.g.,HttpServingChatKwargs) to document supported fields and catch misconfigurations at type-check time.nemo_rl/models/generation/vllm/vllm_generation.py (1)
172-173: Capture of dp_openai_server_base_urls is fine; add an explicit type.Avoid mypy “implicit dynamic” on a new attribute by annotating the field.
Apply:
- # dp_openai_server_base_urls is only returned by Async vLLM flow when http server is active - self.dp_openai_server_base_urls = self._post_init() + # dp_openai_server_base_urls is only returned by Async vLLM flow when http server is active + self.dp_openai_server_base_urls: list[str | None] = self._post_init()nemo_rl/distributed/virtual_cluster.py (1)
63-64: Minor: free-port race window.
_get_free_port_local()+ later bind can race under load. Consider a retry loop in the server startup on EADDRINUSE (or let the server bind to port 0 and discover the actual port before exporting the base URL).nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
296-304: HTTP 400 is clearer than an assertion for on-policy enforcement.Replace asserts with a 400 response to avoid 500s and to work when Python runs with
-O.Apply:
- assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if request.temperature != generation_config["temperature"] or request.top_p != generation_config["top_p"]: + err = ErrorResponse( + message="Sampling params must match policy config (temperature/top_p).", + type="invalid_request_error", + code=400, + ) + return JSONResponse(content=err.model_dump(), status_code=400)
886-896: Shutdown guards are good; avoid re-declaring types at runtime.The runtime type reassignments (
self.http_server: Server,self.server_thread: Thread) are unnecessary here; consider removing them.- from threading import Thread - - from uvicorn import Server - - self.http_server: Server - self.server_thread: Thread - self.http_server.should_exit = True self.server_thread.join()tests/unit/models/generation/test_vllm_generation.py (2)
1162-1164: Avoid “..” in paths; useurljoinand add a timeout.
/v1/../tokenizeworks but is brittle and noisy; normalize the URL and include a timeout.- response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body) + from urllib.parse import urljoin + tokenize_url = urljoin(base_urls[0] + "/", "../tokenize") + response = requests.post(url=tokenize_url, json=body, timeout=30)If you prefer a top-level import, I can place
from urllib.parse import urljoinwith the others.
1189-1201: Tighten shutdown assertion: include timeout and accept timeout exceptions.Post-shutdown, the call should fail quickly. Include a short timeout and cover
Timeoutas well.- with pytest.raises(requests.ConnectionError): + with pytest.raises((requests.ConnectionError, requests.Timeout)): requests.post( url=f"{base_urls[0]}/chat/completions", json=dict( messages=[ {"role": "user", "content": "count to 5"}, ], temperature=0.0, logprobs=True, return_tokens_as_token_ids=True, max_tokens=1, ), + timeout=5, )If this still flakes due to late teardown, we can loop/poll for up to ~15s before failing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
nemo_rl/distributed/virtual_cluster.py(1 hunks)nemo_rl/models/generation/vllm/config.py(1 hunks)nemo_rl/models/generation/vllm/vllm_generation.py(1 hunks)nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)tests/unit/models/generation/test_vllm_generation.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
nemo_rl/distributed/virtual_cluster.py (1)
tests/unit/distributed/test_virtual_cluster.py (1)
test_get_node_ip_and_free_port_does_not_start_with_zero(32-43)
tests/unit/models/generation/test_vllm_generation.py (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
_maybe_correct_merged_tokens(36-115)shutdown(867-900)nemo_rl/models/generation/__init__.py (1)
configure_generation_config(24-45)nemo_rl/distributed/worker_groups.py (1)
shutdown(907-981)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
365-365: Possible binding to all interfaces
(S104)
tests/unit/models/generation/test_vllm_generation.py
1102-1102: Probable use of requests call without timeout
(S113)
1162-1162: Probable use of requests call without timeout
(S113)
1190-1190: Probable use of requests call without timeout
(S113)
🔇 Additional comments (4)
nemo_rl/distributed/virtual_cluster.py (2)
66-71: LGTM: helper split improves reuse and testability.
73-82: Local free-port finder is fine; keep socket import local.tests/unit/models/generation/test_vllm_generation.py (2)
1049-1061: Helper to enable HTTP server: LGTM.Clear, minimal config toggle for async + HTTP exposure.
1082-1085: Make the DP server count assertion robust.
len(base_urls) == cluster.num_gpus_per_nodemay be brittle if DP != GPUs-per-node in some CI shapes. Prefer asserting against the actual vLLM DP size (if exposed) or at least>= 1.For example:
# if available: assert len(base_urls) == getattr(vllm_generation, "dp_world_size", len(base_urls)) and len(base_urls) >= 1Would you like me to search the codebase and wire an explicit dp_world_size onto VllmGeneration for this?
Signed-off-by: Brian Yu <bxyu@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
nice addition @bxyu-nvidia !
Am I correct to assume a functional test like this will come in a later PR?
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
101-106: Clarify intentional off-by-one bound in append loop.The
< len(actual_token_ids) - 1guard is intentional (per prior discussion) but non-obvious. Add an inline comment to prevent future “fixes.”- and len(candidate_token_ids) < len(actual_token_ids) - 1 + # Intentional: keep one token in reserve to avoid overshooting and to + # handle token-merge edge cases discussed in PR #1110. + and len(candidate_token_ids) < len(actual_token_ids) - 1
348-349: Document that /tokenize is intentionally root-scoped (not under /v1).Prevents confusion given
base_urlincludes/v1.- @app.post("/tokenize") + # Intentionally exposed at root (not under "/v1") for Penguin compatibility. + @app.post("/tokenize")
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
149-149: Remove duplicate import of threading.
threadingis imported at the module level (Line 18). The local re-import is redundant.- import threading
73-83: Avoid bare assert in production path; raise a typed error (or degrade gracefully).This assertion can crash request handling. Prefer raising a typed exception that the caller can translate into a structured 4xx response, or fall back to
actual_token_idswith a warning.- assert ( - reference_str == actual_str[: len(reference_str)] - ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization! -Reference str: {reference_str} -Actual str: {actual_str} - -Reference token ids: {reference_token_ids} -Actual token ids: {actual_token_ids}""" + if reference_str != actual_str[: len(reference_str)]: + raise ValueError( + "Non-monotonic trajectory not attributable to token-merge during re-tokenization." + )If you want, I can wire this into the FastAPI routes to return a 400 instead of a 500.
170-175: Prefer logger over print for server start message.Route this through the repo’s logging to respect global log config.
- base_url = f"http://{node_ip}:{free_port}/v1" - print(f"Starting server on {base_url}") + base_url = f"http://{node_ip}:{free_port}/v1" + import logging + logging.getLogger(__name__).info("Starting vLLM HTTP server on %s", base_url)
365-369: Intentional 0.0.0.0 bind: silence Ruff S104.Binding to all interfaces is by design here (acknowledged in discussion). Add a per-line ignore to keep CI green.
- host="0.0.0.0", + host="0.0.0.0", # noqa: S104 - intentional: expose to all interfaces
888-897: Don’t hang shutdown if the server thread won’t exit.Add a join timeout and attempt a forceful exit to avoid blocking teardown.
- self.http_server.should_exit = True - self.server_thread.join() + self.http_server.should_exit = True + self.server_thread.join(timeout=5) + if self.server_thread.is_alive(): + # Try harder and avoid hanging shutdown. + try: + self.http_server.force_exit = True # type: ignore[attr-defined] + except Exception: + pass + self.server_thread.join(timeout=5)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
3rdparty/Penguin-workspace/Penguin(1 hunks)nemo_rl/distributed/virtual_cluster.py(1 hunks)nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_rl/distributed/virtual_cluster.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
367-367: Possible binding to all interfaces
(S104)
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
148-148: Good fix on return annotation.Accurate tuple type makes intent clear and helps type-checkers.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (4)
24-24: Avoid hard dependency on uvicorn at import time.Drop the top-level
import uvicorn; you already import it lazily in_setup_vllm_server, which keeps environments without uvicorn working when HTTP is disabled.-from transformers.tokenization_utils_base import PreTrainedTokenizerBase -import uvicorn +from transformers.tokenization_utils_base import PreTrainedTokenizerBase
100-107: Off-by-one guard is intentional; document it and add a test.Per prior discussion, the
< len(actual_token_ids) - 1bound is deliberate. Please add an inline comment explaining the rationale and a unit test covering a “last-token merge” case so future edits don’t regress it.- elif len(candidate_str) < len(reference_str): - while ( + elif len(candidate_str) < len(reference_str): + # Intentionally stop before the final token to avoid edge re-merge cases + # observed in our environments (see PR #1110 discussion). + while ( candidate_str != reference_str and len(candidate_str) < len(reference_str) and len(candidate_token_ids) < len(actual_token_ids) - 1 ):
365-375: Startup robustness: add readiness probe; silence Ruff S104 intentionally.
- Binding to
0.0.0.0is intentional here—add# noqa: S104to quiet lint.- Add a short readiness loop to avoid client races immediately after returning.
config = uvicorn.Config( app, - host="0.0.0.0", + host="0.0.0.0", # noqa: S104 — intentional: bind all interfaces port=free_port, ) server = uvicorn.Server(config=config) thread = threading.Thread(target=server.run, daemon=True) thread.start() + # Wait briefly until the socket is accepting connections to avoid races. + import socket, time + deadline = time.time() + 5 + while time.time() < deadline: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex((node_ip, free_port)) == 0: + break + time.sleep(0.05) + return thread, base_url, server
299-302: Return 400 on sampling-param mismatch instead of asserting.This avoids 500s and makes client errors explicit.
- assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if ( + request.temperature != generation_config["temperature"] + or request.top_p != generation_config["top_p"] + ): + msg = ( + "Sampling params must match server config: " + f"temperature={generation_config['temperature']}, " + f"top_p={generation_config['top_p']}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + )
🧹 Nitpick comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
173-176: Use logging instead of print; also clarify/tokenizepath difference in docs.Replace
/tokenizewithout/v1.- base_url = f"http://{node_ip}:{free_port}/v1" - print(f"Starting server on {base_url}") + base_url = f"http://{node_ip}:{free_port}/v1" + import logging + logging.getLogger(__name__).info("Starting server on %s", base_url)
888-898: Graceful shutdown: guard None and add join timeout.Avoid potential hangs if the server thread doesn’t exit promptly.
- if self.server_thread is not None: + if getattr(self, "server_thread", None) is not None: from threading import Thread from uvicorn import Server self.http_server: Server self.server_thread: Thread - self.http_server.should_exit = True - self.server_thread.join() + if getattr(self, "http_server", None) is not None: + self.http_server.should_exit = True + self.server_thread.join(timeout=5) + if self.server_thread.is_alive(): + print("Warning: HTTP server thread did not exit within timeout.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/generation/vllm/vllm_generation.py(2 hunks)nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_rl/models/generation/vllm/vllm_generation.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
367-367: Possible binding to all interfaces
(S104)
🔇 Additional comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (4)
135-143: Engine init + optional HTTP exposure looks good.Clean separation of args and conditional server startup.
144-147: Return the base_url from post_init_async—good API design.Surfacing the URL here is convenient for callers.
148-153: Accurate return type for _setup_vllm_server.String-annotated tuple avoids runtime import issues. LGTM.
348-361: Root-level/tokenizeendpoint is fine (intentional divergence).Acknowledging the intended API structure:
/v1/chat/completionsunder/v1,/tokenizeat root. Ensure README/usage samples reflect this.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
24-24: Avoid hard dependency on uvicorn at import time.
uvicornis imported both at module top and inside_setup_vllm_server. Keep the local import only so environments not using the HTTP server don't require uvicorn.-from transformers.tokenization_utils_base import PreTrainedTokenizerBase -import uvicorn +from transformers.tokenization_utils_base import PreTrainedTokenizerBase
375-378: Add readiness probe before returningbase_url.Avoid client races and port-steal window by waiting until the listener is reachable.
thread = threading.Thread(target=server.run, daemon=True) thread.start() - return thread, base_url, server + # Wait briefly for server readiness to avoid client races. + import socket, time + deadline = time.time() + 5 + while time.time() < deadline: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex((node_ip, free_port)) == 0: + break + time.sleep(0.05) + + return thread, base_url, server
115-117: Final invariant shouldn’t be anassert.Return a structured exception so callers can map to a client error.
- assert candidate_str == reference_str + if candidate_str != reference_str: + raise ValueError( + "Unable to align reference and actual tokenization after correction." + )
290-319: Return 400 on sampling-param mismatches instead of asserting.Client-visible validation improves debuggability and stability during training.
- assert request.top_k in (None, -1), ( - f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`" - ) + if request.top_k not in (None, -1): + msg = f"top_k must be unset or -1; got {request.top_k}" + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + ) request.top_k = -1 - assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if ( + request.temperature != generation_config["temperature"] + or request.top_p != generation_config["top_p"] + ): + msg = ( + "Sampling params must match server config: " + f"temperature={generation_config['temperature']}, " + f"top_p={generation_config['top_p']}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + )
73-83: Don’t useassertin request path; return structured errors.Asserts can be optimized out and yield 500s. Raise a clear error (or propagate an HTTP 400 upstream).
- assert ( - reference_str == actual_str[: len(reference_str)] - ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization! -Reference str: {reference_str} -Actual str: {actual_str} - -Reference token ids: {reference_token_ids} -Actual token ids: {actual_token_ids}""" + if reference_str != actual_str[: len(reference_str)]: + raise ValueError( + "Non-monotonic trajectory not due to token-merge during re-tokenization. " + f"reference_str[:{len(reference_str)}]!={actual_str[:len(reference_str)]} | " + f"ref_ids={reference_token_ids} actual_ids={actual_token_ids}" + )
🧹 Nitpick comments (9)
nemo_rl/models/generation/vllm/vllm_worker_async.py (9)
100-107: Intentional off-by-one: document it inline.Per author note, excluding the last token is intentional. Add an explanatory comment to prevent regressions.
- elif len(candidate_str) < len(reference_str): + elif len(candidate_str) < len(reference_str): while ( candidate_str != reference_str and len(candidate_str) < len(reference_str) - and len(candidate_token_ids) < len(actual_token_ids) - 1 + # Intentionally stop before consuming the final token in actual_token_ids. + # This avoids out-of-bounds on the next append and handles edge merges at sequence end. + # See PR discussion: off-by-one is deliberate for vLLM re-tokenization behavior. + and len(candidate_token_ids) < len(actual_token_ids) - 1 ):
170-175: Prefer logger overUse module logger to integrate with existing logging.
- base_url = f"http://{node_ip}:{free_port}/v1" - print(f"Starting server on {base_url}") + base_url = f"http://{node_ip}:{free_port}/v1" + logger.info("Starting vLLM HTTP server on %s", base_url)Add once at module top (outside this hunk):
import logging logger = logging.getLogger(__name__)
351-364: /tokenize sits at root by design—ensure docs/tests reflect it.Since
base_urlincludes/v1while/tokenizedoes not, please document that clients should call{base_url}/../tokenizeor separately advertise the tokenize URL.
368-372: Binding to 0.0.0.0 is intentional—add a comment and optional override.To preempt security lint and future changes, document intent and optionally support an env override.
- config = uvicorn.Config( - app, - host="0.0.0.0", + # Intentionally bind to all interfaces (cluster visibility). + # NOTE: In untrusted environments, set NEMORL_VLLM_HTTP_HOST to limit exposure. + config = uvicorn.Config( + app, + host=os.getenv("NEMORL_VLLM_HTTP_HOST", "0.0.0.0"), port=free_port, )Add once at module top (outside this hunk):
import os
725-732: Drop dead code: you already awaitedcollective_rpc.
asyncio.iscoroutine(result_or_coro)will always be False here. Simplify.- result_or_coro = await self.llm.collective_rpc("report_device_id", args=tuple()) - - if asyncio.iscoroutine(result_or_coro): - list_of_worker_results = await result_or_coro - else: - list_of_worker_results = result_or_coro + list_of_worker_results = await self.llm.collective_rpc( + "report_device_id", args=tuple() + )
760-769: Same coroutine check issue as above.- result_or_coro = await self.llm.collective_rpc( - "update_weights_from_global_ipc_handles", args=(ipc_handles,) - ) - - if asyncio.iscoroutine(result_or_coro): - worker_results = await result_or_coro - else: - worker_results = result_or_coro + worker_results = await self.llm.collective_rpc( + "update_weights_from_global_ipc_handles", args=(ipc_handles,) + )
800-803: Same coroutine check issue here too.- if asyncio.iscoroutine(result_or_coro): - worker_results = await result_or_coro - else: - worker_results = result_or_coro + worker_results = result_or_coroNote: Or simply
worker_results = await self.llm.collective_rpc("update_weights_from_collective", args=tuple())as above.
891-901: Make HTTP server shutdown non-blocking and safe.Add a join timeout and guard when server didn’t start.
- if self.server_thread is not None: + if getattr(self, "server_thread", None) is not None: from threading import Thread from uvicorn import Server self.http_server: Server self.server_thread: Thread self.http_server.should_exit = True - self.server_thread.join() + self.server_thread.join(timeout=5)
320-323: Add a simple health endpoint.A
/healthz(200 OK) helps readiness checks and orchestration.######################################## # /tokenize endpoint ######################################## + @app.get("/healthz") + async def healthz(): + return {"status": "ok"}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
370-370: Possible binding to all interfaces
(S104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint check
🔇 Additional comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
135-143: Engine setup + optional HTTP exposure looks good.Capturing
AsyncEngineArgsfor reuse and gating server spin-up behind config is clean.
144-147: post_init_async API shape is appropriate.Returning
Optional[str]to exposebase_urlwhen enabled is clear.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
… report device id Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
377-380: Add readiness probe to avoid client race after start().Prevents early requests from failing before the listener is ready.
thread = threading.Thread(target=server.run, daemon=True) thread.start() + # Wait briefly for server readiness to avoid client races. + import socket, time + deadline = time.time() + 5 + while time.time() < deadline: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex((node_ip, free_port)) == 0: + break + time.sleep(0.05) + return thread, base_url, server
23-23: Remove top‑level uvicorn import to keep it optional.This makes uvicorn a hard dependency even when the HTTP server isn’t used. You already import it lazily inside
_setup_vllm_server().-import uvicornOptionally, for type checkers only:
# at top-level from typing import TYPE_CHECKING if TYPE_CHECKING: import uvicorn # noqa: F401
73-83: Replace assert with a structured error (non‑monotonic trajectory).
assertcan be stripped with -O and yields 500s. Raise a clear exception instead.- assert ( - reference_str == actual_str[: len(reference_str)] - ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization! -Reference str: {reference_str} -Actual str: {actual_str} - -Reference token ids: {reference_token_ids} -Actual token ids: {actual_token_ids}""" + if reference_str != actual_str[: len(reference_str)]: + raise ValueError( + "Non-monotonic trajectory not caused by token merge during re-tokenization." + )
115-117: Replace terminal assert with a structured error.Avoid
assertfor request-path invariants.- assert candidate_str == reference_str + if candidate_str != reference_str: + raise ValueError( + "Failed to align tokens after correction; likely non-merge re-tokenization." + )
303-307: Return 400 on sampling‑param mismatch instead of asserting.Client errors should not surface as 500s; emit a clear 400 with details.
- assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if ( + request.temperature != generation_config["temperature"] + or request.top_p != generation_config["top_p"] + ): + msg = ( + "Sampling params must match server config: " + f"temperature={generation_config['temperature']}, " + f"top_p={generation_config['top_p']}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
175-177: Use logging instead of print for server start message.- print(f"Starting server on {base_url}") + logger.info("Starting server on %s", base_url)Add once at module top:
import logging logger = logging.getLogger(__name__)
370-374: Suppress Ruff S104 intentionally (public bind).Since binding to 0.0.0.0 is by design here, annotate to quiet the linter.
- host="0.0.0.0", + host="0.0.0.0", # noqa: S104 - intentional public bind
151-156: Trim redundant local imports.
threadingis already imported at module scope;Optionaltoo. Keep uvicorn/fastapi local for optional dependency.- import threading - from typing import List, Optional, Union + from typing import List, Union
727-734: Simplify coroutine handling; remove double “await or coroutine” pattern.You already await
collective_rpc; the subsequent coroutine check is dead code.- result_or_coro = await self.llm.collective_rpc("report_device_id", args=tuple()) - - if asyncio.iscoroutine(result_or_coro): - list_of_worker_results = await result_or_coro - else: - list_of_worker_results = result_or_coro + list_of_worker_results = await self.llm.collective_rpc( + "report_device_id", args=tuple() + )- result_or_coro = await self.llm.collective_rpc( - "update_weights_from_global_ipc_handles", args=(ipc_handles,) - ) - - if asyncio.iscoroutine(result_or_coro): - worker_results = await result_or_coro - else: - worker_results = result_or_coro + worker_results = await self.llm.collective_rpc( + "update_weights_from_global_ipc_handles", args=(ipc_handles,) + )- result_or_coro = await self.llm.collective_rpc( - "update_weights_from_collective", args=tuple() - ) - - if asyncio.iscoroutine(result_or_coro): - worker_results = await result_or_coro - else: - worker_results = result_or_coro + worker_results = await self.llm.collective_rpc( + "update_weights_from_collective", args=tuple() + )Also applies to: 761-770, 798-806
893-903: Drop type‑only imports in shutdown; annotate at class level if needed.Local
from threading import Thread/from uvicorn import Serverare unnecessary; just use the attributes.- if self.server_thread is not None: - from threading import Thread - - from uvicorn import Server - - self.http_server: Server - self.server_thread: Thread - - self.http_server.should_exit = True - self.server_thread.join() + if self.server_thread is not None: + self.http_server.should_exit = True + self.server_thread.join()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/generation/vllm/vllm_generation.py(2 hunks)nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_rl/models/generation/vllm/vllm_generation.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
372-372: Possible binding to all interfaces
(S104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
100-107: Intentional off‑by‑one bound for append loop looks good.Acknowledging the deliberate
< len(actual_token_ids) - 1guard to avoid consuming the final token per prior discussion.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
388-391: Avoid client race: wait until the server is reachable before returning base_url.Small readiness probe prevents flaky callers right after startup.
thread = threading.Thread(target=server.run, daemon=True) thread.start() + # Readiness probe to avoid client races. + import socket, time + deadline = time.time() + 5 + while time.time() < deadline: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex((node_ip, free_port)) == 0: + break + time.sleep(0.05) + return thread, base_url, server
289-298: Return 400 on sampling-param mismatch instead of assert.These are client-controlled fields; asserts produce 500s and can be stripped.
- assert request.top_k in (None, -1), ( - f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`" - ) - request.top_k = -1 + if request.top_k not in (None, -1): + msg = f"top_k must be unset or -1; got {request.top_k}." + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + ) + request.top_k = -1 @@ - assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if ( + request.temperature != generation_config["temperature"] + or request.top_p != generation_config["top_p"] + ): + msg = ( + "Sampling params must match server config: " + f"temperature={generation_config['temperature']}, " + f"top_p={generation_config['top_p']}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
19-25: Avoid hard dependency on FastAPI/uvicorn; gate imports and quote annotations.Make the HTTP server truly optional by removing top-level imports and quoting annotations. This prevents ImportError when the server is disabled and aligns with prior feedback.
Apply:
-from typing import Any, AsyncGenerator, Optional, cast +from typing import Any, AsyncGenerator, Optional, TYPE_CHECKING, cast @@ -import uvicorn -from fastapi import FastAPI +# Optional deps: only needed when HTTP server is enabled. +if TYPE_CHECKING: + import uvicorn + from fastapi import FastAPI @@ -def _setup_vllm_openai_api_server(self, app: FastAPI) -> FastAPI: +def _setup_vllm_openai_api_server(self, app: "FastAPI") -> "FastAPI":Also applies to: 151-152
381-385: Bind to 0.0.0.0 is intentional—suppress the linter.Given the explicit decision to expose on all interfaces, annotate to silence Ruff S104.
config = uvicorn.Config( app, - host="0.0.0.0", + host="0.0.0.0", # noqa: S104 - intentional: expose to cluster network port=free_port, )If you prefer repo-wide config, ignore S104 for this path in ruff.toml.
379-380: Prefer logging over print for server start message.Use module logger for consistency and testability.
Example:
import logging logger = logging.getLogger(__name__) # ... logger.info("Starting server on %s", base_url)
904-914: Graceful HTTP server shutdown: add timeout to join.Avoid potential hangs if the server loop doesn’t exit promptly.
self.http_server.should_exit = True - self.server_thread.join() + self.server_thread.join(timeout=5) + if self.server_thread.is_alive(): + print("Warning: HTTP server thread did not exit within timeout")
344-357: Document mixed route prefixes to prevent client confusion.Since base_url includes “/v1” while /tokenize is root-scoped by design, add a short comment/doc so users don’t call f"{base_url}/tokenize" and hit 404s.
E.g., add:
# Note: /tokenize is intentionally not under /v1; clients should call http://<host>:<port>/tokenize
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)nemo_rl/models/generation/vllm/vllm_worker.py (1)
llm(391-392)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
383-383: Possible binding to all interfaces
(S104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
232-247: Guard looks correct for VLM case.Early-return when required_prefix_token_ids is None cleanly bypasses text-only correction. LGTM.
Please confirm tests cover both presence/absence of required_prefix_token_ids (text vs VLM prompts).
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (7)
tests/unit/models/generation/test_vllm_generation.py (7)
1101-1106: Replace fixed sleep with readiness polling and add request timeouts.Use a readiness loop to avoid flakiness and ensure all HTTP calls time out; also fail fast on non-2xx.
- # Take a short nap for the server to spinup. Maybe there is a better way to do this? - sleep(3) + # Wait for server readiness (max 60s) instead of fixed sleep + import time + deadline = time.time() + 60 + while time.time() < deadline: + try: + # Any HTTP status implies the server is up; only connectivity matters here. + requests.get(base_urls[0], timeout=1) + break + except requests.RequestException: + sleep(0.25) + else: + pytest.fail("HTTP server did not start within 60s") - # Generate and check result - response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body) + # Generate and check result + response = requests.post( + url=f"{base_urls[0]}/chat/completions", + json=body, + timeout=30, + ) + response.raise_for_status()
1191-1204: Tighten shutdown assertion: explicit exception type and timeout.Use the canonical exception import and ensure the call can’t hang.
- with pytest.raises(requests.ConnectionError): + with pytest.raises(requests.exceptions.ConnectionError): requests.post( url=f"{base_urls[0]}/chat/completions", json=dict( messages=[ {"role": "user", "content": "count to 5"}, ], temperature=0.0, logprobs=True, return_tokens_as_token_ids=True, max_tokens=1, ), + timeout=3, )
1236-1243: Add the missing assertion in the no‑op case.Without this, the “no-op” path is unverified.
actual_result = _maybe_correct_merged_tokens( tokenizer=tokenizer, reference_token_ids=[26951, 3834], actual_token_ids=[26951, 3834], ) expected_result = [26951, 3834] + assert expected_result == actual_result
1308-1315: Same flakiness risk: replace sleep with readiness polling and add timeouts.Mirror the earlier test’s readiness loop.
- # Take a short nap for the server to spinup. Maybe there is a better way to do this? - sleep(3) + import time + deadline = time.time() + 60 + while time.time() < deadline: + try: + requests.get(base_urls[0], timeout=1) + break + except requests.RequestException: + sleep(0.25) + else: + pytest.fail("HTTP server did not start within 60s") - response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body) + from urllib.parse import urljoin + tokenize_url = urljoin(base_urls[0], "/tokenize") + response = requests.post(url=tokenize_url, json=body, timeout=10) + response.raise_for_status()
1325-1334: Reuse tokenize_url and add timeout/raise_for_status.Keeps consistency and prevents hangs.
- response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body_with_reference_token_ids) + response = requests.post( + url=tokenize_url, + json=body_with_reference_token_ids, + timeout=10, + ) + response.raise_for_status()
1337-1341: Add timeout and status check to /chat/completions.Prevents indefinite waits and surfaces server errors.
- response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body_with_reference_token_ids) + response = requests.post( + url=f"{base_urls[0]}/chat/completions", + json=body_with_reference_token_ids, + timeout=30, + ) + response.raise_for_status()
1066-1066: Guard HTTP tests with per‑test timeouts.If the server stalls, these tests can hang.
- def test_vllm_http_server(cluster, tokenizer): + @pytest.mark.timeout(180) + def test_vllm_http_server(cluster, tokenizer):- def test_vllm_http_server_correct_merged_tokens_matches_baseline(cluster, tokenizer): + @pytest.mark.timeout(180) + def test_vllm_http_server_correct_merged_tokens_matches_baseline(cluster, tokenizer):Also applies to: 1269-1269
🧹 Nitpick comments (1)
tests/unit/models/generation/test_vllm_generation.py (1)
1165-1166: Avoid '/../' in URLs; build the root tokenize path explicitly and add a timeout.This is clearer and immune to base path changes.
- response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body) + from urllib.parse import urljoin + tokenize_url = urljoin(base_urls[0], "/tokenize") + response = requests.post(url=tokenize_url, json=body, timeout=10) + response.raise_for_status()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unit/models/generation/test_vllm_generation.py(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
tests/unit/models/generation/test_vllm_generation.py
🧬 Code graph analysis (1)
tests/unit/models/generation/test_vllm_generation.py (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
_maybe_correct_merged_tokens(39-118)nemo_rl/models/generation/interfaces.py (1)
GenerationDatumSpec(127-158)nemo_rl/models/generation/vllm/vllm_generation.py (1)
VllmGeneration(47-801)
🪛 Ruff (0.12.2)
tests/unit/models/generation/test_vllm_generation.py
1105-1105: Probable use of requests call without timeout
(S113)
1165-1165: Probable use of requests call without timeout
(S113)
1193-1193: Probable use of requests call without timeout
(S113)
1313-1313: Probable use of requests call without timeout
(S113)
1326-1326: Probable use of requests call without timeout
(S113)
1337-1337: Probable use of requests call without timeout
(S113)
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
aside from last minor comment on config, lgtm
@parthchadha to review and @yfw to suggest where to assert guard the vlm stuff for now
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (6)
tests/unit/models/generation/test_vllm_generation.py (6)
1080-1080: Protect the HTTP test with a per‑test timeout.+@pytest.mark.timeout(180) def test_vllm_http_server(cluster, tokenizer):
1249-1256: Missing assertion in the no‑op case.The no‑op path computes results but never asserts equality.
actual_result = _maybe_correct_merged_tokens( tokenizer=tokenizer, reference_token_ids=[26951, 3834], actual_token_ids=[26951, 3834], ) expected_result = [26951, 3834] + assert expected_result == actual_result
1115-1119: Add timeout and status check to POST /chat/completions.Prevents hangs and surfaces HTTP errors earlier.
- response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body) + response = requests.post( + url=f"{base_urls[0]}/chat/completions", json=body, timeout=REQUEST_TIMEOUT + ) + response.raise_for_status()
1373-1384: Add timeout and status check for /chat/completions.- response = requests.post( - url=f"{base_urls[0]}/chat/completions", json=body_with_reference_token_ids - ) + response = requests.post( + url=f"{base_urls[0]}/chat/completions", + json=body_with_reference_token_ids, + timeout=REQUEST_TIMEOUT, + ) + response.raise_for_status()
1358-1371: Add timeout for /tokenize (with reference).- response = requests.post( - url=f"{base_urls[0]}/../tokenize", json=body_with_reference_token_ids - ) + response = requests.post( + url=tokenize_url, + json=body_with_reference_token_ids, + timeout=REQUEST_TIMEOUT, + ) + response.raise_for_status()
1066-1079: Bound the readiness wait; avoid infinite loop and silent exceptions.Current while True + bare pass can hang forever. Add a deadline, small backoff, and fail fast if unreachable.
-def _wait_for_vllm_http_server_spinup(base_url: str): - while True: - try: - requests.get(base_url, timeout=5) - # We don't check the status code since there may not be a route at / - break - except ( - requests.exceptions.ConnectionError, - requests.exceptions.Timeout, - Exception, - ): - pass +def _wait_for_vllm_http_server_spinup(base_url: str, deadline_s: float = 60.0): + deadline = time.time() + deadline_s + while time.time() < deadline: + try: + # 404/405 still indicates the server is accepting connections. + requests.get(base_url, timeout=5) + return + except (requests.exceptions.ConnectionError, requests.exceptions.Timeout): + time.sleep(0.5) + pytest.fail(f"HTTP server did not start within {deadline_s:.0f}s: {base_url}")
🧹 Nitpick comments (5)
tests/unit/models/generation/test_vllm_generation.py (5)
1178-1179: Avoid “../” in URLs; add timeout and status check.Use urljoin to reach the root /tokenize route (design is intentional) and add a timeout.
- response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body) + tokenize_url = urljoin(base_urls[0].rstrip("/") + "/", "../tokenize") + response = requests.post(url=tokenize_url, json=body, timeout=REQUEST_TIMEOUT) + response.raise_for_status()
1204-1217: Add timeout to negative‑case POST and use explicit exceptions type.Ensure fast failure after shutdown.
- with pytest.raises(requests.ConnectionError): - requests.post( + with pytest.raises(requests.exceptions.ConnectionError): + requests.post( url=f"{base_urls[0]}/chat/completions", json=dict( messages=[ {"role": "user", "content": "count to 5"}, ], temperature=0.0, logprobs=True, return_tokens_as_token_ids=True, max_tokens=1, ), + timeout=REQUEST_TIMEOUT, )
1282-1284: Add timeout to the async HTTP/tokenization alignment test.@pytest.mark.asyncio +@pytest.mark.timeout(180) async def test_vllm_http_server_correct_merged_tokens_matches_baseline(
1385-1399: Avoid using loop variable outside the async loop (Ruff B007).Capture the last result explicitly for clarity and to appease linters.
- async for _, generate_result in vllm_generation.generate_async( + last_generate_result = None + async for _, generate_result in vllm_generation.generate_async( BatchedDataDict[GenerationDatumSpec]( { "input_ids": torch.tensor([initial_tokenized_query_ids]), "input_lengths": torch.tensor([len(initial_tokenized_query_ids)]), } ) ): - pass + last_generate_result = generate_result + + assert last_generate_result is not None - generate_generated_token_id = generate_result["output_ids"][0][ + generate_generated_token_id = last_generate_result["output_ids"][0][ len(initial_tokenized_query_ids) ].item()
1118-1176: Optional: reduce brittleness by asserting key fields instead of full JSON payloads.Full deep equality on model outputs can break across minor dependency changes. Consider schema/partial field checks (e.g., keys present, token ids/logprob token id) after standardization.
If you want, I can refactor these assertions to schema-based checks in a follow-up.
Also applies to: 1328-1377, 1206-1217
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/generation/vllm/config.py(1 hunks)tests/unit/models/generation/test_vllm_generation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_rl/models/generation/vllm/config.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:34:35.406Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.406Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
tests/unit/models/generation/test_vllm_generation.py
🧬 Code graph analysis (1)
tests/unit/models/generation/test_vllm_generation.py (4)
nemo_rl/models/generation/interfaces.py (1)
GenerationDatumSpec(127-158)nemo_rl/models/generation/vllm/vllm_generation.py (2)
shutdown(691-698)generate_async(617-638)nemo_rl/models/generation/vllm/vllm_worker_async.py (3)
_maybe_correct_merged_tokens(39-118)shutdown(885-918)generate_async(406-624)nemo_rl/distributed/batched_data_dict.py (1)
BatchedDataDict(75-839)
🪛 Ruff (0.12.2)
tests/unit/models/generation/test_vllm_generation.py
1072-1077: try-except-pass detected, consider logging the exception
(S110)
1118-1118: Probable use of requests call without timeout
(S113)
1178-1178: Probable use of requests call without timeout
(S113)
1206-1206: Probable use of requests call without timeout
(S113)
1328-1328: Probable use of requests call without timeout
(S113)
1361-1361: Probable use of requests call without timeout
(S113)
1374-1374: Probable use of requests call without timeout
(S113)
1385-1385: Loop control variable generate_result not used within loop body
(B007)
🔇 Additional comments (3)
tests/unit/models/generation/test_vllm_generation.py (3)
31-33: LGTM: typed import used correctly.
35-37: LGTM: internal helper is intentionally exercised by tests.
1052-1064: LGTM: minimal helper cleanly enables HTTP server for async engine.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (3)
21-25: Drop top-level uvicorn import to keep it optional.You already import uvicorn inside
_setup_vllm_server; the top-level import makes it a hard dep unnecessarily.import ray import torch -import uvicorn from fastapi import FastAPI from transformers.tokenization_utils_base import PreTrainedTokenizerBase
103-110: Document intentional off‑by‑one bound to prevent future regressions.Prior discussion confirms this is intentional; add a brief comment to deter “fixes.”
- elif len(candidate_str) < len(reference_str): - while ( + elif len(candidate_str) < len(reference_str): + # Intentional: do not consume the final available token; see PR #1110 discussion. + while ( candidate_str != reference_str and len(candidate_str) < len(reference_str) and len(candidate_token_ids) < len(actual_token_ids) - 1 ):
291-300: Return 400s for invalid/mismatched sampling params instead of asserting.Client errors should be explicit and non-fatal to the process.
- assert request.top_k in (None, -1), ( - f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`" - ) + if request.top_k not in (None, -1): + msg = ( + "Top-k must be unset or -1 to stay on-policy. " + f"Got top_k={request.top_k}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + ) request.top_k = -1 @@ - assert request.temperature == generation_config["temperature"] - assert request.top_p == generation_config["top_p"] + if ( + request.temperature != generation_config["temperature"] + or request.top_p != generation_config["top_p"] + ): + msg = ( + "Sampling params must match server config to stay on-policy: " + f"temperature={generation_config['temperature']}, " + f"top_p={generation_config['top_p']}." + ) + return JSONResponse( + content=ErrorResponse(message=msg, code=400).model_dump(), + status_code=400, + )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
63-64: Fix typo in docstring.- ... then we will need to uppdate this function. + ... then we will need to update this function.
380-383: Use logging instead of print for server startup.- base_url = f"http://{node_ip}:{free_port}/v1" - print(f"Starting server on {base_url}") + base_url = f"http://{node_ip}:{free_port}/v1" + from logging import getLogger + getLogger(__name__).info("Starting server on %s", base_url)
390-393: Wait for readiness to avoid client race and port snatch.Small socket probe ensures the listener is up before returning.
thread = threading.Thread(target=server.run, daemon=True) thread.start() + # Basic readiness wait to avoid client race conditions. + import socket, time + deadline = time.time() + 5 + while time.time() < deadline: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + if s.connect_ex((node_ip, free_port)) == 0: + break + time.sleep(0.05) + return thread, base_url, server
437-441: Don’t use assert for runtime precondition; raise a proper error.- assert batch_size == 1, ( - f"generate_async is restricted to handle only single samples, " - f"but received batch_size={batch_size}. Please handle batching outside this method." - ) + if batch_size != 1: + raise ValueError( + f"generate_async expects batch_size=1; got {batch_size}. Handle batching upstream." + )
906-916: Null out server refs after shutdown to aid GC and reuse.if self.server_thread is not None: from threading import Thread from uvicorn import Server self.http_server: Server self.server_thread: Thread self.http_server.should_exit = True self.server_thread.join() + self.http_server = None + self.server_thread = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py(4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.406Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.406Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
_get_free_port_local(73-81)_get_node_ip_local(66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py
385-385: Possible binding to all interfaces
(S104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
361-389: HTTP server setup and base_url wiring look good.Consider adding a brief docstring to
_setup_vllm_servernoting the intentional0.0.0.0bind and/tokenizeroot path to preempt future changes.
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: bxyu-nvidia <bxyu@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Improvements
Refactor
Tests