Skip to content

feat: Expose async vLLM engine as HTTP server#1110

Merged
terrykong merged 24 commits intomainfrom
bxyu/async-vllm-http-server
Sep 17, 2025
Merged

feat: Expose async vLLM engine as HTTP server#1110
terrykong merged 24 commits intomainfrom
bxyu/async-vllm-http-server

Conversation

@bxyu-nvidia
Copy link
Copy Markdown
Contributor

@bxyu-nvidia bxyu-nvidia commented Sep 10, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Optional in-process HTTP server exposing OpenAI-compatible chat and tokenization endpoints; server base URL(s) are now surfaced.
  • Improvements

    • Token alignment/correction for merged-token scenarios.
    • Graceful server startup/shutdown integrated with worker lifecycle.
    • Public attribute reporting server base URLs for data-parallel workers.
  • Refactor

    • Simplified internal IP/port acquisition logic.
  • Tests

    • Extensive new tests for HTTP serving, token-correction, and many weight-update scenarios (duplicate HTTP-server test block present).

Signed-off-by: Brian Yu <bxyu@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 10, 2025

Walkthrough

Adds an optional in-process FastAPI/uvicorn HTTP bridge for async vLLM workers, a token-sequence correction utility, new vLLM config flags, exposes per-worker OpenAI-style server base URLs from VllmGeneration, factors node IP/port helpers, and expands unit tests for HTTP, token-correction, and generation/weight/memory scenarios.

Changes

Cohort / File(s) Summary of Changes
vLLM async HTTP server & token alignment
nemo_rl/models/generation/vllm/vllm_worker_async.py
Adds _maybe_correct_merged_tokens; persists AsyncEngineArgs when creating AsyncLLM; implements optional in-process FastAPI/uvicorn HTTP server when expose_http_server is enabled (endpoints /v1/chat/completions and /tokenize), OpenAI-serving shims, _setup_vllm_server, report_dp_openai_server_base_url, and integrates server lifecycle (start/shutdown) with the worker.
vLLM generation: expose server URLs
nemo_rl/models/generation/vllm/vllm_generation.py
Adds attribute dp_openai_server_base_urls to VllmGeneration, initialized in __init__ via _report_dp_openai_server_base_urls() which gathers per-worker base URLs when async engine/server is active.
vLLM config additions
nemo_rl/models/generation/vllm/config.py
Extends VllmSpecificArgs TypedDict with optional expose_http_server: NotRequired[bool] and http_server_serving_chat_kwargs: NotRequired[dict[str, Any]].
Distributed utils: local IP/port helpers
nemo_rl/distributed/virtual_cluster.py
Refactors _get_node_ip_and_free_port into two helpers: _get_node_ip_local() and _get_free_port_local(); moves socket import into the free-port helper; external behavior unchanged.
Tests: HTTP, token correction, generation/weight/memory/batching
tests/unit/models/generation/test_vllm_generation.py
Adds extensive tests: HTTP server integration (chat/tokenize), unit tests for _maybe_correct_merged_tokens, and expanded vLLM generation/weight-update/memory/stop/batching/megatron scenarios. Note: duplicated HTTP server test blocks present in the diff.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client as Client
  participant Server as In-process HTTP Server
  participant Worker as VllmAsyncGenerationWorker
  participant Engine as AsyncLLM Engine

  Note over Worker,Server: Started only when ``expose_http_server=True``
  Worker->>Worker: _create_engine() (store AsyncEngineArgs, init llm)
  Worker->>Worker: _setup_vllm_server() -> (thread, base_url, server)
  Worker-->>Caller: _report_dp_openai_server_base_urls() returns base_url list

  rect #E6F7FF
  Client->>Server: POST /v1/chat/completions
  Server->>Worker: build request (mixins, derive prefix token ids)
  Server->>Server: call _maybe_correct_merged_tokens()
  Server->>Engine: submit generation request
  Engine-->>Server: tokens / logprobs
  Server-->>Client: ChatCompletionResponse (or streaming)
  end
Loading
sequenceDiagram
  autonumber
  participant Test as Test Suite
  participant Server as In-process HTTP Server

  Test->>Server: POST /tokenize
  Server->>Server: perform tokenization via tokenizer
  Server-->>Test: tokenization response
  Test->>Server: request shutdown
  Server-->>Test: server stops
  Test-x Server: further requests -> ConnectionError
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
  • wangshangsam

Poem

I hopped through sockets, ports, and threads,
Spun up a server where the vLLM treads.
I stitched token seams with whiskered delight,
Chats bloom like carrots in soft morning light.
Tests dug the burrow—now everything's bright. 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: Expose async vLLM engine as HTTP server" is a concise, single-sentence summary that directly describes the PR's primary change—adding optional in-process HTTP-server exposure for the async vLLM engine and related worker/server code; it is specific and aligned with the changes in the diff. The phrasing avoids vague terms and unnecessary file lists, so a reviewer scanning history will understand the main intent quickly.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bxyu/async-vllm-http-server

Comment @coderabbitai help to get the list of available commands and usage tips.

@bxyu-nvidia bxyu-nvidia changed the title Expose async vLLM engine as HTTP server feat: Expose async vLLM engine as HTTP server Sep 10, 2025
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (7)
nemo_rl/models/generation/vllm/config.py (1)

32-33: Config keys look good; consider a typed kwargs shape.

Optional: replace dict[str, Any] with a dedicated TypedDict (e.g., HttpServingChatKwargs) to document supported fields and catch misconfigurations at type-check time.

nemo_rl/models/generation/vllm/vllm_generation.py (1)

172-173: Capture of dp_openai_server_base_urls is fine; add an explicit type.

Avoid mypy “implicit dynamic” on a new attribute by annotating the field.

Apply:

-        # dp_openai_server_base_urls is only returned by Async vLLM flow when http server is active
-        self.dp_openai_server_base_urls = self._post_init()
+        # dp_openai_server_base_urls is only returned by Async vLLM flow when http server is active
+        self.dp_openai_server_base_urls: list[str | None] = self._post_init()
nemo_rl/distributed/virtual_cluster.py (1)

63-64: Minor: free-port race window.

_get_free_port_local() + later bind can race under load. Consider a retry loop in the server startup on EADDRINUSE (or let the server bind to port 0 and discover the actual port before exporting the base URL).

nemo_rl/models/generation/vllm/vllm_worker_async.py (2)

296-304: HTTP 400 is clearer than an assertion for on-policy enforcement.

Replace asserts with a 400 response to avoid 500s and to work when Python runs with -O.

Apply:

-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if request.temperature != generation_config["temperature"] or request.top_p != generation_config["top_p"]:
+                err = ErrorResponse(
+                    message="Sampling params must match policy config (temperature/top_p).",
+                    type="invalid_request_error",
+                    code=400,
+                )
+                return JSONResponse(content=err.model_dump(), status_code=400)

886-896: Shutdown guards are good; avoid re-declaring types at runtime.

The runtime type reassignments (self.http_server: Server, self.server_thread: Thread) are unnecessary here; consider removing them.

-                from threading import Thread
-
-                from uvicorn import Server
-
-                self.http_server: Server
-                self.server_thread: Thread
-
                 self.http_server.should_exit = True
                 self.server_thread.join()
tests/unit/models/generation/test_vllm_generation.py (2)

1162-1164: Avoid “..” in paths; use urljoin and add a timeout.

/v1/../tokenize works but is brittle and noisy; normalize the URL and include a timeout.

-    response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body)
+    from urllib.parse import urljoin
+    tokenize_url = urljoin(base_urls[0] + "/", "../tokenize")
+    response = requests.post(url=tokenize_url, json=body, timeout=30)

If you prefer a top-level import, I can place from urllib.parse import urljoin with the others.


1189-1201: Tighten shutdown assertion: include timeout and accept timeout exceptions.

Post-shutdown, the call should fail quickly. Include a short timeout and cover Timeout as well.

-    with pytest.raises(requests.ConnectionError):
+    with pytest.raises((requests.ConnectionError, requests.Timeout)):
         requests.post(
             url=f"{base_urls[0]}/chat/completions",
             json=dict(
                 messages=[
                     {"role": "user", "content": "count to 5"},
                 ],
                 temperature=0.0,
                 logprobs=True,
                 return_tokens_as_token_ids=True,
                 max_tokens=1,
             ),
+            timeout=5,
         )

If this still flakes due to late teardown, we can loop/poll for up to ~15s before failing.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b060d1d and ce1906e.

📒 Files selected for processing (5)
  • nemo_rl/distributed/virtual_cluster.py (1 hunks)
  • nemo_rl/models/generation/vllm/config.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_generation.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
nemo_rl/distributed/virtual_cluster.py (1)
tests/unit/distributed/test_virtual_cluster.py (1)
  • test_get_node_ip_and_free_port_does_not_start_with_zero (32-43)
tests/unit/models/generation/test_vllm_generation.py (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
  • _maybe_correct_merged_tokens (36-115)
  • shutdown (867-900)
nemo_rl/models/generation/__init__.py (1)
  • configure_generation_config (24-45)
nemo_rl/distributed/worker_groups.py (1)
  • shutdown (907-981)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

365-365: Possible binding to all interfaces

(S104)

tests/unit/models/generation/test_vllm_generation.py

1102-1102: Probable use of requests call without timeout

(S113)


1162-1162: Probable use of requests call without timeout

(S113)


1190-1190: Probable use of requests call without timeout

(S113)

🔇 Additional comments (4)
nemo_rl/distributed/virtual_cluster.py (2)

66-71: LGTM: helper split improves reuse and testability.


73-82: Local free-port finder is fine; keep socket import local.

tests/unit/models/generation/test_vllm_generation.py (2)

1049-1061: Helper to enable HTTP server: LGTM.

Clear, minimal config toggle for async + HTTP exposure.


1082-1085: Make the DP server count assertion robust.

len(base_urls) == cluster.num_gpus_per_node may be brittle if DP != GPUs-per-node in some CI shapes. Prefer asserting against the actual vLLM DP size (if exposed) or at least >= 1.

For example:

# if available:
assert len(base_urls) == getattr(vllm_generation, "dp_world_size", len(base_urls)) and len(base_urls) >= 1

Would you like me to search the codebase and wire an explicit dp_world_size onto VllmGeneration for this?

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/models/generation/test_vllm_generation.py Outdated
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice addition @bxyu-nvidia !

Am I correct to assume a functional test like this will come in a later PR?

Comment thread nemo_rl/distributed/virtual_cluster.py Outdated
Comment thread nemo_rl/distributed/virtual_cluster.py Outdated
Comment thread nemo_rl/models/generation/vllm/config.py
Comment thread nemo_rl/models/generation/vllm/vllm_generation.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
@terrykong terrykong requested a review from yfw September 10, 2025 05:35
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)

101-106: Clarify intentional off-by-one bound in append loop.

The < len(actual_token_ids) - 1 guard is intentional (per prior discussion) but non-obvious. Add an inline comment to prevent future “fixes.”

-            and len(candidate_token_ids) < len(actual_token_ids) - 1
+            # Intentional: keep one token in reserve to avoid overshooting and to
+            # handle token-merge edge cases discussed in PR #1110.
+            and len(candidate_token_ids) < len(actual_token_ids) - 1

348-349: Document that /tokenize is intentionally root-scoped (not under /v1).

Prevents confusion given base_url includes /v1.

-        @app.post("/tokenize")
+        # Intentionally exposed at root (not under "/v1") for Penguin compatibility.
+        @app.post("/tokenize")
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

149-149: Remove duplicate import of threading.

threading is imported at the module level (Line 18). The local re-import is redundant.

-        import threading

73-83: Avoid bare assert in production path; raise a typed error (or degrade gracefully).

This assertion can crash request handling. Prefer raising a typed exception that the caller can translate into a structured 4xx response, or fall back to actual_token_ids with a warning.

-    assert (
-        reference_str == actual_str[: len(reference_str)]
-    ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization!
-Reference str: {reference_str}
-Actual str: {actual_str}
-
-Reference token ids: {reference_token_ids}
-Actual token ids: {actual_token_ids}"""
+    if reference_str != actual_str[: len(reference_str)]:
+        raise ValueError(
+            "Non-monotonic trajectory not attributable to token-merge during re-tokenization."
+        )

If you want, I can wire this into the FastAPI routes to return a 400 instead of a 500.


170-175: Prefer logger over print for server start message.

Route this through the repo’s logging to respect global log config.

-        base_url = f"http://{node_ip}:{free_port}/v1"
-        print(f"Starting server on {base_url}")
+        base_url = f"http://{node_ip}:{free_port}/v1"
+        import logging
+        logging.getLogger(__name__).info("Starting vLLM HTTP server on %s", base_url)

365-369: Intentional 0.0.0.0 bind: silence Ruff S104.

Binding to all interfaces is by design here (acknowledged in discussion). Add a per-line ignore to keep CI green.

-            host="0.0.0.0",
+            host="0.0.0.0",  # noqa: S104 - intentional: expose to all interfaces

888-897: Don’t hang shutdown if the server thread won’t exit.

Add a join timeout and attempt a forceful exit to avoid blocking teardown.

-                self.http_server.should_exit = True
-                self.server_thread.join()
+                self.http_server.should_exit = True
+                self.server_thread.join(timeout=5)
+                if self.server_thread.is_alive():
+                    # Try harder and avoid hanging shutdown.
+                    try:
+                        self.http_server.force_exit = True  # type: ignore[attr-defined]
+                    except Exception:
+                        pass
+                    self.server_thread.join(timeout=5)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce1906e and 78b444c.

📒 Files selected for processing (3)
  • 3rdparty/Penguin-workspace/Penguin (1 hunks)
  • nemo_rl/distributed/virtual_cluster.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_rl/distributed/virtual_cluster.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

367-367: Possible binding to all interfaces

(S104)

🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

148-148: Good fix on return annotation.

Accurate tuple type makes intent clear and helps type-checkers.

Comment thread 3rdparty/Penguin-workspace/Penguin Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (4)

24-24: Avoid hard dependency on uvicorn at import time.

Drop the top-level import uvicorn; you already import it lazily in _setup_vllm_server, which keeps environments without uvicorn working when HTTP is disabled.

-from transformers.tokenization_utils_base import PreTrainedTokenizerBase
-import uvicorn
+from transformers.tokenization_utils_base import PreTrainedTokenizerBase

100-107: Off-by-one guard is intentional; document it and add a test.

Per prior discussion, the < len(actual_token_ids) - 1 bound is deliberate. Please add an inline comment explaining the rationale and a unit test covering a “last-token merge” case so future edits don’t regress it.

-    elif len(candidate_str) < len(reference_str):
-        while (
+    elif len(candidate_str) < len(reference_str):
+        # Intentionally stop before the final token to avoid edge re-merge cases
+        # observed in our environments (see PR #1110 discussion).
+        while (
             candidate_str != reference_str
             and len(candidate_str) < len(reference_str)
             and len(candidate_token_ids) < len(actual_token_ids) - 1
         ):

365-375: Startup robustness: add readiness probe; silence Ruff S104 intentionally.

  • Binding to 0.0.0.0 is intentional here—add # noqa: S104 to quiet lint.
  • Add a short readiness loop to avoid client races immediately after returning.
         config = uvicorn.Config(
             app,
-            host="0.0.0.0",
+            host="0.0.0.0",  # noqa: S104 — intentional: bind all interfaces
             port=free_port,
         )
         server = uvicorn.Server(config=config)
 
         thread = threading.Thread(target=server.run, daemon=True)
         thread.start()
 
+        # Wait briefly until the socket is accepting connections to avoid races.
+        import socket, time
+        deadline = time.time() + 5
+        while time.time() < deadline:
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex((node_ip, free_port)) == 0:
+                    break
+            time.sleep(0.05)
+
         return thread, base_url, server

299-302: Return 400 on sampling-param mismatch instead of asserting.

This avoids 500s and makes client errors explicit.

-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if (
+                request.temperature != generation_config["temperature"]
+                or request.top_p != generation_config["top_p"]
+            ):
+                msg = (
+                    "Sampling params must match server config: "
+                    f"temperature={generation_config['temperature']}, "
+                    f"top_p={generation_config['top_p']}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
🧹 Nitpick comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)

173-176: Use logging instead of print; also clarify /tokenize path difference in docs.

Replace print with logging, and ensure user-facing docs/tests call /tokenize without /v1.

-        base_url = f"http://{node_ip}:{free_port}/v1"
-        print(f"Starting server on {base_url}")
+        base_url = f"http://{node_ip}:{free_port}/v1"
+        import logging
+        logging.getLogger(__name__).info("Starting server on %s", base_url)

888-898: Graceful shutdown: guard None and add join timeout.

Avoid potential hangs if the server thread doesn’t exit promptly.

-            if self.server_thread is not None:
+            if getattr(self, "server_thread", None) is not None:
                 from threading import Thread
 
                 from uvicorn import Server
 
                 self.http_server: Server
                 self.server_thread: Thread
 
-                self.http_server.should_exit = True
-                self.server_thread.join()
+                if getattr(self, "http_server", None) is not None:
+                    self.http_server.should_exit = True
+                self.server_thread.join(timeout=5)
+                if self.server_thread.is_alive():
+                    print("Warning: HTTP server thread did not exit within timeout.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78b444c and 05ebf8b.

📒 Files selected for processing (2)
  • nemo_rl/models/generation/vllm/vllm_generation.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_rl/models/generation/vllm/vllm_generation.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

367-367: Possible binding to all interfaces

(S104)

🔇 Additional comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (4)

135-143: Engine init + optional HTTP exposure looks good.

Clean separation of args and conditional server startup.


144-147: Return the base_url from post_init_async—good API design.

Surfacing the URL here is convenient for callers.


148-153: Accurate return type for _setup_vllm_server.

String-annotated tuple avoids runtime import issues. LGTM.


348-361: Root-level /tokenize endpoint is fine (intentional divergence).

Acknowledging the intended API structure: /v1/chat/completions under /v1, /tokenize at root. Ensure README/usage samples reflect this.

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

24-24: Avoid hard dependency on uvicorn at import time.

uvicorn is imported both at module top and inside _setup_vllm_server. Keep the local import only so environments not using the HTTP server don't require uvicorn.

-from transformers.tokenization_utils_base import PreTrainedTokenizerBase
-import uvicorn
+from transformers.tokenization_utils_base import PreTrainedTokenizerBase

375-378: Add readiness probe before returning base_url.

Avoid client races and port-steal window by waiting until the listener is reachable.

         thread = threading.Thread(target=server.run, daemon=True)
         thread.start()

-        return thread, base_url, server
+        # Wait briefly for server readiness to avoid client races.
+        import socket, time
+        deadline = time.time() + 5
+        while time.time() < deadline:
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex((node_ip, free_port)) == 0:
+                    break
+            time.sleep(0.05)
+
+        return thread, base_url, server

115-117: Final invariant shouldn’t be an assert.

Return a structured exception so callers can map to a client error.

-    assert candidate_str == reference_str
+    if candidate_str != reference_str:
+        raise ValueError(
+            "Unable to align reference and actual tokenization after correction."
+        )

290-319: Return 400 on sampling-param mismatches instead of asserting.

Client-visible validation improves debuggability and stability during training.

-            assert request.top_k in (None, -1), (
-                f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`"
-            )
+            if request.top_k not in (None, -1):
+                msg = f"top_k must be unset or -1; got {request.top_k}"
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
             request.top_k = -1

-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if (
+                request.temperature != generation_config["temperature"]
+                or request.top_p != generation_config["top_p"]
+            ):
+                msg = (
+                    "Sampling params must match server config: "
+                    f"temperature={generation_config['temperature']}, "
+                    f"top_p={generation_config['top_p']}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )

73-83: Don’t use assert in request path; return structured errors.

Asserts can be optimized out and yield 500s. Raise a clear error (or propagate an HTTP 400 upstream).

-    assert (
-        reference_str == actual_str[: len(reference_str)]
-    ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization!
-Reference str: {reference_str}
-Actual str: {actual_str}
-
-Reference token ids: {reference_token_ids}
-Actual token ids: {actual_token_ids}"""
+    if reference_str != actual_str[: len(reference_str)]:
+        raise ValueError(
+            "Non-monotonic trajectory not due to token-merge during re-tokenization. "
+            f"reference_str[:{len(reference_str)}]!={actual_str[:len(reference_str)]} | "
+            f"ref_ids={reference_token_ids} actual_ids={actual_token_ids}"
+        )
🧹 Nitpick comments (9)
nemo_rl/models/generation/vllm/vllm_worker_async.py (9)

100-107: Intentional off-by-one: document it inline.

Per author note, excluding the last token is intentional. Add an explanatory comment to prevent regressions.

-    elif len(candidate_str) < len(reference_str):
+    elif len(candidate_str) < len(reference_str):
         while (
             candidate_str != reference_str
             and len(candidate_str) < len(reference_str)
-            and len(candidate_token_ids) < len(actual_token_ids) - 1
+            # Intentionally stop before consuming the final token in actual_token_ids.
+            # This avoids out-of-bounds on the next append and handles edge merges at sequence end.
+            # See PR discussion: off-by-one is deliberate for vLLM re-tokenization behavior.
+            and len(candidate_token_ids) < len(actual_token_ids) - 1
         ):

170-175: Prefer logger over print for server banner.

Use module logger to integrate with existing logging.

-        base_url = f"http://{node_ip}:{free_port}/v1"
-        print(f"Starting server on {base_url}")
+        base_url = f"http://{node_ip}:{free_port}/v1"
+        logger.info("Starting vLLM HTTP server on %s", base_url)

Add once at module top (outside this hunk):

import logging
logger = logging.getLogger(__name__)

351-364: /tokenize sits at root by design—ensure docs/tests reflect it.

Since base_url includes /v1 while /tokenize does not, please document that clients should call {base_url}/../tokenize or separately advertise the tokenize URL.


368-372: Binding to 0.0.0.0 is intentional—add a comment and optional override.

To preempt security lint and future changes, document intent and optionally support an env override.

-        config = uvicorn.Config(
-            app,
-            host="0.0.0.0",
+        # Intentionally bind to all interfaces (cluster visibility).
+        # NOTE: In untrusted environments, set NEMORL_VLLM_HTTP_HOST to limit exposure.
+        config = uvicorn.Config(
+            app,
+            host=os.getenv("NEMORL_VLLM_HTTP_HOST", "0.0.0.0"),
             port=free_port,
         )

Add once at module top (outside this hunk):

import os

725-732: Drop dead code: you already awaited collective_rpc.

asyncio.iscoroutine(result_or_coro) will always be False here. Simplify.

-        result_or_coro = await self.llm.collective_rpc("report_device_id", args=tuple())
-
-        if asyncio.iscoroutine(result_or_coro):
-            list_of_worker_results = await result_or_coro
-        else:
-            list_of_worker_results = result_or_coro
+        list_of_worker_results = await self.llm.collective_rpc(
+            "report_device_id", args=tuple()
+        )

760-769: Same coroutine check issue as above.

-            result_or_coro = await self.llm.collective_rpc(
-                "update_weights_from_global_ipc_handles", args=(ipc_handles,)
-            )
-
-            if asyncio.iscoroutine(result_or_coro):
-                worker_results = await result_or_coro
-            else:
-                worker_results = result_or_coro
+            worker_results = await self.llm.collective_rpc(
+                "update_weights_from_global_ipc_handles", args=(ipc_handles,)
+            )

800-803: Same coroutine check issue here too.

-            if asyncio.iscoroutine(result_or_coro):
-                worker_results = await result_or_coro
-            else:
-                worker_results = result_or_coro
+            worker_results = result_or_coro

Note: Or simply worker_results = await self.llm.collective_rpc("update_weights_from_collective", args=tuple()) as above.


891-901: Make HTTP server shutdown non-blocking and safe.

Add a join timeout and guard when server didn’t start.

-            if self.server_thread is not None:
+            if getattr(self, "server_thread", None) is not None:
                 from threading import Thread

                 from uvicorn import Server

                 self.http_server: Server
                 self.server_thread: Thread

                 self.http_server.should_exit = True
-                self.server_thread.join()
+                self.server_thread.join(timeout=5)

320-323: Add a simple health endpoint.

A /healthz (200 OK) helps readiness checks and orchestration.

         ########################################
         # /tokenize endpoint
         ########################################
 
+        @app.get("/healthz")
+        async def healthz():
+            return {"status": "ok"}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05ebf8b and 377b0a0.

📒 Files selected for processing (1)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

370-370: Possible binding to all interfaces

(S104)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)

135-143: Engine setup + optional HTTP exposure looks good.

Capturing AsyncEngineArgs for reuse and gating server spin-up behind config is clean.


144-147: post_init_async API shape is appropriate.

Returning Optional[str] to expose base_url when enabled is clear.

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
… report device id

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

377-380: Add readiness probe to avoid client race after start().

Prevents early requests from failing before the listener is ready.

         thread = threading.Thread(target=server.run, daemon=True)
         thread.start()
 
+        # Wait briefly for server readiness to avoid client races.
+        import socket, time
+        deadline = time.time() + 5
+        while time.time() < deadline:
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex((node_ip, free_port)) == 0:
+                    break
+            time.sleep(0.05)
+
         return thread, base_url, server

23-23: Remove top‑level uvicorn import to keep it optional.

This makes uvicorn a hard dependency even when the HTTP server isn’t used. You already import it lazily inside _setup_vllm_server().

-import uvicorn

Optionally, for type checkers only:

# at top-level
from typing import TYPE_CHECKING
if TYPE_CHECKING:
    import uvicorn  # noqa: F401

73-83: Replace assert with a structured error (non‑monotonic trajectory).

assert can be stripped with -O and yields 500s. Raise a clear exception instead.

-    assert (
-        reference_str == actual_str[: len(reference_str)]
-    ), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization!
-Reference str: {reference_str}
-Actual str: {actual_str}
-
-Reference token ids: {reference_token_ids}
-Actual token ids: {actual_token_ids}"""
+    if reference_str != actual_str[: len(reference_str)]:
+        raise ValueError(
+            "Non-monotonic trajectory not caused by token merge during re-tokenization."
+        )

115-117: Replace terminal assert with a structured error.

Avoid assert for request-path invariants.

-    assert candidate_str == reference_str
+    if candidate_str != reference_str:
+        raise ValueError(
+            "Failed to align tokens after correction; likely non-merge re-tokenization."
+        )

303-307: Return 400 on sampling‑param mismatch instead of asserting.

Client errors should not surface as 500s; emit a clear 400 with details.

-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if (
+                request.temperature != generation_config["temperature"]
+                or request.top_p != generation_config["top_p"]
+            ):
+                msg = (
+                    "Sampling params must match server config: "
+                    f"temperature={generation_config['temperature']}, "
+                    f"top_p={generation_config['top_p']}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

175-177: Use logging instead of print for server start message.

-        print(f"Starting server on {base_url}")
+        logger.info("Starting server on %s", base_url)

Add once at module top:

import logging
logger = logging.getLogger(__name__)

370-374: Suppress Ruff S104 intentionally (public bind).

Since binding to 0.0.0.0 is by design here, annotate to quiet the linter.

-            host="0.0.0.0",
+            host="0.0.0.0",  # noqa: S104 - intentional public bind

151-156: Trim redundant local imports.

threading is already imported at module scope; Optional too. Keep uvicorn/fastapi local for optional dependency.

-        import threading
-        from typing import List, Optional, Union
+        from typing import List, Union

727-734: Simplify coroutine handling; remove double “await or coroutine” pattern.

You already await collective_rpc; the subsequent coroutine check is dead code.

-        result_or_coro = await self.llm.collective_rpc("report_device_id", args=tuple())
-
-        if asyncio.iscoroutine(result_or_coro):
-            list_of_worker_results = await result_or_coro
-        else:
-            list_of_worker_results = result_or_coro
+        list_of_worker_results = await self.llm.collective_rpc(
+            "report_device_id", args=tuple()
+        )
-            result_or_coro = await self.llm.collective_rpc(
-                "update_weights_from_global_ipc_handles", args=(ipc_handles,)
-            )
-
-            if asyncio.iscoroutine(result_or_coro):
-                worker_results = await result_or_coro
-            else:
-                worker_results = result_or_coro
+            worker_results = await self.llm.collective_rpc(
+                "update_weights_from_global_ipc_handles", args=(ipc_handles,)
+            )
-            result_or_coro = await self.llm.collective_rpc(
-                "update_weights_from_collective", args=tuple()
-            )
-
-            if asyncio.iscoroutine(result_or_coro):
-                worker_results = await result_or_coro
-            else:
-                worker_results = result_or_coro
+            worker_results = await self.llm.collective_rpc(
+                "update_weights_from_collective", args=tuple()
+            )

Also applies to: 761-770, 798-806


893-903: Drop type‑only imports in shutdown; annotate at class level if needed.

Local from threading import Thread / from uvicorn import Server are unnecessary; just use the attributes.

-            if self.server_thread is not None:
-                from threading import Thread
-
-                from uvicorn import Server
-
-                self.http_server: Server
-                self.server_thread: Thread
-
-                self.http_server.should_exit = True
-                self.server_thread.join()
+            if self.server_thread is not None:
+                self.http_server.should_exit = True
+                self.server_thread.join()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 549ea1c and e59261a.

📒 Files selected for processing (2)
  • nemo_rl/models/generation/vllm/vllm_generation.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_rl/models/generation/vllm/vllm_generation.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

372-372: Possible binding to all interfaces

(S104)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

100-107: Intentional off‑by‑one bound for append loop looks good.

Acknowledging the deliberate < len(actual_token_ids) - 1 guard to avoid consuming the final token per prior discussion.

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)

388-391: Avoid client race: wait until the server is reachable before returning base_url.

Small readiness probe prevents flaky callers right after startup.

         thread = threading.Thread(target=server.run, daemon=True)
         thread.start()
 
+        # Readiness probe to avoid client races.
+        import socket, time
+        deadline = time.time() + 5
+        while time.time() < deadline:
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex((node_ip, free_port)) == 0:
+                    break
+            time.sleep(0.05)
+
         return thread, base_url, server

289-298: Return 400 on sampling-param mismatch instead of assert.

These are client-controlled fields; asserts produce 500s and can be stripped.

-            assert request.top_k in (None, -1), (
-                f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`"
-            )
-            request.top_k = -1
+            if request.top_k not in (None, -1):
+                msg = f"top_k must be unset or -1; got {request.top_k}."
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
+            request.top_k = -1
@@
-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if (
+                request.temperature != generation_config["temperature"]
+                or request.top_p != generation_config["top_p"]
+            ):
+                msg = (
+                    "Sampling params must match server config: "
+                    f"temperature={generation_config['temperature']}, "
+                    f"top_p={generation_config['top_p']}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

19-25: Avoid hard dependency on FastAPI/uvicorn; gate imports and quote annotations.

Make the HTTP server truly optional by removing top-level imports and quoting annotations. This prevents ImportError when the server is disabled and aligns with prior feedback.

Apply:

-from typing import Any, AsyncGenerator, Optional, cast
+from typing import Any, AsyncGenerator, Optional, TYPE_CHECKING, cast
@@
-import uvicorn
-from fastapi import FastAPI
+# Optional deps: only needed when HTTP server is enabled.
+if TYPE_CHECKING:
+    import uvicorn
+    from fastapi import FastAPI
@@
-def _setup_vllm_openai_api_server(self, app: FastAPI) -> FastAPI:
+def _setup_vllm_openai_api_server(self, app: "FastAPI") -> "FastAPI":

Also applies to: 151-152


381-385: Bind to 0.0.0.0 is intentional—suppress the linter.

Given the explicit decision to expose on all interfaces, annotate to silence Ruff S104.

         config = uvicorn.Config(
             app,
-            host="0.0.0.0",
+            host="0.0.0.0",  # noqa: S104 - intentional: expose to cluster network
             port=free_port,
         )

If you prefer repo-wide config, ignore S104 for this path in ruff.toml.


379-380: Prefer logging over print for server start message.

Use module logger for consistency and testability.

Example:

import logging
logger = logging.getLogger(__name__)
# ...
logger.info("Starting server on %s", base_url)

904-914: Graceful HTTP server shutdown: add timeout to join.

Avoid potential hangs if the server loop doesn’t exit promptly.

                 self.http_server.should_exit = True
-                self.server_thread.join()
+                self.server_thread.join(timeout=5)
+                if self.server_thread.is_alive():
+                    print("Warning: HTTP server thread did not exit within timeout")

344-357: Document mixed route prefixes to prevent client confusion.

Since base_url includes “/v1” while /tokenize is root-scoped by design, add a short comment/doc so users don’t call f"{base_url}/tokenize" and hit 404s.

E.g., add:

# Note: /tokenize is intentionally not under /v1; clients should call http://<host>:<port>/tokenize
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e59261a and ae13a07.

📒 Files selected for processing (1)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.319Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.319Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.395Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.395Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.800Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.800Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • llm (391-392)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

383-383: Possible binding to all interfaces

(S104)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

232-247: Guard looks correct for VLM case.

Early-return when required_prefix_token_ids is None cleanly bypasses text-only correction. LGTM.

Please confirm tests cover both presence/absence of required_prefix_token_ids (text vs VLM prompts).

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (7)
tests/unit/models/generation/test_vllm_generation.py (7)

1101-1106: Replace fixed sleep with readiness polling and add request timeouts.

Use a readiness loop to avoid flakiness and ensure all HTTP calls time out; also fail fast on non-2xx.

-    # Take a short nap for the server to spinup. Maybe there is a better way to do this?
-    sleep(3)
+    # Wait for server readiness (max 60s) instead of fixed sleep
+    import time
+    deadline = time.time() + 60
+    while time.time() < deadline:
+        try:
+            # Any HTTP status implies the server is up; only connectivity matters here.
+            requests.get(base_urls[0], timeout=1)
+            break
+        except requests.RequestException:
+            sleep(0.25)
+    else:
+        pytest.fail("HTTP server did not start within 60s")
 
-    # Generate and check result
-    response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body)
+    # Generate and check result
+    response = requests.post(
+        url=f"{base_urls[0]}/chat/completions",
+        json=body,
+        timeout=30,
+    )
+    response.raise_for_status()

1191-1204: Tighten shutdown assertion: explicit exception type and timeout.

Use the canonical exception import and ensure the call can’t hang.

-    with pytest.raises(requests.ConnectionError):
+    with pytest.raises(requests.exceptions.ConnectionError):
         requests.post(
             url=f"{base_urls[0]}/chat/completions",
             json=dict(
                 messages=[
                     {"role": "user", "content": "count to 5"},
                 ],
                 temperature=0.0,
                 logprobs=True,
                 return_tokens_as_token_ids=True,
                 max_tokens=1,
             ),
+            timeout=3,
         )

1236-1243: Add the missing assertion in the no‑op case.

Without this, the “no-op” path is unverified.

     actual_result = _maybe_correct_merged_tokens(
         tokenizer=tokenizer,
         reference_token_ids=[26951, 3834],
         actual_token_ids=[26951, 3834],
     )
     expected_result = [26951, 3834]
+    assert expected_result == actual_result

1308-1315: Same flakiness risk: replace sleep with readiness polling and add timeouts.

Mirror the earlier test’s readiness loop.

-    # Take a short nap for the server to spinup. Maybe there is a better way to do this?
-    sleep(3)
+    import time
+    deadline = time.time() + 60
+    while time.time() < deadline:
+        try:
+            requests.get(base_urls[0], timeout=1)
+            break
+        except requests.RequestException:
+            sleep(0.25)
+    else:
+        pytest.fail("HTTP server did not start within 60s")
 
-    response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body)
+    from urllib.parse import urljoin
+    tokenize_url = urljoin(base_urls[0], "/tokenize")
+    response = requests.post(url=tokenize_url, json=body, timeout=10)
+    response.raise_for_status()

1325-1334: Reuse tokenize_url and add timeout/raise_for_status.

Keeps consistency and prevents hangs.

-    response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body_with_reference_token_ids)
+    response = requests.post(
+        url=tokenize_url,
+        json=body_with_reference_token_ids,
+        timeout=10,
+    )
+    response.raise_for_status()

1337-1341: Add timeout and status check to /chat/completions.

Prevents indefinite waits and surfaces server errors.

-    response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body_with_reference_token_ids)
+    response = requests.post(
+        url=f"{base_urls[0]}/chat/completions",
+        json=body_with_reference_token_ids,
+        timeout=30,
+    )
+    response.raise_for_status()

1066-1066: Guard HTTP tests with per‑test timeouts.

If the server stalls, these tests can hang.

- def test_vllm_http_server(cluster, tokenizer):
+ @pytest.mark.timeout(180)
+ def test_vllm_http_server(cluster, tokenizer):
- def test_vllm_http_server_correct_merged_tokens_matches_baseline(cluster, tokenizer):
+ @pytest.mark.timeout(180)
+ def test_vllm_http_server_correct_merged_tokens_matches_baseline(cluster, tokenizer):

Also applies to: 1269-1269

🧹 Nitpick comments (1)
tests/unit/models/generation/test_vllm_generation.py (1)

1165-1166: Avoid '/../' in URLs; build the root tokenize path explicitly and add a timeout.

This is clearer and immune to base path changes.

-    response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body)
+    from urllib.parse import urljoin
+    tokenize_url = urljoin(base_urls[0], "/tokenize")
+    response = requests.post(url=tokenize_url, json=body, timeout=10)
+    response.raise_for_status()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae13a07 and f3b07b0.

📒 Files selected for processing (1)
  • tests/unit/models/generation/test_vllm_generation.py (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • tests/unit/models/generation/test_vllm_generation.py
🧬 Code graph analysis (1)
tests/unit/models/generation/test_vllm_generation.py (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
  • _maybe_correct_merged_tokens (39-118)
nemo_rl/models/generation/interfaces.py (1)
  • GenerationDatumSpec (127-158)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • VllmGeneration (47-801)
🪛 Ruff (0.12.2)
tests/unit/models/generation/test_vllm_generation.py

1105-1105: Probable use of requests call without timeout

(S113)


1165-1165: Probable use of requests call without timeout

(S113)


1193-1193: Probable use of requests call without timeout

(S113)


1313-1313: Probable use of requests call without timeout

(S113)


1326-1326: Probable use of requests call without timeout

(S113)


1337-1337: Probable use of requests call without timeout

(S113)

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Comment thread nemo_rl/models/generation/vllm/config.py
terrykong
terrykong previously approved these changes Sep 16, 2025
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from last minor comment on config, lgtm

@parthchadha to review and @yfw to suggest where to assert guard the vlm stuff for now

Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
parthchadha
parthchadha previously approved these changes Sep 16, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (6)
tests/unit/models/generation/test_vllm_generation.py (6)

1080-1080: Protect the HTTP test with a per‑test timeout.

+@pytest.mark.timeout(180)
 def test_vllm_http_server(cluster, tokenizer):

1249-1256: Missing assertion in the no‑op case.

The no‑op path computes results but never asserts equality.

     actual_result = _maybe_correct_merged_tokens(
         tokenizer=tokenizer,
         reference_token_ids=[26951, 3834],
         actual_token_ids=[26951, 3834],
     )
     expected_result = [26951, 3834]
+    assert expected_result == actual_result

1115-1119: Add timeout and status check to POST /chat/completions.

Prevents hangs and surfaces HTTP errors earlier.

-    response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body)
+    response = requests.post(
+        url=f"{base_urls[0]}/chat/completions", json=body, timeout=REQUEST_TIMEOUT
+    )
+    response.raise_for_status()

1373-1384: Add timeout and status check for /chat/completions.

-    response = requests.post(
-        url=f"{base_urls[0]}/chat/completions", json=body_with_reference_token_ids
-    )
+    response = requests.post(
+        url=f"{base_urls[0]}/chat/completions",
+        json=body_with_reference_token_ids,
+        timeout=REQUEST_TIMEOUT,
+    )
+    response.raise_for_status()

1358-1371: Add timeout for /tokenize (with reference).

-    response = requests.post(
-        url=f"{base_urls[0]}/../tokenize", json=body_with_reference_token_ids
-    )
+    response = requests.post(
+        url=tokenize_url,
+        json=body_with_reference_token_ids,
+        timeout=REQUEST_TIMEOUT,
+    )
+    response.raise_for_status()

1066-1079: Bound the readiness wait; avoid infinite loop and silent exceptions.

Current while True + bare pass can hang forever. Add a deadline, small backoff, and fail fast if unreachable.

-def _wait_for_vllm_http_server_spinup(base_url: str):
-    while True:
-        try:
-            requests.get(base_url, timeout=5)
-            # We don't check the status code since there may not be a route at /
-            break
-        except (
-            requests.exceptions.ConnectionError,
-            requests.exceptions.Timeout,
-            Exception,
-        ):
-            pass
+def _wait_for_vllm_http_server_spinup(base_url: str, deadline_s: float = 60.0):
+    deadline = time.time() + deadline_s
+    while time.time() < deadline:
+        try:
+            # 404/405 still indicates the server is accepting connections.
+            requests.get(base_url, timeout=5)
+            return
+        except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
+            time.sleep(0.5)
+    pytest.fail(f"HTTP server did not start within {deadline_s:.0f}s: {base_url}")
🧹 Nitpick comments (5)
tests/unit/models/generation/test_vllm_generation.py (5)

1178-1179: Avoid “../” in URLs; add timeout and status check.

Use urljoin to reach the root /tokenize route (design is intentional) and add a timeout.

-    response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body)
+    tokenize_url = urljoin(base_urls[0].rstrip("/") + "/", "../tokenize")
+    response = requests.post(url=tokenize_url, json=body, timeout=REQUEST_TIMEOUT)
+    response.raise_for_status()

1204-1217: Add timeout to negative‑case POST and use explicit exceptions type.

Ensure fast failure after shutdown.

-    with pytest.raises(requests.ConnectionError):
-        requests.post(
+    with pytest.raises(requests.exceptions.ConnectionError):
+        requests.post(
             url=f"{base_urls[0]}/chat/completions",
             json=dict(
                 messages=[
                     {"role": "user", "content": "count to 5"},
                 ],
                 temperature=0.0,
                 logprobs=True,
                 return_tokens_as_token_ids=True,
                 max_tokens=1,
             ),
+            timeout=REQUEST_TIMEOUT,
         )

1282-1284: Add timeout to the async HTTP/tokenization alignment test.

 @pytest.mark.asyncio
+@pytest.mark.timeout(180)
 async def test_vllm_http_server_correct_merged_tokens_matches_baseline(

1385-1399: Avoid using loop variable outside the async loop (Ruff B007).

Capture the last result explicitly for clarity and to appease linters.

-    async for _, generate_result in vllm_generation.generate_async(
+    last_generate_result = None
+    async for _, generate_result in vllm_generation.generate_async(
         BatchedDataDict[GenerationDatumSpec](
             {
                 "input_ids": torch.tensor([initial_tokenized_query_ids]),
                 "input_lengths": torch.tensor([len(initial_tokenized_query_ids)]),
             }
         )
     ):
-        pass
+        last_generate_result = generate_result
+
+    assert last_generate_result is not None
-    generate_generated_token_id = generate_result["output_ids"][0][
+    generate_generated_token_id = last_generate_result["output_ids"][0][
         len(initial_tokenized_query_ids)
     ].item()

1118-1176: Optional: reduce brittleness by asserting key fields instead of full JSON payloads.

Full deep equality on model outputs can break across minor dependency changes. Consider schema/partial field checks (e.g., keys present, token ids/logprob token id) after standardization.

If you want, I can refactor these assertions to schema-based checks in a follow-up.

Also applies to: 1328-1377, 1206-1217

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17aec53 and 66f1fc5.

📒 Files selected for processing (2)
  • nemo_rl/models/generation/vllm/config.py (1 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_rl/models/generation/vllm/config.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • tests/unit/models/generation/test_vllm_generation.py
📚 Learning: 2025-09-10T05:34:35.406Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.406Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • tests/unit/models/generation/test_vllm_generation.py
🧬 Code graph analysis (1)
tests/unit/models/generation/test_vllm_generation.py (4)
nemo_rl/models/generation/interfaces.py (1)
  • GenerationDatumSpec (127-158)
nemo_rl/models/generation/vllm/vllm_generation.py (2)
  • shutdown (691-698)
  • generate_async (617-638)
nemo_rl/models/generation/vllm/vllm_worker_async.py (3)
  • _maybe_correct_merged_tokens (39-118)
  • shutdown (885-918)
  • generate_async (406-624)
nemo_rl/distributed/batched_data_dict.py (1)
  • BatchedDataDict (75-839)
🪛 Ruff (0.12.2)
tests/unit/models/generation/test_vllm_generation.py

1072-1077: try-except-pass detected, consider logging the exception

(S110)


1118-1118: Probable use of requests call without timeout

(S113)


1178-1178: Probable use of requests call without timeout

(S113)


1206-1206: Probable use of requests call without timeout

(S113)


1328-1328: Probable use of requests call without timeout

(S113)


1361-1361: Probable use of requests call without timeout

(S113)


1374-1374: Probable use of requests call without timeout

(S113)


1385-1385: Loop control variable generate_result not used within loop body

(B007)

🔇 Additional comments (3)
tests/unit/models/generation/test_vllm_generation.py (3)

31-33: LGTM: typed import used correctly.


35-37: LGTM: internal helper is intentionally exercised by tests.


1052-1064: LGTM: minimal helper cleanly enables HTTP server for async engine.

Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
nemo_rl/models/generation/vllm/vllm_worker_async.py (3)

21-25: Drop top-level uvicorn import to keep it optional.

You already import uvicorn inside _setup_vllm_server; the top-level import makes it a hard dep unnecessarily.

 import ray
 import torch
-import uvicorn
 from fastapi import FastAPI
 from transformers.tokenization_utils_base import PreTrainedTokenizerBase

103-110: Document intentional off‑by‑one bound to prevent future regressions.

Prior discussion confirms this is intentional; add a brief comment to deter “fixes.”

-    elif len(candidate_str) < len(reference_str):
-        while (
+    elif len(candidate_str) < len(reference_str):
+        # Intentional: do not consume the final available token; see PR #1110 discussion.
+        while (
             candidate_str != reference_str
             and len(candidate_str) < len(reference_str)
             and len(candidate_token_ids) < len(actual_token_ids) - 1
         ):

291-300: Return 400s for invalid/mismatched sampling params instead of asserting.

Client errors should be explicit and non-fatal to the process.

-            assert request.top_k in (None, -1), (
-                f"Top k sampling parameter must be unset, empty, or -1. Got `{request.top_k}`"
-            )
+            if request.top_k not in (None, -1):
+                msg = (
+                    "Top-k must be unset or -1 to stay on-policy. "
+                    f"Got top_k={request.top_k}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
             request.top_k = -1
@@
-            assert request.temperature == generation_config["temperature"]
-            assert request.top_p == generation_config["top_p"]
+            if (
+                request.temperature != generation_config["temperature"]
+                or request.top_p != generation_config["top_p"]
+            ):
+                msg = (
+                    "Sampling params must match server config to stay on-policy: "
+                    f"temperature={generation_config['temperature']}, "
+                    f"top_p={generation_config['top_p']}."
+                )
+                return JSONResponse(
+                    content=ErrorResponse(message=msg, code=400).model_dump(),
+                    status_code=400,
+                )
🧹 Nitpick comments (5)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

63-64: Fix typo in docstring.

-    ... then we will need to uppdate this function.
+    ... then we will need to update this function.

380-383: Use logging instead of print for server startup.

-        base_url = f"http://{node_ip}:{free_port}/v1"
-        print(f"Starting server on {base_url}")
+        base_url = f"http://{node_ip}:{free_port}/v1"
+        from logging import getLogger
+        getLogger(__name__).info("Starting server on %s", base_url)

390-393: Wait for readiness to avoid client race and port snatch.

Small socket probe ensures the listener is up before returning.

         thread = threading.Thread(target=server.run, daemon=True)
         thread.start()
 
+        # Basic readiness wait to avoid client race conditions.
+        import socket, time
+        deadline = time.time() + 5
+        while time.time() < deadline:
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex((node_ip, free_port)) == 0:
+                    break
+            time.sleep(0.05)
+
         return thread, base_url, server

437-441: Don’t use assert for runtime precondition; raise a proper error.

-        assert batch_size == 1, (
-            f"generate_async is restricted to handle only single samples, "
-            f"but received batch_size={batch_size}. Please handle batching outside this method."
-        )
+        if batch_size != 1:
+            raise ValueError(
+                f"generate_async expects batch_size=1; got {batch_size}. Handle batching upstream."
+            )

906-916: Null out server refs after shutdown to aid GC and reuse.

             if self.server_thread is not None:
                 from threading import Thread
 
                 from uvicorn import Server
 
                 self.http_server: Server
                 self.server_thread: Thread
 
                 self.http_server.should_exit = True
                 self.server_thread.join()
+                self.http_server = None
+                self.server_thread = None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66f1fc5 and f4eda22.

📒 Files selected for processing (1)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:34:35.406Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.406Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
📚 Learning: 2025-09-10T05:35:59.840Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
nemo_rl/distributed/virtual_cluster.py (2)
  • _get_free_port_local (73-81)
  • _get_node_ip_local (66-70)
🪛 Ruff (0.12.2)
nemo_rl/models/generation/vllm/vllm_worker_async.py

385-385: Possible binding to all interfaces

(S104)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (1)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

361-389: HTTP server setup and base_url wiring look good.

Consider adding a brief docstring to _setup_vllm_server noting the intentional 0.0.0.0 bind and /tokenize root path to preempt future changes.

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py
@terrykong terrykong enabled auto-merge September 16, 2025 18:47
@terrykong terrykong added this pull request to the merge queue Sep 16, 2025
Merged via the queue into main with commit 5a9f7ac Sep 17, 2025
24 checks passed
@terrykong terrykong deleted the bxyu/async-vllm-http-server branch September 17, 2025 05:20
@coderabbitai coderabbitai Bot mentioned this pull request Oct 31, 2025
4 tasks
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants