feat: support DP inside vLLM for EP by yuki-97 · Pull Request #1081 · NVIDIA-NeMo/RL

yuki-97 · 2025-09-05T15:39:28Z

What does this PR do ?

Support DP inside vLLM to unblock EP, since vLLM's EP = DP * TP.
1. vLLM's DP does have extra overhead, so this PR only use it when really needed (when EP > 1 and EP != TP), we'll still control DP outside vLLM in other cases.
Update the param enable_expert_parallel to expert_parallel_size.
1. enable_expert_parallel are all set to 1 by default, so that nothing will be affected.

For how does DP in generation side be like after this PR:

Assume we have 256 GPUs and run with TP = 8 EP = 64, then:
global_DP_SIZE = total_GPUs / model_parallel_size = 256 / 8 = 32
vLLM_DP_SIZE = EP / TP = 64 / 8 = 8 for each EP group

Issues

Related #908.

Test Result

green: main branch baseline, TP32
blue: feature branch, using the same setting as baseline
red/pink: feature branch, TP8EP64

Convergence

Time cost
The EP setting is not fine tuned, so the generation time is slower.

Summary by CodeRabbit

New Features
- Switched expert-parallel setting from a boolean to expert_parallel_size (integer); added explicit sizing for TP/PP/EP/DP and automatic DP environment setup when applicable.
- Per-worker address/port discovery injected into worker environments for coordinated launches.
Reliability
- Improved cluster startup with retries, resource validation, timeouts, clearer errors, and runtime consistency checks.
Refactor
- Example configs updated to use expert_parallel_size.
Tests
- Test configs updated to the new expert_parallel_size setting.

terrykong

lgtm. @parthchadha to review

Signed-off-by: Yuki Huang <yukih@nvidia.com> fix type Signed-off-by: Yuki Huang <yukih@nvidia.com> fix rank Signed-off-by: Yuki Huang <yukih@nvidia.com> fix local rank Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai · 2025-09-16T00:42:54Z

Walkthrough

Configs replace boolean enable_expert_parallel with integer expert_parallel_size. vLLM generation and worker code introduce explicit TP/PP/EP/DP sizing, checks, device-UUID IPC distribution, and DP env wiring. Virtual cluster adds per-(pg,bundle) address/port retrieval with retries/timeouts; worker groups inject address/port lists into worker envs.

Changes

Cohort / File(s)	Summary
Examples: config key change `examples/configs/.../.yaml`, `examples/configs/recipes/llm/`, `examples/configs/recipes/vlm/*`, `examples/configs/evals/eval.yaml`	Replace `enable_expert_parallel: false` with `expert_parallel_size: 1` in vLLM/dtensor generation config blocks across example YAMLs.
Virtual cluster addressing `nemo_rl/distributed/virtual_cluster.py`	Add `get_available_address_and_port(pg_idx,bundle_idx)`; make `get_master_address_and_port()` delegate to `(0,0)`; add retries/backoff, resource checks, 180s readiness timeout, improved errors and doc updates.
Worker env provisioning `nemo_rl/distributed/worker_groups.py`	Discover per-(pg,bundle) available addresses/ports and inject `AVAILABLE_ADDR_LIST` and `AVAILABLE_PORT_LIST` into each worker's environment when creating workers.
vLLM config typing `nemo_rl/models/generation/vllm/config.py`	Rename `VllmSpecificArgs.enable_expert_parallel: bool` → `expert_parallel_size: int`.
vLLM generation orchestration `nemo_rl/models/generation/vllm/vllm_generation.py`	Introduce explicit `tp_size/pp_size/ep_size/model_parallel_size/dp_size/vllm_dp_size`, add divisibility/consistency assertions and PP/EP constraints, set NCCL/CUMEM/VLLM DP env vars, add device UUID reporting and per-device IPC-handle distribution, and pass env vars to worker groups.
vLLM worker behavior `nemo_rl/models/generation/vllm/vllm_worker.py`	Add public `expert_parallel_size` (from config); derive `enable_expert_parallel = expert_parallel_size > 1`; when `expert_parallel_size > tensor_parallel_size`, configure vLLM DP env vars (DP rank, local rank, DP master IP/port) using `AVAILABLE_*` lists.
Unit tests: config updates `tests/unit/.../*.py`	Update test vLLM configs to use `expert_parallel_size: 1` instead of `enable_expert_parallel: False`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Orchestrator
  participant VirtualCluster
  participant WorkerGroups
  participant Worker as "vLLM Worker"

  Orchestrator->>VirtualCluster: init/create placement groups (retries/backoff, timeout)
  VirtualCluster-->>Orchestrator: placement groups ready + (ip,port) per bundle

  Orchestrator->>WorkerGroups: request workers (bundle_indices_list)
  loop per (pg_idx,bundle_idx)
    WorkerGroups->>VirtualCluster: get_available_address_and_port(pg_idx,bundle_idx)
    VirtualCluster-->>WorkerGroups: (ip, port)
  end
  WorkerGroups->>Worker: spawn with env { AVAILABLE_ADDR_LIST, AVAILABLE_PORT_LIST, ... }

  Worker->>Worker: read cfg.expert_parallel_size and tensor_parallel_size
  alt expert_parallel_size > tensor_parallel_size
    Worker->>Worker: compute DP sizing, set VLLM_DP_RANK / VLLM_DP_RANK_LOCAL
    Worker->>Worker: set VLLM_DP_MASTER_IP and VLLM_DP_MASTER_PORT from AVAILABLE_* lists
  end
  Worker-->>Orchestrator: worker ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Support use vLLM DP + EP in async engine #1101 — The PR surfaces expert-parallel sizing and vLLM DP/EP checks/env wiring, which aligns with enabling vLLM DP+EP support referenced in the issue.

Suggested reviewers

yaoyu-33
terrykong

Poem

A rabbit taps keys with a parallel grin,
Flags turned to sizes — let the experts begin.
Ports and ranks arranged in tidy rows,
UUIDs whisper where each IPC goes.
Hoppity hops, clusters hum and spin. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title concisely and accurately summarizes the primary change — adding support for data-parallelism (DP) inside vLLM to enable expert parallelism (EP) — and aligns with the PR objectives and the described code changes (DP/EP handling, config key rename, worker/env updates). It is short, specific, and developer-focused, so it is appropriate for the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/vllm-ep-size

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1)

41-49: No 'enable_expert_parallel' config key found; fix vllm config typing

Sweep: no configs/tests contain an "enable_expert_parallel" config key — it's only a runtime variable in nemo_rl/models/generation/vllm/vllm_worker.py (set from expert_parallel_size and passed to vLLM).

Action (critical): nemo_rl/models/generation/vllm/config.py currently declares "expert_parallel_size: bool" (line ~23) but configs/tests use integers. Change that to "expert_parallel_size: int" and run validation/tests.

🧹 Nitpick comments (22)

examples/configs/recipes/llm/grpo-gspo-deepscaler-1.5b-8K.yaml (1)
105-109: Consistent migration to expert_parallel_size.

EP=1 with TP=1 is a no-op; OK. Add the EP%TP note for clarity.
-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1)
151-154: Migration LGTM; EP disabled by default.

Here TP=4; keep a comment to avoid misconfig when enabling EP later.
-      expert_parallel_size: 1
+      # If you enable EP, set expert_parallel_size to a multiple of TP (=4 here)
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1)
99-101: Rename is correct; matches PR semantics (EP=1 => vLLM DP off).

Optional: add the EP%TP constraint comment.
-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt.v3.yaml (1)
99-101: Good: default EP off; TP=4 here.

Add a guardrail comment to prevent invalid EP later.
-      expert_parallel_size: 1
+      # If enabling EP, choose a multiple of TP (=4)
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.yaml (1)
99-101: Change is consistent; EP off by default.

Same optional comment suggestion.
-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1)
133-138: Rename aligns with code; maintainers note for EP/TP constraint recommended.

TP=4 in this config; comment will help future edits.
-      expert_parallel_size: 1
+      # To enable EP, set to a multiple of TP (=4). Otherwise keep at 1.
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml (1)
104-108: LGTM; EP=1 keeps previous behavior.

Add the EP%TP reminder to reduce misconfiguration.
-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/evals/eval.yaml (1)

24-24: EP size switch ACK; default safe for single-GPU eval.

Value 1 keeps EP disabled. Consider adding a comment that EP>1 on eval requires EP % TP == 0 and enough GPUs.

examples/configs/recipes/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.yaml (1)

99-99: Migration is correct; keep EP constraints visible to users.

Optional: add a YAML comment noting EP > 1 requires EP % TP == 0 and may enable vLLM-internal DP = EP/TP.

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1)

129-129: Good rename; default preserves behavior.

No functional change. If you plan to exercise the new DP-in-vLLM path in CI, consider adding a variant with EP > 1 and TP = 1 for coverage.

examples/configs/recipes/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.yaml (1)

99-99: Rename LGTM; add minimal guardrail note (optional).

A short inline comment like “1 disables EP; >1 requires EP % TP == 0” could prevent misconfigurations.

examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4sp.v3.yaml (1)

99-99: Config update is correct.

Given TP=4, if someone sets EP>1 later, remind EP must be a multiple of 4; DP_in_vLLM = EP/4. Optional comment.

examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1)

41-49: Rename to expert_parallel_size looks good; default preserves behavior.

Setting expert_parallel_size: 1 is backward‑compatible. When users tune EP > 1, please document that EP must be divisible by TP (EP % TP == 0) so vLLM DP = EP/TP is integral.

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.yaml (1)

95-102: Good default for EP.

expert_parallel_size: 1 preserves previous behavior; note in docs that EP > 1 requires EP % TP == 0.
tests/unit/environments/test_code_environment.py (1)
51-62: Use boolean for enforce_eager (avoid string).

The current value is a string "False"; prefer boolean False to match runtime expectations.

Apply this diff:
         "gpu_memory_utilization": 0.6,
-        "enforce_eager": "False",
+        "enforce_eager": False,
tests/unit/models/generation/test_vllm_generation.py (2)
53-65: EP key migration OK; also fix enforce_eager type.

Key rename is correct. Please switch enforce_eager from string to boolean for consistency.
         "load_format": "auto",
-        "enforce_eager": "False",
+        "enforce_eager": False,
53-65: Add a minimal EP>1 test to cover new DP-in-vLLM path.

Parametrize a case with TP=1, EP=2 to validate layout checks and env wiring.
# Outside this block: add a new test
def test_vllm_ep_gt_1_layout(cluster, tokenizer):
    cfg = deepcopy(basic_vllm_test_config)
    cfg = configure_generation_config(cfg, tokenizer, is_eval=True)
    cfg["vllm_cfg"]["tensor_parallel_size"] = 1
    cfg["vllm_cfg"]["expert_parallel_size"] = 2  # triggers vLLM-internal DP
    v = VllmGeneration(cluster, cfg)
    v.finish_generation()  # init path exercises EP layout
    v.shutdown()
tests/unit/environments/test_retriever.py (1)
51-61: Set enforce_eager to boolean False.

Avoid quoting booleans in config dict.
         "gpu_memory_utilization": 0.6,
-        "enforce_eager": "False",
+        "enforce_eager": False,
nemo_rl/models/generation/vllm/vllm_generation.py (1)
154-163: Make vLLM DP env setting robust; optional NCCL resiliency

Gate VLLM_DP_SIZE on computed vllm_dp_size > 1 (covers future changes).

Optional: set NCCL_ASYNC_ERROR_HANDLING=1 for safer comms.

Apply this diff:
         if not self.cfg["colocated"]["enabled"]:
             env_vars["NCCL_CUMEM_ENABLE"] = "1"
-        # We should use vLLM DP if ep_size > tp_size since EP_SIZE = DP_SIZE * TP_SIZE in vLLM.
-        # See details in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py
-        if self.ep_size > self.tp_size:
-            env_vars["VLLM_DP_SIZE"] = str(self.vllm_dp_size)
+        # Use vLLM internal DP when DP_SIZE > 1 (EP = DP * TP).
+        if self.vllm_dp_size > 1:
+            env_vars["VLLM_DP_SIZE"] = str(self.vllm_dp_size)
+        # Optional resiliency:
+        # env_vars.setdefault("NCCL_ASYNC_ERROR_HANDLING", "1")
nemo_rl/distributed/virtual_cluster.py (3)
354-372: Validate pg_idx and bundle_idx before indexing

Out-of-range indices will surface as IndexError. Fail fast with a clear message.

Apply this diff:
-        placement_groups = self.get_placement_groups()
-        if len(placement_groups) == 1:
-            pg = placement_groups[0]
-        else:
-            pg = placement_groups[pg_idx]
+        placement_groups = self.get_placement_groups()
+        if not placement_groups:
+            raise RuntimeError("No placement groups available")
+        if len(placement_groups) == 1:
+            pg = placement_groups[0]
+        else:
+            if not (0 <= pg_idx < len(placement_groups)):
+                raise ValueError(
+                    f"pg_idx out of range: {pg_idx} (available={len(placement_groups)})"
+                )
+            pg = placement_groups[pg_idx]
+        if not (0 <= bundle_idx < pg.bundle_count):
+            raise ValueError(
+                f"bundle_idx out of range: {bundle_idx} (bundle_count={pg.bundle_count})"
+            )
375-383: Confirm Ray behavior with num_cpus=0 + PG scheduling across Ray versions

Setting num_cpus=0 is intentional to avoid contention, but older Ray versions had quirks with PG enforcement on zero‑resource tasks. If you target older clusters, consider reserving a fractional CPU (e.g., 0.01) on the bundle.

Would you like a small compatibility matrix check script for your supported Ray versions?

386-388: Minor: tighten error message or use a dedicated exception

Message is fine; if you want to appease linters, shorten or use a custom exception type.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1d6a and 14b45e1.

📒 Files selected for processing (32)

examples/configs/evals/eval.yaml (1 hunks)
examples/configs/grpo_math_1B.yaml (1 hunks)
examples/configs/grpo_sliding_puzzle.yaml (1 hunks)
examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1 hunks)
examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml (1 hunks)
examples/configs/recipes/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.yaml (1 hunks)
examples/configs/recipes/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.yaml (1 hunks)
examples/configs/recipes/llm/grpo-gspo-deepscaler-1.5b-8K.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt.v3.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4sp.v3.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.yaml (1 hunks)
examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1 hunks)
examples/configs/recipes/vlm/vlm_grpo-smolvlm2-2.2b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1 hunks)
examples/configs/vlm_grpo_3B.yaml (1 hunks)
nemo_rl/distributed/virtual_cluster.py (2 hunks)
nemo_rl/distributed/worker_groups.py (2 hunks)
nemo_rl/models/generation/vllm/config.py (1 hunks)
nemo_rl/models/generation/vllm/vllm_generation.py (4 hunks)
nemo_rl/models/generation/vllm/vllm_worker.py (2 hunks)
tests/unit/environments/test_code_environment.py (1 hunks)
tests/unit/environments/test_retriever.py (1 hunks)
tests/unit/experience/test_rollouts.py (1 hunks)
tests/unit/models/generation/test_vllm_generation.py (1 hunks)
tests/unit/models/generation/test_vllm_large_model.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_rl/distributed/worker_groups.py (1)

nemo_rl/distributed/virtual_cluster.py (1)

get_available_address_and_port (354-388)

nemo_rl/models/generation/vllm/vllm_generation.py (3)

nemo_rl/distributed/worker_groups.py (1)

dp_size (590-592)

tests/unit/models/generation/test_vllm_generation.py (1)

cluster (211-222)

nemo_rl/distributed/virtual_cluster.py (1)

world_size (348-349)

🪛 Ruff (0.12.2)

nemo_rl/distributed/worker_groups.py

463-463: Loop control variable group_idx not used within loop body

(B007)

464-464: Loop control variable local_rank not used within loop body

(B007)

nemo_rl/models/generation/vllm/vllm_worker.py

347-347: Use of possibly insecure function; consider using ast.literal_eval

(S307)

348-348: Use of possibly insecure function; consider using ast.literal_eval

(S307)

nemo_rl/distributed/virtual_cluster.py

386-388: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Lint check

🔇 Additional comments (11)

examples/configs/grpo_sliding_puzzle.yaml (1)
44-46: Rename to expert_parallel_size looks good; preserves default off (1).

File: examples/configs/grpo_sliding_puzzle.yaml (lines 44–46)

Add a short inline note that when setting expert_parallel_size > 1 it must be a multiple of tensor_parallel_size.
-      expert_parallel_size: 1
+      # Set >1 to enable EP. Constraint: expert_parallel_size % tensor_parallel_size == 0
+      expert_parallel_size: 1
Verification not completed: the ripgrep run returned "No files were searched". Re-run from the repo root to confirm there are no lingering enable_expert_parallel keys and that an alias/deprecation path exists. Suggested command:
rg -n --hidden -S --no-ignore 'enable_expert_parallel|expert_parallel_size' -g '!**/site-packages/**'
examples/configs/recipes/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.yaml (1)

98-98: Rename LGTM.

Defaults are sane; nothing further.

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1)

128-128: Migration OK; FP8 generation unaffected.

Nothing blocking. If you intend to demo EP>1 here, ensure kernel/layout support matches FP8 path.

examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (1)

99-99: expert_parallel_size default preserved — add YAML hint & confirm TP mismatch

vllm_worker sets enable_expert_parallel = expert_parallel_size > 1 and passes it through (nemo_rl/models/generation/vllm/vllm_worker.py:138, 386) — expert_parallel_size: 1 preserves current behavior.

Add a one-line YAML comment next to examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (around line 99) stating: "If >1, expert_parallel_size must be a multiple of generation.vllm_cfg.tensor_parallel_size; vLLM internal DP = EP / TP."

Quick rg run did not surface tensor_parallel_size entries under examples/configs — confirm whether the dtensor_cfg.tensor_parallel_size=8 vs generation.vllm_cfg.tensor_parallel_size=4 (train vs gen) mismatch is intentional; if not, align or document.

examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1)

110-117: EP size key migration OK.

expert_parallel_size: 1 aligns with new runtime logic and keeps EP disabled by default.

examples/configs/vlm_grpo_3B.yaml (1)

111-118: Consistent EP config.

Good: expert_parallel_size replaces the old boolean and stays at 1.

examples/configs/recipes/vlm/vlm_grpo-smolvlm2-2.2b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1)

109-116: Looks good.

EP key rename applied consistently with value 1.

examples/configs/grpo_math_1B.yaml (1)

181-182: YAML update is consistent with the new EP semantics.

Comment is correct (EP must be a multiple of TP).

tests/unit/experience/test_rollouts.py (1)

201-202: Config rename to expert_parallel_size is correct here.

Keeps default EP=1 for tests.

nemo_rl/models/generation/vllm/vllm_worker.py (1)

137-139: Deriving enable_expert_parallel from expert_parallel_size is correct.

This aligns with the new config contract (EP enabled iff size > 1).

Confirm all config sources now pass an int for expert_parallel_size (no strings).

nemo_rl/distributed/virtual_cluster.py (1)

390-397: Delegation wrapper looks good

Backwards compatibility via delegation to (pg_idx=0, bundle_idx=0) is clean.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

nemo_rl/models/generation/vllm/vllm_generation.py (3)

58-69: Compute sizes as ints, raise instead of assert, and defer vLLM-DP calc.

Casts avoid accidental str math; explicit exceptions survive -O; compute vllm_dp_size only after EP/TP validation.

-        self.tp_size = self.cfg["vllm_cfg"]["tensor_parallel_size"]
-        self.pp_size = self.cfg["vllm_cfg"]["pipeline_parallel_size"]
-        self.ep_size = self.cfg["vllm_cfg"]["expert_parallel_size"]
-        self.model_parallel_size = self.tp_size * self.pp_size
-
-        assert cluster.world_size() % self.model_parallel_size == 0, (
-            "World size must be a multiple of model parallel size. "
-            f"Got world size {cluster.world_size()} and model parallel size (TP * PP) {self.model_parallel_size}."
-        )
-        self.dp_size = cluster.world_size() // self.model_parallel_size
-        self.vllm_dp_size = self.ep_size // self.tp_size
+        vllm_cfg = self.cfg["vllm_cfg"]
+        self.tp_size = int(vllm_cfg["tensor_parallel_size"])
+        self.pp_size = int(vllm_cfg["pipeline_parallel_size"])
+        self.ep_size = int(vllm_cfg["expert_parallel_size"])
+        self.model_parallel_size = self.tp_size * self.pp_size
+
+        total_mp = self.model_parallel_size
+        if cluster.world_size() % total_mp != 0:
+            raise ValueError(
+                f"world_size={cluster.world_size()} is not divisible by TP*PP={self.tp_size}*{self.pp_size}={total_mp}"
+            )
+        self.dp_size = cluster.world_size() // total_mp

70-88: Replace assert-based config checks; validate EP vs TP before computing vLLM-DP.

Asserts can be optimized out; make the checks deterministic and only then set vllm_dp_size.

-        if self.pp_size > 1:
-            assert self.cfg["vllm_cfg"]["async_engine"], (
-                "When pipeline_parallel_size > 1, async_engine must be set to True in the vLLM configuration. "
-                "You can enable it by adding `policy.generation.vllm_cfg.async_engine=true` to your command."
-            )
+        if self.pp_size > 1 and not vllm_cfg["async_engine"]:
+            raise ValueError(
+                "When pipeline_parallel_size > 1, set vllm_cfg.async_engine=true."
+            )
 
-        if self.ep_size > 1:
-            assert self.ep_size % self.tp_size == 0, (
-                "When EP > 1, EP must be a multiple of TP since vLLM's EP = DP * TP. "
-                "Please update your configuration to set expert_parallel_size to a multiple of tensor_parallel_size."
-            )
-            if self.ep_size != self.tp_size:
-                # vLLM's EP = DP * TP, so here we need to use DP inside vLLM.
-                assert not self.cfg["vllm_cfg"]["async_engine"], (
-                    "vLLM async_engine has some issues when using DP inside vLLM. "
-                    "Please update your configuration to set `policy.generation.vllm_cfg.async_engine=false`. "
-                    "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
-                )
+        if self.ep_size > 1:
+            if self.ep_size % self.tp_size != 0:
+                raise ValueError(
+                    "When EP > 1, expert_parallel_size must be a multiple of tensor_parallel_size (EP = DP * TP in vLLM)."
+                )
+            if self.ep_size != self.tp_size and vllm_cfg["async_engine"]:
+                raise ValueError(
+                    "async_engine is not supported when using DP inside vLLM (EP != TP). Set vllm_cfg.async_engine=false. "
+                    "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
+                )
+        # vLLM DP size computed after EP/TP validation
+        self.vllm_dp_size = max(1, self.ep_size // self.tp_size)

199-201: Don’t use assert for DP-size contract; raise a deterministic error.

Prevents silent bypass under python -O.

-        assert self.dp_size == self.worker_group.dp_size, (
-            f"Data parallel size mismatch. Expected {self.dp_size}, got {self.worker_group.dp_size}"
-        )
+        if self.dp_size != self.worker_group.dp_size:
+            raise RuntimeError(
+                f"Data-parallel size mismatch: expected {self.dp_size}, got {self.worker_group.dp_size}"
+            )

🧹 Nitpick comments (1)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

93-114: Fix non-interpolated constants in error messages.

These lines reference placeholders without f-strings; users will see braces literally.

-                    "does not return logprobs after top_k filtering. Values >= {TOP_K_THRESHOLD} are allowed "
+                    f"does not return logprobs after top_k filtering. Values >= {TOP_K_THRESHOLD} are allowed "
...
-                    "does not return logprobs after top_p filtering. Values >= {TOP_P_THRESHOLD} are allowed "
+                    f"does not return logprobs after top_p filtering. Values >= {TOP_P_THRESHOLD} are allowed "

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a869c8 and 43a6808.

📒 Files selected for processing (1)

nemo_rl/models/generation/vllm/vllm_generation.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_rl/models/generation/vllm/vllm_generation.py (3)

tests/unit/models/generation/test_vllm_generation.py (1)

cluster (211-222)

nemo_rl/distributed/virtual_cluster.py (1)

world_size (348-349)

nemo_rl/distributed/worker_groups.py (1)

dp_size (590-592)

🔇 Additional comments (2)

nemo_rl/models/generation/vllm/vllm_generation.py (2)

127-132: NamedSharding reshape looks good given the divisibility guard.

With the explicit divisibility check above, this reshape is safe.

160-169: Env wiring for VLLM_DP_SIZE is correct and scoped.

Setting VLLM_DP_SIZE only when EP > TP matches the DP-inside-vLLM contract.

terrykong · 2025-09-18T17:07:01Z

this commit passed as part of this merge queue run: https://github.com/NVIDIA-NeMo/RL/actions/runs/17828559136. will manually merge this one

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 requested review from parthchadha and terrykong September 5, 2025 15:43

yuki-97 mentioned this pull request Sep 9, 2025

Support use vLLM DP + EP in async engine #1101

Open

terrykong previously approved these changes Sep 10, 2025

View reviewed changes

yuki-97 added 6 commits September 15, 2025 17:39

use vllm dp

9bf06f4

Signed-off-by: Yuki Huang <yukih@nvidia.com> fix type Signed-off-by: Yuki Huang <yukih@nvidia.com> fix rank Signed-off-by: Yuki Huang <yukih@nvidia.com> fix local rank Signed-off-by: Yuki Huang <yukih@nvidia.com>

use expert_parallel_size instead of enable_expert_parallel

108c34d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix rank

47ab6ec

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update assert

cfc44a0

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update configs

984c224

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix rebase

14b45e1

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 dismissed terrykong’s stale review via 14b45e1 September 16, 2025 00:42

yuki-97 force-pushed the yukih/vllm-ep-size branch from 93dfe12 to 14b45e1 Compare September 16, 2025 00:42

terrykong added the r0.4.0 label Sep 16, 2025

coderabbitai Bot reviewed Sep 16, 2025

View reviewed changes

address coderabbit

0a869c8

Signed-off-by: Yuki Huang <yukih@nvidia.com>

parthchadha reviewed Sep 16, 2025

View reviewed changes

Comment thread nemo_rl/models/generation/vllm/vllm_generation.py

parthchadha previously approved these changes Sep 16, 2025

View reviewed changes

update comment

43a6808

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 dismissed parthchadha’s stale review via 43a6808 September 16, 2025 14:28

coderabbitai Bot reviewed Sep 16, 2025

View reviewed changes

terrykong linked an issue Sep 16, 2025 that may be closed by this pull request

Enable vLLM PP and EP for DSV3 #908

Closed

terrykong approved these changes Sep 16, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 16, 2025

yuki-97 mentioned this pull request Aug 26, 2025

Enable vLLM PP and EP for DSV3 #908

Closed

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Sep 17, 2025

terrykong added this pull request to the merge queue Sep 17, 2025

chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025

chtruong814 added this pull request to the merge queue Sep 18, 2025

chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025

chtruong814 added this pull request to the merge queue Sep 18, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 18, 2025

chtruong814 merged commit 594e700 into main Sep 18, 2025
26 checks passed

chtruong814 deleted the yukih/vllm-ep-size branch September 18, 2025 17:42

This was referenced Sep 19, 2025

feat: add on policy distillation algorithm #1006

Merged

feat: FP8 Training in Megatron Path #971

Merged

This was referenced Oct 17, 2025

chore: major version bump (torch 2.8, vllm 0.11, ray 2.49) & SP fixes #1334

Merged

cp: fix: Fix policy worker placement when using unified placement group (1341) into r0.4.0 #1416

Merged

coderabbitai Bot mentioned this pull request Nov 10, 2025

fix: Support vLLM DP+EP in async engine via Ray-level data parallelism #1495

Open

4 tasks

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

feat: support DP inside vLLM for EP (NVIDIA-NeMo#1081)

c4d3568

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai Bot mentioned this pull request Dec 15, 2025

fix: split dtensorv1 vllm dependency #1638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support DP inside vLLM for EP#1081

feat: support DP inside vLLM for EP#1081
chtruong814 merged 8 commits intomainfrom
yukih/vllm-ep-size

yuki-97 commented Sep 5, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

terrykong left a comment

Uh oh!

coderabbitai Bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuki-97 commented Sep 5, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Test Result

Summary by CodeRabbit

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuki-97 commented Sep 5, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 16, 2025 •

edited

Loading