Skip to content

feat: support non-colocated sync vllm#489

Merged
chtruong814 merged 8 commits intomainfrom
yukih/non-colocated-inference
Jun 12, 2025
Merged

feat: support non-colocated sync vllm#489
chtruong814 merged 8 commits intomainfrom
yukih/non-colocated-inference

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented Jun 6, 2025

What does this PR do ?

Support non-colocated sync vllm.

Convergence

image

Time Cost

pink: colocated baseline, using 4 node.
green/blue: non-colocated, using 4 node for train and 1 node for inference.
image

Usage

Train resources will be inferred from overall and inference resources.
i.e. training nodes = overall nodes - inference nodes

# 1 node with 8 GPUs, 4 GPUs for train and 4 GPUs for inference
uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.gpus_per_node=4 \
    cluster.num_nodes=1 \
    cluster.gpus_per_node=8

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

@yuki-97 yuki-97 changed the title feat: non-colocated inference feat: support non-colocated sync vllm Jun 6, 2025
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jun 9, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from b1633bb to 9806368 Compare June 9, 2025 03:35
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 9, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 9, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 0d8b927 to 0bc1cb7 Compare June 10, 2025 05:49
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 10, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 6535ede to d74e144 Compare June 10, 2025 12:29
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 10, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from d74e144 to 067b937 Compare June 10, 2025 13:13
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 10, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 067b937 to 375f2c0 Compare June 10, 2025 14:15
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 10, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 375f2c0 to 0e06197 Compare June 10, 2025 15:18
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 10, 2025
@yuki-97 yuki-97 marked this pull request as ready for review June 10, 2025 15:18
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 002c56e to fa6b7f3 Compare June 11, 2025 09:31
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 11, 2025
yuki-97 added 2 commits June 11, 2025 13:39
Signed-off-by: Yuki Huang <yukih@nvidia.com>

lint

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix ip and config

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>

update config structure

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from fa6b7f3 to 5d35feb Compare June 11, 2025 13:40
Comment thread examples/configs/grpo_math_1B.yaml Outdated
Comment thread examples/configs/grpo_math_1B.yaml Outdated
Comment thread nemo_rl/algorithms/grpo.py Outdated
Comment thread nemo_rl/algorithms/grpo.py Outdated
Comment thread nemo_rl/algorithms/grpo.py
Comment thread nemo_rl/distributed/worker_groups.py Outdated
Comment thread nemo_rl/models/generation/vllm.py
parthchadha
parthchadha previously approved these changes Jun 12, 2025
yuki-97 added 3 commits June 12, 2025 03:31
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-inference branch from 9feb008 to 21b6193 Compare June 12, 2025 03:50
@github-actions github-actions Bot added the CI Relating to CI label Jun 12, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 12, 2025
@terrykong terrykong added this pull request to the merge queue Jun 12, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 12, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Jun 12, 2025
Merged via the queue into main with commit 4f1cd1a Jun 12, 2025
21 of 23 checks passed
@chtruong814 chtruong814 deleted the yukih/non-colocated-inference branch June 12, 2025 13:09
parthchadha pushed a commit that referenced this pull request Jun 17, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jul 14, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants