[trainer] Force multi-node ranks to be allocated in continuous order by erictang000 · Pull Request #338 · NovaSky-AI/SkyRL

erictang000 · 2025-09-23T02:30:02Z

Overview

Closes #324

We need to create an actor to place on each node and kill it (rather than just using ray.util.placement_group_table to look up node id) since we need to get gpu id for each bundle index, which can only be grabbed from inside a ray actor.

ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated

erictang000 · 2025-09-23T02:31:06Z

+        cfg.trainer.placement.colocate_all = True
+        cfg.generator.weight_sync_backend = "nccl"
+        cfg.trainer.strategy = "fsdp"
+        cfg.trainer.placement.policy_num_nodes = 5


note that I only saw this fail consistently once I had above 16 gpus. But we can scale this down when we include multi-node tests in CI later

gemini-code-assist

Code Review

This pull request introduces a mechanism to ensure that multi-node ranks are allocated in a continuous order, which is important for distributed training. This is achieved by using temporary actors to discover the physical GPU layout within a placement group and then reordering the bundle indices before creating the main worker actors. The changes are logical and are accompanied by a new, thorough test case for multi-node scenarios. My feedback includes a suggestion to refactor the temporary actor logic for cleaner resource management and a minor cleanup in the new test file to improve code clarity.

gemini-code-assist · 2025-09-23T02:31:24Z

+
+def test_multi_node_pg_init(ray_init_fixture, cfg):
+    try:
+        cfg = get_test_actor_config()


The cfg object is already provided to the test function by the cfg pytest fixture (defined on line 33). This line re-initializes it, which is redundant and can be removed for clarity.

SumanthRH

Nice! Some minor issues to fix

erictang000 · 2025-09-23T21:50:20Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a mechanism to force continuous rank allocation in multi-node setups, which is a crucial fix for correctly mapping ranks to GPU IDs. The approach of using temporary InfoActors to gather node and GPU information before creating the main worker actors is clever and effective. The new logic is well-tested with the addition of test_multi_node_pg.py.

I have a couple of suggestions to improve code conciseness and address a potential resource leak in the new test cases. Overall, this is a solid contribution that addresses an important issue in multi-node training.

erictang000 · 2025-09-23T22:42:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a mechanism to ensure continuous rank allocation in multi-node training environments by using temporary actors to determine GPU IDs for sorting placement group bundles. This is a solid approach to work around the limitations of Ray's placement group introspection. The changes are well-supported by new tests that cover various configurations and error conditions. My review includes suggestions for improving code clarity, a minor performance optimization during initialization, and enhancing test structure. Overall, this is a valuable contribution that improves the robustness of multi-node training.

…i_node_pg_fix

…ovaSky-AI#338) # Overview Closes NovaSky-AI#324 We need to create an actor to place on each node and kill it (rather than just using `ray.util.placement_group_table` to look up node id) since we need to get gpu id for each bundle index, which can only be grabbed from inside a ray actor. ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated <img width="434" height="315" alt="image" src="https://github.com/user-attachments/assets/44c87b5e-801a-4895-946f-ef5e34e207f3" />

x

8427aa3

erictang000 commented Sep 23, 2025

View reviewed changes

gemini-code-assist Bot reviewed Sep 23, 2025

View reviewed changes

SumanthRH self-assigned this Sep 23, 2025

SumanthRH reviewed Sep 23, 2025

View reviewed changes

Comment thread skyrl-train/skyrl_train/utils/utils.py

SumanthRH reviewed Sep 23, 2025

View reviewed changes

Comment thread skyrl-train/skyrl_train/workers/worker.py Outdated

SumanthRH reviewed Sep 23, 2025

View reviewed changes

Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py

SumanthRH requested changes Sep 23, 2025

View reviewed changes

x

88b584e

gemini-code-assist Bot reviewed Sep 23, 2025

View reviewed changes

Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated

Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated

x

4afe826

gemini-code-assist Bot reviewed Sep 23, 2025

View reviewed changes

Comment thread skyrl-train/examples/gsm8k/run_gsm8k.sh Outdated

Comment thread skyrl-train/skyrl_train/utils/utils.py

Comment thread skyrl-train/skyrl_train/workers/worker.py

Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated

x

0022ee8

erictang000 requested a review from SumanthRH September 23, 2025 23:49

SumanthRH approved these changes Sep 24, 2025

View reviewed changes

Merge branch 'main' of https://github.com/erictang000/SkyRL into mult…

e0750f4

…i_node_pg_fix

erictang000 merged commit 8395f83 into NovaSky-AI:main Sep 24, 2025
3 checks passed

erictang000 deleted the multi_node_pg_fix branch September 24, 2025 02:36

guyueh1 mentioned this pull request Oct 11, 2025

fix: Fix policy worker placement when using unified placement group NVIDIA-NeMo/RL#1341

Merged

4 tasks

erictang000 mentioned this pull request Mar 10, 2026

[vllm] Make sure vLLM inference engine allocates tp ranks contiguously with ray backend #1306

Closed

Conversation

erictang000 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

erictang000 Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

erictang000 commented Sep 23, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Sep 23, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictang000 commented Sep 23, 2025 •

edited

Loading