Skip to content

[trainer] Force multi-node ranks to be allocated in continuous order#338

Merged
erictang000 merged 5 commits intoNovaSky-AI:mainfrom
erictang000:multi_node_pg_fix
Sep 24, 2025
Merged

[trainer] Force multi-node ranks to be allocated in continuous order#338
erictang000 merged 5 commits intoNovaSky-AI:mainfrom
erictang000:multi_node_pg_fix

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 commented Sep 23, 2025

Overview

Closes #324

We need to create an actor to place on each node and kill it (rather than just using ray.util.placement_group_table to look up node id) since we need to get gpu id for each bundle index, which can only be grabbed from inside a ray actor.

ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated
image

cfg.trainer.placement.colocate_all = True
cfg.generator.weight_sync_backend = "nccl"
cfg.trainer.strategy = "fsdp"
cfg.trainer.placement.policy_num_nodes = 5
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that I only saw this fail consistently once I had above 16 gpus. But we can scale this down when we include multi-node tests in CI later

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to ensure that multi-node ranks are allocated in a continuous order, which is important for distributed training. This is achieved by using temporary actors to discover the physical GPU layout within a placement group and then reordering the bundle indices before creating the main worker actors. The changes are logical and are accompanied by a new, thorough test case for multi-node scenarios. My feedback includes a suggestion to refactor the temporary actor logic for cleaner resource management and a minor cleanup in the new test file to improve code clarity.

Comment thread skyrl-train/skyrl_train/utils/utils.py

def test_multi_node_pg_init(ray_init_fixture, cfg):
try:
cfg = get_test_actor_config()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cfg object is already provided to the test function by the cfg pytest fixture (defined on line 33). This line re-initializes it, which is redundant and can be removed for clarity.

@SumanthRH SumanthRH self-assigned this Sep 23, 2025
Comment thread skyrl-train/skyrl_train/utils/utils.py
Comment thread skyrl-train/skyrl_train/workers/worker.py Outdated
Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some minor issues to fix

@erictang000
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to force continuous rank allocation in multi-node setups, which is a crucial fix for correctly mapping ranks to GPU IDs. The approach of using temporary InfoActors to gather node and GPU information before creating the main worker actors is clever and effective. The new logic is well-tested with the addition of test_multi_node_pg.py.

I have a couple of suggestions to improve code conciseness and address a potential resource leak in the new test cases. Overall, this is a solid contribution that addresses an important issue in multi-node training.

Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated
Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated
@erictang000
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to ensure continuous rank allocation in multi-node training environments by using temporary actors to determine GPU IDs for sorting placement group bundles. This is a solid approach to work around the limitations of Ray's placement group introspection. The changes are well-supported by new tests that cover various configurations and error conditions. My review includes suggestions for improving code clarity, a minor performance optimization during initialization, and enhancing test structure. Overall, this is a valuable contribution that improves the robustness of multi-node training.

Comment thread skyrl-train/examples/gsm8k/run_gsm8k.sh Outdated
Comment thread skyrl-train/skyrl_train/utils/utils.py
Comment thread skyrl-train/skyrl_train/workers/worker.py
Comment thread skyrl-train/tests/gpu/test_multi_node_pg.py Outdated
@erictang000 erictang000 merged commit 8395f83 into NovaSky-AI:main Sep 24, 2025
3 checks passed
@erictang000 erictang000 deleted the multi_node_pg_fix branch September 24, 2025 02:36
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…ovaSky-AI#338)

# Overview
Closes NovaSky-AI#324

We need to create an actor to place on each node and kill it (rather
than just using `ray.util.placement_group_table` to look up node id)
since we need to get gpu id for each bundle index, which can only be
grabbed from inside a ray actor.

ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated
<img width="434" height="315" alt="image"
src="https://github.com/user-attachments/assets/44c87b5e-801a-4895-946f-ef5e34e207f3"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure consecutive worker ranks are grouped together on nodes

2 participants