[trainer] Force multi-node ranks to be allocated in continuous order#338
[trainer] Force multi-node ranks to be allocated in continuous order#338erictang000 merged 5 commits intoNovaSky-AI:mainfrom
Conversation
| cfg.trainer.placement.colocate_all = True | ||
| cfg.generator.weight_sync_backend = "nccl" | ||
| cfg.trainer.strategy = "fsdp" | ||
| cfg.trainer.placement.policy_num_nodes = 5 |
There was a problem hiding this comment.
note that I only saw this fail consistently once I had above 16 gpus. But we can scale this down when we include multi-node tests in CI later
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to ensure that multi-node ranks are allocated in a continuous order, which is important for distributed training. This is achieved by using temporary actors to discover the physical GPU layout within a placement group and then reordering the bundle indices before creating the main worker actors. The changes are logical and are accompanied by a new, thorough test case for multi-node scenarios. My feedback includes a suggestion to refactor the temporary actor logic for cleaner resource management and a minor cleanup in the new test file to improve code clarity.
|
|
||
| def test_multi_node_pg_init(ray_init_fixture, cfg): | ||
| try: | ||
| cfg = get_test_actor_config() |
SumanthRH
left a comment
There was a problem hiding this comment.
Nice! Some minor issues to fix
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to force continuous rank allocation in multi-node setups, which is a crucial fix for correctly mapping ranks to GPU IDs. The approach of using temporary InfoActors to gather node and GPU information before creating the main worker actors is clever and effective. The new logic is well-tested with the addition of test_multi_node_pg.py.
I have a couple of suggestions to improve code conciseness and address a potential resource leak in the new test cases. Overall, this is a solid contribution that addresses an important issue in multi-node training.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to ensure continuous rank allocation in multi-node training environments by using temporary actors to determine GPU IDs for sorting placement group bundles. This is a solid approach to work around the limitations of Ray's placement group introspection. The changes are well-supported by new tests that cover various configurations and error conditions. My review includes suggestions for improving code clarity, a minor performance optimization during initialization, and enhancing test structure. Overall, this is a valuable contribution that improves the robustness of multi-node training.
…ovaSky-AI#338) # Overview Closes NovaSky-AI#324 We need to create an actor to place on each node and kill it (rather than just using `ray.util.placement_group_table` to look up node id) since we need to get gpu id for each bundle index, which can only be grabbed from inside a ray actor. ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated <img width="434" height="315" alt="image" src="https://github.com/user-attachments/assets/44c87b5e-801a-4895-946f-ef5e34e207f3" />
Overview
Closes #324
We need to create an actor to place on each node and kill it (rather than just using
ray.util.placement_group_tableto look up node id) since we need to get gpu id for each bundle index, which can only be grabbed from inside a ray actor.ran GSM8k on 2 nodes of 4xL40S colocated and non-colocated
