fix: fix nccl P2P initialization error for non-colocated by zhandaz · Pull Request #636 · NVIDIA-NeMo/RL

zhandaz · 2025-07-10T04:01:32Z

What does this PR do ?

Adds explicit NCCL_CUMEM_ENABLE=1 environment variable setting to resolve P2P initialization failures in distributed training with vLLM.

Please see detailed analysis in #564 (comment).

Issues

Closes #564.

This PR can also be helpful to #613. Maybe @YUki-666 could take a look. The only change you may need to make it to delete the os.environ["NCCL_CUMEM_ENABLE"] = "0" in function init_collective for nemo_rl/models/policy/megatron_policy_worker.py.

Test results

I have tested the settings @YUki-666 provided: 8b model grpo on 5 nodes:

Where

Purple line: exp1_5n_non_colocated_p2p_disabled: before the fix, running with NCCL_P2P_DISABLE=1.
Pink line: exp2_5n_non_colocated_fix: after this pr's fix.

We can see that:

The training reward mostly aligns.
The training speed is back to normal.

Usage

The fix automatically applies when using distributed training with vLLM generation workers. No user action required.

Additional Information

This PR works for both the current vllm==0.9.0 and also new versions like vllm>=0.9.1rc1.
If we upgrade our version, we can remove the additional environment variable setting in nemo_rl/models/generation/vllm_backend.py.

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com>

yuki-97

LGTM! Thanks @Dazz993 !

wangshangsam · 2025-07-10T15:43:56Z

Hmmm ... I was looking into the mypy errors in #632 until I realized that mypy failed for this PR too. It would be hard to imagine why this PR would trigger any mypy failures.

@terrykong @parthchadha is mypy failing expected?

terrykong · 2025-07-10T15:47:41Z

@wangshangsam the mypy job is expected to fail. It won't block a PR, but just as an FYI of typing issues. Once we're completely in the green, we'll change that so it gates PRs

This reverts commit 233cfca.

Signed-off-by: Zhanda <zhandazhu@gmail.com> Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com> Co-authored-by: Zhanda Zhu <zhandaz@cw-dfw-cs-001-vscode-02.cm.cluster> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

…#636) Signed-off-by: Zhanda <zhandazhu@gmail.com> Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com> Co-authored-by: Zhanda Zhu <zhandaz@cw-dfw-cs-001-vscode-02.cm.cluster> Signed-off-by: Jialei Chen <jialeic@google.com>

Signed-off-by: Zhanda <zhandazhu@gmail.com> Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com> Co-authored-by: Zhanda Zhu <zhandaz@cw-dfw-cs-001-vscode-02.cm.cluster>

…#636) Signed-off-by: Zhanda <zhandazhu@gmail.com> Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com> Co-authored-by: Zhanda Zhu <zhandaz@cw-dfw-cs-001-vscode-02.cm.cluster>

zhandaz requested review from parthchadha, wangshangsam and yuki-97 July 10, 2025 04:01

Zhanda Zhu and others added 2 commits July 9, 2025 21:11

fix: fix nccl P2P initialization error for non-colocated

386c06c

Signed-off-by: Zhanda <zhandazhu@gmail.com>

fix: fix the linting error

0e55848

Signed-off-by: Zhanda <zhandazhu@gmail.com>

zhandaz force-pushed the zhanda/fix-nccl branch from 1cf87df to 0e55848 Compare July 10, 2025 04:11

yuki-97 reviewed Jul 10, 2025

View reviewed changes

Comment thread nemo_rl/models/generation/vllm.py

add the comment back.

a196e37

Signed-off-by: Zhanda Zhu <49645678+Dazz993@users.noreply.github.com>

zhandaz requested a review from yuki-97 July 10, 2025 05:57

yuki-97 approved these changes Jul 10, 2025

View reviewed changes

Merge branch 'main' into zhanda/fix-nccl

b407efc

wangshangsam approved these changes Jul 10, 2025

View reviewed changes

parthchadha approved these changes Jul 10, 2025

View reviewed changes

parthchadha added this pull request to the merge queue Jul 10, 2025

Merged via the queue into main with commit 233cfca Jul 11, 2025
13 of 14 checks passed

parthchadha deleted the zhanda/fix-nccl branch July 11, 2025 01:55

guyueh1 added a commit that referenced this pull request Jul 15, 2025

Revert "fix: fix nccl P2P initialization error for non-colocated (#636)"

f678302

This reverts commit 233cfca.

wangshangsam assigned zhandaz Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix nccl P2P initialization error for non-colocated#636

fix: fix nccl P2P initialization error for non-colocated#636
parthchadha merged 4 commits intomainfrom
zhanda/fix-nccl

zhandaz commented Jul 10, 2025

Uh oh!

Uh oh!

yuki-97 left a comment

Uh oh!

wangshangsam commented Jul 10, 2025

Uh oh!

terrykong commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhandaz commented Jul 10, 2025

What does this PR do ?

Issues

Test results

Usage

Additional Information

Uh oh!

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

wangshangsam commented Jul 10, 2025

Uh oh!

terrykong commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants