feat: Allow P2P communication across different PIDs #1951
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title:**
feat: Allow P2P communication across different PIDsThis PR introduces support for P2P transport (including NVLink) between processes with different PIDs on the same host. This capability is essential for Kubernetes deployments where different ranks run in separate pods but share the same physical node.
Problem:
When running NCCL in environments like Kubernetes, different ranks of the same node may be in different pods with different Process IDs (PIDs). NCCL's P2P transport currently requires ranks to have the same PID to establish a direct connection, preventing the use of high-speed interconnects like NVLink between pods. This forces communication to fall back to slower transports (P2P/CUMEM or P2P/IPC instead of P2P/DirectPoint), significantly impacting performance for AI inference workloads using disaggregated architectures (e.g., Prefill/Decode separation).
Solution:
Introduces a new environment variable
NCCL_P2P_ALLOW_CROSS_PID. When set to 1, the PID check is bypassed, allowing P2P transport (including NVLink) to be used between processes with different PIDs on the same host. An informational message is logged when this override is active to ensure users are aware of the non-standard configuration.Related Issues
Fixes #1781
Changes & Impact
Code Changes
NCCL_P2P_ALLOW_CROSS_PID(default: 0)src/transport/p2p.cc: Added theNCCL_PARAMdefinition for the new variable and updated theP2P_SAME_PIDmacro to bypass the PID check when the variable is set.(Note: Changes to
src/include/param.handsrc/init.ccwere not required, as theNCCL_PARAMmacro system allows for self-contained parameter definition within the C++ file where it is used.)Behavior Changes
NCCL_P2P_ALLOW_CROSS_PID=1):"NCCL_P2P_ALLOW_CROSS_PID set, allowing P2P connection between processes with different PIDs. This is not a recommended configuration."Breaking Changes
None. This is a purely additive feature that does not affect existing behavior when the environment variable is not set.
API Modifications
None. Changes are configuration-only via an environment variable.
Performance Impact
Testing Performed
Recommended testing (to be performed by reviewers or in CI):
NCCL_P2P_ALLOW_CROSS_PID=0(default).NCCL_P2P_ALLOW_CROSS_PID=1in a multi-pod Kubernetes setup on a single node.nccl-testsin the multi-pod setup to confirm NVLink speeds are achieved.Limitations and Caveats
Security & Safety Considerations
⚠️ Important: This feature bypasses a safety check designed to prevent unintended memory access between processes.
Safe usage requires:
Do NOT use if:
Documentation Updates
(Note: No central documentation file like
docs/env.mdexists in the repository. The following is a proposed documentation entry for the official NCCL documentation.)NCCL_P2P_ALLOW_CROSS_PIDType: Integer (0 or 1)
Default:
0(disabled)Allows P2P transport between processes with different PIDs on the same host.
When set to
1, this variable bypasses the PID check in the P2P transport setup. This is primarily intended to enable high-speed interconnects like NVLink between containerized processes (e.g., Kubernetes pods) that are running on the same physical node but in different PID namespaces.Use Case: Multi-pod, single-node Kubernetes deployments where different ranks of a distributed job share physical GPUs connected via NVLink. For example, LLM inference with Prefill/Decode disaggregation.