Skip to content

Conversation

@holynakamoto
Copy link

@holynakamoto holynakamoto commented Dec 20, 2025

Title:** feat: Allow P2P communication across different PIDs

This PR introduces support for P2P transport (including NVLink) between processes with different PIDs on the same host. This capability is essential for Kubernetes deployments where different ranks run in separate pods but share the same physical node.

Problem:
When running NCCL in environments like Kubernetes, different ranks of the same node may be in different pods with different Process IDs (PIDs). NCCL's P2P transport currently requires ranks to have the same PID to establish a direct connection, preventing the use of high-speed interconnects like NVLink between pods. This forces communication to fall back to slower transports (P2P/CUMEM or P2P/IPC instead of P2P/DirectPoint), significantly impacting performance for AI inference workloads using disaggregated architectures (e.g., Prefill/Decode separation).

Solution:
Introduces a new environment variable NCCL_P2P_ALLOW_CROSS_PID. When set to 1, the PID check is bypassed, allowing P2P transport (including NVLink) to be used between processes with different PIDs on the same host. An informational message is logged when this override is active to ensure users are aware of the non-standard configuration.

Related Issues
Fixes #1781

Changes & Impact

Code Changes

  • New environment variable: NCCL_P2P_ALLOW_CROSS_PID (default: 0)
  • Modified files:
    • src/transport/p2p.cc: Added the NCCL_PARAM definition for the new variable and updated the P2P_SAME_PID macro to bypass the PID check when the variable is set.

(Note: Changes to src/include/param.h and src/init.cc were not required, as the NCCL_PARAM macro system allows for self-contained parameter definition within the C++ file where it is used.)

Behavior Changes

  • Default behavior: Unchanged. The PID check remains enforced.
  • When enabled (NCCL_P2P_ALLOW_CROSS_PID=1):
    • PID mismatch no longer blocks P2P transport between ranks on the same host.
    • An info message is logged: "NCCL_P2P_ALLOW_CROSS_PID set, allowing P2P connection between processes with different PIDs. This is not a recommended configuration."

Breaking Changes
None. This is a purely additive feature that does not affect existing behavior when the environment variable is not set.

API Modifications

None. Changes are configuration-only via an environment variable.

Performance Impact

  • Expected Performance Improvement: In Kubernetes environments with multiple pods per node, this change enables P2P/DirectPoint transport over NVLink, which was previously unavailable. This will result in a significant bandwidth improvement compared to the previous fallback to P2P/CUMEM or P2P/IPC transport.

Testing Performed

⚠️ Local validation note: The build was not validated locally due to the NCCL build system's incompatibility with the development environment (macOS), which is not a target platform. The changes are self-contained and expected to build correctly on supported Linux environments.

Recommended testing (to be performed by reviewers or in CI):

  • Basic functionality: Verify P2P still works as expected with NCCL_P2P_ALLOW_CROSS_PID=0 (default).
  • Cross-PID enablement: Test with NCCL_P2P_ALLOW_CROSS_PID=1 in a multi-pod Kubernetes setup on a single node.
  • Performance validation: Benchmark with nccl-tests in the multi-pod setup to confirm NVLink speeds are achieved.
  • Logging validation: Verify the informational message appears correctly in the logs when the feature is enabled.

Limitations and Caveats

Security & Safety Considerations
⚠️ Important: This feature bypasses a safety check designed to prevent unintended memory access between processes.

Safe usage requires:

  • Process isolation: Containers/pods must have proper namespace isolation (PID, network, etc.).
  • Trusted environment: Only enable in controlled environments (e.g., Kubernetes with proper pod security policies).
  • Same user: Processes should ideally run as the same UID for an additional layer of safety.

Do NOT use if:

  • Running untrusted code in different pods on the same node.
  • Processes have different privilege levels.
  • You are unsure about the container isolation mechanisms of your environment.

Documentation Updates

(Note: No central documentation file like docs/env.md exists in the repository. The following is a proposed documentation entry for the official NCCL documentation.)


NCCL_P2P_ALLOW_CROSS_PID

Type: Integer (0 or 1)
Default: 0 (disabled)

Allows P2P transport between processes with different PIDs on the same host.

When set to 1, this variable bypasses the PID check in the P2P transport setup. This is primarily intended to enable high-speed interconnects like NVLink between containerized processes (e.g., Kubernetes pods) that are running on the same physical node but in different PID namespaces.

⚠️ Security Warning: This bypasses a safety mechanism. It should only be used in controlled and trusted environments where process isolation is guaranteed by other means (e.g., Kubernetes namespaces, cgroups, and seccomp profiles).

Use Case: Multi-pod, single-node Kubernetes deployments where different ranks of a distributed job share physical GPUs connected via NVLink. For example, LLM inference with Prefill/Decode disaggregation.

Problem:
When running NCCL in environments like Kubernetes, where different ranks of the same node may be in different pods, they have different Process IDs (PIDs). NCCL's P2P transport requires ranks to have the same PID to establish a connection, preventing the use of high-speed interconnects like NVLink between pods on the same node. This forces communication to fall back to slower network transports, impacting performance.

Solution:
Introduce a new environment variable NCCL_P2P_ALLOW_CROSS_PID. When this variable is set to 1, the PID check is bypassed, allowing P2P transport (including NVLink) to be used between processes with different PIDs on the same host. An informational message is printed when this override is active to ensure the user is aware of the non-standard configuration.

Limitations and Caveats:
This feature bypasses a safety check. It should only be used in controlled environments where the user understands the implications of allowing direct memory access between different processes.

The build was not validated locally due to the build system's incompatibility with macOS, which is not a target platform. The changes are self-contained and are expected to build correctly on a supported Linux environment.

Signed-off-by: Your Name <your.email@example.com>
@sjeaugey
Copy link
Member

sjeaugey commented Jan 3, 2026

I'm quite confused by the PR. Forcing NCCL to believe that two processes are the same PID will make NCCL think that it can access other ranks' memory by just passing the address of the pointer.

This is not supported by CUDA between processes, even less between containers. I'm struggling to understand how that can work...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problems with using NVLink across Kubernetes Pods on a same node

2 participants