Skip to content

[train] Support packing for CUDA IPC transfer with new inference codepath#1557

Merged
SumanthRH merged 3 commits intomainfrom
new-inference-packing
Apr 22, 2026
Merged

[train] Support packing for CUDA IPC transfer with new inference codepath#1557
SumanthRH merged 3 commits intomainfrom
new-inference-packing

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Apr 22, 2026

What does this PR do?

Support for CUDA IPC based weight transfer for the new inference codepath was added in #1512 but it sent tensors one at a time. This PR packs tensors in the same chunk together.

Test Plan

I manually ran FSDP and Megatron colocated weight sync tests and they pass:

  1. uv run --isolated --extra megatron --extra dev -- pytest -s -vvv tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py::test_megatron_policy_weight_sync[colocate_all]
  2. uv run --isolated --extra fsdp --extra dev -- pytest -s -vv tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_nccl_fsdp2_vllm]

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review April 22, 2026 18:38
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the weight transfer process by packing tensors within a chunk into a single contiguous CUDA buffer, which reduces the overhead of managing multiple IPC handles. The review feedback identifies a security risk in using pickle for deserialization and suggests refactoring the manual unpacking logic to reuse existing components. Further recommendations include replacing magic indices with comments and using consistent utility functions for GPU identification to improve maintainability.

Comment on lines +109 to +133
# --- unpack SkyRL packed CUDA IPC format ---
import base64
import pickle

names = update_info["names"]
shapes = update_info["shapes"]
sizes = update_info["sizes"]
pickled = update_info["ipc_handles_pickled"]
handles = pickle.loads(base64.b64decode(pickled))

device_index = torch.cuda.current_device()
physical_gpu_id = str(torch.cuda.get_device_properties(device_index).uuid)
if physical_gpu_id not in handles:
raise ValueError(f"IPC handle not found for GPU UUID {physical_gpu_id}. " f"Available: {list(handles)}")
func, args = handles[physical_gpu_id]
# Remap device index to the LOCAL current-device.
list_args = list(args)
list_args[6] = device_index
packed_tensor = func(*list_args)

weights: list[tuple[str, torch.Tensor]] = []
offset = 0
for name, shape, size in zip(names, shapes, sizes):
weights.append((name, packed_tensor[offset : offset + size].view(*shape)))
offset += size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block manually implements the logic for unpacking the packed CUDA IPC format and rebuilding the tensors. This logic is already present in CudaIpcWeightTransferReceiver.receive_weights (in cuda_ipc_strategy.py). To improve maintainability and reduce duplication, consider refactoring this method to leverage the existing weight_transfer_engine (which is the receiver) to handle the unpacking of update_info. This would centralize the IPC handling logic and make it easier to update in the future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old codepath will be removed soon

Comment thread skyrl/backends/skyrl_train/weight_sync/cuda_ipc_strategy.py
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit 57591db into main Apr 22, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant