[train] Support packing for CUDA IPC transfer with new inference codepath by SumanthRH · Pull Request #1557 · NovaSky-AI/SkyRL

SumanthRH · 2026-04-22T06:47:23Z

What does this PR do?

Support for CUDA IPC based weight transfer for the new inference codepath was added in #1512 but it sent tensors one at a time. This PR packs tensors in the same chunk together.

Test Plan

I manually ran FSDP and Megatron colocated weight sync tests and they pass:

uv run --isolated --extra megatron --extra dev -- pytest -s -vvv tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py::test_megatron_policy_weight_sync[colocate_all]
uv run --isolated --extra fsdp --extra dev -- pytest -s -vv tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_nccl_fsdp2_vllm]

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request optimizes the weight transfer process by packing tensors within a chunk into a single contiguous CUDA buffer, which reduces the overhead of managing multiple IPC handles. The review feedback identifies a security risk in using pickle for deserialization and suggests refactoring the manual unpacking logic to reuse existing components. Further recommendations include replacing magic indices with comments and using consistent utility functions for GPU identification to improve maintainability.

gemini-code-assist · 2026-04-22T18:40:27Z

+        # --- unpack SkyRL packed CUDA IPC format ---
+        import base64
+        import pickle
+
+        names = update_info["names"]
+        shapes = update_info["shapes"]
+        sizes = update_info["sizes"]
+        pickled = update_info["ipc_handles_pickled"]
+        handles = pickle.loads(base64.b64decode(pickled))
+
+        device_index = torch.cuda.current_device()
+        physical_gpu_id = str(torch.cuda.get_device_properties(device_index).uuid)
+        if physical_gpu_id not in handles:
+            raise ValueError(f"IPC handle not found for GPU UUID {physical_gpu_id}. " f"Available: {list(handles)}")
+        func, args = handles[physical_gpu_id]
+        # Remap device index to the LOCAL current-device.
+        list_args = list(args)
+        list_args[6] = device_index
+        packed_tensor = func(*list_args)
+
+        weights: list[tuple[str, torch.Tensor]] = []
+        offset = 0
+        for name, shape, size in zip(names, shapes, sizes):
+            weights.append((name, packed_tensor[offset : offset + size].view(*shape)))
+            offset += size


This block manually implements the logic for unpacking the packed CUDA IPC format and rebuilding the tensors. This logic is already present in CudaIpcWeightTransferReceiver.receive_weights (in cuda_ipc_strategy.py). To improve maintainability and reduce duplication, consider refactoring this method to leverage the existing weight_transfer_engine (which is the receiver) to handle the unpacking of update_info. This would centralize the IPC handling logic and make it easier to update in the future.

Old codepath will be removed soon

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

SumanthRH added 3 commits April 22, 2026 06:40

switch to poacked tensors

3eceafc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

a8cec79

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

update

6acdc6d

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review April 22, 2026 18:38

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

devin-ai-integration Bot reviewed Apr 22, 2026

View reviewed changes

SumanthRH merged commit 57591db into main Apr 22, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Support packing for CUDA IPC transfer with new inference codepath#1557

[train] Support packing for CUDA IPC transfer with new inference codepath#1557
SumanthRH merged 3 commits intomainfrom
new-inference-packing

SumanthRH commented Apr 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

SumanthRH Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SumanthRH commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SumanthRH commented Apr 22, 2026 •

edited

Loading