[feat] chunked ipc support for new inference#1512
[feat] chunked ipc support for new inference#1512SumanthRH merged 31 commits intoNovaSky-AI:mainfrom
Conversation
Set _SKYRL_USE_NEW_INFERENCE=1 globally in the CI script instead of running new-inference tests as separate pytest invocations. This ensures all GPU CI tests exercise the new inference codepath. Fixes: - ServerActorPool.shutdown() now kills Ray actors to release GPU memory - VLLMRouter uses dynamic ports for both the router and Prometheus metrics to avoid address-already-in-use crashes between tests - test_new_inference_generation: fix tokenizer return value - test_pause_and_continue_generation: adapt 3 tests to work with RemoteInferenceClient (use router URL directly, fix .engines access, remove fragile tokenizer.decode comparison) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Use the same port reservation pattern as vLLMServerActor to prevent TOCTOU races. Release reservations in a try/except to avoid socket leaks on early failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py
…gration # Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
|
Looking good @hao-aaron ! I'm planning to get #1476 in first, which also has the world size fixes for num_engines > 1 with DP. Feel free to leave comments on that PR |
) # What does this PR do? Migrate the new inference codepath to run only on the new inference codepath. This is the first part in a series of PRs to migrate completely to the new inference codepath. I will wait for R3 to land before merging changes from this PR: #1428 ## Fixes 1. Fix port collisions for prometheus port for vLLM router with multiple concurrent runs of SkyRL 2. Fix world size calculation for new inference codepath with DP > 1: Previously, we incorrectly calculated offsets per server url (count of `num_engines * data_parallel_size`) - we should really be calculating offsets per deployment (i.e for the count of `num_engines`). This PR fixes it by including data parallel size in the offset calculation. 3. Fix sleep/ wake-up for `tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` : Old codepath performed a sleep + wake up by default -> this lead to some memory savings in temporary buffers etc. New codepath ooms because engines are on GPU by default. Added proper sleep and wake up calls at inference training boundaries as the fix. 4. Migrates `test_pause_and_continue_generation.py` to the new inference codepath. 5. Removes tests meant solely for legacy inference codepath in `test_engine_generation.py` ## Test Plan I ran GPU CI E2E with the new changes and all tests pass. TODO: - [ ] Run GPU CI again and ensure tests pass ## Future work Not all tests are fully migrated to the new codepath. There are two major items pending 1. Megatron migration worker tests skip colocated tests for new inference: We should be able to run these after #1512 lands. 2. Gloo backend support for weight syncing: We should probably just get rid of these for now, Should be implementable on top of SkyRL --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a chunked weight update protocol for vLLM inference workers to reduce peak memory usage during weight synchronization. It implements a new worker extension, NewInferenceWorkerWrap, and updates the RemoteInferenceClient and CudaIpcStrategy to utilize a three-phase lifecycle (start, update, and finish) for weight transfers. Review feedback suggests using torch.cuda.device for safer CUDA context management, improving the weight loading logic to handle model buffers, and using torch.cuda.synchronize for better compatibility and consistency.
| ) | ||
|
|
||
| model = self.model_runner.model | ||
| with torch.device(self.device): |
There was a problem hiding this comment.
The with torch.device(self.device): context manager (introduced in PyTorch 2.0) sets the default device for factory methods (like torch.empty) but does not change the active CUDA device. If the underlying logic (e.g., initialize_layerwise_reload) relies on the current CUDA device via torch.cuda.current_device(), this might lead to issues in multi-GPU environments. Using with torch.cuda.device(self.device): is generally safer for ensuring the correct CUDA context is active.
| with torch.device(self.device): | |
| with torch.cuda.device(self.device): |
| def load_weights_direct( | ||
| weights: list[tuple[str, torch.Tensor]], | ||
| ) -> None: | ||
| for name, weight in weights: | ||
| param = model.get_parameter(name) | ||
| param.copy_(weight) | ||
|
|
||
| self.weight_transfer_engine.receive_weights( | ||
| typed_update_info, | ||
| load_weights=load_weights_direct, | ||
| ) |
There was a problem hiding this comment.
The load_weights_direct implementation uses model.get_parameter(name), which will raise an AttributeError if the weight name refers to a buffer (e.g., quantization scales or running statistics) rather than a parameter. Additionally, it performs a direct copy_ without verifying if the destination parameter is on the correct device or has a matching dtype/shape, which model.load_weights typically handles. Consider using a more robust lookup mechanism that handles both parameters and buffers if this path is intended for general use.
| load_weights=load_weights_direct, | ||
| ) | ||
|
|
||
| torch.accelerator.synchronize() |
There was a problem hiding this comment.
Using torch.accelerator.synchronize() introduces an inconsistency with the rest of the codebase (e.g., cuda_ipc_strategy.py uses torch.cuda.synchronize()) and may cause compatibility issues with PyTorch versions older than 2.4. It is recommended to use torch.cuda.synchronize() for better portability and consistency within the repository.
| torch.accelerator.synchronize() | |
| torch.cuda.synchronize() |
SumanthRH
left a comment
There was a problem hiding this comment.
can you unskip the full param test in test_megatron_worker.py ?
…path (#1557) # What does this PR do? Support for CUDA IPC based weight transfer for the new inference codepath was added in #1512 but it sent tensors one at a time. This PR packs tensors in the same chunk together. ## Test Plan I manually ran FSDP and Megatron colocated weight sync tests and they pass: 1. `uv run --isolated --extra megatron --extra dev -- pytest -s -vvv tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py::test_megatron_policy_weight_sync[colocate_all]` 2. `uv run --isolated --extra fsdp --extra dev -- pytest -s -vv tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_nccl_fsdp2_vllm]` --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
While we wait for vllm-project/vllm#39212 and vllm-project/vllm#37476, support for chunked updates is still unavailable on vllm. In the meantime, this PR adds a new worker extension class for new inference that adds the relevant start/finish weight update api, allowing chunked weight updates for SkyRL IPC. We maintain both the old, one shot weight update api that NCCL uses, as well as exposing a new api that only IPC uses to do chunked weight update.
tested by running test_policy_local_engines_e2e.py and ensuring all tests pass