Skip to content

[feat] chunked ipc support for new inference#1512

Merged
SumanthRH merged 31 commits intoNovaSky-AI:mainfrom
hao-aaron:chunked-fix
Apr 15, 2026
Merged

[feat] chunked ipc support for new inference#1512
SumanthRH merged 31 commits intoNovaSky-AI:mainfrom
hao-aaron:chunked-fix

Conversation

@hao-aaron
Copy link
Copy Markdown
Collaborator

@hao-aaron hao-aaron commented Apr 14, 2026

While we wait for vllm-project/vllm#39212 and vllm-project/vllm#37476, support for chunked updates is still unavailable on vllm. In the meantime, this PR adds a new worker extension class for new inference that adds the relevant start/finish weight update api, allowing chunked weight updates for SkyRL IPC. We maintain both the old, one shot weight update api that NCCL uses, as well as exposing a new api that only IPC uses to do chunked weight update.

tested by running test_policy_local_engines_e2e.py and ensuring all tests pass

SumanthRH and others added 16 commits April 8, 2026 00:54
Set _SKYRL_USE_NEW_INFERENCE=1 globally in the CI script instead of
running new-inference tests as separate pytest invocations. This ensures
all GPU CI tests exercise the new inference codepath.

Fixes:
- ServerActorPool.shutdown() now kills Ray actors to release GPU memory
- VLLMRouter uses dynamic ports for both the router and Prometheus
  metrics to avoid address-already-in-use crashes between tests
- test_new_inference_generation: fix tokenizer return value
- test_pause_and_continue_generation: adapt 3 tests to work with
  RemoteInferenceClient (use router URL directly, fix .engines access,
  remove fragile tokenizer.decode comparison)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Use the same port reservation pattern as vLLMServerActor to prevent
TOCTOU races. Release reservations in a try/except to avoid socket leaks
on early failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	skyrl/backends/skyrl_train/inference_servers/vllm_router.py
…gration

# Conflicts:
#	skyrl/backends/skyrl_train/inference_servers/vllm_router.py
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Copy Markdown
Member

Looking good @hao-aaron ! I'm planning to get #1476 in first, which also has the world size fixes for num_engines > 1 with DP. Feel free to leave comments on that PR

hao-aaron and others added 8 commits April 15, 2026 03:12
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH added a commit that referenced this pull request Apr 15, 2026
)

# What does this PR do?

Migrate the new inference codepath to run only on the new inference
codepath.

This is the first part in a series of PRs to migrate completely to the
new inference codepath.

I will wait for R3 to land before merging changes from this PR: #1428 


## Fixes
1. Fix port collisions for prometheus port for vLLM router with multiple
concurrent runs of SkyRL
2. Fix world size calculation for new inference codepath with DP > 1:
Previously, we incorrectly calculated offsets per server url (count of
`num_engines * data_parallel_size`) - we should really be calculating
offsets per deployment (i.e for the count of `num_engines`). This PR
fixes it by including data parallel size in the offset calculation.
3. Fix sleep/ wake-up for
`tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` : Old codepath
performed a sleep + wake up by default -> this lead to some memory
savings in temporary buffers etc. New codepath ooms because engines are
on GPU by default. Added proper sleep and wake up calls at inference
training boundaries as the fix.
4. Migrates `test_pause_and_continue_generation.py` to the new inference
codepath.
5. Removes tests meant solely for legacy inference codepath in
`test_engine_generation.py`


## Test Plan

I ran GPU CI E2E with the new changes and all tests pass. 

TODO:

- [ ] Run GPU CI again and ensure tests pass

## Future work

Not all tests are fully migrated to the new codepath. There are two
major items pending
1. Megatron migration worker tests skip colocated tests for new
inference: We should be able to run these after #1512 lands.
2. Gloo backend support for weight syncing: We should probably just get
rid of these for now, Should be implementable on top of SkyRL

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as ready for review April 15, 2026 19:08
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a chunked weight update protocol for vLLM inference workers to reduce peak memory usage during weight synchronization. It implements a new worker extension, NewInferenceWorkerWrap, and updates the RemoteInferenceClient and CudaIpcStrategy to utilize a three-phase lifecycle (start, update, and finish) for weight transfers. Review feedback suggests using torch.cuda.device for safer CUDA context management, improving the weight loading logic to handle model buffers, and using torch.cuda.synchronize for better compatibility and consistency.

)

model = self.model_runner.model
with torch.device(self.device):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The with torch.device(self.device): context manager (introduced in PyTorch 2.0) sets the default device for factory methods (like torch.empty) but does not change the active CUDA device. If the underlying logic (e.g., initialize_layerwise_reload) relies on the current CUDA device via torch.cuda.current_device(), this might lead to issues in multi-GPU environments. Using with torch.cuda.device(self.device): is generally safer for ensuring the correct CUDA context is active.

Suggested change
with torch.device(self.device):
with torch.cuda.device(self.device):

Comment on lines +114 to +124
def load_weights_direct(
weights: list[tuple[str, torch.Tensor]],
) -> None:
for name, weight in weights:
param = model.get_parameter(name)
param.copy_(weight)

self.weight_transfer_engine.receive_weights(
typed_update_info,
load_weights=load_weights_direct,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The load_weights_direct implementation uses model.get_parameter(name), which will raise an AttributeError if the weight name refers to a buffer (e.g., quantization scales or running statistics) rather than a parameter. Additionally, it performs a direct copy_ without verifying if the destination parameter is on the correct device or has a matching dtype/shape, which model.load_weights typically handles. Consider using a more robust lookup mechanism that handles both parameters and buffers if this path is intended for general use.

load_weights=load_weights_direct,
)

torch.accelerator.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.accelerator.synchronize() introduces an inconsistency with the rest of the codebase (e.g., cuda_ipc_strategy.py uses torch.cuda.synchronize()) and may cause compatibility issues with PyTorch versions older than 2.4. It is recommended to use torch.cuda.synchronize() for better portability and consistency within the repository.

Suggested change
torch.accelerator.synchronize()
torch.cuda.synchronize()

Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you unskip the full param test in test_megatron_worker.py ?

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit 996698d into NovaSky-AI:main Apr 15, 2026
5 of 6 checks passed
SumanthRH added a commit that referenced this pull request Apr 22, 2026
…path (#1557)

# What does this PR do?

Support for CUDA IPC based weight transfer for the new inference
codepath was added in #1512 but it sent tensors one at a time. This PR
packs tensors in the same chunk together.


## Test Plan

I manually ran FSDP and Megatron colocated weight sync tests and they
pass:

1. `uv run --isolated --extra megatron --extra dev -- pytest -s -vvv
tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py::test_megatron_policy_weight_sync[colocate_all]`
2. `uv run --isolated --extra fsdp --extra dev -- pytest -s -vv
tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_nccl_fsdp2_vllm]`

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants