[R3] Enable R3 with new inference#1428
Conversation
…nference codepath Adds a custom `/skyrl/v1/generate` endpoint to `VLLMServerActor` that calls the vLLM engine directly and returns `routed_experts` alongside token output. The standard `/inference/v1/generate` endpoint's `GenerateResponseChoice` does not include `routed_experts` (only available on the Python `CompletionOutput` object), so a custom endpoint is required. Changes: - `vllm_server_actor.py`: Add `/skyrl/v1/generate` endpoint with correct logprobs serialisation (placeholder `-9999.0` for missing entries, matching vLLM's `ChatCompletionLogProb` default) and `routed_experts` extraction. Raises `NotImplementedError` if LoRA is enabled. - `remote_inference_client.py`: Switch `_generate_single` to `/skyrl/v1/generate`; extract and propagate `routed_experts` through to `InferenceEngineOutput.rollout_expert_indices`. - `inference_servers/utils.py`: Pass `enable_return_routed_experts` to vLLM CLI args so the engine computes routed experts. - `train/utils/utils.py`: Gate the `mp` backend assertion for R3 behind `if not _SKYRL_USE_NEW_INFERENCE` (new path uses ray backend); remove the `ValueError` blocking R3 on the new inference path; add startup validation that LoRA + R3 cannot be combined on the new path. - `main_base.py`, `tests/gpu/utils.py`: Pass `enable_return_routed_experts` when constructing `RemoteInferenceClient`. - `test_remote_inference_client.py`: Update mock endpoint to `/skyrl/v1/generate` returning a single choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pported R3 requires the mp backend to avoid hangs, but mp is not yet supported on the new inference path (tracked in NovaSky-AI#1309). Restore the ValueError blocking R3 on new inference, and un-gate the mp assertion so it applies to both old and new inference paths consistently. The infrastructure changes (/skyrl/v1/generate endpoint, RemoteInferenceClient propagation) remain as pre-work for when mp support lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Made-with: Cursor # Conflicts: # skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py # skyrl/train/entrypoints/main_base.py # tests/backends/skyrl_train/gpu/utils.py
|
For reference: We ran the Moonlight-16B script with the old and the new inference and got matching curves: https://api.wandb.ai/links/sky-posttraining-uc-berkeley/lwaaqy73
|
SumanthRH
left a comment
There was a problem hiding this comment.
Let's address the issue with the sample API
…ence_client.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
…client.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
| # needed for megatron tests | ||
| env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1" | ||
| env_vars["NVTE_FUSED_ATTN"] = "0" | ||
| env_vars["NVTE_FUSED_ATTN"] = "1" |
There was a problem hiding this comment.
SkyRL/skyrl/train/utils/utils.py
Line 582 in 7ba2490
There was a problem hiding this comment.
it was needed to run the r3 tests but i can revert it
There was a problem hiding this comment.
Could we move this to scripts? It will be lost here
There was a problem hiding this comment.
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
SumanthRH
left a comment
There was a problem hiding this comment.
Can you add test_router_replay.py to GPU CI with _SKYRL_USE_NEW_INFERENCE=1 ? The primary integration test is currently skipped but it will be good to have it in the CI script. @erictang000 is working on re-enabling this test so we will have coverage as soon as his changes land.
Add it here:
SkyRL/ci/gpu_ci_run_skyrl_train.sh
Line 39 in 7ba2490
|
Can you fix lint @hao-aaron |
) # What does this PR do? Migrate the new inference codepath to run only on the new inference codepath. This is the first part in a series of PRs to migrate completely to the new inference codepath. I will wait for R3 to land before merging changes from this PR: #1428 ## Fixes 1. Fix port collisions for prometheus port for vLLM router with multiple concurrent runs of SkyRL 2. Fix world size calculation for new inference codepath with DP > 1: Previously, we incorrectly calculated offsets per server url (count of `num_engines * data_parallel_size`) - we should really be calculating offsets per deployment (i.e for the count of `num_engines`). This PR fixes it by including data parallel size in the offset calculation. 3. Fix sleep/ wake-up for `tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` : Old codepath performed a sleep + wake up by default -> this lead to some memory savings in temporary buffers etc. New codepath ooms because engines are on GPU by default. Added proper sleep and wake up calls at inference training boundaries as the fix. 4. Migrates `test_pause_and_continue_generation.py` to the new inference codepath. 5. Removes tests meant solely for legacy inference codepath in `test_engine_generation.py` ## Test Plan I ran GPU CI E2E with the new changes and all tests pass. TODO: - [ ] Run GPU CI again and ensure tests pass ## Future work Not all tests are fully migrated to the new codepath. There are two major items pending 1. Megatron migration worker tests skip colocated tests for new inference: We should be able to run these after #1512 lands. 2. Gloo backend support for weight syncing: We should probably just get rid of these for now, Should be implementable on top of SkyRL --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


Uh oh!
There was an error while loading. Please reload this page.