Skip to content

[CI] Migrate non-Megatron GPU CI to run on new inference codepath#1476

Merged
SumanthRH merged 22 commits intomainfrom
gpu-ci-migration
Apr 15, 2026
Merged

[CI] Migrate non-Megatron GPU CI to run on new inference codepath#1476
SumanthRH merged 22 commits intomainfrom
gpu-ci-migration

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Apr 8, 2026

What does this PR do?

Migrate the new inference codepath to run only on the new inference codepath.

This is the first part in a series of PRs to migrate completely to the new inference codepath.

I will wait for R3 to land before merging changes from this PR: #1428

Fixes

  1. Fix port collisions for prometheus port for vLLM router with multiple concurrent runs of SkyRL
  2. Fix world size calculation for new inference codepath with DP > 1: Previously, we incorrectly calculated offsets per server url (count of num_engines * data_parallel_size) - we should really be calculating offsets per deployment (i.e for the count of num_engines). This PR fixes it by including data parallel size in the offset calculation.
  3. Fix sleep/ wake-up for tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py : Old codepath performed a sleep + wake up by default -> this lead to some memory savings in temporary buffers etc. New codepath ooms because engines are on GPU by default. Added proper sleep and wake up calls at inference training boundaries as the fix.
  4. Migrates test_pause_and_continue_generation.py to the new inference codepath.
  5. Removes tests meant solely for legacy inference codepath in test_engine_generation.py

Test Plan

I ran GPU CI E2E with the new changes and all tests pass.

TODO:

  • Run GPU CI again and ensure tests pass

Future work

Not all tests are fully migrated to the new codepath. There are two major items pending

  1. Megatron migration worker tests skip colocated tests for new inference: We should be able to run these after [feat] chunked ipc support for new inference #1512 lands.
  2. Gloo backend support for weight syncing: We should probably just get rid of these for now, Should be implementable on top of SkyRL

SumanthRH and others added 3 commits April 8, 2026 00:54
Set _SKYRL_USE_NEW_INFERENCE=1 globally in the CI script instead of
running new-inference tests as separate pytest invocations. This ensures
all GPU CI tests exercise the new inference codepath.

Fixes:
- ServerActorPool.shutdown() now kills Ray actors to release GPU memory
- VLLMRouter uses dynamic ports for both the router and Prometheus
  metrics to avoid address-already-in-use crashes between tests
- test_new_inference_generation: fix tokenizer return value
- test_pause_and_continue_generation: adapt 3 tests to work with
  RemoteInferenceClient (use router URL directly, fix .engines access,
  remove fragile tokenizer.decode comparison)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH and others added 8 commits April 8, 2026 20:43
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Use the same port reservation pattern as vLLMServerActor to prevent
TOCTOU races. Release reservations in a try/except to avoid socket leaks
on early failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	skyrl/backends/skyrl_train/inference_servers/vllm_router.py
…gration

# Conflicts:
#	skyrl/backends/skyrl_train/inference_servers/vllm_router.py
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH changed the title [CI] Migrate GPU CI to run on new inference codepath [CI] Migrate FSDP GPU CI to run on new inference codepath Apr 15, 2026
@SumanthRH SumanthRH changed the title [CI] Migrate FSDP GPU CI to run on new inference codepath [CI] Migrate non-Megatron GPU CI to run on new inference codepath Apr 15, 2026
@SumanthRH SumanthRH marked this pull request as ready for review April 15, 2026 01:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the CI to the new inference pathway, adds support for data parallelism in the RemoteInferenceClient, and cleans up legacy tests. Key modifications include updating world size and rank offset logic to account for data_parallel_size and implementing port reservation in VLLMRouter to avoid race conditions. Feedback identifies potential ZeroDivisionError issues in RemoteInferenceClient and BroadcastInitInfo if data_parallel_size is zero or if the server count is not a multiple of the parallel size.

Comment on lines +1005 to +1006
num_deployments = len(self.server_urls) // self.data_parallel_size
self._world_size = (per_server[0] * num_deployments, per_server[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation of num_deployments assumes data_parallel_size is at least 1 and that len(self.server_urls) is an exact multiple of it. If data_parallel_size is 0, this will raise a ZeroDivisionError. If the division is not exact, it will silently truncate the number of deployments, leading to an incorrect total_world_size. It is recommended to validate these invariants.

Suggested change
num_deployments = len(self.server_urls) // self.data_parallel_size
self._world_size = (per_server[0] * num_deployments, per_server[0])
assert self.data_parallel_size > 0, "data_parallel_size must be at least 1"
num_deployments, remainder = divmod(len(self.server_urls), self.data_parallel_size)
assert remainder == 0, "Number of server URLs must be a multiple of data_parallel_size"
self._world_size = (per_server[0] * num_deployments, per_server[0])

Returns:
List of BroadcastInitInfo, one per server, with cumulative rank_offset.
"""
result: List[BroadcastInitInfo] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential ZeroDivisionError at line 97 if dp_size is 0. Although it has a default value of 1 in the method signature, it is passed dynamically from the client's configuration. Adding a validation check at the start of the method would improve robustness.

Suggested change
result: List[BroadcastInitInfo] = []
assert dp_size > 0, "dp_size must be at least 1"
result: List[BroadcastInitInfo] = []

devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Copy Markdown
Member Author

@SumanthRH SumanthRH merged commit 2facd80 into main Apr 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant