[CI] Migrate non-Megatron GPU CI to run on new inference codepath by SumanthRH · Pull Request #1476 · NovaSky-AI/SkyRL

SumanthRH · 2026-04-08T05:54:18Z

What does this PR do?

Migrate the new inference codepath to run only on the new inference codepath.

This is the first part in a series of PRs to migrate completely to the new inference codepath.

I will wait for R3 to land before merging changes from this PR: #1428

Fixes

Fix port collisions for prometheus port for vLLM router with multiple concurrent runs of SkyRL
Fix world size calculation for new inference codepath with DP > 1: Previously, we incorrectly calculated offsets per server url (count of num_engines * data_parallel_size) - we should really be calculating offsets per deployment (i.e for the count of num_engines). This PR fixes it by including data parallel size in the offset calculation.
Fix sleep/ wake-up for tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py : Old codepath performed a sleep + wake up by default -> this lead to some memory savings in temporary buffers etc. New codepath ooms because engines are on GPU by default. Added proper sleep and wake up calls at inference training boundaries as the fix.
Migrates test_pause_and_continue_generation.py to the new inference codepath.
Removes tests meant solely for legacy inference codepath in test_engine_generation.py

Test Plan

I ran GPU CI E2E with the new changes and all tests pass.

TODO:

Run GPU CI again and ensure tests pass

Future work

Not all tests are fully migrated to the new codepath. There are two major items pending

Megatron migration worker tests skip colocated tests for new inference: We should be able to run these after [feat] chunked ipc support for new inference #1512 lands.
Gloo backend support for weight syncing: We should probably just get rid of these for now, Should be implementable on top of SkyRL

Set _SKYRL_USE_NEW_INFERENCE=1 globally in the CI script instead of running new-inference tests as separate pytest invocations. This ensures all GPU CI tests exercise the new inference codepath. Fixes: - ServerActorPool.shutdown() now kills Ray actors to release GPU memory - VLLMRouter uses dynamic ports for both the router and Prometheus metrics to avoid address-already-in-use crashes between tests - test_new_inference_generation: fix tokenizer return value - test_pause_and_continue_generation: adapt 3 tests to work with RemoteInferenceClient (use router URL directly, fix .engines access, remove fragile tokenizer.decode comparison) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Use the same port reservation pattern as vLLMServerActor to prevent TOCTOU races. Release reservations in a try/except to avoid socket leaks on early failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py

…gration # Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request migrates the CI to the new inference pathway, adds support for data parallelism in the RemoteInferenceClient, and cleans up legacy tests. Key modifications include updating world size and rank offset logic to account for data_parallel_size and implementing port reservation in VLLMRouter to avoid race conditions. Feedback identifies potential ZeroDivisionError issues in RemoteInferenceClient and BroadcastInitInfo if data_parallel_size is zero or if the server count is not a multiple of the parallel size.

gemini-code-assist · 2026-04-15T01:43:15Z

+        num_deployments = len(self.server_urls) // self.data_parallel_size
+        self._world_size = (per_server[0] * num_deployments, per_server[0])


The calculation of num_deployments assumes data_parallel_size is at least 1 and that len(self.server_urls) is an exact multiple of it. If data_parallel_size is 0, this will raise a ZeroDivisionError. If the division is not exact, it will silently truncate the number of deployments, leading to an incorrect total_world_size. It is recommended to validate these invariants.

Suggested change

num_deployments = len(self.server_urls) // self.data_parallel_size

self._world_size = (per_server[0] * num_deployments, per_server[0])

assert self.data_parallel_size > 0, "data_parallel_size must be at least 1"

num_deployments, remainder = divmod(len(self.server_urls), self.data_parallel_size)

assert remainder == 0, "Number of server URLs must be a multiple of data_parallel_size"

self._world_size = (per_server[0] * num_deployments, per_server[0])

gemini-code-assist · 2026-04-15T01:43:16Z

        Returns:
            List of BroadcastInitInfo, one per server, with cumulative rank_offset.
        """
        result: List[BroadcastInitInfo] = []


Potential ZeroDivisionError at line 97 if dp_size is 0. Although it has a default value of 1 in the method signature, it is passed dynamically from the client's configuration. Adding a validation check at the start of the method would improve robustness.

Suggested change

result: List[BroadcastInitInfo] = []

assert dp_size > 0, "dp_size must be at least 1"

result: List[BroadcastInitInfo] = []

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2026-04-15T16:37:02Z

non-Megatron GPU CI is passing: https://github.com/NovaSky-AI/SkyRL/actions/runs/24435811449/job/71389704452
Megatron GPU CI is also passing: https://github.com/NovaSky-AI/SkyRL/actions/runs/24435811479

SumanthRH and others added 3 commits April 8, 2026 00:54

x

e93b687

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge commit 'e93b687d' into gpu-ci-migration

95cb516

SumanthRH added the run_train_gpu_ci label Apr 8, 2026

SumanthRH and others added 8 commits April 8, 2026 20:43

x

eccc612

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

28d9288

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'origin/main' into gpu-ci-migration

1495b28

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

7a385c9

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

5a70fd1

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'origin/main' into gpu-ci-migration

7967524

# Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py

Merge remote-tracking branch 'origin/gpu-ci-migration' into gpu-ci-mi…

5390ff8

…gration # Conflicts: # skyrl/backends/skyrl_train/inference_servers/vllm_router.py

SumanthRH added run_train_gpu_ci run_train_megatron_gpu_ci and removed run_train_gpu_ci labels Apr 13, 2026

SumanthRH added 3 commits April 14, 2026 23:24

fix world size calculation in new inference codepath

b45711e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

fix

91c15fb

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

5026e35

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH removed the run_train_gpu_ci label Apr 15, 2026

SumanthRH added 3 commits April 15, 2026 01:27

fix pause

41ba01c

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

remove couple

a869907

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c8e9ce0

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH changed the title ~~[CI] Migrate GPU CI to run on new inference codepath~~ [CI] Migrate FSDP GPU CI to run on new inference codepath Apr 15, 2026

SumanthRH changed the title ~~[CI] Migrate FSDP GPU CI to run on new inference codepath~~ [CI] Migrate non-Megatron GPU CI to run on new inference codepath Apr 15, 2026

SumanthRH marked this pull request as ready for review April 15, 2026 01:40

SumanthRH added the run_train_gpu_ci label Apr 15, 2026

SumanthRH mentioned this pull request Apr 15, 2026

[feat] chunked ipc support for new inference #1512

Merged

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

x

cb19f15

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

99b9ae6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added run_train_gpu_ci and removed run_train_gpu_ci labels Apr 15, 2026

This comment was marked as resolved.

Sign in to view

SumanthRH added 2 commits April 15, 2026 04:09

x

1b751ef

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

2234a4e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

This comment was marked as resolved.

Sign in to view

x

15dcda8

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit 2facd80 into main Apr 15, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Migrate non-Megatron GPU CI to run on new inference codepath#1476

[CI] Migrate non-Megatron GPU CI to run on new inference codepath#1476
SumanthRH merged 22 commits intomainfrom
gpu-ci-migration

SumanthRH commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

SumanthRH commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		num_deployments = len(self.server_urls) // self.data_parallel_size
		self._world_size = (per_server[0] * num_deployments, per_server[0])

	result: List[BroadcastInitInfo] = []
	assert dp_size > 0, "dp_size must be at least 1"
	result: List[BroadcastInitInfo] = []

Conversation

SumanthRH commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Fixes

Test Plan

Future work

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

SumanthRH commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SumanthRH commented Apr 8, 2026 •

edited

Loading