[skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend by nithinvc · Pull Request #1464 · NovaSky-AI/SkyRL

nithinvc · 2026-04-06T23:16:06Z

Summary

This PR introduces a vLLM renderer which used by the SkyRL backend to convert ModelInputChunks into tokenized text, pixel_values, and image_grid_thw. The scope of this PR is limited to the actual renderer implementation, not adding to the backend yet.

Add VLLMRenderer.
Add RenderedImage NamedTuple return type from _render_images.
Reuse render_model_input for the text-only fast path in _render_single instead of duplicating the token concatenation logic.
Fix RenderedModelInput.multi_modal_kwargs type annotation from dict[str, bytes] to dict[str, list[str]] to match actual usage (list of base64-encoded strings per modality key).
Add unit tests for VLLMRenderer with a mocked RemoteInferenceClient (text-only, image-only, mixed, error cases).
Add GPU CI integration test (test_vlm_renderer.py) that exercises the full renderer against a real VLM via InferenceEngineState.
Gate all VLM render tests behind SKYRL_LOCAL_VLLM=1 since they depend on a local vLLM fork with /v1/chat/completions/render support that is not yet upstreamed.

Test plan

Unit tests pass: python -m pytest tests/backends/skyrl_train/test_renderer.py -v (8 tests, no GPU required)
GPU CI integration tests: SKYRL_LOCAL_VLLM=1 uv run --extra fsdp --extra dev --extra tinker pytest tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_vlm_inference_generation.py -m vllm -v

nithinvc · 2026-04-07T01:38:58Z

 from skyrl.train.config import SkyRLTrainConfig
 from tests.backends.skyrl_train.gpu.utils import InferenceEngineState

+requires_local_vllm = pytest.mark.skipif(


All of the current vision language rendering and generation tests require a local vLLM install from main till the next vllm release

pcmoritz · 2026-04-07T18:49:43Z

 class RenderedModelInput(BaseModel):
    prompt_ids: list[int]
-    multi_modal_kwargs: dict[str, bytes] | None = None
+    multi_modal_kwargs: dict[str, list[str]] | None = None


Would it make sense to make this a TypedDict? To make it easier to understand which keys can be there (even if they are optional)

That makes sense to me. Added it as a typed dict, but I kept the value typing as any instead of a torch.Tensor since types.py is also used by the jax backend.

pcmoritz · 2026-04-07T18:52:52Z

    ]
+
+
+def decode_mm_kwargs(rendered: RenderedModelInput) -> Tuple[torch.Tensor, torch.Tensor]:


It seems more natural to me to put only the multi_modal_kwargs as an argument (which will go well with the suggestion below to introduce a typeddict for it), since that's the only part of the renderedmodelinput that is used here

pcmoritz

This looks great to me! It is slightly sad that we depend on

from vllm.entrypoints.serve.disagg.mm_serde import (
        decode_mm_kwargs_item as _vllm_decode,
    )

which seems more of an internal vllm API (and depending on it violates the client / server separation for vllm a little). If the messagepack protocol is stable, maybe we would want to replicate it in here. We can also do that going forward (e.g. maybe in the future it makes sense to have skyrl/backends/renderer.py not depend on vllm and put the VLLMRenderer into the skyrl_train folder. So feel free to just move forward with the PR for now :)

## Summary Integrates the VLLMRenderer (landed in #1464) into the SkyRL train backend so that VLM training batches include image placeholder tokens and decoded vision tensors (`pixel_values`, `image_grid_thw`). - When using new inference (`_SKYRL_USE_NEW_INFERENCE`), `_to_training_batch` lazily creates a `VLLMRenderer` and renders all `ModelInput`s through it. - Extracts `pixel_values` and `image_grid_thw` from rendered outputs and adds them to the `TrainingInputBatch` as `TensorList` entries (one tensor per batch element, since patch counts vary per image). - Extends `_pad_batch` to handle `TensorList` fields by cycling and cloning entries, matching the existing padding strategy for regular tensors. - Reorders `forward_backward` and `forward` to call `_to_training_batch` before `_sleep_inference_engines`, since the renderer needs the inference servers need to be initialized. Note that this does not wake the KV cache or model GPU memory since that is explicitly done in `save_weights_for_sampler` via the dispatcher. ## E2E Tinker VLM Classifier Curves With #1484 , we can now run tinker vision language recipes against the SkyRL. Merging both closes #1200 Example: ```bash _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \ --base-model "Qwen/Qwen3-VL-8B-Instruct" \ --backend fsdp \ --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}' ``` Cookbook ```bash TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \ tinker_cookbook.recipes.vlm_classifier.train \ base_url=http://localhost:8000 \ model_name="Qwen/Qwen3-VL-4B-Instruct" \ dataset=caltech101 ``` Train nll: <img width="1200" height="675" alt="train_nll" src="https://github.com/user-attachments/assets/82e36767-edee-43b7-ab4a-7fbf496c8cbb" /> Val nll: <img width="1200" height="675" alt="val_nll" src="https://github.com/user-attachments/assets/1dc6e96b-7e1b-4ead-bf0e-71e42eab0491" /> Val accuracy: <img width="1200" height="675" alt="accuracy" src="https://github.com/user-attachments/assets/ec6f92b8-a544-42d9-9a00-4c06292e7ae3" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1496" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

nithinvc added 5 commits April 6, 2026 19:17

port initial impl

d46edde

cleaned up with tests

475de01

standardize on async

e884e45

prune tests, renderer comment

309d8ca

consolidate tests

4f3561c

nithinvc marked this pull request as ready for review April 7, 2026 01:02

nithinvc changed the title ~~[skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks~~ [skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend Apr 7, 2026

add placeholder assert

64d69d5

This comment was marked as resolved.

Sign in to view

standardize the stack / cat convention

e8f344d

This comment was marked as resolved.

Sign in to view

correct base 64 encoding

7685775

nithinvc commented Apr 7, 2026

View reviewed changes

pcmoritz reviewed Apr 7, 2026

View reviewed changes

pcmoritz approved these changes Apr 7, 2026

View reviewed changes

typed dict

3bccb82

pcmoritz merged commit c8bdc63 into NovaSky-AI:main Apr 9, 2026
5 of 7 checks passed

nithinvc mentioned this pull request Apr 10, 2026

[skyrl][tinker] Use VLLMRenderer in SkyRL train backend #1496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend#1464

[skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend#1464
pcmoritz merged 9 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/tinker-vllm-renderer

nithinvc commented Apr 6, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

nithinvc Apr 7, 2026 •

edited

Loading

Uh oh!

pcmoritz Apr 7, 2026

Uh oh!

nithinvc Apr 8, 2026

Uh oh!

pcmoritz Apr 7, 2026

Uh oh!

nithinvc Apr 8, 2026

Uh oh!

pcmoritz left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		]


		def decode_mm_kwargs(rendered: RenderedModelInput) -> Tuple[torch.Tensor, torch.Tensor]:

Conversation

nithinvc commented Apr 6, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

nithinvc Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

nithinvc Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

nithinvc Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nithinvc commented Apr 6, 2026 •

edited by devin-ai-integration Bot

Loading

nithinvc Apr 7, 2026 •

edited

Loading

pcmoritz left a comment •

edited

Loading