[skyrl][tinker] Multi-modal Tinker Sampling by nithinvc · Pull Request #1484 · NovaSky-AI/SkyRL

nithinvc · 2026-04-09T05:32:20Z

Summary

Adds VLM sampling support to RemoteInferenceClient sample endpoint. Finalizes inference side changes for #1200.

Extract _render_for_sample to handle both text-only and image-containing prompts. For text-only prompts, it flattens chunk tokens directly. When images are present, it calls /v1/chat/completions/render to process images, then splices placeholder tokens into the pre-tokenized text stream with adjusted offsets.
Update sample() to pass multi-modal features through to the generate payload when present.

Test plan

Verify text-only sampling still works (no render call made, features is None)
Verify image sampling works end-to-end (render call is made, placeholder tokens are spliced correctly, features are included in the generate payload)
New multi-modal sampling tests in test_vlm_inference_generation.py

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

## Summary Integrates the VLLMRenderer (landed in #1464) into the SkyRL train backend so that VLM training batches include image placeholder tokens and decoded vision tensors (`pixel_values`, `image_grid_thw`). - When using new inference (`_SKYRL_USE_NEW_INFERENCE`), `_to_training_batch` lazily creates a `VLLMRenderer` and renders all `ModelInput`s through it. - Extracts `pixel_values` and `image_grid_thw` from rendered outputs and adds them to the `TrainingInputBatch` as `TensorList` entries (one tensor per batch element, since patch counts vary per image). - Extends `_pad_batch` to handle `TensorList` fields by cycling and cloning entries, matching the existing padding strategy for regular tensors. - Reorders `forward_backward` and `forward` to call `_to_training_batch` before `_sleep_inference_engines`, since the renderer needs the inference servers need to be initialized. Note that this does not wake the KV cache or model GPU memory since that is explicitly done in `save_weights_for_sampler` via the dispatcher. ## E2E Tinker VLM Classifier Curves With #1484 , we can now run tinker vision language recipes against the SkyRL. Merging both closes #1200 Example: ```bash _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \ --base-model "Qwen/Qwen3-VL-8B-Instruct" \ --backend fsdp \ --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}' ``` Cookbook ```bash TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \ tinker_cookbook.recipes.vlm_classifier.train \ base_url=http://localhost:8000 \ model_name="Qwen/Qwen3-VL-4B-Instruct" \ dataset=caltech101 ``` Train nll: <img width="1200" height="675" alt="train_nll" src="https://github.com/user-attachments/assets/82e36767-edee-43b7-ab4a-7fbf496c8cbb" /> Val nll: <img width="1200" height="675" alt="val_nll" src="https://github.com/user-attachments/assets/1dc6e96b-7e1b-4ead-bf0e-71e42eab0491" /> Val accuracy: <img width="1200" height="675" alt="accuracy" src="https://github.com/user-attachments/assets/ec6f92b8-a544-42d9-9a00-4c06292e7ae3" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1496" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

- add types, clean up comment - raise error when mismatched token counts - resolve merge conflicts - fix docstring - move types

devin-ai-integration

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

devin-ai-integration · 2026-04-23T22:21:39Z

+                raw = c["data"]
+                b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw


🔴 UnicodeDecodeError when _render_for_sample receives raw bytes from model_dump()

The code assumes model_dump() on Pydantic's Base64Bytes produces base64-encoded bytes, but it actually produces decoded raw bytes (e.g., raw JPEG binary). The production caller _sample_with_remote_client at skyrl/backends/skyrl_train_backend.py:735 does model_input.model_dump(), which yields {"data": b'\xff\xd8\xff...'} (raw image bytes). When _render_for_sample then calls raw.decode("ascii") on these raw bytes, it raises UnicodeDecodeError.

Comparison with correct pattern in VLLMRenderer

The existing VLLMRenderer._render_images at skyrl/backends/renderer.py:133 handles this correctly:

b64_data = base64.b64encode(chunk.data).decode("ascii")

It first base64-encodes the raw bytes, then decodes to an ASCII string. The fix here should follow the same pattern.

Suggested change

raw = c["data"]

b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw

raw = c["data"]

b64_str = base64.b64encode(raw).decode("ascii") if isinstance(raw, bytes) else raw

Was this helpful? React with 👍 or 👎 to provide feedback.

nithinvc marked this pull request as ready for review April 9, 2026 16:46

devin-ai-integration Bot reviewed Apr 9, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

nithinvc mentioned this pull request Apr 10, 2026

[skyrl][tinker] Use VLLMRenderer in SkyRL train backend #1496

Merged

SumanthRH self-assigned this Apr 13, 2026

mm sampling first pass

eb8ef37

- add types, clean up comment - raise error when mismatched token counts - resolve merge conflicts - fix docstring - move types

nithinvc force-pushed the nithinc/mm-sample branch from cc1d988 to eb8ef37 Compare April 23, 2026 22:06

This comment was marked as resolved.

Sign in to view

types

b6d8c9f

SumanthRH approved these changes Apr 23, 2026

View reviewed changes

devin-ai-integration Bot reviewed Apr 23, 2026

View reviewed changes

SumanthRH merged commit 4c6b1a6 into NovaSky-AI:main Apr 24, 2026
4 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl][tinker] Multi-modal Tinker Sampling#1484

[skyrl][tinker] Multi-modal Tinker Sampling#1484
SumanthRH merged 2 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/mm-sample

nithinvc commented Apr 9, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		raw = c["data"]
		b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw

Conversation

nithinvc commented Apr 9, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nithinvc commented Apr 9, 2026 •

edited by devin-ai-integration Bot

Loading