[skyrl][tinker] Multi-modal Tinker Sampling#1484
Conversation
## Summary Integrates the VLLMRenderer (landed in #1464) into the SkyRL train backend so that VLM training batches include image placeholder tokens and decoded vision tensors (`pixel_values`, `image_grid_thw`). - When using new inference (`_SKYRL_USE_NEW_INFERENCE`), `_to_training_batch` lazily creates a `VLLMRenderer` and renders all `ModelInput`s through it. - Extracts `pixel_values` and `image_grid_thw` from rendered outputs and adds them to the `TrainingInputBatch` as `TensorList` entries (one tensor per batch element, since patch counts vary per image). - Extends `_pad_batch` to handle `TensorList` fields by cycling and cloning entries, matching the existing padding strategy for regular tensors. - Reorders `forward_backward` and `forward` to call `_to_training_batch` before `_sleep_inference_engines`, since the renderer needs the inference servers need to be initialized. Note that this does not wake the KV cache or model GPU memory since that is explicitly done in `save_weights_for_sampler` via the dispatcher. ## E2E Tinker VLM Classifier Curves With #1484 , we can now run tinker vision language recipes against the SkyRL. Merging both closes #1200 Example: ```bash _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \ --base-model "Qwen/Qwen3-VL-8B-Instruct" \ --backend fsdp \ --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}' ``` Cookbook ```bash TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \ tinker_cookbook.recipes.vlm_classifier.train \ base_url=http://localhost:8000 \ model_name="Qwen/Qwen3-VL-4B-Instruct" \ dataset=caltech101 ``` Train nll: <img width="1200" height="675" alt="train_nll" src="https://github.com/user-attachments/assets/82e36767-edee-43b7-ab4a-7fbf496c8cbb" /> Val nll: <img width="1200" height="675" alt="val_nll" src="https://github.com/user-attachments/assets/1dc6e96b-7e1b-4ead-bf0e-71e42eab0491" /> Val accuracy: <img width="1200" height="675" alt="accuracy" src="https://github.com/user-attachments/assets/ec6f92b8-a544-42d9-9a00-4c06292e7ae3" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1496" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
- add types, clean up comment - raise error when mismatched token counts - resolve merge conflicts - fix docstring - move types
cc1d988 to
eb8ef37
Compare
| raw = c["data"] | ||
| b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw |
There was a problem hiding this comment.
🔴 UnicodeDecodeError when _render_for_sample receives raw bytes from model_dump()
The code assumes model_dump() on Pydantic's Base64Bytes produces base64-encoded bytes, but it actually produces decoded raw bytes (e.g., raw JPEG binary). The production caller _sample_with_remote_client at skyrl/backends/skyrl_train_backend.py:735 does model_input.model_dump(), which yields {"data": b'\xff\xd8\xff...'} (raw image bytes). When _render_for_sample then calls raw.decode("ascii") on these raw bytes, it raises UnicodeDecodeError.
Comparison with correct pattern in VLLMRenderer
The existing VLLMRenderer._render_images at skyrl/backends/renderer.py:133 handles this correctly:
b64_data = base64.b64encode(chunk.data).decode("ascii")It first base64-encodes the raw bytes, then decodes to an ASCII string. The fix here should follow the same pattern.
| raw = c["data"] | |
| b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw | |
| raw = c["data"] | |
| b64_str = base64.b64encode(raw).decode("ascii") if isinstance(raw, bytes) else raw |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Adds VLM sampling support to
RemoteInferenceClientsample endpoint. Finalizes inference side changes for #1200._render_for_sampleto handle both text-only and image-containing prompts. For text-only prompts, it flattens chunk tokens directly. When images are present, it calls/v1/chat/completions/renderto process images, then splices placeholder tokens into the pre-tokenized text stream with adjusted offsets.sample()to pass multi-modal features through to the generate payload when present.Test plan
None)test_vlm_inference_generation.py