Skip to content

[skyrl][tinker] Multi-modal Tinker Sampling#1484

Merged
SumanthRH merged 2 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/mm-sample
Apr 24, 2026
Merged

[skyrl][tinker] Multi-modal Tinker Sampling#1484
SumanthRH merged 2 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/mm-sample

Conversation

@nithinvc
Copy link
Copy Markdown
Contributor

@nithinvc nithinvc commented Apr 9, 2026

Summary

Adds VLM sampling support to RemoteInferenceClient sample endpoint. Finalizes inference side changes for #1200.

  • Extract _render_for_sample to handle both text-only and image-containing prompts. For text-only prompts, it flattens chunk tokens directly. When images are present, it calls /v1/chat/completions/render to process images, then splices placeholder tokens into the pre-tokenized text stream with adjusted offsets.
  • Update sample() to pass multi-modal features through to the generate payload when present.

Test plan

  • Verify text-only sampling still works (no render call made, features is None)
  • Verify image sampling works end-to-end (render call is made, placeholder tokens are spliced correctly, features are included in the generate payload)
  • New multi-modal sampling tests in test_vlm_inference_generation.py

Open with Devin

@nithinvc nithinvc marked this pull request as ready for review April 9, 2026 16:46
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

gemini-code-assist[bot]

This comment was marked as resolved.

@SumanthRH SumanthRH self-assigned this Apr 13, 2026
pcmoritz pushed a commit that referenced this pull request Apr 17, 2026
## Summary

Integrates the VLLMRenderer (landed in #1464) into the SkyRL train
backend so that VLM training batches include image placeholder tokens
and decoded vision tensors (`pixel_values`, `image_grid_thw`).

- When using new inference (`_SKYRL_USE_NEW_INFERENCE`),
`_to_training_batch` lazily creates a `VLLMRenderer` and renders all
`ModelInput`s through it.
- Extracts `pixel_values` and `image_grid_thw` from rendered outputs and
adds them to the `TrainingInputBatch` as `TensorList` entries (one
tensor per batch element, since patch counts vary per image).
- Extends `_pad_batch` to handle `TensorList` fields by cycling and
cloning entries, matching the existing padding strategy for regular
tensors.
- Reorders `forward_backward` and `forward` to call `_to_training_batch`
before `_sleep_inference_engines`, since the renderer needs the
inference servers need to be initialized. Note that this does not wake
the KV cache or model GPU memory since that is explicitly done in
`save_weights_for_sampler` via the dispatcher.

## E2E Tinker VLM Classifier Curves
With #1484 , we can now run tinker vision language recipes against the
SkyRL. Merging both closes #1200

Example:
```bash
 _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \
    --base-model "Qwen/Qwen3-VL-8B-Instruct" \
    --backend fsdp \
    --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}'
```

Cookbook
```bash
 TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \
    tinker_cookbook.recipes.vlm_classifier.train  \
    base_url=http://localhost:8000 \
    model_name="Qwen/Qwen3-VL-4B-Instruct" \
    dataset=caltech101 
```

Train nll:
<img width="1200" height="675" alt="train_nll"
src="https://github.com/user-attachments/assets/82e36767-edee-43b7-ab4a-7fbf496c8cbb"
/>

Val nll:
<img width="1200" height="675" alt="val_nll"
src="https://github.com/user-attachments/assets/1dc6e96b-7e1b-4ead-bf0e-71e42eab0491"
/>

Val accuracy:
<img width="1200" height="675" alt="accuracy"
src="https://github.com/user-attachments/assets/ec6f92b8-a544-42d9-9a00-4c06292e7ae3"
/>

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1496"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
- add types, clean up comment
- raise error when mismatched token counts
- resolve merge conflicts
- fix docstring
- move types
@nithinvc nithinvc force-pushed the nithinc/mm-sample branch from cc1d988 to eb8ef37 Compare April 23, 2026 22:06
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment on lines +487 to +488
raw = c["data"]
b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 UnicodeDecodeError when _render_for_sample receives raw bytes from model_dump()

The code assumes model_dump() on Pydantic's Base64Bytes produces base64-encoded bytes, but it actually produces decoded raw bytes (e.g., raw JPEG binary). The production caller _sample_with_remote_client at skyrl/backends/skyrl_train_backend.py:735 does model_input.model_dump(), which yields {"data": b'\xff\xd8\xff...'} (raw image bytes). When _render_for_sample then calls raw.decode("ascii") on these raw bytes, it raises UnicodeDecodeError.

Comparison with correct pattern in VLLMRenderer

The existing VLLMRenderer._render_images at skyrl/backends/renderer.py:133 handles this correctly:

b64_data = base64.b64encode(chunk.data).decode("ascii")

It first base64-encodes the raw bytes, then decodes to an ASCII string. The fix here should follow the same pattern.

Suggested change
raw = c["data"]
b64_str = raw.decode("ascii") if isinstance(raw, bytes) else raw
raw = c["data"]
b64_str = base64.b64encode(raw).decode("ascii") if isinstance(raw, bytes) else raw
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@SumanthRH SumanthRH merged commit 4c6b1a6 into NovaSky-AI:main Apr 24, 2026
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants