We are pleased to open-source Thinker, a state-of-the-art vision-language foundation model specifically engineered for embodied intelligence. While conventional VLMs often struggle with perspective confusion and temporal oversight, Thinker is designed to bridge the gap between general scene understanding and robust robot-centric task-level capabilities. By leveraging high-quality dataset curation, multi-stage training, and reinforcement learning, Thinker exhibits advanced capabilities across four core dimensions: Task Planning with future-state prediction, Spatial Intelligence grounded in an egocentric coordinate system, Temporal Understanding through historical state integration, and precise Visual Grounding. Leveraging these capabilities, Thinker sets new records across 7 embodied AI benchmarks in Task Planning, Visual Grounding and Spatial Understanding, and significantly outperforms existing open-source, closed-source, and specialized baselines, showing its potential as a foundation for embodied intelligence and autonomous robotic decision-making.
2026-01-28: 🤗 Thinker-4B model checkpoint has been released in Huggingface.
Clone this repo, and set up the environment with a few commands.
# The Thinker model requires transformers >= 4.57.0
pip install "transformers>=4.57.0"The following contains a code snippet illustrating how to use our Thinker. More details can refer to inference.py.
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"UBTECH-Robotics/Thinker-4B", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("UBTECH-Robotics/Thinker-4B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
},
{"type": "text", "text": "Please describe this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)| Model Name | Checkpoint | Description |
|---|---|---|
| Thinker 4B | 🤗 UBTECH-Robotics/Thinker-4B | 4B parameter Instruct version of Thinker |
| Thinker Thinking 4B | ⌛ Coming soon | 4B parameter Reasoning (Thinking) version |
We use the flageval and evalscope frameworks for evaluation. More evaluation results and scripts will be added soon.
This project is released under the Attribution-NonCommercial-ShareAlike 4.0 International.
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@article{UBTECH_Thinker_short_report,
title={Thinker: A vision-language foundation model for embodied intelligence},
author={Baiyu Pan and Daqin Luo and Junpeng Yang and Jiyuan Wang and Yixuan Zhang and Hailin Shi and Jichao Jiao},
journal={https://arxiv.org/abs/2601.21199},
year={2025}
}



