Thinker: A vision-language foundation model for embodied intelligence

🌟 Overview

We are pleased to open-source Thinker, a state-of-the-art vision-language foundation model specifically engineered for embodied intelligence. While conventional VLMs often struggle with perspective confusion and temporal oversight, Thinker is designed to bridge the gap between general scene understanding and robust robot-centric task-level capabilities. By leveraging high-quality dataset curation, multi-stage training, and reinforcement learning, Thinker exhibits advanced capabilities across four core dimensions: Task Planning with future-state prediction, Spatial Intelligence grounded in an egocentric coordinate system, Temporal Understanding through historical state integration, and precise Visual Grounding. Leveraging these capabilities, Thinker sets new records across 7 embodied AI benchmarks in Task Planning, Visual Grounding and Spatial Understanding, and significantly outperforms existing open-source, closed-source, and specialized baselines, showing its potential as a foundation for embodied intelligence and autonomous robotic decision-making.

Update

2026-01-28: 🤗 Thinker-4B model checkpoint has been released in Huggingface.

Quick Start

Clone this repo, and set up the environment with a few commands.

# The Thinker model requires transformers >= 4.57.0
pip install "transformers>=4.57.0"

The following contains a code snippet illustrating how to use our Thinker. More details can refer to inference.py.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor


model = Qwen3VLForConditionalGeneration.from_pretrained(
    "UBTECH-Robotics/Thinker-4B", dtype="auto", device_map="auto"
)


processor = AutoProcessor.from_pretrained("UBTECH-Robotics/Thinker-4B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
            },
            {"type": "text", "text": "Please describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

🤗 Models

Model Name	Checkpoint	Description
Thinker 4B	🤗 UBTECH-Robotics/Thinker-4B	4B parameter Instruct version of Thinker
Thinker Thinking 4B	⌛ Coming soon	4B parameter Reasoning (Thinking) version

Evaluation

We use the flageval and evalscope frameworks for evaluation. More evaluation results and scripts will be added soon.

The performance comparison of Thinker-4B with models below 10B.

The performance comparison of Thinker-4B with models above 10B.

License

This project is released under the Attribution-NonCommercial-ShareAlike 4.0 International.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{UBTECH_Thinker_short_report,
      title={Thinker: A vision-language foundation model for embodied intelligence}, 
      author={Baiyu Pan and Daqin Luo and Junpeng Yang and Jiyuan Wang and Yixuan Zhang and Hailin Shi and Jichao Jiao},
	    journal={https://arxiv.org/abs/2601.21199},
      year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinker: A vision-language foundation model for embodied intelligence

🌟 Overview

Update

Quick Start

🤗 Models

Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

UBTECH-Robot/Thinker

Folders and files

Latest commit

History

Repository files navigation

Thinker: A vision-language foundation model for embodied intelligence

🌟 Overview

Update

Quick Start

🤗 Models

Evaluation

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages