$\text{ViT}^3$: Unlocking Test-Time Training in Vision

This repo contains the official PyTorch code and pre-trained models for Vision Test-Time Training ($\text{ViT}^3$).

$\text{ViT}^3$: Unlocking Test-Time Training in Vision

News

April 09 2026: Selected as an oral.
February 21 2026: Accepted to CVPR 2026. Final ratings: 6, 6, 5 (min: 1 Reject, max: 6 Accept).

Abstract

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training ($\text{ViT}^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate $\text{ViT}^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that $\text{ViT}^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the $\text{ViT}^3$ baseline can facilitate future work on visual TTT models.

Usage

We provide a minimal implementation of $\text{ViT}^3$ block in ttt_block.py, which can act as a plug-in module in various vision tasks.

Example:

from ttt_block import TTT
block = TTT(dim=512, num_heads=16)
x = torch.rand(1, 256, 512)
x = block(x, h=16, w=16)

Results

Please go to the folder vittt for specific document.

Acknowledgements

This code is developed on the top of Swin Transformer and MILA.

Citation

If you find this repo helpful, please consider citing us.

@inproceedings{han2025vit,
  title={ViT$^3$: Unlocking Test-Time Training in Vision},
  author={Han, Dongchen and Li, Yining and Li, Tianyu and Cao, Zixuan and Wang, Ziming and Song, Jun and Cheng, Yu and Zheng, Bo and Huang, Gao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Contact

If you have any questions, please feel free to contact the authors.

Dongchen Han: hdc23@mails.tsinghua.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figures		figures
vittt		vittt
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
ttt_block.py		ttt_block.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\text{ViT}^3$: Unlocking Test-Time Training in Vision

News

Abstract

Usage

Results

Acknowledgements

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

$\text{ViT}^3$: Unlocking Test-Time Training in Vision

News

Abstract

Usage

Results

Acknowledgements

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages