This repo contains the official PyTorch code and pre-trained models for Vision Test-Time Training (
-
$\text{ViT}^3$ : Unlocking Test-Time Training in Vision
-
April 09 2026: Selected as an oral.
-
February 21 2026: Accepted to CVPR 2026. Final ratings: 6, 6, 5 (min: 1 Reject, max: 6 Accept).
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (
We provide a minimal implementation of
- Example:
from ttt_block import TTT
block = TTT(dim=512, num_heads=16)
x = torch.rand(1, 256, 512)
x = block(x, h=16, w=16)Please go to the folder vittt for specific document.
This code is developed on the top of Swin Transformer and MILA.
If you find this repo helpful, please consider citing us.
@inproceedings{han2025vit,
title={ViT$^3$: Unlocking Test-Time Training in Vision},
author={Han, Dongchen and Li, Yining and Li, Tianyu and Cao, Zixuan and Wang, Ziming and Song, Jun and Cheng, Yu and Zheng, Bo and Huang, Gao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}If you have any questions, please feel free to contact the authors.
Dongchen Han: hdc23@mails.tsinghua.edu.cn
