DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models

PyTorch · arXiv:2506.10390 · MIT License

This repository contains the official PyTorch implementation for the paper: DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models.

DART is a fully differentiable tokenizer that adaptively partitions images into content-dependent patches of varying sizes, allocating more tokens to information-rich regions. It can be seamlessly integrated into Vision Transformer (ViT) and Vision Mamba (Vim) architectures to enhance performance with minimal or even reduced computational overhead.

DART adaptively allocates tokens, focusing on important regions (e.g., the bird) while using fewer tokens for the background.

Abstract

The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation.

Main Results

DART significantly boosts the performance of various backbones on the ImageNet-1K dataset while efficiently managing computational resources.

Performance on Transformers and SSMs

DART consistently improves Top-1 accuracy for DeiT, Vim, and VideoMamba models. Notably, during long-sequence fine-tuning, DART achieves superior or comparable accuracy with a substantial reduction in GFLOPs.

Backbone	Tokenizer	Params (M)	Patches	GFLOPs	Top-1 (%)
Transformers
DeiT-Ti		6	196	1.26	72.2
DeiT-Ti	DART	7	196	1.32	73.8 (+1.6)
DeiT-S		22	196	4.61	79.8
DeiT-S	DART	24	196	4.84	80.6 (+0.8)
DeiT-S†		22	576	15.5	81.6
DeiT-S†	DART	24	392	10.1 (-35%)	81.8 (+0.2)
SSMs
Vim-Ti		7	196	1.60	76.1
Vim-Ti	DART	8	196	1.68	77.2 (+1.1)
Vim-S		26	196	5.30	80.5
Vim-S	DART	29	196	5.55	81.5 (+1.0)
VideoMamba-Ti		7	196	1.08	76.9
VideoMamba-Ti	DART	8	196	1.15	78.2 (+1.3)
Vim-Ti†		7	784	5.95	78.3
Vim-Ti†	DART	8	392	3.29 (-45%)	78.9 (+0.6)
Vim-S†		26	784	19.6	81.6
Vim-S†	DART	29	392	10.9 (-44%)	82.2 (+0.6)
VideoMamba-Ti†		7	784	4.30	79.3
VideoMamba-Ti†		7	1296	7.11	79.6
VideoMamba-Ti†	DART	8	392	2.24 (-69%)	79.7 (+0.1)

† denotes long‐sequence fine‐tuning.

Comparison with Dynamic Tokenizers

DART demonstrates superior performance and efficiency compared to other dynamic inference methods for ViT.

Model	Patches	GFLOPs	Acc. (%)
A-ViT-T	dynamic	0.8	71.0
DeiT-Ti + DART	121	0.8	71.8
DeiT-Ti + DART	196	1.3	73.8
DeiT-S	196	4.61	79.8
DynamicViT-S/0.5	dynamic	7.0	80.3
DeiT-S + DART	196	4.8	80.6
DeiT-S + DART	392	10.1	81.8

Ablation on Scoring Network

The choice of scoring backbone in DART offers a trade-off between parameter count and accuracy improvement.

Scoring Network	Params (M)	FLOPs	Top-1 (%)
w/o (DeiT-Ti baseline)	6	1.26	72.2
MobileNetV3 Small	7	1.32	73.8 (+1.6)
MnasNet	7	1.37	74.0 (+1.8)
SqueezeNet	7	1.54	74.3 (+2.1)
EfficientNet-B0	10	2.41	75.1 (+2.9)

Installation

DART is designed as a self-contained component and has minimal dependencies.

Clone the repository:

git clone https://github.com/HCPLab-SYSU/DART.git
cd DART

Set up the environment: DART's primary dependency is PyTorch. We recommend using a virtual environment.

# Example using conda
conda create -n dart_env python=3.10
conda activate dart_env
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Video Training

For video fine-tuning, we follow the training recipes from VideoMamba. We only add a new flag --num_patches to control the total number of dynamic patches. See video/README.md for a runnable example.

Reference: VideoMamba

Model Zoo

We provide pretrained model weights for various backbones enhanced with DART.

Model	Download Link
DeiT-Tiny variants
`darvit-tiny.pth`	Link
`darvit-tiny-v2.pth`	Link
`darvit-tiny-ft121p.pth`	Link
DeiT-Small variants
`darvit-sm.pth`	Link
`darvit-sm-ft144p.pth`	Link
`darvit-sm-ft288p.pth`	Link
`darvit-sm-ft392p.pth`	Link
Vim variants
`darvim-tiny.pth`	Link
`darvim-tiny-ft2seq.pth`	Link
`darvim-sm.pth`	Link
`darvim-sm-ft2seq.pth`	Link
RMT variant
`darmt-L6.pth`	Link

Training and Evaluation

The training and evaluation scripts are adapted from the DeiT repository.

Training

Use the following command for multi-GPU training. Hyperparameters should align with the baseline models.

python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--output_dir /path/to/save/models \
--input-size 448

Evaluation

To evaluate a pretrained model, use the --eval flag.

python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--resume /path/to/your/darvit-tiny.pth \
--input-size 448 \
--eval

Integrating DART with New Models

The core logic of DART is encapsulated in the dart/ directory, designed as a standard Python package. To integrate DART into a new vision model, you can replace the standard static patch embedding layer with the DART module. See models_deit.py for an example of how DART is integrated into the DeiT architecture.

Citation

If you find our work useful in your research, please consider citing our paper:

@article{yin2025dart,
  title={DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models},
  author={Shicheng Yin and Kaixuan Yin and Yang Liu and Weixing Chen and Liang Lin},
  journal={arXiv preprint arXiv:2506.10390},
  year={2025}
}

Acknowledgements

This project is built upon the excellent work of several open-source repositories. We gratefully acknowledge the contributions from:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
dart		dart
seg		seg
video		video
LICENSE		LICENSE
README.md		README.md
augment.py		augment.py
benchmark.py		benchmark.py
datasets.py		datasets.py
engine.py		engine.py
losses.py		losses.py
main.py		main.py
models_deit.py		models_deit.py
models_mamba.py		models_mamba.py
models_rmt.py		models_rmt.py
models_videomamba.py		models_videomamba.py
rope.py		rope.py
samplers.py		samplers.py
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models

Abstract

Main Results

Performance on Transformers and SSMs

Comparison with Dynamic Tokenizers

Ablation on Scoring Network

Installation

Video Training

Model Zoo

Training and Evaluation

Training

Evaluation

Integrating DART with New Models

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

ykx3/DART

Folders and files

Latest commit

History

Repository files navigation

DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models

Abstract

Main Results

Performance on Transformers and SSMs

Comparison with Dynamic Tokenizers

Ablation on Scoring Network

Installation

Video Training

Model Zoo

Training and Evaluation

Training

Evaluation

Integrating DART with New Models

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages