PyTorch · arXiv:2506.10390 · MIT License
This repository contains the official PyTorch implementation for the paper: DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models.
DART is a fully differentiable tokenizer that adaptively partitions images into content-dependent patches of varying sizes, allocating more tokens to information-rich regions. It can be seamlessly integrated into Vision Transformer (ViT) and Vision Mamba (Vim) architectures to enhance performance with minimal or even reduced computational overhead.
DART adaptively allocates tokens, focusing on important regions (e.g., the bird) while using fewer tokens for the background.
The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation.
DART significantly boosts the performance of various backbones on the ImageNet-1K dataset while efficiently managing computational resources.
DART consistently improves Top-1 accuracy for DeiT, Vim, and VideoMamba models. Notably, during long-sequence fine-tuning, DART achieves superior or comparable accuracy with a substantial reduction in GFLOPs.
| Backbone | Tokenizer | Params (M) | Patches | GFLOPs | Top-1 (%) |
|---|---|---|---|---|---|
| Transformers | |||||
| DeiT-Ti | 6 | 196 | 1.26 | 72.2 | |
| DeiT-Ti | DART | 7 | 196 | 1.32 | 73.8 (+1.6) |
| DeiT-S | 22 | 196 | 4.61 | 79.8 | |
| DeiT-S | DART | 24 | 196 | 4.84 | 80.6 (+0.8) |
| DeiT-S† | 22 | 576 | 15.5 | 81.6 | |
| DeiT-S† | DART | 24 | 392 | 10.1 (-35%) | 81.8 (+0.2) |
| SSMs | |||||
| Vim-Ti | 7 | 196 | 1.60 | 76.1 | |
| Vim-Ti | DART | 8 | 196 | 1.68 | 77.2 (+1.1) |
| Vim-S | 26 | 196 | 5.30 | 80.5 | |
| Vim-S | DART | 29 | 196 | 5.55 | 81.5 (+1.0) |
| VideoMamba-Ti | 7 | 196 | 1.08 | 76.9 | |
| VideoMamba-Ti | DART | 8 | 196 | 1.15 | 78.2 (+1.3) |
| Vim-Ti† | 7 | 784 | 5.95 | 78.3 | |
| Vim-Ti† | DART | 8 | 392 | 3.29 (-45%) | 78.9 (+0.6) |
| Vim-S† | 26 | 784 | 19.6 | 81.6 | |
| Vim-S† | DART | 29 | 392 | 10.9 (-44%) | 82.2 (+0.6) |
| VideoMamba-Ti† | 7 | 784 | 4.30 | 79.3 | |
| VideoMamba-Ti† | 7 | 1296 | 7.11 | 79.6 | |
| VideoMamba-Ti† | DART | 8 | 392 | 2.24 (-69%) | 79.7 (+0.1) |
† denotes long‐sequence fine‐tuning.
DART demonstrates superior performance and efficiency compared to other dynamic inference methods for ViT.
| Model | Patches | GFLOPs | Acc. (%) |
|---|---|---|---|
| A-ViT-T | dynamic | 0.8 | 71.0 |
| DeiT-Ti + DART | 121 | 0.8 | 71.8 |
| DeiT-Ti + DART | 196 | 1.3 | 73.8 |
| DeiT-S | 196 | 4.61 | 79.8 |
| DynamicViT-S/0.5 | dynamic | 7.0 | 80.3 |
| DeiT-S + DART | 196 | 4.8 | 80.6 |
| DeiT-S + DART | 392 | 10.1 | 81.8 |
The choice of scoring backbone in DART offers a trade-off between parameter count and accuracy improvement.
| Scoring Network | Params (M) | FLOPs | Top-1 (%) |
|---|---|---|---|
| w/o (DeiT-Ti baseline) | 6 | 1.26 | 72.2 |
| MobileNetV3 Small | 7 | 1.32 | 73.8 (+1.6) |
| MnasNet | 7 | 1.37 | 74.0 (+1.8) |
| SqueezeNet | 7 | 1.54 | 74.3 (+2.1) |
| EfficientNet-B0 | 10 | 2.41 | 75.1 (+2.9) |
DART is designed as a self-contained component and has minimal dependencies.
-
Clone the repository:
git clone https://github.com/HCPLab-SYSU/DART.git cd DART -
Set up the environment: DART's primary dependency is PyTorch. We recommend using a virtual environment.
# Example using conda conda create -n dart_env python=3.10 conda activate dart_env pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
For video fine-tuning, we follow the training recipes from VideoMamba. We only add a new flag --num_patches to control the total number of dynamic patches. See video/README.md for a runnable example.
- Reference: VideoMamba
We provide pretrained model weights for various backbones enhanced with DART.
| Model | Download Link |
|---|---|
| DeiT-Tiny variants | |
darvit-tiny.pth |
Link |
darvit-tiny-v2.pth |
Link |
darvit-tiny-ft121p.pth |
Link |
| DeiT-Small variants | |
darvit-sm.pth |
Link |
darvit-sm-ft144p.pth |
Link |
darvit-sm-ft288p.pth |
Link |
darvit-sm-ft392p.pth |
Link |
| Vim variants | |
darvim-tiny.pth |
Link |
darvim-tiny-ft2seq.pth |
Link |
darvim-sm.pth |
Link |
darvim-sm-ft2seq.pth |
Link |
| RMT variant | |
darmt-L6.pth |
Link |
The training and evaluation scripts are adapted from the DeiT repository.
Use the following command for multi-GPU training. Hyperparameters should align with the baseline models.
python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--output_dir /path/to/save/models \
--input-size 448To evaluate a pretrained model, use the --eval flag.
python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--resume /path/to/your/darvit-tiny.pth \
--input-size 448 \
--evalThe core logic of DART is encapsulated in the dart/ directory, designed as a standard Python package. To integrate DART into a new vision model, you can replace the standard static patch embedding layer with the DART module. See models_deit.py for an example of how DART is integrated into the DeiT architecture.
If you find our work useful in your research, please consider citing our paper:
@article{yin2025dart,
title={DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models},
author={Shicheng Yin and Kaixuan Yin and Yang Liu and Weixing Chen and Liang Lin},
journal={arXiv preprint arXiv:2506.10390},
year={2025}
}This project is built upon the excellent work of several open-source repositories. We gratefully acknowledge the contributions from:
