Skip to content
/ DART Public
forked from HCPLab-SYSU/DART

DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

License

Notifications You must be signed in to change notification settings

ykx3/DART

 
 

Repository files navigation

DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models

PyTorch · arXiv:2506.10390 · MIT License

This repository contains the official PyTorch implementation for the paper: DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models.

DART is a fully differentiable tokenizer that adaptively partitions images into content-dependent patches of varying sizes, allocating more tokens to information-rich regions. It can be seamlessly integrated into Vision Transformer (ViT) and Vision Mamba (Vim) architectures to enhance performance with minimal or even reduced computational overhead.

DART Method Illustration

DART adaptively allocates tokens, focusing on important regions (e.g., the bird) while using fewer tokens for the background.

Abstract

The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation.

Main Results

DART significantly boosts the performance of various backbones on the ImageNet-1K dataset while efficiently managing computational resources.

Performance on Transformers and SSMs

DART consistently improves Top-1 accuracy for DeiT, Vim, and VideoMamba models. Notably, during long-sequence fine-tuning, DART achieves superior or comparable accuracy with a substantial reduction in GFLOPs.

Backbone Tokenizer Params (M) Patches GFLOPs Top-1 (%)
Transformers
DeiT-Ti 6 196 1.26 72.2
DeiT-Ti DART 7 196 1.32 73.8 (+1.6)
DeiT-S 22 196 4.61 79.8
DeiT-S DART 24 196 4.84 80.6 (+0.8)
DeiT-S† 22 576 15.5 81.6
DeiT-S† DART 24 392 10.1 (-35%) 81.8 (+0.2)
SSMs
Vim-Ti 7 196 1.60 76.1
Vim-Ti DART 8 196 1.68 77.2 (+1.1)
Vim-S 26 196 5.30 80.5
Vim-S DART 29 196 5.55 81.5 (+1.0)
VideoMamba-Ti 7 196 1.08 76.9
VideoMamba-Ti DART 8 196 1.15 78.2 (+1.3)
Vim-Ti† 7 784 5.95 78.3
Vim-Ti† DART 8 392 3.29 (-45%) 78.9 (+0.6)
Vim-S† 26 784 19.6 81.6
Vim-S† DART 29 392 10.9 (-44%) 82.2 (+0.6)
VideoMamba-Ti† 7 784 4.30 79.3
VideoMamba-Ti† 7 1296 7.11 79.6
VideoMamba-Ti† DART 8 392 2.24 (-69%) 79.7 (+0.1)

† denotes long‐sequence fine‐tuning.

Comparison with Dynamic Tokenizers

DART demonstrates superior performance and efficiency compared to other dynamic inference methods for ViT.

Model Patches GFLOPs Acc. (%)
A-ViT-T dynamic 0.8 71.0
DeiT-Ti + DART 121 0.8 71.8
DeiT-Ti + DART 196 1.3 73.8
DeiT-S 196 4.61 79.8
DynamicViT-S/0.5 dynamic 7.0 80.3
DeiT-S + DART 196 4.8 80.6
DeiT-S + DART 392 10.1 81.8

Ablation on Scoring Network

The choice of scoring backbone in DART offers a trade-off between parameter count and accuracy improvement.

Scoring Network Params (M) FLOPs Top-1 (%)
w/o (DeiT-Ti baseline) 6 1.26 72.2
MobileNetV3 Small 7 1.32 73.8 (+1.6)
MnasNet 7 1.37 74.0 (+1.8)
SqueezeNet 7 1.54 74.3 (+2.1)
EfficientNet-B0 10 2.41 75.1 (+2.9)

Installation

DART is designed as a self-contained component and has minimal dependencies.

  1. Clone the repository:

    git clone https://github.com/HCPLab-SYSU/DART.git
    cd DART
  2. Set up the environment: DART's primary dependency is PyTorch. We recommend using a virtual environment.

    # Example using conda
    conda create -n dart_env python=3.10
    conda activate dart_env
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Video Training

For video fine-tuning, we follow the training recipes from VideoMamba. We only add a new flag --num_patches to control the total number of dynamic patches. See video/README.md for a runnable example.

Model Zoo

We provide pretrained model weights for various backbones enhanced with DART.

Model Download Link
DeiT-Tiny variants
darvit-tiny.pth Link
darvit-tiny-v2.pth Link
darvit-tiny-ft121p.pth Link
DeiT-Small variants
darvit-sm.pth Link
darvit-sm-ft144p.pth Link
darvit-sm-ft288p.pth Link
darvit-sm-ft392p.pth Link
Vim variants
darvim-tiny.pth Link
darvim-tiny-ft2seq.pth Link
darvim-sm.pth Link
darvim-sm-ft2seq.pth Link
RMT variant
darmt-L6.pth Link

Training and Evaluation

The training and evaluation scripts are adapted from the DeiT repository.

Training

Use the following command for multi-GPU training. Hyperparameters should align with the baseline models.

python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--output_dir /path/to/save/models \
--input-size 448

Evaluation

To evaluate a pretrained model, use the --eval flag.

python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_tiny \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--resume /path/to/your/darvit-tiny.pth \
--input-size 448 \
--eval

Integrating DART with New Models

The core logic of DART is encapsulated in the dart/ directory, designed as a standard Python package. To integrate DART into a new vision model, you can replace the standard static patch embedding layer with the DART module. See models_deit.py for an example of how DART is integrated into the DeiT architecture.

Citation

If you find our work useful in your research, please consider citing our paper:

@article{yin2025dart,
  title={DART: Differentiable Adaptive Region Tokenizer for Vision Foundation Models},
  author={Shicheng Yin and Kaixuan Yin and Yang Liu and Weixing Chen and Liang Lin},
  journal={arXiv preprint arXiv:2506.10390},
  year={2025}
}

Acknowledgements

This project is built upon the excellent work of several open-source repositories. We gratefully acknowledge the contributions from:

About

DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%