Skip to content

wph6/CAB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚕 CAB: Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge

✨ Overview

CAB Overview Figure 1: Towards Effective Attention-to-SSM Distillation. We highlight the structural complementarity between attention-based and SSM-based models, and the limitations of direct attention transfer, motivating our proposed alignment-based distillation approach.

CAB is a data-efficient framework for transferring attention knowledge from Transformer teachers to state-space student models such as Mamba. It introduces a lightweight MLP-based bridge that aligns Transformer’s attention projections (Q/K) with Mamba’s dynamic projections (B/C), enabling fine-grained, token-level supervision. CAB further adopts a hierarchical layer alignment strategy to handle architectural heterogeneity. Across both vision and language tasks, CAB achieves superior performance and efficiency, demonstrating that attention-based inductive biases can be effectively transferred to recurrent models.

🔍 Key Features

Results Figure 2: Top-1 accuracy comparison between pretraining and distillation methods on ImageNet classification under varying proportions of training data.

  • Attention Bridge – A lightweight MLP module that aligns Transformer attention (Q/K) with Mamba’s dynamic projections (B/C). This enables fine-grained, token-level supervision and allows effective transfer of attention structures into recurrent state-space models.
  • Dual Efficiency – CAB achieves both computational and data efficiency: it avoids the heavy quadratic cost of dense attention alignment and remains effective in low-data regimes, making it a scalable solution for cross-architecture knowledge transfer.

sim Figure 3: Attention matrices similarity between Vim and pretrained ViT, comparing results with and without attention alignment. Higher similarity indicates better alignment of attention representations.


⚙️ Quick Start

We recommend Python 3.10+.

Create Environment

conda create -n CAB python=3.10
conda activate CAB

🖼️ Vision Task Setup (vision_CAB)

# Install vision task dependencies
pip install -r requirements.txt
pip install -e causal_conv1d>=1.1.0
pip install -e mamba-1p1p1

# (Optional) Create a subset of the dataset, e.g., 10% of ImageNet
python create_subset.py 

# Run distillation
bash run_distill.sh

💬 Language Task Setup (phi_mamba_CAB)

# Install language task dependencies
pip install -r requirements.txt

# Run distillation
bash run.sh

🤝 Acknowledgements

This project builds on:

  • Vim — Vision Mamba: Efficient visual state-space models for image understanding.
  • Phi-Mamba — A Mamba-based language model for efficient sequence modeling.
  • Attention Transfer — A PyTorch implementation of attention-based knowledge distillation methods.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


📚 Citation

If you find CAB useful, please cite our paper:

@misc{wang2025dataefficienttransformertomambadistillation,
      title={Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge}, 
      author={Penghao Wang and Yuhao Zhou and Mengxuan Wu and Panpan Zhang and Zhangyang Wang and Kai Wang},
      year={2025},
      eprint={2510.19266},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.19266}, 
}

About

Official repo for CAB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors