GitHub - shilinyan99/CrossLMM: CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

CrossLMM: Decoupling Long Video Sequences from LMMs via
Dual Cross-Attention Mechanisms

Shilin Yan^1†, Jiaming Han², Joey Tsai³, Hongwei Xue¹, Rongyao Fang²,
Lingyi Hong, Ziyu Guo², Ray Zhang^2‡

¹Accio Team, Alibaba Group ²CUHK MMLab ³Tsinghua University

^†Project Leader ^‡Corresponding author

Paper • Introduction • Model •

🔥 News

[2025-05-23]🔥🔥🔥 We release the paper

🧠 Introduction

We present CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.

👀 Model

🚩 Main Innovations

1. 🌟 Token Reduction via Pooling

Significantly compress the number of tokens from pretrained visual encoders for efficient representation.
Apply a simple pooling strategy to retain critical visual information while reducing token count.

2. 🚀 Visual-to-Visual Cross-Attention

Novel architecture design: Pooled visual tokens act as queries attending over the original visual token set.
Enables the model to capture fine-grained visual details, maintaining fidelity even under strong token compression.

3. 🔮 Text-to-Visual Cross-Attention

Enhances text token representations through interaction with the original visual tokens.
Deepens text-visual alignment, offering richer contextual understanding for multimodal downstream tasks.

🔗 Framework Benefits

The dual attention mechanism maximizes model efficiency while preserving the ability to handle long-form video content.
Achieves a strong balance between computational efficiency and fine-grained multimodal understanding, empowering advanced video-language applications.

This architecture enables efficient and scalable video-text modeling while maintaining state-of-the-art accuracy.

🥳 Acknowledgements

We would like to thank LLAVA-NeXT, upon which our repo is built.

📄 Cite

@article{yan2025crosslmm,
  title={CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms},
  author={Yan, Shilin and Han, Jiaming and Tsai, Joey and Xue, Hongwei and Fang, Rongyao and Hong, Lingyi and Guo, Ziyu and Zhang, Ray},
  journal={arXiv preprint arXiv:2505.17020},
  year={2025}
}

📧 Concat

If you have any question about this project, please feel free to contact tattoo.ysl@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
figures		figures
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrossLMM: Decoupling Long Video Sequences from LMMs via
Dual Cross-Attention Mechanisms

🔥 News

🧠 Introduction

👀 Model

🚩 Main Innovations

1. 🌟 Token Reduction via Pooling

2. 🚀 Visual-to-Visual Cross-Attention

3. 🔮 Text-to-Visual Cross-Attention

🔗 Framework Benefits

🥳 Acknowledgements

📄 Cite

📧 Concat

About

Uh oh!

Releases

Packages

License

shilinyan99/CrossLMM

Folders and files

Latest commit

History

Repository files navigation

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

🔥 News

🧠 Introduction

👀 Model

🚩 Main Innovations

1. 🌟 Token Reduction via Pooling

2. 🚀 Visual-to-Visual Cross-Attention

3. 🔮 Text-to-Visual Cross-Attention

🔗 Framework Benefits

🥳 Acknowledgements

📄 Cite

📧 Concat

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

CrossLMM: Decoupling Long Video Sequences from LMMs via
Dual Cross-Attention Mechanisms

Packages