Skip to content

shilinyan99/CrossLMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation


CrossLMM: Decoupling Long Video Sequences from LMMs via
Dual Cross-Attention Mechanisms

Shilin Yan1โ€ , Jiaming Han2, Joey Tsai3, Hongwei Xue1, Rongyao Fang2,
Lingyi Hong, Ziyu Guo2, Ray Zhang2โ€ก

1Accio Team, Alibaba Group 2CUHK MMLab 3Tsinghua University

โ€ Project Leader โ€กCorresponding author

CrossLMM Framework

Paper โ€ข Introduction โ€ข Model โ€ข

๐Ÿ”ฅ News

  • [2025-05-23]๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ We release the paper

๐Ÿง  Introduction

We present CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.

๐Ÿ‘€ Model

CrossLMM Architecture

๐Ÿšฉ Main Innovations

1. ๐ŸŒŸ Token Reduction via Pooling

  • Significantly compress the number of tokens from pretrained visual encoders for efficient representation.
  • Apply a simple pooling strategy to retain critical visual information while reducing token count.

2. ๐Ÿš€ Visual-to-Visual Cross-Attention

  • Novel architecture design: Pooled visual tokens act as queries attending over the original visual token set.
  • Enables the model to capture fine-grained visual details, maintaining fidelity even under strong token compression.

3. ๐Ÿ”ฎ Text-to-Visual Cross-Attention

  • Enhances text token representations through interaction with the original visual tokens.
  • Deepens text-visual alignment, offering richer contextual understanding for multimodal downstream tasks.

๐Ÿ”— Framework Benefits

  • The dual attention mechanism maximizes model efficiency while preserving the ability to handle long-form video content.
  • Achieves a strong balance between computational efficiency and fine-grained multimodal understanding, empowering advanced video-language applications.

This architecture enables efficient and scalable video-text modeling while maintaining state-of-the-art accuracy.

๐Ÿฅณ Acknowledgements

We would like to thank LLAVA-NeXT, upon which our repo is built.

๐Ÿ“„ Cite

@article{yan2025crosslmm,
  title={CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms},
  author={Yan, Shilin and Han, Jiaming and Tsai, Joey and Xue, Hongwei and Fang, Rongyao and Hong, Lingyi and Guo, Ziyu and Zhang, Ray},
  journal={arXiv preprint arXiv:2505.17020},
  year={2025}
}

๐Ÿ“ง Concat

If you have any question about this project, please feel free to contact tattoo.ysl@gmail.com.

About

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published