Qingtao Pan, Zhihao Dou, and Shuo Li
CVPR Findings, 2026
FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance.
- Clone this repository
git clone https://github.com/QingtaoPan/FMVR.git
cd FMVR- Install Package
conda create -n FMVR python=3.10 -y
conda activate FMVR
pip install --upgrade pip
pip install -e .- [Optional] Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dirDownload corresponding LLaVA checkpoints from Hugging Face 🤗:
| Version | LLM | Checkpoint |
|---|---|---|
| LLaVA-1.5 | Vicuna-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.5 | Vicuna-13B | liuhaotian/llava-v1.5-13b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-13B | liuhaotian/llava-v1.6-vicuna-13b |
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Please refer to the documentation of llava1.5, set up the environment according to llava1.5's way, and organize the training data properly, placing it in the path ./playground. Then run the following code for inference:
bash scripts/v1_5/pretrain.shPlease download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as .jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
Download dataset images as in the finetuning process of llava1.5, place them in the playground, and then run the following code:
bash scripts/v1_5/finetune.shWhen evaluating the model, we almost synchronously use the testing code of llava1.5, and the basic usage method is consistent. Please refer to here for help. We provide the same script to complete the testing.
If you find FMVR useful for your research and applications, please cite using this BibTeX:
@article{pan2026frequency,
title={Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models},
author={Pan, Qingtao and Dou, Zhihao and Li, Shuo},
journal={arXiv preprint arXiv:2603.11220},
year={2026}
}We appreciate the open-source efforts of LLaVA and CDPruner.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
