Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

CVPR Findings, 2026

Overview

FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance.

Installation and Setup

Clone this repository

git clone https://github.com/QingtaoPan/FMVR.git
cd FMVR

Install Package

conda create -n FMVR python=3.10 -y
conda activate FMVR
pip install --upgrade pip  
pip install -e .

[Optional] Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

Model

Download corresponding LLaVA checkpoints from Hugging Face 🤗:

Version	LLM	Checkpoint
LLaVA-1.5	Vicuna-7B	liuhaotian/llava-v1.5-7b
LLaVA-1.5	Vicuna-13B	liuhaotian/llava-v1.5-13b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-7B	liuhaotian/llava-v1.6-vicuna-7b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-13B	liuhaotian/llava-v1.6-vicuna-13b

Pretraining Code

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.

Please refer to the documentation of llava1.5, set up the environment according to llava1.5's way, and organize the training data properly, placing it in the path ./playground. Then run the following code for inference:

bash scripts/v1_5/pretrain.sh

Fine-tuning Code

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

Download dataset images as in the finetuning process of llava1.5, place them in the playground, and then run the following code:

bash scripts/v1_5/finetune.sh

Evaluation Code

When evaluating the model, we almost synchronously use the testing code of llava1.5, and the basic usage method is consistent. Please refer to here for help. We provide the same script to complete the testing.

Citation

If you find FMVR useful for your research and applications, please cite using this BibTeX:

@article{pan2026frequency,
  title={Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models},
  author={Pan, Qingtao and Dou, Zhihao and Li, Shuo},
  journal={arXiv preprint arXiv:2603.11220},
  year={2026}
}

Acknowledgement

We appreciate the open-source efforts of LLaVA and CDPruner.

License

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
figs		figs
llava.egg-info		llava.egg-info
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Overview

Installation and Setup

Model

Pretraining Code

Fine-tuning Code

Evaluation Code

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Overview

Installation and Setup

Model

Pretraining Code

Fine-tuning Code

Evaluation Code

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages