M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision (ICCV 2025)
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene’s generalizability across eleven datasets for four RGBT downstream tasks.
To pretrain a multispectral foundation model with robust generalization capabilities, we exert our utmost efforts to make a comprehensive collection of available RGBT datasets. The multispectral (RGBT) image datasets can be found at A Summary of Multispectral (RGBT) Image Datasets. Our meticulous collection and preprocessing yields RGBT550K, a comprehensive dataset comprising 548,238 high-quality samples. It encompasses diverse scenarios, tasks, lighting conditions, resolutions, and object categories, providing a solid foundation for the self-supervised pre-training of the multispectral foundation model. You can download the RGBT550K dataset from Baidu Cloud (code: rwf7) or One Dirve.
# RGBT550K Usage
sudo apt install p7zip-full
7z x RGBT550K_archive.7z.partaa
| Foundation Model | Backbone | Model Weights |
|---|---|---|
| M-SpecGene | ViT-B | M-SpecGene_VIT-B.pth |
Since the above pretrained foundation model M-SpecGene retains all parameters during self-supervised training, we extract the encoder for detection (ViTDet) and segmentation (UperNet) task.
cd tool
python M-SpecGeneTransform_det.py # M-SpecGene_VIT-B_det_transform.pth
python M-SpecGeneTransform_seg.py # M-SpecGene_VIT-B_seg_transform.pth
| Task | Backbone | Model Weights |
|---|---|---|
| Detection | ViT-B | M-SpecGene_VIT-B_det_transform.pth |
| Segmentation | ViT-B | M-SpecGene_VIT-B_seg_transform.pth |
| Task | Dataset | Trained Models | Performance |
|---|---|---|---|
| Detection | KAIST | KAIST_iter_25000.pth | MR^-2 23.74 |
| Detection | LLVIP | LLVIP_iter_105625.pth | mAP 65.3% |
| Detection | FLIR | FLIR_iter_90000.pth | mAP 44.7% |
| Segmentation | SemanticRT | SRT_iter_320000.pth | mIoU 79.84% |
| Segmentation | MVSEG | MVSEG_iter_240000.pth | mIoU 63.02% |
| Segmentation | FMB | FMB_iter_224000.pth | mIoU ~60% |
| SOD | VT5000 | VT5000_iter_54000.pth | S 0.892, MAE 0.028 |
a. RGBT550K dataset
# link the dataset
cd pretrain/mmpretrain-main_rgbt
ln -s /path/to/RGBT_CLEAN_v6/ir ./data/imagenet
ln -s /path/to/RGBT_CLEAN_v6/rgb ./data/imagenet2
b. preparation
Please refer to mmpretrain documentation for more detailed installation,
and download mae_single_modality_in148w_t48w_vit-b_epoch_400_dual_decoder.pth to ./work_dirs/mae_vit-base-p16_8xb512-amp-coslr-500e_in1k_siam/
# after installation
mim install -e .
c. cross-modality self-supervised pretraining
bash tools/dist_train.sh configs/mae/mae_vit-base-p16_8xb512-amp-coslr-500e_in1k_siam.py 8
a. dataset preparation
PLease download the FLIR , LLVIP and KAIST datasets to the proposal path.
# link the dataset (FLIR by default)
cd det/mmdetection_rgbt
ln -s /path/to/COCO_FLIR/FLIR_ir ./data/FLIR/coco
ln -s /path/to/COCO_FLIR/FLIR_rgb ./data/FLIR/coco2
b. installation
Please refer to mmdetection get_started.md for installation. You can also refer to the mmdet_env_refer.txt to check the version.
# after installation
cd det/mmdetection_rgbt
pip install -v -e .
c. evalution (FLIR by default)
python tools/test.py projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py /path/to/FLIR_iter_90000.pth
d. train (FLIR by default)
Please download the M-SpecGene_VIT-B_det_transform.pth, and change the pretrained model path in projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py
bash tools/dist_train.sh projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py 2
e. evalution or train on the other datasets
1. change the dataset link in ./data and data_root (line 7) in projects/ViTDet/configs/lsj-100e_coco-instance_5w.py
2. change num_classes (FLIR->3, LLVIP->1, KAIST->1) in ./configs/_base_/models/mask-rcnn_r50_fpn.py (line 54 73)
3. change the name of ann_file in ./projects/ViTDet/configs/lsj-100e_coco-instance_5w.py
4. train or evalution as above
For MR^-2 metric evalution on KAIST, please use the KAISTdevkit-matlab-wrapper from MBNet.
# MR^-2 metric for KAIST
1. Please uncommet line 388~406 in mmdet/evaluation/metrics/coco_metric.py
2. change dataset link&root, num_classes, ann_file and run the tool/test.py as above
3. txt file will be save at data/result
4. open the KAISTdevkit-matlab-wrapper and run the demo_test.m
a. dataset preparation
PLease download the SemanticRT, MVSEG and FMB datasets to the proposal path.
# link the dataset (MVSEG by default)
cd seg/mmsegmentation-main-rgbt
ln -s /path/to/MVSEG_ALL/MVSEG ./data/ade/ADEChallengeData2016
ln -s /path/to/MVSEG_ALL/MVSEG_T ./data/ade/ADEChallengeData2016_T
b. installation
Please refer to mmsegmentation-v1.2.2 get_started.md for installation. You can also refer to the mmseg_env_refer.txt to check the version.
# after installation
cd seg/mmsegmentation-main-rgbt
pip install -v -e .
c. evalution (MVSEG by default)
python tools/test.py configs/mae/mae-base_upernet_8xb2-amp-320k_ade20k-768x768.py /path/to/MVSEG_iter_240000.pth
d. train (MVSEG by default)
Please download the M-SpecGene_VIT-B_seg_transform.pth, and change the pretrained model path in configs/mae/mae-base_upernet_8xb2-amp-320k_ade20k-768x768.py
bash tools/dist_train.sh configs/mae/mae-base_upernet_8xb2-amp-320k_ade20k-768x768.py 2
e. evalution or train on the other datasets
1. change the dataset link in /dataset/ade/
2. change the mmseg/datasets/ade.py (refer to ade_FMB.py ade_MVSEG.py ade_SRT.py)
3. change num_classes (FMB->15, MVSEG->26, SRT->13) in configs/mae
/mae-base_upernet_8xb2-amp-320k_ade20k-768x768.py
4. train or evalution as above
a. dataset preparation
PLease download the VT5000, VT1000 , VT821 and VIRGBT-1500 datasets to the proposal path.
# link the dataset (VT5000 by default)
cd sod/mmsegmentation-main-rgbt
ln -s /path/to/VT5000_ALL/VT5000 ./data/ade/ADEChallengeData2016
ln -s /path/to/VT5000_ALL/VT5000_T ./data/ade/ADEChallengeData2016_T
b. installation
This part is the same with 2) RGBT Multispectral Semantic Segmentation.
c. evalution (VT5000 by default)
1. python tools/test.py configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-768x768.py /path/to/VT5000_iter_54000.pth --out ./pred_mask/54000/VT5000
2. cd ../SOD_Evaluation_Metrics-main
3. cp -r ../mmsegmentation-main-rgbt/pred_mask ./
4. cp -r /path/to/VT5000_ALL/VT5000/annotations/validation ./gt/VT5000
4. python 01to0255.py # To ensure the labels are 0/255 since the SOD_Evaluation_Metrics-main requires the labels to be 0/255 rather than 0/1.(Please change the file path in 01to0255.py)
5. python main.py # Please ensure the directory structure in pred_mask/54000 matches that of gt/.
d. train (VT5000 by default)
Please download the M-SpecGene_VIT-B_seg_transform.pth, and change the pretrained model path in configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-768x768.py
bash tools/dist_train.sh configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-768x768.py 2
e. evalution on the other datasets
1. change the dataset link in /dataset/ade/
2. evalution as above
If you find this repository useful in your research, please consider giving a star ⭐ and a citation.
@article{zhou2025m,
title={M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision},
author={Zhou, Kailai and Yang, Fuqiang and Wang, Shixian and Wen, Bihan and Zi, Chongde and Chen, Linsen and Shen, Qiu and Cao, Xun},
journal={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}

