Nermin Samet · Gilles Puy · Renaud Marlet
Valeo.ai, Paris, France
Open-vocabulary perception in 3D is still far behind 2D vision, largely because transferring semantic knowledge from vision-language models to LiDAR data is not straightforward.
Instead of relying only on image-text alignment, IGLOSS uses text-to-image generation to create prototype images for target classes. These prototypes act as a visual semantic bridge between language and LiDAR. Combined with a 3D network distilled from a 2D vision foundation model, our method labels points by matching 3D point features with 2D prototype image features.
Some highlights:
• State-of-the-art open-vocabulary semantic segmentation on both nuScenes and SemanticKITTI
• State-of-the-art annotation-free closed-set segmentation on both datasets as well
• A simple framework built from three ingredients: image generation, a strong 2D VFM, and an aligned 3D VFM
• No need for direct image-text alignment, heavy prompt engineering, images at test time, or scan sequences at test time
• Opens new perspectives for research at the intersection of 3D scene understanding, foundation models, and generative methods.
If you find IGLOSS useful for your research, please cite our paper as follows.
N. Samet, G. Puy, R. Marlet, "IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation", arXiv, 2026.
BibTeX entry:
@inproceedings{igloss,
author = {Nermin Samet and Gilles Puy and Renaud Marlet},
title = {IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation},
booktitle = {arXiv},
year = {2026},
}
Please follow the ScaLR installation instructions to set up the required environment. Also, please install any other required packages as needed via pip during the setup process.
Ensure the ScaLR+ pretrained backbone weights are downloaded, unpacked, and placed under the checkpoints/ folder.
Set your project directory path:
export PROJECT_PATH="path/to/IGLOSS"1. Extract template features
Using DINOv2, extract the template features by running:
./PROJECT_PATH/open_voc_segmentation/image_feature_extraction.sh2. Segment LiDAR scenes
Once features are extracted, run open-vocabulary segmentation with IGLOSS:
./PROJECT_PATH/open_voc_segmentation/open_voc_segmentation_w_igloss.shBy default, the script runs on nuScenes. Settings for SemanticKITTI are also provided within the script.
In order to distill this model, run the following distillation script:
./PROJECT_PATH/scripts/WI_768_droppath_gelu-dinov2_vitl-448x896-mlp_2048-65_epoch-checkpointing.shNote: For more on ScaLR+ (finetuning, downstream evaluation etc.), please refer to the ScaLR repository.
