π€ Laser-7B ο½
π€ Laser-7B-GTA1 |
Laser-7B |
Laser-7B-GTA1
- [September 3, 2025]: π Full codebase released. Laser now supports self-envloving pipeline with any models like Qwen2.5-VL-7B or GTA1-7B.
- Code
- Data Generation
- Training
- Evaluation
- Model
- Laser(qwen2.5_vl-7b)
- Laser(GTA1-7b)
- Training Dataset
Laser is a self-evolving optimization framework, which nables the model to bootstrap its active perception capabilities through rejection samplingβbased SFT and region-wise preference learning, without relying on extensive human supervision.
As shown above, the evaluation covers six GUI domains and two task types (Text and Icon grounding). Our method, LASER, consistently outperforms previous models in terms of both overall grounding accuracy and generalization abil- ity across different domains, demonstrating the effectiveness and robustness of our self-evolving training strategy.The framework of Laser is above. Given a user instruction and the original image, the trained MLASER model progressively focuses on key regions through a multi-step reasoning process. At each step, the Visual CoT captures critical cues (highlighted in red within the tag) based on the current focus region. Below, we also illustrate the multi- stage self-evolving optimization process that elicits LASERβs multi-step active perception capabilities
- Eliciting Active Perception through Visual Cropping. Given the paired training data, we prompt the VLM back- bone Mraw to predict a focused region. The correspond- ing region is then cropped from the original image and integrated into the CoT as visual context, guiding the model toward accurate click-coordinate prediction. To improve the quality of reasoning trajectories, we adopt a STaR-style rejection sampling strategy to construct the dataset Dsft, which is used to finetune Msft.
- Learning Focused Region Preferences. We sample mul- tiple reasoning trajectories from Msft and estimate region-wise preferences using Monte Carlo Estimation. An IoU-based filter is applied to remove low-quality can- didates. The resulting preference pairs dataset Ddpo are used to train a stronger model Mdpo via DPO.
- Difficulty-Aware Multi-step Perception. While Mdpo supports single-step perception, it is prone to failure in complex scenarios that demand deeper reasoning. To overcome this limitation, we allow Mdpo to iteratively generate multi-step reasoning trajectories, enabling the construction of a diverse and difficulty-aware training data. The final model is then trained on this multi-step dataset Dβ³, making it with the ability to dynamically ad- just reasoning depth based on the difficulty of the query.
conda create --name llama_factory python==3.11
conda activate llama_factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolationgit clone https://github.com/wwfnb/Laser.git
conda create --name Laser python==3.11
conda activate Laser
cd Laser
pip install qwen-vl-utils
pip install 'vllm>0.7.2'
pip install -e .The two environments are used separately. Laser is used for data generation and evaluation, while LLaMA-Factory is used for model training.
The project uses Laser for data generation. Before generating data, you need to download the raw dataset and preprocess it. Make sure you are in the Laser environment.
The data used for generation comes from GTA1: GUI Test-time Scaling Agent, available on Hugging Face: grounding_dataset.Please download the dataset and place it under data/opensource_data:
mkdir data/opensource_data
# download the grounding_dataset
huggingface-cli download --repo-type dataset --resume-download "HelloKKMe/grounding_dataset" --local-dir "data/opensource_data"
# unzip the images
cd data/opensource_data
unzip image.part.aaWe preprocess the dataset using the following script:
python src/laser/prodata_para.pyThe processed dataset is stored in JSONL format, where each line corresponds to one sample.
Each sample contains:
{
"image_url": "image/dataset/Aria-UI_Data/web/images/screenshot_bb37986a-b810-44db-a28b-5cf5d5bd97cd_part_5.png",
"instruction": "manage my information preferences.",
"action_type": null,
"coordinate": [854, 1034, 1062, 1068],
"id": "47215f78-38f1-497a-8963-e3538ee32bd7",
"source": "aria"
}
π‘ Notes:
In the original grounding_dataset, bounding box coordinates were normalized to [0, 1000]. During our preprocessing, they are converted into absolute pixel values based on the corresponding image resolution.
We generate our dataset in four stages, each contributing to a specific training purpose.
To make it easier to follow, we group the data generation steps by purpose and list the corresponding scripts.
- Stage 1: Single-step SFT Data Generation
python src/laser/generator/single_step_sft_generator.py- Stage 2: Single-step DPO Data Generation
python src/laser/generator/single_step_dpo_generator.py- Stage 3: Multi-step SFT Data Generation
python src/laser/generator/multi_step_sft_generator.py- Stage 4: Multi-step SFT Data Generation
python src/laser/generator/multi_step_dpo_generator.pyAfter running the scripts, the processed SFT and DPO datasets will be saved under:
data/llamafactory_training_dataThey will follow the LLaMA-Factory training format, making them ready for immediate use in training.
You can either construct the datasets using the data generation process described above, or directly download our training data to start model training. You can download our training data from Hugging Face. The dataset is split into multiple parts, e.g.:
llamafactory_training_data.tar.gz.part_aa
llamafactory_training_data.tar.gz.part_ab
...Use cat to merge them into a single archive:
cat llamafactory_training_data.tar.gz.part_* > llamafactory_training_data.tar.gzThen extract it under the data/ directory:
mkdir -p data
tar -xzvf llamafactory_training_data.tar.gz -C dataWe train our models in four stages, using the datasets prepared above. Each stage focuses on a specific training purpose. Make sure you are in the LLaMA-Factory training environment.
- Stage 1: Single-step SFT Training
bash scripts/train/train_single_sft.sh- Stage 2: Single-step DPO Training
bash scripts/train/train_single_dpo.sh- Stage 3: Multi-step SFT Training
bash scripts/train/train_multi_sft.sh- Stage 4: Multi-step DPO Training
bash scripts/train/train_multi_dpo.shWe evaluate Laser on two widely-used GUI grounding benchmarks:ScreenSpot-Pro and ScreenSpot-V2. Put the ScreenSpot-Pro and ScreenSpot-V2 under data/benchmark.
Make sure you are in the Laser environment and have downloaded the datasets. We provide scripts for easy evaluation:
bash scripts/eval/eval_sceenspot_pro.sh## process the data, just once.
python scripts/transfer.py
## evaluate
bash scripts/eval/eval_screenspot_v2.shIf you find this work helpful, please cite our paper:
@misc{wang2025learningactiveperceptionselfevolving,
title={Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding},
author={Wanfu Wang and Qipeng Huang and Guangquan Xue and Xiaobo Liang and Juntao Li},
year={2025},
eprint={2509.04243},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.04243},
}This project is released under the MIT License.





