Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

🤗 Laser-7B ｜ 🤗 Laser-7B-GTA1 | Laser-7B | Laser-7B-GTA1

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

[September 3, 2025]: 🚀 Full codebase released. Laser now supports self-envloving pipeline with any models like Qwen2.5-VL-7B or GTA1-7B.

Release Plans

💡 Overview

Laser is a self-evolving optimization framework, which nables the model to bootstrap its active perception capabilities through rejection sampling–based SFT and region-wise preference learning, without relying on extensive human supervision.

📊 Overall Performance

As shown above, the evaluation covers six GUI domains and two task types (Text and Icon grounding). Our method, LASER, consistently outperforms previous models in terms of both overall grounding accuracy and generalization abil- ity across different domains, demonstrating the effectiveness and robustness of our self-evolving training strategy.

✨ The Laser Framework

The framework of Laser is above. Given a user instruction and the original image, the trained MLASER model progressively focuses on key regions through a multi-step reasoning process. At each step, the Visual CoT captures critical cues (highlighted in red within the tag) based on the current focus region. Below, we also illustrate the multi- stage self-evolving optimization process that elicits LASER’s multi-step active perception capabilities

Eliciting Active Perception through Visual Cropping. Given the paired training data, we prompt the VLM back- bone M_raw to predict a focused region. The correspond- ing region is then cropped from the original image and integrated into the CoT as visual context, guiding the model toward accurate click-coordinate prediction. To improve the quality of reasoning trajectories, we adopt a STaR-style rejection sampling strategy to construct the dataset D_sft, which is used to finetune M_sft.
Learning Focused Region Preferences. We sample mul- tiple reasoning trajectories from M_sft and estimate region-wise preferences using Monte Carlo Estimation. An IoU-based filter is applied to remove low-quality can- didates. The resulting preference pairs dataset D_dpo are used to train a stronger model M_dpo via DPO.
Difficulty-Aware Multi-step Perception. While M_dpo supports single-step perception, it is prone to failure in complex scenarios that demand deeper reasoning. To overcome this limitation, we allow M_dpo to iteratively generate multi-step reasoning trajectories, enabling the construction of a diverse and difficulty-aware training data. The final model is then trained on this multi-step dataset D_⟳, making it with the ability to dynamically ad- just reasoning depth based on the difficulty of the query.

🔧 Installation

Install LLaMA-Factory

conda create --name llama_factory python==3.11
conda activate llama_factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Install Laser

git clone https://github.com/wwfnb/Laser.git
conda create --name Laser python==3.11
conda activate Laser
cd Laser
pip install qwen-vl-utils
pip install 'vllm>0.7.2'
pip install -e .

The two environments are used separately. Laser is used for data generation and evaluation, while LLaMA-Factory is used for model training.

🛠️ Data Generation

The project uses Laser for data generation. Before generating data, you need to download the raw dataset and preprocess it. Make sure you are in the Laser environment.

1️⃣ Step 1: Preprocessing

📂 Download Dataset

The data used for generation comes from GTA1: GUI Test-time Scaling Agent, available on Hugging Face: grounding_dataset.Please download the dataset and place it under data/opensource_data:

mkdir data/opensource_data
# download the grounding_dataset
huggingface-cli download --repo-type dataset --resume-download "HelloKKMe/grounding_dataset" --local-dir "data/opensource_data"
# unzip the images
cd data/opensource_data
unzip image.part.aa

⚙️ Preprocess Dataset

We preprocess the dataset using the following script:

python src/laser/prodata_para.py

The processed dataset is stored in JSONL format, where each line corresponds to one sample.

Each sample contains:

{
  "image_url": "image/dataset/Aria-UI_Data/web/images/screenshot_bb37986a-b810-44db-a28b-5cf5d5bd97cd_part_5.png",
  "instruction": "manage my information preferences.",
  "action_type": null,
  "coordinate": [854, 1034, 1062, 1068],
  "id": "47215f78-38f1-497a-8963-e3538ee32bd7",
  "source": "aria"
}

💡 Notes:

In the original grounding_dataset, bounding box coordinates were normalized to [0, 1000]. During our preprocessing, they are converted into absolute pixel values based on the corresponding image resolution.

2️⃣ Step 2: Generation

We generate our dataset in four stages, each contributing to a specific training purpose.
To make it easier to follow, we group the data generation steps by purpose and list the corresponding scripts.

🔍 Eliciting Active Perception through Visual Cropping

Stage 1: Single-step SFT Data Generation

python src/laser/generator/single_step_sft_generator.py

🎯 Learning Focused Region Preferences

Stage 2: Single-step DPO Data Generation

python src/laser/generator/single_step_dpo_generator.py

🧩 Difficulty-Aware Multi-step Perception

Stage 3: Multi-step SFT Data Generation

python src/laser/generator/multi_step_sft_generator.py

Stage 4: Multi-step SFT Data Generation

python src/laser/generator/multi_step_dpo_generator.py

After running the scripts, the processed SFT and DPO datasets will be saved under:

data/llamafactory_training_data

They will follow the LLaMA-Factory training format, making them ready for immediate use in training.

🏋️‍♂️ Training

📂 Dataset Preparation

You can either construct the datasets using the data generation process described above, or directly download our training data to start model training. You can download our training data from Hugging Face. The dataset is split into multiple parts, e.g.:

llamafactory_training_data.tar.gz.part_aa
llamafactory_training_data.tar.gz.part_ab
...

Use cat to merge them into a single archive:

cat llamafactory_training_data.tar.gz.part_* > llamafactory_training_data.tar.gz

Then extract it under the data/ directory:

mkdir -p data
tar -xzvf llamafactory_training_data.tar.gz -C data

🚀 Start Training

We train our models in four stages, using the datasets prepared above. Each stage focuses on a specific training purpose. Make sure you are in the LLaMA-Factory training environment.

🔍 Eliciting Active Perception through Visual Cropping

Stage 1: Single-step SFT Training

bash scripts/train/train_single_sft.sh

🎯 Learning Focused Region Preferences

Stage 2: Single-step DPO Training

bash scripts/train/train_single_dpo.sh

🧩 Difficulty-Aware Multi-step Perception

Stage 3: Multi-step SFT Training

bash scripts/train/train_multi_sft.sh

Stage 4: Multi-step DPO Training

bash scripts/train/train_multi_dpo.sh

📊 Evaluation

We evaluate Laser on two widely-used GUI grounding benchmarks:ScreenSpot-Pro and ScreenSpot-V2. Put the ScreenSpot-Pro and ScreenSpot-V2 under data/benchmark. Make sure you are in the Laser environment and have downloaded the datasets. We provide scripts for easy evaluation:

Evaluate on ScreenSpot-Pro

bash scripts/eval/eval_sceenspot_pro.sh

Evaluate on ScreenSpot-V2

## process the data, just once.
python scripts/transfer.py
## evaluate
bash scripts/eval/eval_screenspot_v2.sh

📄 Citation

If you find this work helpful, please cite our paper:

@misc{wang2025learningactiveperceptionselfevolving,
      title={Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding}, 
      author={Wanfu Wang and Qipeng Huang and Guangquan Xue and Xiaobo Liang and Juntao Li},
      year={2025},
      eprint={2509.04243},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.04243}, 
}

📄 License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
docs		docs
figures		figures
scripts		scripts
src/laser		src/laser
.gitignore		.gitignore
readme.md		readme.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

Release Plans

💡 Overview

📊 Overall Performance

✨ The Laser Framework

🔧 Installation

Install LLaMA-Factory

Install Laser

🛠️ Data Generation

1️⃣ Step 1: Preprocessing

📂 Download Dataset

⚙️ Preprocess Dataset

2️⃣ Step 2: Generation

🔍 Eliciting Active Perception through Visual Cropping

🎯 Learning Focused Region Preferences

🧩 Difficulty-Aware Multi-step Perception

🏋️‍♂️ Training

📂 Dataset Preparation

🚀 Start Training

🔍 Eliciting Active Perception through Visual Cropping

🎯 Learning Focused Region Preferences

🧩 Difficulty-Aware Multi-step Perception

📊 Evaluation

Evaluate on ScreenSpot-Pro

Evaluate on ScreenSpot-V2

📄 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages