Skip to content

wwfnb/Laser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Paper Blog Dataset License Organization

πŸ€— Laser-7B | πŸ€— Laser-7B-GTA1 | Laser-7B | Laser-7B-GTA1

If you like our project, please give us a star ⭐ on GitHub for the latest update.

πŸ“£ Latest News

  • [September 3, 2025]: πŸš€ Full codebase released. Laser now supports self-envloving pipeline with any models like Qwen2.5-VL-7B or GTA1-7B.

Release Plans

  • Code
    • Data Generation
    • Training
    • Evaluation
  • Model
    • Laser(qwen2.5_vl-7b)
    • Laser(GTA1-7b)
  • Training Dataset

πŸ’‘ Overview

Laser is a self-evolving optimization framework, which nables the model to bootstrap its active perception capabilities through rejection sampling–based SFT and region-wise preference learning, without relying on extensive human supervision.

πŸ“Š Overall Performance

As shown above, the evaluation covers six GUI domains and two task types (Text and Icon grounding). Our method, LASER, consistently outperforms previous models in terms of both overall grounding accuracy and generalization abil- ity across different domains, demonstrating the effectiveness and robustness of our self-evolving training strategy.

✨ The Laser Framework

The framework of Laser is above. Given a user instruction and the original image, the trained MLASER model progressively focuses on key regions through a multi-step reasoning process. At each step, the Visual CoT captures critical cues (highlighted in red within the tag) based on the current focus region. Below, we also illustrate the multi- stage self-evolving optimization process that elicits LASER’s multi-step active perception capabilities

  • Eliciting Active Perception through Visual Cropping. Given the paired training data, we prompt the VLM back- bone Mraw to predict a focused region. The correspond- ing region is then cropped from the original image and integrated into the CoT as visual context, guiding the model toward accurate click-coordinate prediction. To improve the quality of reasoning trajectories, we adopt a STaR-style rejection sampling strategy to construct the dataset Dsft, which is used to finetune Msft.
  • Learning Focused Region Preferences. We sample mul- tiple reasoning trajectories from Msft and estimate region-wise preferences using Monte Carlo Estimation. An IoU-based filter is applied to remove low-quality can- didates. The resulting preference pairs dataset Ddpo are used to train a stronger model Mdpo via DPO.
  • Difficulty-Aware Multi-step Perception. While Mdpo supports single-step perception, it is prone to failure in complex scenarios that demand deeper reasoning. To overcome this limitation, we allow Mdpo to iteratively generate multi-step reasoning trajectories, enabling the construction of a diverse and difficulty-aware training data. The final model is then trained on this multi-step dataset D⟳, making it with the ability to dynamically ad- just reasoning depth based on the difficulty of the query.

πŸ”§ Installation

Install LLaMA-Factory

conda create --name llama_factory python==3.11
conda activate llama_factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Install Laser

git clone https://github.com/wwfnb/Laser.git
conda create --name Laser python==3.11
conda activate Laser
cd Laser
pip install qwen-vl-utils
pip install 'vllm>0.7.2'
pip install -e .

The two environments are used separately. Laser is used for data generation and evaluation, while LLaMA-Factory is used for model training.

πŸ› οΈ Data Generation

The project uses Laser for data generation. Before generating data, you need to download the raw dataset and preprocess it. Make sure you are in the Laser environment.

1️⃣ Step 1: Preprocessing

πŸ“‚ Download Dataset

The data used for generation comes from GTA1: GUI Test-time Scaling Agent, available on Hugging Face: grounding_dataset.Please download the dataset and place it under data/opensource_data:

mkdir data/opensource_data
# download the grounding_dataset
huggingface-cli download --repo-type dataset --resume-download "HelloKKMe/grounding_dataset" --local-dir "data/opensource_data"
# unzip the images
cd data/opensource_data
unzip image.part.aa

βš™οΈ Preprocess Dataset

We preprocess the dataset using the following script:

python src/laser/prodata_para.py

The processed dataset is stored in JSONL format, where each line corresponds to one sample.

Each sample contains:

{
  "image_url": "image/dataset/Aria-UI_Data/web/images/screenshot_bb37986a-b810-44db-a28b-5cf5d5bd97cd_part_5.png",
  "instruction": "manage my information preferences.",
  "action_type": null,
  "coordinate": [854, 1034, 1062, 1068],
  "id": "47215f78-38f1-497a-8963-e3538ee32bd7",
  "source": "aria"
}

πŸ’‘ Notes:

In the original grounding_dataset, bounding box coordinates were normalized to [0, 1000]. During our preprocessing, they are converted into absolute pixel values based on the corresponding image resolution.

2️⃣ Step 2: Generation

We generate our dataset in four stages, each contributing to a specific training purpose.
To make it easier to follow, we group the data generation steps by purpose and list the corresponding scripts.

πŸ” Eliciting Active Perception through Visual Cropping

  • Stage 1: Single-step SFT Data Generation
python src/laser/generator/single_step_sft_generator.py

🎯 Learning Focused Region Preferences

  • Stage 2: Single-step DPO Data Generation
python src/laser/generator/single_step_dpo_generator.py

🧩 Difficulty-Aware Multi-step Perception

  • Stage 3: Multi-step SFT Data Generation
python src/laser/generator/multi_step_sft_generator.py
  • Stage 4: Multi-step SFT Data Generation
python src/laser/generator/multi_step_dpo_generator.py

After running the scripts, the processed SFT and DPO datasets will be saved under:

data/llamafactory_training_data

They will follow the LLaMA-Factory training format, making them ready for immediate use in training.

πŸ‹οΈβ€β™‚οΈ Training

πŸ“‚ Dataset Preparation

You can either construct the datasets using the data generation process described above, or directly download our training data to start model training. You can download our training data from Hugging Face. The dataset is split into multiple parts, e.g.:

llamafactory_training_data.tar.gz.part_aa
llamafactory_training_data.tar.gz.part_ab
...

Use cat to merge them into a single archive:

cat llamafactory_training_data.tar.gz.part_* > llamafactory_training_data.tar.gz

Then extract it under the data/ directory:

mkdir -p data
tar -xzvf llamafactory_training_data.tar.gz -C data

πŸš€ Start Training

We train our models in four stages, using the datasets prepared above. Each stage focuses on a specific training purpose. Make sure you are in the LLaMA-Factory training environment.

πŸ” Eliciting Active Perception through Visual Cropping

  • Stage 1: Single-step SFT Training
bash scripts/train/train_single_sft.sh

🎯 Learning Focused Region Preferences

  • Stage 2: Single-step DPO Training
bash scripts/train/train_single_dpo.sh

🧩 Difficulty-Aware Multi-step Perception

  • Stage 3: Multi-step SFT Training
bash scripts/train/train_multi_sft.sh
  • Stage 4: Multi-step DPO Training
bash scripts/train/train_multi_dpo.sh

πŸ“Š Evaluation

We evaluate Laser on two widely-used GUI grounding benchmarks:ScreenSpot-Pro and ScreenSpot-V2. Put the ScreenSpot-Pro and ScreenSpot-V2 under data/benchmark. Make sure you are in the Laser environment and have downloaded the datasets. We provide scripts for easy evaluation:

Evaluate on ScreenSpot-Pro

bash scripts/eval/eval_sceenspot_pro.sh

Evaluate on ScreenSpot-V2

## process the data, just once.
python scripts/transfer.py
## evaluate
bash scripts/eval/eval_screenspot_v2.sh

πŸ“„ Citation

If you find this work helpful, please cite our paper:

@misc{wang2025learningactiveperceptionselfevolving,
      title={Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding}, 
      author={Wanfu Wang and Qipeng Huang and Guangquan Xue and Xiaobo Liang and Juntao Li},
      year={2025},
      eprint={2509.04243},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.04243}, 
}

πŸ“„ License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors