👁️ FMU-Agent

Fine-grained Multimodal Understanding Agent with Qwen-3-VL & SAM2

A Real-time, Intent-Driven Visual Agent featuring "Spotlight" Prompting and Self-Verification.

🚀 Introduction

FMU-Agent is a real-time multimodal agent equipped with fine-grained spatial perception. Users can interact by clicking any object in an image or inputting text instructions to receive precise descriptions and object localization.

Addressing the limitations of existing MLLMs in small object localization and complex referencing, FMU-Agent introduces the innovative "Spotlight" Visual Prompting algorithm. Combined with the real-time segmentation of SAM2 and the high-throughput inference of vLLM, it achieves millisecond-level visual interaction.

Unlike traditional "Image Captioning," FMU-Agent possesses System 2 level self-reflection capabilities. Through the Inference-Time Cycle-Verification mechanism, it significantly reduces spatial hallucinations.

✨ Key Features

👆 Click-to-Chat: Real-time segmentation and multi-turn dialogue for any fine-grained object using SAM2.
🔍 Text-to-Box (Intent-Driven Grounding): Supports explicit commands (e.g., "Find: the red cup") to accurately locate and draw target BBoxes.
🎨 Spotlight Visual Prompting: A unique "Background Dimming + Fidelity Preservation + White Outline" algorithm that solves the texture occlusion issues caused by traditional red masks.
🛡️ Cycle-Consistency Verification: Automatic reverse verification during inference (Text $\to$ Box $\to$ IoU) to intercept hallucinated outputs.
⚡ Decoupled Architecture: Three-layer decoupling of Front-end (Gradio), Vision (SAM2), and Reasoning (vLLM) to support multi-GPU pipeline deployment.

🎞️ Demo

Task 1: Fine-grained Image Captioning	Task 2: Precise Grounding

🏁 Quick Start

Prerequisites

Linux (Tested on Ubuntu 20.04)
Python 3.10+ (Tested with 3.11.14)
NVIDIA GPU (Tested in RTX3090 24GB)
CUDA 12.1+
PyTorch 2.0+ (Tested with 2.9.1)
vLLM (for efficient LLM serving)
SAM2 (for real-time segmentation)

Installation

Clone the repository

git clone https://github.com/wolfvoid/FMU-Agent.git
cd FMU-Agent

Install SAM2 model

pip install git+[https://github.com/facebookresearch/segment-anything-2.git](https://github.com/facebookresearch/segment-anything-2.git)

Install dependencies

pip install -r requirements.txt

Download Checkpoints SAM2: Download sam2_hiera_large.pt to checkpoints/.

cd checkpoints
wget [https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt)

MLLM-LoRA: Download our fine-tuned LoRA weights to checkpoints/mllm-lora/.

wget [https://huggingface.co/wolfvoid/FMU-Agent-lora](https://huggingface.co/wolfvoid/FMU-Agent-lora)

Merge LoRA weights

# Please ensure you modify the base model path, lora path, and merged path in tools/merge_lora.py
python tools/merge_lora.py

Check environments

python tools/check_env.py

Launching

Step 1: Start vLLM Backend (GPU 0)

bash mllm.sh

Step 2: Start Application (GPU 1 for SAM2)

CUDA_VISIBLE_DEVICES=1 python app.py

Access the Web UI at http://localhost:8081.

🧠 Methodology

1. Spotlight Visual Prompting

Traditional overlays (red semi-transparent masks) can alter object pixel colors, causing models to misidentify red objects as pink. FMU-Agent employs Spotlight Mode:

Dimming: Background brightness reduced by 50% for high contrast.
Fidelity: Target area retains 100% original pixels with zero color pollution.
Indicator: A 1px ultra-thin white outline serves as a non-semantic boundary indicator.

2. Cycle-Consistency Verification

A "self-play" mechanism introduced during inference:

Generate: User clicks $\to$ generate description $T$.
Reverse: Input $T$ into the model $\to$ predict coordinates $B_{pred}$.
Verify: Calculate the Intersection over Union (IoU) between $B_{pred}$ and the original Mask $M$, $\text{IoU}(B_{pred}, M)$.
Decision: If $\text{IoU} < Threshold$, mark as low confidence or trigger regeneration.

🛠️ Architecture

FMU-Agent utilizes a three-layer decoupled architecture to maximize throughput:

graph LR
    User["Web UI (Gradio)"] -- "Click/Text" --> Controller
    subgraph "GPU 1: Vision Worker"
        Controller -- Image --> SAM2["SAM2 Encoder"]
        SAM2 -- Mask --> Visual["Spotlight Renderer"]
    end
    subgraph "GPU 0: Reasoning Worker"
        Visual -- "Processed Img" --> vLLM["Qwen3-VL LoRA"]
        vLLM -- "Stream Text" --> Controller
    end
    Controller -- Response --> User

📊 Data Preparation

We use COCO 2017 Val set for evaluation, you can also use other datasets. Download COCO 2017 Val images and annotations:

# Download COCO 2017 Val Set
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
# or you can use a python script to download
python tools/download_coco.py

We provide the Spotlight Data Engine used to train FMU-Agent. To reproduce our training data:

# Generate Spotlight-Augmented COCO Dataset
python tools/prepare_finetune_data.py \
    --coco_dir ./coco/val2017 \
    --output_dir ./dataset/spotlight_images

Note: This script requires an OpenAI API Key for GPT-4o distillation.

🛠️ Fine-Tuning

We use LLaMA-Factory for efficient and easy-to-use fine-tuning. Follow the steps below to fine-tune the model on your own dataset.

1. Installation

If you haven't installed LLaMA-Factory yet, please set it up first:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .[metrics]

2. Prepare Dataset

LLaMA-Factory requires a specific data format (Alpaca or ShareGPT style) and a registry entry in dataset_info.json.

Step 1: Format your data For Vision-Language tasks, we recommend the ShareGPT format. Save your data as data/my_custom_data.json:

    [
        {
            "id": "identity_0",
            "conversations": [
            {
                "from": "user",
                "value": "<image>Describe this image."
            },
            {
                "from": "assistant",
                "value": "This is a detailed description of the image..."
            }
            ],
            "images": [
            "/path/to/image_01.jpg"
            ]
        }
    ]

Step 2: Register your dataset Edit LLaMA-Factory/data/dataset_info.json and append your dataset definition:

    "my_custom_dataset": {
        "file_name": "my_custom_data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        }
    }

Start Training Use the following command to start LoRA fine-tuning. We provide a sample script scripts/finetune.sh for reference.

Note:

Ensure --template matches your base model (e.g., qwen3_vl).
Adjust --per_device_train_batch_size and --gradient_accumulation_steps based on your GPU memory.
If you encounter OOM (Out of Memory), try enabling DeepSpeed Zero2 or Zero3 by adding --deepspeed examples/deepspeed/ds_z2_config.json.
Once training is complete, refer to the Merge LoRA weights section to merge the adapter into the base model.

🏆 Evaluation Results

We evaluated FMU-Agent across four dimensions against the baseline Qwen3-VL-Instruct using our Spotlight validation set ($N=100$)。

1. Text Generation Quality

We evaluated the model's performance in generating fine-grained object descriptions. We used four metrics: BLEU-4, ROUGE-L, BERT-Score, and CLIP-Score.

Model	BLEU-4	ROUGE-L	BERT-Score	CLIP-Score
Qwen3-VL (Base)	0.0951	0.3005	0.8977	27.00
FMU-Agent (Ours)	0.3465	0.5584	0.9383	26.23
Improvement	+264.3% 🔺	+85.8% 🔺	+4.5% 🔺	-2.8%

Analysis: FMU-Agent achieved significant improvements in BLEU-4 and ROUGE-L, indicating that the fine-tuned agent better follows fine-grained spatial description instructions. The slight decrease in CLIP-Score is expected, as the baseline model tends to generate general descriptions covering the entire image, while FMU-Agent strictly focuses on the target area highlighted by the Spotlight.

2. Fine-grained Spatial Perception

We evaluated the model's accuracy in target localization and understanding. We used mIoU (mean intersection over union) and Acc@0.5 (localization accuracy) as core metrics.

Model	mIoU (Mean Intersection over Union)	Acc@0.5 (Localization Accuracy)
Qwen3-VL (Base)	0.7402	0.7800
FMU-Agent (Ours)	0.8696	0.9200
Improvement	+17.5% 🔺	+17.9% 🔺

Analysis: Thanks to the Spotlight visual prompt and CoT coordinate training, FMU-Agent performs excellently in spatial localization tasks (Acc@0.5 reaches 92%), demonstrating its ability to accurately locate and understand user-specified targets.

3. Hallucination Rate

We evaluated the model's "reference shift" phenomenon in description generation. We define hallucination as: the model generates descriptions that are grammatically correct but semantically cannot be corresponded back to the user-specified Spotlight area through reverse localization (Text-to-Box). We quantify and reduce this metric through the Inference-Time Cycle-Verification mechanism.

Setting	Hallucination Rate	Correction Rate
Baseline	18.7%	-
Ours (Cycle Verification)	9.9%	52.9%
Improvement	Reduced by 8.8% 📉	Intercepted and corrected nearly half of the errors ✅

Analysis: With the introduction of a System 2 level cycle verification mechanism, the system can automatically detect and correct about 44% of reference errors, significantly improving the reliability of responses.

4. System Performance

We evaluated the system's response speed in real-time interaction. We compared the stateless serial architecture (Baseline) with the memory-resident decoupled architecture (Ours) in terms of end-to-end time to first token (TTFT).

Pipeline Step	Stateless Mode / Baseline (ms)	Cached Mode / Ours (ms)
SAM2 Image Encode	~348.1	0.0 (Cached)
SAM2 Mask Decode	~10.0	~10.0
Visual Process CPU	~30.0	~30.0
vLLM Inference + Net	Included	Included
Total E2E TTFT	565.55	215.76
Improvement	-	2.62x Speedup (61.85% Latency Reduction) 🚀

Analysis: Thanks to the decoupled architecture and feature caching design, we completely eliminated the most time-consuming SAM2 Image Encoding step during interaction, improving end-to-end response speed by 2.6 times, achieving truly millisecond-level real-time interaction experience.

💻 Evaluation Scripts

You can use the provided scripts to reproduce the above results on the Spotlight validation set or your own dataset:

1. Text Quality & Spatial Perception

Start the vLLM backend (start services for base or custom models respectively)

bash mllm.sh &

Step 1: Generate answers (generate answers_base.jsonl or answers_custom.jsonl)

python evaluation/benchmark_generate_answer.py \
    --model custom \
    --dataset /path/to/val_set.jsonl \
    --samples 100

Step 2: Calculate metrics (BLEU, ROUGE, CLIP-Score, mIoU, etc.)

python evaluation/benchmark_quality_metrics.py \
    --model custom \
    --clip_path /path/to/clip/model

2. Hallucination Rate

Run cycle consistency verification test

bash mllm.sh &
python evaluation/benchmark_hallucination.py \
    --dataset /path/to/val_set.jsonl \
    --samples 100

3. System Performance / TTFT

Run end-to-end latency benchmark

bash mllm.sh &
python -m evaluation.benchmark_ttft

🗓️ Roadmap

[✅] Release Inference Code & Gradio Demo
[✅] Release "Spotlight" Data Generation Script
[✅] Release Pre-trained LoRA Weights
Optimize SAM2 Preprocessing Pipeline to Further Reduce Latency
Support Video Input and Temporal Consistency Modeling

🖊️ Citation

If you find this project useful, please cite:

@misc{fmu-agent2026,
  title={FMU-Agent: Fine-Grained Multimodal Understanding Agent with MLLM & SAM2},
  author={wolf void},
  year={2026},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={\url{https://github.com/wolfvoid/FMU-Agent}}
}

📄 License

This project is licensed under the Apache 2.0 License. Based on Qwen3-VL and SAM2.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
checkpoints		checkpoints
evaluation		evaluation
pic		pic
tools		tools
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
app.py		app.py
mllm.sh		mllm.sh
requirements.txt		requirements.txt

wolfvoid/FMU-Agent

Folders and files

Latest commit

History

Repository files navigation

👁️ FMU-Agent

Fine-grained Multimodal Understanding Agent with Qwen-3-VL & SAM2

🚀 Introduction

✨ Key Features

🎞️ Demo

📑 Table of Contents

🏁 Quick Start

Prerequisites

Installation

Launching

🧠 Methodology

1. Spotlight Visual Prompting

2. Cycle-Consistency Verification

🛠️ Architecture

📊 Data Preparation

🛠️ Fine-Tuning

1. Installation

2. Prepare Dataset

🏆 Evaluation Results

1. Text Generation Quality

2. Fine-grained Spatial Perception

3. Hallucination Rate

4. System Performance

💻 Evaluation Scripts

1. Text Quality & Spatial Perception

2. Hallucination Rate

3. System Performance / TTFT

Run end-to-end latency benchmark

🗓️ Roadmap

🖊️ Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages