A Real-time, Intent-Driven Visual Agent featuring "Spotlight" Prompting and Self-Verification.
FMU-Agent is a real-time multimodal agent equipped with fine-grained spatial perception. Users can interact by clicking any object in an image or inputting text instructions to receive precise descriptions and object localization.
Addressing the limitations of existing MLLMs in small object localization and complex referencing, FMU-Agent introduces the innovative "Spotlight" Visual Prompting algorithm. Combined with the real-time segmentation of SAM2 and the high-throughput inference of vLLM, it achieves millisecond-level visual interaction.
Unlike traditional "Image Captioning," FMU-Agent possesses System 2 level self-reflection capabilities. Through the Inference-Time Cycle-Verification mechanism, it significantly reduces spatial hallucinations.
- π Click-to-Chat: Real-time segmentation and multi-turn dialogue for any fine-grained object using SAM2.
- π Text-to-Box (Intent-Driven Grounding): Supports explicit commands (e.g., "Find: the red cup") to accurately locate and draw target BBoxes.
- π¨ Spotlight Visual Prompting: A unique "Background Dimming + Fidelity Preservation + White Outline" algorithm that solves the texture occlusion issues caused by traditional red masks.
-
π‘οΈ Cycle-Consistency Verification: Automatic reverse verification during inference (Text
$\to$ Box$\to$ IoU) to intercept hallucinated outputs. - β‘ Decoupled Architecture: Three-layer decoupling of Front-end (Gradio), Vision (SAM2), and Reasoning (vLLM) to support multi-GPU pipeline deployment.
| Task 1: Fine-grained Image Captioning | Task 2: Precise Grounding |
![]() |
![]() |
- Introduction
- Demo
- Quick Start
- Methodology
- Architecture
- Data Preparation
- Fine-Tuning
- Evaluation Results
- Roadmap
- Citation
- License
- Linux (Tested on Ubuntu 20.04)
- Python 3.10+ (Tested with 3.11.14)
- NVIDIA GPU (Tested in RTX3090 24GB)
- CUDA 12.1+
- PyTorch 2.0+ (Tested with 2.9.1)
- vLLM (for efficient LLM serving)
- SAM2 (for real-time segmentation)
- Clone the repository
git clone https://github.com/wolfvoid/FMU-Agent.git
cd FMU-Agent- Install SAM2 model
pip install git+[https://github.com/facebookresearch/segment-anything-2.git](https://github.com/facebookresearch/segment-anything-2.git)- Install dependencies
pip install -r requirements.txt- Download Checkpoints SAM2: Download sam2_hiera_large.pt to checkpoints/.
cd checkpoints
wget [https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt)MLLM-LoRA: Download our fine-tuned LoRA weights to checkpoints/mllm-lora/.
wget [https://huggingface.co/wolfvoid/FMU-Agent-lora](https://huggingface.co/wolfvoid/FMU-Agent-lora)- Merge LoRA weights
# Please ensure you modify the base model path, lora path, and merged path in tools/merge_lora.py
python tools/merge_lora.pyCheck environments
python tools/check_env.pyStep 1: Start vLLM Backend (GPU 0)
bash mllm.shStep 2: Start Application (GPU 1 for SAM2)
CUDA_VISIBLE_DEVICES=1 python app.pyAccess the Web UI at http://localhost:8081.
Traditional overlays (red semi-transparent masks) can alter object pixel colors, causing models to misidentify red objects as pink. FMU-Agent employs Spotlight Mode:
- Dimming: Background brightness reduced by 50% for high contrast.
- Fidelity: Target area retains 100% original pixels with zero color pollution.
- Indicator: A 1px ultra-thin white outline serves as a non-semantic boundary indicator.
A "self-play" mechanism introduced during inference:
-
Generate: User clicks
$\to$ generate description$T$ . -
Reverse: Input
$T$ into the model$\to$ predict coordinates$B_{pred}$ . -
Verify: Calculate the Intersection over Union (IoU) between
$B_{pred}$ and the original Mask$M$ ,$\text{IoU}(B_{pred}, M)$ . -
Decision: If
$\text{IoU} < Threshold$ , mark as low confidence or trigger regeneration.
FMU-Agent utilizes a three-layer decoupled architecture to maximize throughput:
graph LR
User["Web UI (Gradio)"] -- "Click/Text" --> Controller
subgraph "GPU 1: Vision Worker"
Controller -- Image --> SAM2["SAM2 Encoder"]
SAM2 -- Mask --> Visual["Spotlight Renderer"]
end
subgraph "GPU 0: Reasoning Worker"
Visual -- "Processed Img" --> vLLM["Qwen3-VL LoRA"]
vLLM -- "Stream Text" --> Controller
end
Controller -- Response --> User
We use COCO 2017 Val set for evaluation, you can also use other datasets. Download COCO 2017 Val images and annotations:
# Download COCO 2017 Val Set
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
# or you can use a python script to download
python tools/download_coco.pyWe provide the Spotlight Data Engine used to train FMU-Agent. To reproduce our training data:
# Generate Spotlight-Augmented COCO Dataset
python tools/prepare_finetune_data.py \
--coco_dir ./coco/val2017 \
--output_dir ./dataset/spotlight_imagesNote: This script requires an OpenAI API Key for GPT-4o distillation.
We use LLaMA-Factory for efficient and easy-to-use fine-tuning. Follow the steps below to fine-tune the model on your own dataset.
If you haven't installed LLaMA-Factory yet, please set it up first:
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .[metrics]LLaMA-Factory requires a specific data format (Alpaca or ShareGPT style) and a registry entry in dataset_info.json.
Step 1: Format your data For Vision-Language tasks, we recommend the ShareGPT format. Save your data as data/my_custom_data.json:
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "<image>Describe this image."
},
{
"from": "assistant",
"value": "This is a detailed description of the image..."
}
],
"images": [
"/path/to/image_01.jpg"
]
}
]Step 2: Register your dataset Edit LLaMA-Factory/data/dataset_info.json and append your dataset definition:
"my_custom_dataset": {
"file_name": "my_custom_data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"images": "images"
}
}- Start Training Use the following command to start LoRA fine-tuning. We provide a sample script scripts/finetune.sh for reference.
Note:
-
Ensure --template matches your base model (e.g., qwen3_vl).
-
Adjust --per_device_train_batch_size and --gradient_accumulation_steps based on your GPU memory.
-
If you encounter OOM (Out of Memory), try enabling DeepSpeed Zero2 or Zero3 by adding --deepspeed examples/deepspeed/ds_z2_config.json.
-
Once training is complete, refer to the Merge LoRA weights section to merge the adapter into the base model.
We evaluated FMU-Agent across four dimensions against the baseline Qwen3-VL-Instruct using our Spotlight validation set (
We evaluated the model's performance in generating fine-grained object descriptions. We used four metrics: BLEU-4, ROUGE-L, BERT-Score, and CLIP-Score.
| Model | BLEU-4 | ROUGE-L | BERT-Score | CLIP-Score |
|---|---|---|---|---|
| Qwen3-VL (Base) | 0.0951 | 0.3005 | 0.8977 | 27.00 |
| FMU-Agent (Ours) | 0.3465 | 0.5584 | 0.9383 | 26.23 |
| Improvement | +264.3% πΊ | +85.8% πΊ | +4.5% πΊ | -2.8% |
Analysis: FMU-Agent achieved significant improvements in BLEU-4 and ROUGE-L, indicating that the fine-tuned agent better follows fine-grained spatial description instructions. The slight decrease in CLIP-Score is expected, as the baseline model tends to generate general descriptions covering the entire image, while FMU-Agent strictly focuses on the target area highlighted by the Spotlight.
We evaluated the model's accuracy in target localization and understanding. We used mIoU (mean intersection over union) and Acc@0.5 (localization accuracy) as core metrics.
| Model | mIoU (Mean Intersection over Union) | Acc@0.5 (Localization Accuracy) |
|---|---|---|
| Qwen3-VL (Base) | 0.7402 | 0.7800 |
| FMU-Agent (Ours) | 0.8696 | 0.9200 |
| Improvement | +17.5% πΊ | +17.9% πΊ |
Analysis: Thanks to the Spotlight visual prompt and CoT coordinate training, FMU-Agent performs excellently in spatial localization tasks (Acc@0.5 reaches 92%), demonstrating its ability to accurately locate and understand user-specified targets.
We evaluated the model's "reference shift" phenomenon in description generation. We define hallucination as: the model generates descriptions that are grammatically correct but semantically cannot be corresponded back to the user-specified Spotlight area through reverse localization (Text-to-Box). We quantify and reduce this metric through the Inference-Time Cycle-Verification mechanism.
| Setting | Hallucination Rate | Correction Rate |
|---|---|---|
| Baseline | 18.7% | - |
| Ours (Cycle Verification) | 9.9% | 52.9% |
| Improvement | Reduced by 8.8% π | Intercepted and corrected nearly half of the errors β |
Analysis: With the introduction of a System 2 level cycle verification mechanism, the system can automatically detect and correct about 44% of reference errors, significantly improving the reliability of responses.
We evaluated the system's response speed in real-time interaction. We compared the stateless serial architecture (Baseline) with the memory-resident decoupled architecture (Ours) in terms of end-to-end time to first token (TTFT).
| Pipeline Step | Stateless Mode / Baseline (ms) | Cached Mode / Ours (ms) |
|---|---|---|
| SAM2 Image Encode | ~348.1 | 0.0 (Cached) |
| SAM2 Mask Decode | ~10.0 | ~10.0 |
| Visual Process CPU | ~30.0 | ~30.0 |
| vLLM Inference + Net | Included | Included |
| Total E2E TTFT | 565.55 | 215.76 |
| Improvement | - | 2.62x Speedup (61.85% Latency Reduction) π |
Analysis: Thanks to the decoupled architecture and feature caching design, we completely eliminated the most time-consuming SAM2 Image Encoding step during interaction, improving end-to-end response speed by 2.6 times, achieving truly millisecond-level real-time interaction experience.
You can use the provided scripts to reproduce the above results on the Spotlight validation set or your own dataset:
Start the vLLM backend (start services for base or custom models respectively)
bash mllm.sh &Step 1: Generate answers (generate answers_base.jsonl or answers_custom.jsonl)
python evaluation/benchmark_generate_answer.py \
--model custom \
--dataset /path/to/val_set.jsonl \
--samples 100Step 2: Calculate metrics (BLEU, ROUGE, CLIP-Score, mIoU, etc.)
python evaluation/benchmark_quality_metrics.py \
--model custom \
--clip_path /path/to/clip/modelRun cycle consistency verification test
bash mllm.sh &
python evaluation/benchmark_hallucination.py \
--dataset /path/to/val_set.jsonl \
--samples 100bash mllm.sh &
python -m evaluation.benchmark_ttft- [β ] Release Inference Code & Gradio Demo
- [β ] Release "Spotlight" Data Generation Script
- [β ] Release Pre-trained LoRA Weights
- Optimize SAM2 Preprocessing Pipeline to Further Reduce Latency
- Support Video Input and Temporal Consistency Modeling
If you find this project useful, please cite:
@misc{fmu-agent2026,
title={FMU-Agent: Fine-Grained Multimodal Understanding Agent with MLLM & SAM2},
author={wolf void},
year={2026},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{https://github.com/wolfvoid/FMU-Agent}}
}This project is licensed under the Apache 2.0 License. Based on Qwen3-VL and SAM2.

