A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
A comprehensive multi-modal evaluation protocol using VLM-based scores and multi-scale representation metrics.
Overview โข Dataset โข Evaluation โข Installation โข Quick Start โข Citation
OpenVTON-Bench is a large-scale, high-resolution benchmark designed for the systematic evaluation of controllable virtual try-on (VTON) models.
Unlike existing datasets and evaluation protocols that struggle with texture details and semantic consistency, OpenVTON-Bench provides:
- ๐ผ๏ธ ~100K Image Pairs: Breathtaking resolutions up to 1536ร1536 to evaluate fine-grained texture generation.
- ๐ท๏ธ Fine-Grained Taxonomy: Semantically balanced across 20 garment categories.
- ๐ Multi-Level Automated Evaluation: Comprehensively covering:
- Pixel fidelity
- Garment consistency
- Semantic realism
This benchmark enables fair, reproducible, and scalable comparison across modern diffusion-based and transformer-based try-on systems.
The dataset is constructed through a rigorous three-stage pipeline ensuring category diversity, visual quality, and semantic consistency:
- ๐ Web-Scale Crawling: With strict resolution filtering to maintain commercial-grade quality.
- โ๏ธ Hybrid Annotation: Combining human verification with Vision-Language Model (VLM) dense captioning.
- โ๏ธ Semantic-Aware Balancing: Utilizing DINOv3 hierarchical clustering for uniform distribution.
The full dataset is open-source and publicly available on HuggingFace:
Important
| Property | Value | Description |
|---|---|---|
| Image Pairs | ~100,000 |
High-quality garment and person pairs |
| Resolution | Up to 1536ร1536 |
Critical for fine-grained texture assessment |
| Categories | 20 |
Fine-grained garment taxonomy |
| Annotation | Hybrid | VLM dense captioning + human verified |
OpenVTON-Bench introduces a hybrid evaluation paradigm featuring four complementary protocols. This multi-view design captures both perceptual and structural quality of generated try-on results:
| Type | Description | Key Metrics |
|---|---|---|
| ๐ง VLM-based | Semantic realism via Vision-Language Models (VLM-as-a-Judge) | Background, Identity, Texture, Shape, Overall |
| โ๏ธ Garment-based | Region-level evaluation via SAM3 | SSIM, PSNR, LPIPS, Cosine |
| ๐ผ๏ธ All-based | Full-image feature comparison using DINOv3 | SSIM, LPIPS, Cosine |
| ๐พ Pixel-based | Raw pixel structural comparison | MSE, PSNR, SSIM, LPIPS, FID |
| Metric | Goal | Description |
|---|---|---|
| SSIM | โ Higher | Structural Similarity Index |
| PSNR | โ Higher | Peak Signal-to-Noise Ratio |
| Cosine Sim | โ Higher | Feature-level cosine similarity |
| LPIPS | โ Lower | Learned Perceptual Image Patch Similarity |
| FID | โ Lower | Frรฉchet Inception Distance |
| MSE | โ Lower | Mean Squared Error |
Note
VLM evaluation utilizes a 1โ5 scoring scale across five semantic dimensions (Background, Identity, Texture, Shape, and Overall realism).
- Python 3.12+
- CUDA 12.4+
- Recommended: 4ร GPUs for large-scale evaluation
-
Clone the repository:
git clone https://github.com/RenxingIntelligence/OpenVTON-Bench.git cd OpenVTON-Bench -
Create and activate the environment: Using
conda(Recommended):conda env create -f env.yaml conda activate bench_new
Or using
pip:pip install -r requirements.txt
Before running the benchmark, prepare the necessary backbone weights for feature extraction and segmentation in the models/ directory:
models/
โโโ dinov3-vith16plus/ # Feature extraction
โโโ sam3/ # Garment segmentation
Copy the template configuration file:
cp benchmark/config.yaml benchmark/config.local.yamlUpdate the configuration (benchmark/config.local.yaml) with your generated image directories and model paths:
data:
test_jsonl: "./data/test_samples.jsonl"
generated_dirs:
- name: "your_model"
path: "./generated_images/your_model"
models:
dinov3:
path: "./models/dinov3-vith16plus"Run the full benchmark suite:
bash run_benchmark.sh --config benchmark/config.local.yamlOr run specific evaluation types individually:
bash run_benchmark.sh --eval-type pixel
bash run_benchmark.sh --eval-type garment
bash run_benchmark.sh --eval-type vlmWarning
Dataset Format Requirement:
The generated images must maintain identical filenames to the source/target images as specified in the JSONL:
{"source": "00001.jpg", "target": "00001.jpg"}
The benchmark automatically generates a rich suite of analytics, outputted to the results/ directory:
results/
โโโ YYYYMMDD_HHMMSS/
โโโ summary.json # Aggregate metric scores
โโโ per_model/ # Detailed model-specific data
โโโ visualizations/ # Radar plots, comparison charts, per-sample diagnostics
To measure the agreement between our automated multi-modal metrics and human subjective evaluation:
python benchmark/analyze_correlation.py \
--result_dir results/... \
--human_ratings data/human.jsonClick to expand
benchmark/
โโโ metrics/ # Implementation of all evaluation metrics
โโโ utils/ # Helper scripts and visualizers
โโโ run_benchmark.py # Main execution entrypoint
โโโ analyze_correlation.py # Statistical correlation tools
If you find this benchmark useful in your research, please consider citing:
@misc{li2026openvton,
title={OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation},
author={Jin Li and Tao Chen and Shuai Jiang and Weijie Wang and Jingwen Luo and Chenhui Wu},
year={2026},
eprint={2601.22725},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.22725}
}This project is licensed under the MIT License.

