Compared with FLUX, RealGRPO generates images with fewer synthetic artifacts. Compared with DanceGRPO, it mitigates HPSv2-driven reward hacking, especially the tendency toward over-saturated outputs.
Unlike SRPO's fixed prompt templates, RealGRPO uses a LLM to adaptively extract positive and negative style cues for each input prompt. This preserves prompt intent across domains (e.g., artistic or anime styles) instead of collapsing toward a single photorealistic bias.
When training diffusion models with GRPO, directly maximizing a reward model (e.g., HPSv2) can cause reward hacking. The model may exploit shortcut artifacts (such as over-smoothing, over-exposure, and unnatural contrast) to increase reward scores without improving real visual quality.
Inspired by SRPO, we use contrastive positive/negative text guidance. Instead of using fixed, hand-crafted style prompts, RealGRPO introduces a LLM that analyzes each training prompt and dynamically generates matched pos_style and neg_style pairs.
This dynamic strategy preserves style consistency across prompt domains (e.g., photorealistic vs. anime) while discouraging reward-hacking artifacts. We integrate it into the DanceGRPO framework with the following reward:
This objective pulls generations toward desired styles and away from artifact-prone directions.
- Download FLUX.1-dev from Hugging Face to
./checkpoints/flux. - Download HPS-v2.1 (
HPS_v2.1_compressed.pt) from Hugging Face to./checkpoints/hps_ckpt. - Download CLIP ViT-H-14 (
open_clip_pytorch_model.bin) from Hugging Face to./checkpoints/hps_ckpt. - Download Qwen3-4B from Hugging Face to
./checkpoints/Qwen3-4B.
git clone https://github.com/yangzhou24/RealGRPO.git
cd RealGRPO
conda create --name RealGRPO python=3.10
conda activate RealGRPO
bash env_setup.shPrepare the reward model:
mkdir third_party && cd third_party
git clone https://github.com/tgxs002/HPSv2.git
cd HPSv2 && pip install -e .For the open-source image generation setting, we use prompts from HPDv2, provided in ./assets/prompts.txt.
First, generate text embeddings (required):
# FLUX preprocessing with multiple GPUs
bash scripts/preprocess/preprocess_flux_rl_embeddings.shThen use Qwen3-4B to extract adaptive pos_style and neg_style prompts from training text.
The output is stored in data/rl_embeddings/videos2caption_cfg.json.
Each entry follows:
<prompt>|||<pos_style_1, pos_style_2, pos_style_3>|||<neg_style_1, neg_style_2, neg_style_3>
Examples:
A full body shot of many Asian girls in a river by Artgerm.|||Natural-lighting, Detailed, Real|||Anime, Flat, Painting
A cute anthropomorphic portrait of Stan Lee in a fantasy art style by various artists.|||Fantasy, Ethereal, Colorful|||Real, Photorealistic, 2D
torchrun --nproc_per_node=<NUM_GPUS> fastvideo/data_preprocess/preprocess_text_Qwen3.py \
--input-json data/rl_embeddings/videos2caption.json \
--output-dir data/rl_embeddingsRun GRPO fine-tuning to update the DiT backbone.
Reference setup: 32x A800 GPUs for around 100 epochs.
bash scripts/finetune/finetune_flux_realgrpo.shbash scripts/visualization/vis_flux.shReward hacking remains a serious issue in image-generation reward models. A common failure mode is preference for overexposed or grid-like artifacts that inflate reward scores without improving visual quality.
The example below comes from a DanceGRPO model trained with HPSv2. We then remove the grid-like artifact using Nano Banana and rescore both images with HPSv2. The artifact-removed image receives a lower score (0.385498 vs. 0.394043), indicating that HPSv2 can reward undesirable grid patterns. This highlights how critical reward-model design is for post-training in generative modeling.
| 0.394043 | 0.385498 |
|---|---|
![]() |
![]() |
This codebase builds on the open-source implementations of DanceGRPO and SRPO.
If you find this codebase useful for your research, please kindly cite:
@article{xue2025dancegrpo,
title={DanceGRPO: Unleashing GRPO on Visual Generation},
author={Xue, Zeyue and Wu, Jie and Gao, Yu and Kong, Fangyuan and Zhu, Lingting and Chen, Mengzhao and Liu, Zhiheng and Liu, Wei and Guo, Qiushan and Huang, Weilin and others},
journal={arXiv preprint arXiv:2505.07818},
year={2025}
}
@misc{shen2025directlyaligningdiffusiontrajectory,
title={Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference},
author={Xiangwei Shen and Zhimin Li and Zhantao Yang and Shiyi Zhang and Yingfang Zhang and Donghao Li and Chunyu Wang and Qinglin Lu and Yansong Tang},
year={2025},
eprint={2509.06942},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.06942},
}
@misc{RealGRPO,
Author = {Yang Zhou, Haoyu Guo},
Year = {2026},
Note = {https://github.com/yangzhou24/RealGRPO},
Title = {RealGRPO: A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment}
}


































