RealGRPO: A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment

🔍 Quick Visual Comparison

Input Prompt	FLUX	DanceGRPO (HPSv2)	RealGRPO (HPSv2)
A close-up portrait of a young woman with long hair smiling at the camera.
The image depicts a young female in ornate battle armor wearing a metallic helmet with long dark hair and symmetrical facial features.
An eightyearold, smiling, redhaired girl with glasses, dressed in a pink pirate costume, holding a stuffed toy dog. Full body.
Poor mom and child in war times
Celtic goddess with red hair and green warrior outfit, fantasy character concept, pure, benevolent, strong, portrait, line art, realistic, hypermaximalist, intricate details, epic composition, golden
A gopro snapshot of an anthropomorphic cat dressed as a firefighter putting out a building fire

Compared with FLUX, RealGRPO generates images with fewer synthetic artifacts. Compared with DanceGRPO, it mitigates HPSv2-driven reward hacking, especially the tendency toward over-saturated outputs.

Input Prompt	FLUX	SRPO (HPSv2)	RealGRPO (HPSv2)
A digital painting of an Aztec empress in sharp focus, portrayed as a fantasy portrait in concept art style.
Portrait of a woman touching her hair, artstation, digital art.
Hayao Miyazaki style, ghibli style, Perspective composition, а Volkswagen T5, seaside, Italian riviera, blue sky, a few white clouds, breeze, mountains, cozy, travel, sunny, best quality, 4k niji
Real life sasuke uchiha anime with sad and crying face set in 1980s Japan in cyberpunk world, detailed faces, lens flares, city backgrounds, neon colors, dramatic lighting, incredible details, saturat.
Grunge painting of an empty road with a distant forest.

Unlike SRPO's fixed prompt templates, RealGRPO uses a LLM to adaptively extract positive and negative style cues for each input prompt. This preserves prompt intent across domains (e.g., artistic or anime styles) instead of collapsing toward a single photorealistic bias.

🌟 Method

When training diffusion models with GRPO, directly maximizing a reward model (e.g., HPSv2) can cause reward hacking. The model may exploit shortcut artifacts (such as over-smoothing, over-exposure, and unnatural contrast) to increase reward scores without improving real visual quality.

Inspired by SRPO, we use contrastive positive/negative text guidance. Instead of using fixed, hand-crafted style prompts, RealGRPO introduces a LLM that analyzes each training prompt and dynamically generates matched pos_style and neg_style pairs.

This dynamic strategy preserves style consistency across prompt domains (e.g., photorealistic vs. anime) while discouraging reward-hacking artifacts. We integrate it into the DanceGRPO framework with the following reward:

$$Reward=(1 + \lambda)\cdot\text{Sim}(Image, Text_{pos}) - \text{Sim}(Image, Text_{neg})$$

This objective pulls generations toward desired styles and away from artifact-prone directions.

Checkpoint Setup

Download FLUX.1-dev from Hugging Face to ./checkpoints/flux.
Download HPS-v2.1 (HPS_v2.1_compressed.pt) from Hugging Face to ./checkpoints/hps_ckpt.
Download CLIP ViT-H-14 (open_clip_pytorch_model.bin) from Hugging Face to ./checkpoints/hps_ckpt.
Download Qwen3-4B from Hugging Face to ./checkpoints/Qwen3-4B.

🛠️ Installation

git clone https://github.com/yangzhou24/RealGRPO.git
cd RealGRPO
conda create --name RealGRPO python=3.10
conda activate RealGRPO
bash env_setup.sh

Prepare the reward model:

mkdir third_party && cd third_party
git clone https://github.com/tgxs002/HPSv2.git
cd HPSv2 && pip install -e .

📖 Quick Start

1. Data Preparation (Get text embeddings & LLM Labeling)

For the open-source image generation setting, we use prompts from HPDv2, provided in ./assets/prompts.txt.

First, generate text embeddings (required):

# FLUX preprocessing with multiple GPUs
bash scripts/preprocess/preprocess_flux_rl_embeddings.sh

Then use Qwen3-4B to extract adaptive pos_style and neg_style prompts from training text.
The output is stored in data/rl_embeddings/videos2caption_cfg.json.

Each entry follows: <prompt>|||<pos_style_1, pos_style_2, pos_style_3>|||<neg_style_1, neg_style_2, neg_style_3>

Examples:

A full body shot of many Asian girls in a river by Artgerm.|||Natural-lighting, Detailed, Real|||Anime, Flat, Painting
A cute anthropomorphic portrait of Stan Lee in a fantasy art style by various artists.|||Fantasy, Ethereal, Colorful|||Real, Photorealistic, 2D

torchrun --nproc_per_node=<NUM_GPUS> fastvideo/data_preprocess/preprocess_text_Qwen3.py \
  --input-json data/rl_embeddings/videos2caption.json \
  --output-dir data/rl_embeddings

2. GRPO Fine-Tuning

Run GRPO fine-tuning to update the DiT backbone.

Reference setup: 32x A800 GPUs for around 100 epochs.

bash scripts/finetune/finetune_flux_realgrpo.sh

3. Inference

bash scripts/visualization/vis_flux.sh

Discussion

Reward hacking remains a serious issue in image-generation reward models. A common failure mode is preference for overexposed or grid-like artifacts that inflate reward scores without improving visual quality.

The example below comes from a DanceGRPO model trained with HPSv2. We then remove the grid-like artifact using Nano Banana and rescore both images with HPSv2. The artifact-removed image receives a lower score (0.385498 vs. 0.394043), indicating that HPSv2 can reward undesirable grid patterns. This highlights how critical reward-model design is for post-training in generative modeling.

0.394043	0.385498

Acknowledgements

This codebase builds on the open-source implementations of DanceGRPO and SRPO.

Citation

If you find this codebase useful for your research, please kindly cite:

@article{xue2025dancegrpo,
  title={DanceGRPO: Unleashing GRPO on Visual Generation},
  author={Xue, Zeyue and Wu, Jie and Gao, Yu and Kong, Fangyuan and Zhu, Lingting and Chen, Mengzhao and Liu, Zhiheng and Liu, Wei and Guo, Qiushan and Huang, Weilin and others},
  journal={arXiv preprint arXiv:2505.07818},
  year={2025}
}

@misc{shen2025directlyaligningdiffusiontrajectory,
      title={Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference},
      author={Xiangwei Shen and Zhimin Li and Zhantao Yang and Shiyi Zhang and Yingfang Zhang and Donghao Li and Chunyu Wang and Qinglin Lu and Yansong Tang},
      year={2025},
      eprint={2509.06942},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.06942},
}

@misc{RealGRPO,
    Author = {Yang Zhou, Haoyu Guo},
    Year = {2026},
    Note = {https://github.com/yangzhou24/RealGRPO},
    Title = {RealGRPO: A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data/rl_embeddings		data/rl_embeddings
fastvideo		fastvideo
scripts		scripts
.gitignore		.gitignore
README.md		README.md
env_setup.sh		env_setup.sh
pyproject.toml		pyproject.toml
requirements-lint.txt		requirements-lint.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealGRPO: A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment

🔍 Quick Visual Comparison

🌟 Method

Checkpoint Setup

🛠️ Installation

📖 Quick Start

1. Data Preparation (Get text embeddings & LLM Labeling)

2. GRPO Fine-Tuning

3. Inference

Discussion

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RealGRPO: A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment

🔍 Quick Visual Comparison

🌟 Method

Checkpoint Setup

🛠️ Installation

📖 Quick Start

1. Data Preparation (Get text embeddings & LLM Labeling)

2. GRPO Fine-Tuning

3. Inference

Discussion

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages