wd1: Weighted Policy Optimization for Diffusion Language Models Reasoning

We introduce wd1, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood

Environment Setup

To setup the environment, run;

python -m venv .venv
pip install -r requirements.txt

SFT

# First go to the SFT directory
cd SFT

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file ddp_config.yaml --main_process_port 29500 --num_processes 4 sft_train.py

wd1

You must change the data directory for all the bash scripts. Change it based on your path. Or you could just export it before the run with this command.

export BASE_DATA=/home/diffusion-rl/data

Otherwise the code will use the default.

RL only

To run direct RL without SFT

# Pattern
bash run/wll_NP_{datasetname}.sh
# Example
bash run/wll_NP_countdown.sh

RL on top of SFT

To run RL on top of SFT

# Pattern
bash run/wll_SFT_NP_{datasetname}.sh
# Example
bash run/wll_SFT_NP_countdown.sh

Evaluation

The evaluation code is inside the eval directory.

Run with bash eval/run_eval_all.sh
Make sure to point to the correct checkpoint.
The evaluation file will only save the generations; use the parser to calculate accuracy
For example, baseline generations are in the eval_baselines directory. Use python parse_and_get_acc.py to print the accuracy.

Acknowledgement

The implementation is adapted from d1. We appreciate the clear repository!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
SFT		SFT
dataset		dataset
eval		eval
media		media
run		run
wd1		wd1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wd1: Weighted Policy Optimization for Diffusion Language Models Reasoning

Environment Setup

SFT

wd1

RL only

RL on top of SFT

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xiaohangt/wd1

Folders and files

Latest commit

History

Repository files navigation

wd1: Weighted Policy Optimization for Diffusion Language Models Reasoning

Environment Setup

SFT

wd1

RL only

RL on top of SFT

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages