GitHub - 0xzhouchenyu/StepORLM: The official implementation repository of the paper "StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models"

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

🤗 Model Links

The official implementation repository of the paper "StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models".

🔥 Key Performance Metrics

Overall Performance Comparison

The table below shows the overall performance of StepORLM and baselines with Pass@1 accuracy (%) on six OR benchmarks. StepORLM+GenPRM indicates using the GenPRM as process verifier to enable the inference scaling of StepORLM.

Model	Params	NL4OPT	MAMO EasyLP	MAMO ComplexLP	NLP4LP	CompOR	IndOR	ReSocratic	Avg.
Zero-shot LLMs
GPT-4o	Closed	61.2	70.3	57.7	73.6	42.9	38.1	48.4	56.0
DeepSeek-V3	671B	79.8	95.2	53.2	92.1	55.6	66.7	85.1	75.4
Qwen3-32B	32B	77.5	92.3	46.9	93.8	50.0	61.9	85.1	72.5
Qwen2.5-72B-Instruct	72B	78.9	95.8	44.1	88.2	50.0	57.1	81.1	70.7
Fine-tuned LLMs
ORLM	8B	73.8	90.4	59.5	76.4	50.0	42.9	61.8	65.0
LLMOPT (origin)	14B	80.3*	89.5*	44.1*	73.4*	35.3*	29.0*	53.8*	57.9*
LLMOPT (reproduce)	14B	49.3	36.3	25.2	43.3	16.7	40.5	39.5	35.8
OptMATH (origin)	32B	95.9*	89.9*	54.1*	-	-	-	-	-
StepORLM	8B	96.7	97.6	77.5	97.2	50.0	52.4	81.9	79.0
Agentic Methods
OptiMUS-v0.3	Closed	76.2	78.0	46.8	88.8	46.8	45.2	87.6	67.1
CoT	Closed	62.2	49.5	42.3	74.7	39.2	40.5	43.6	50.3
CoE	Closed	66.7	94.4	50.6	87.4	57.1	31.2	71.2	65.5
CAFA	Closed	68.1	71.2	44.5	50.0	46.4	41.1	40.1	51.6
StepORLM+GenPRM	8B+8B	97.2	97.8	87.4	98.9	61.1	61.9	94.6	85.6

Notes:

Scores cited from original publications are marked with (*), missing entries are denoted with (-).
Best results are highlighted in bold and second-highest values are underlined.
Abbreviations: CompOR = ComplexOR, IndOR = IndustryOR, Avg = Macro-Average

Performance Under Different Inference Scaling Strategies

The table below compares the performance of StepORLM under different inference scaling strategies:

Model	NL4OPT	MAMO EasyLP	MAMO ComplexLP	NLP4LP	CompOR	IndOR	ReSocratic	Avg.
StepORLM as Policy Model
StepORLM	97.7	97.2	79.3	97.8	55.6	59.5	82.6	81.4
+ Major Vote	97.2	97.6	81.1	96.6	61.1	61.9	89.3	83.5
+ Solver Exec	97.7	98.4	81.1	96.1	61.1	66.7	90.3	84.5
+ Discriminative PRM	97.2	97.2	81.1	97.2	55.6	59.5	87.8	82.2
+ GenPRM (initial)	97.8	97.6	82.8	97.2	55.6	58.5	93.1	83.2
+ GenPRM (final)	97.2	97.8	87.4	98.9	61.1	61.9	94.6	85.6
ORLM as Policy Model
ORLM	73.8	90.4	59.5	76.4	50.0	42.9	61.8	65.0
+ Major Vote	78.7	88.4	50.5	78.7	44.4	47.6	73.0	65.9
+ Solver Exec	82.2	88.6	63.1	79.8	44.4	52.4	78.9	69.9
+ Discriminative PRM	75.1	91.7	63.1	82.0	50.0	54.8	74.7	70.2
+ GenPRM (initial)	87.3	90.6	55.0	90.4	44.4	47.6	65.5	68.7
+ GenPRM (final)	91.5	91.0	64.9	91.0	50.0	57.1	79.4	75.0

Notes:

Best results are highlighted in bold and second-highest values are underlined.

🔄 Self-Evolving Framework

Co-Evolutionary Loop

The co-evolutionary loop of StepORLM. At each iteration, the policy model π_θ generates multiple trajectories. The feedback from both the external solver (outcome) and the GenPRM ρ_θ (process) is used to create training data that simultaneously refines the policy via W-DPO and improves the GenPRM via SFT, fostering reciprocal improvement.

Analysis of Self-Evolving Process

The analysis on the self-evolving process by tracking the performance (Pass@1 accuracy) at each training iteration. The relative improvement of current iteration over the previous one is demonstrated on the corresponding bar.

We conduct in-depth analysis on the self-evolving process by tracking the performance at each training iteration, including the warm-up and co-evolving stages. Key observations include:

Warm-up SFT provides foundational gains: The initial supervised fine-tuning (SFT) delivers a massive foundational performance lift across all benchmarks (e.g., 62.5% accuracy improvement on NLP4LP).
Consistent iterative improvements: The subsequent self-evolution iterations consistently build upon the prior models, delivering incremental but crucial accuracy improvements. This confirms that the model progressively refines its capabilities through our self-evolving loop, rather than stagnating after initial fine-tuning.
Non-monotonic progress on challenging datasets: The performance trend is notably non-monotonic on the IndustryOR dataset, where performance first declines and finally largely increases at the third iteration. This is attributed to the small testing set size and challenging questions in IndustryOR. Our case-level inspection reveals a progression: early iterations primarily rectify structural modeling errors flagged by the PRM, whereas later iterations concentrate on code-level issues (e.g., index-out-of-bounds errors), ultimately leading to accuracy improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
benchmark		benchmark
docs		docs
eval		eval
eval_with_genprm		eval_with_genprm
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

🤗 Model Links

🔥 Key Performance Metrics

Overall Performance Comparison

Performance Under Different Inference Scaling Strategies

🔄 Self-Evolving Framework

Co-Evolutionary Loop

Analysis of Self-Evolving Process

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

🤗 Model Links

🔥 Key Performance Metrics

Overall Performance Comparison

Performance Under Different Inference Scaling Strategies

🔄 Self-Evolving Framework

Co-Evolutionary Loop

Analysis of Self-Evolving Process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages