Skip to content

0xzhouchenyu/StepORLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

🤗 Model Links

StepORLM StepORLM-GenPRM

The official implementation repository of the paper "StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models".

🔥 Key Performance Metrics

Overall Performance Comparison

The table below shows the overall performance of StepORLM and baselines with Pass@1 accuracy (%) on six OR benchmarks. StepORLM+GenPRM indicates using the GenPRM as process verifier to enable the inference scaling of StepORLM.

Model Params NL4OPT MAMO
EasyLP
MAMO
ComplexLP
NLP4LP CompOR IndOR ReSocratic Avg.
Zero-shot LLMs
GPT-4o Closed 61.2 70.3 57.7 73.6 42.9 38.1 48.4 56.0
DeepSeek-V3 671B 79.8 95.2 53.2 92.1 55.6 66.7 85.1 75.4
Qwen3-32B 32B 77.5 92.3 46.9 93.8 50.0 61.9 85.1 72.5
Qwen2.5-72B-Instruct 72B 78.9 95.8 44.1 88.2 50.0 57.1 81.1 70.7
Fine-tuned LLMs
ORLM 8B 73.8 90.4 59.5 76.4 50.0 42.9 61.8 65.0
LLMOPT (origin) 14B 80.3* 89.5* 44.1* 73.4* 35.3* 29.0* 53.8* 57.9*
LLMOPT (reproduce) 14B 49.3 36.3 25.2 43.3 16.7 40.5 39.5 35.8
OptMATH (origin) 32B 95.9* 89.9* 54.1* - - - - -
StepORLM 8B 96.7 97.6 77.5 97.2 50.0 52.4 81.9 79.0
Agentic Methods
OptiMUS-v0.3 Closed 76.2 78.0 46.8 88.8 46.8 45.2 87.6 67.1
CoT Closed 62.2 49.5 42.3 74.7 39.2 40.5 43.6 50.3
CoE Closed 66.7 94.4 50.6 87.4 57.1 31.2 71.2 65.5
CAFA Closed 68.1 71.2 44.5 50.0 46.4 41.1 40.1 51.6
StepORLM+GenPRM 8B+8B 97.2 97.8 87.4 98.9 61.1 61.9 94.6 85.6

Notes:

  • Scores cited from original publications are marked with (*), missing entries are denoted with (-).
  • Best results are highlighted in bold and second-highest values are underlined.
  • Abbreviations: CompOR = ComplexOR, IndOR = IndustryOR, Avg = Macro-Average

Performance Under Different Inference Scaling Strategies

The table below compares the performance of StepORLM under different inference scaling strategies:

Model NL4OPT MAMO
EasyLP
MAMO
ComplexLP
NLP4LP CompOR IndOR ReSocratic Avg.
StepORLM as Policy Model
StepORLM 97.7 97.2 79.3 97.8 55.6 59.5 82.6 81.4
+ Major Vote 97.2 97.6 81.1 96.6 61.1 61.9 89.3 83.5
+ Solver Exec 97.7 98.4 81.1 96.1 61.1 66.7 90.3 84.5
+ Discriminative PRM 97.2 97.2 81.1 97.2 55.6 59.5 87.8 82.2
+ GenPRM (initial) 97.8 97.6 82.8 97.2 55.6 58.5 93.1 83.2
+ GenPRM (final) 97.2 97.8 87.4 98.9 61.1 61.9 94.6 85.6
ORLM as Policy Model
ORLM 73.8 90.4 59.5 76.4 50.0 42.9 61.8 65.0
+ Major Vote 78.7 88.4 50.5 78.7 44.4 47.6 73.0 65.9
+ Solver Exec 82.2 88.6 63.1 79.8 44.4 52.4 78.9 69.9
+ Discriminative PRM 75.1 91.7 63.1 82.0 50.0 54.8 74.7 70.2
+ GenPRM (initial) 87.3 90.6 55.0 90.4 44.4 47.6 65.5 68.7
+ GenPRM (final) 91.5 91.0 64.9 91.0 50.0 57.1 79.4 75.0

Notes:

  • Best results are highlighted in bold and second-highest values are underlined.

🔄 Self-Evolving Framework

Co-Evolutionary Loop

Self-Evolving Training Loop

The co-evolutionary loop of StepORLM. At each iteration, the policy model πθ generates multiple trajectories. The feedback from both the external solver (outcome) and the GenPRM ρθ (process) is used to create training data that simultaneously refines the policy via W-DPO and improves the GenPRM via SFT, fostering reciprocal improvement.

Analysis of Self-Evolving Process

Self-Evolving Iteration Analysis

The analysis on the self-evolving process by tracking the performance (Pass@1 accuracy) at each training iteration. The relative improvement of current iteration over the previous one is demonstrated on the corresponding bar.

We conduct in-depth analysis on the self-evolving process by tracking the performance at each training iteration, including the warm-up and co-evolving stages. Key observations include:

  • Warm-up SFT provides foundational gains: The initial supervised fine-tuning (SFT) delivers a massive foundational performance lift across all benchmarks (e.g., 62.5% accuracy improvement on NLP4LP).

  • Consistent iterative improvements: The subsequent self-evolution iterations consistently build upon the prior models, delivering incremental but crucial accuracy improvements. This confirms that the model progressively refines its capabilities through our self-evolving loop, rather than stagnating after initial fine-tuning.

  • Non-monotonic progress on challenging datasets: The performance trend is notably non-monotonic on the IndustryOR dataset, where performance first declines and finally largely increases at the third iteration. This is attributed to the small testing set size and challenging questions in IndustryOR. Our case-level inspection reveals a progression: early iterations primarily rectify structural modeling errors flagged by the PRM, whereas later iterations concentrate on code-level issues (e.g., index-out-of-bounds errors), ultimately leading to accuracy improvements.


About

The official implementation repository of the paper "StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors