StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
The official implementation repository of the paper "StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models".
The table below shows the overall performance of StepORLM and baselines with Pass@1 accuracy (%) on six OR benchmarks. StepORLM+GenPRM indicates using the GenPRM as process verifier to enable the inference scaling of StepORLM.
| Model | Params | NL4OPT | MAMO EasyLP |
MAMO ComplexLP |
NLP4LP | CompOR | IndOR | ReSocratic | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Zero-shot LLMs | |||||||||
| GPT-4o | Closed | 61.2 | 70.3 | 57.7 | 73.6 | 42.9 | 38.1 | 48.4 | 56.0 |
| DeepSeek-V3 | 671B | 79.8 | 95.2 | 53.2 | 92.1 | 55.6 | 66.7 | 85.1 | 75.4 |
| Qwen3-32B | 32B | 77.5 | 92.3 | 46.9 | 93.8 | 50.0 | 61.9 | 85.1 | 72.5 |
| Qwen2.5-72B-Instruct | 72B | 78.9 | 95.8 | 44.1 | 88.2 | 50.0 | 57.1 | 81.1 | 70.7 |
| Fine-tuned LLMs | |||||||||
| ORLM | 8B | 73.8 | 90.4 | 59.5 | 76.4 | 50.0 | 42.9 | 61.8 | 65.0 |
| LLMOPT (origin) | 14B | 80.3* | 89.5* | 44.1* | 73.4* | 35.3* | 29.0* | 53.8* | 57.9* |
| LLMOPT (reproduce) | 14B | 49.3 | 36.3 | 25.2 | 43.3 | 16.7 | 40.5 | 39.5 | 35.8 |
| OptMATH (origin) | 32B | 95.9* | 89.9* | 54.1* | - | - | - | - | - |
| StepORLM | 8B | 96.7 | 97.6 | 77.5 | 97.2 | 50.0 | 52.4 | 81.9 | 79.0 |
| Agentic Methods | |||||||||
| OptiMUS-v0.3 | Closed | 76.2 | 78.0 | 46.8 | 88.8 | 46.8 | 45.2 | 87.6 | 67.1 |
| CoT | Closed | 62.2 | 49.5 | 42.3 | 74.7 | 39.2 | 40.5 | 43.6 | 50.3 |
| CoE | Closed | 66.7 | 94.4 | 50.6 | 87.4 | 57.1 | 31.2 | 71.2 | 65.5 |
| CAFA | Closed | 68.1 | 71.2 | 44.5 | 50.0 | 46.4 | 41.1 | 40.1 | 51.6 |
| StepORLM+GenPRM | 8B+8B | 97.2 | 97.8 | 87.4 | 98.9 | 61.1 | 61.9 | 94.6 | 85.6 |
Notes:
- Scores cited from original publications are marked with (*), missing entries are denoted with (-).
- Best results are highlighted in bold and second-highest values are underlined.
- Abbreviations: CompOR = ComplexOR, IndOR = IndustryOR, Avg = Macro-Average
The table below compares the performance of StepORLM under different inference scaling strategies:
| Model | NL4OPT | MAMO EasyLP |
MAMO ComplexLP |
NLP4LP | CompOR | IndOR | ReSocratic | Avg. |
|---|---|---|---|---|---|---|---|---|
| StepORLM as Policy Model | ||||||||
| StepORLM | 97.7 | 97.2 | 79.3 | 97.8 | 55.6 | 59.5 | 82.6 | 81.4 |
| + Major Vote | 97.2 | 97.6 | 81.1 | 96.6 | 61.1 | 61.9 | 89.3 | 83.5 |
| + Solver Exec | 97.7 | 98.4 | 81.1 | 96.1 | 61.1 | 66.7 | 90.3 | 84.5 |
| + Discriminative PRM | 97.2 | 97.2 | 81.1 | 97.2 | 55.6 | 59.5 | 87.8 | 82.2 |
| + GenPRM (initial) | 97.8 | 97.6 | 82.8 | 97.2 | 55.6 | 58.5 | 93.1 | 83.2 |
| + GenPRM (final) | 97.2 | 97.8 | 87.4 | 98.9 | 61.1 | 61.9 | 94.6 | 85.6 |
| ORLM as Policy Model | ||||||||
| ORLM | 73.8 | 90.4 | 59.5 | 76.4 | 50.0 | 42.9 | 61.8 | 65.0 |
| + Major Vote | 78.7 | 88.4 | 50.5 | 78.7 | 44.4 | 47.6 | 73.0 | 65.9 |
| + Solver Exec | 82.2 | 88.6 | 63.1 | 79.8 | 44.4 | 52.4 | 78.9 | 69.9 |
| + Discriminative PRM | 75.1 | 91.7 | 63.1 | 82.0 | 50.0 | 54.8 | 74.7 | 70.2 |
| + GenPRM (initial) | 87.3 | 90.6 | 55.0 | 90.4 | 44.4 | 47.6 | 65.5 | 68.7 |
| + GenPRM (final) | 91.5 | 91.0 | 64.9 | 91.0 | 50.0 | 57.1 | 79.4 | 75.0 |
Notes:
- Best results are highlighted in bold and second-highest values are underlined.
The co-evolutionary loop of StepORLM. At each iteration, the policy model πθ generates multiple trajectories. The feedback from both the external solver (outcome) and the GenPRM ρθ (process) is used to create training data that simultaneously refines the policy via W-DPO and improves the GenPRM via SFT, fostering reciprocal improvement.
The analysis on the self-evolving process by tracking the performance (Pass@1 accuracy) at each training iteration. The relative improvement of current iteration over the previous one is demonstrated on the corresponding bar.
We conduct in-depth analysis on the self-evolving process by tracking the performance at each training iteration, including the warm-up and co-evolving stages. Key observations include:
-
Warm-up SFT provides foundational gains: The initial supervised fine-tuning (SFT) delivers a massive foundational performance lift across all benchmarks (e.g., 62.5% accuracy improvement on NLP4LP).
-
Consistent iterative improvements: The subsequent self-evolution iterations consistently build upon the prior models, delivering incremental but crucial accuracy improvements. This confirms that the model progressively refines its capabilities through our self-evolving loop, rather than stagnating after initial fine-tuning.
-
Non-monotonic progress on challenging datasets: The performance trend is notably non-monotonic on the IndustryOR dataset, where performance first declines and finally largely increases at the third iteration. This is attributed to the small testing set size and challenging questions in IndustryOR. Our case-level inspection reveals a progression: early iterations primarily rectify structural modeling errors flagged by the PRM, whereas later iterations concentrate on code-level issues (e.g., index-out-of-bounds errors), ultimately leading to accuracy improvements.