Post-trained Qwen2.5-3B-Instruct with a GRPO pipeline that adds a Python execution tool + new reward shaping to improve mathematical reasoning. After 1 epoch the model scores 0.52 on MATH-500, beating the larger Qwen2.5-7B (~0.50).
Inject a Python interpreter into GRPO training so the policy can:
- Think:
<think>...</think> - Execute code:
<python>...</python> - See runtime feedback:
<output>...</output> - Produce final answer:
<answer>...</answer>
Rewards encourage correct answers and reliable tool usage.
Qwen2.5-3B-Instruct.
Deterministic (parsed from the tagged output):
- Format: Output contains required tag sequence.
- Accuracy: 1 if final answer matches ground truth.
- Tool Success:
log(T_success / T_total)— penalizes failed executions.
Using Hugging Face GRPOTrainer with custom hooks:
- Parse generated text for
<python>blocks. - Execute code in a sandbox; capture stdout/exception as
<output>. - Reinsert
<output>into the model’s trajectory. - Compute rewards and update policy (GRPO).
- Hardware: 1× A100 80GB
- Duration: 1 epoch (~3h)
- Data: Math reasoning problems; evaluation on MATH-500
- Standard HF optimizations (gradient checkpointing, etc.)
| Model | Params | Tool Aug? | Epochs | MATH-500 Accuracy |
|---|---|---|---|---|
| Qwen2.5-7B (baseline) | 7B | No | – | ~0.50 |
| This Work (3B) | 3B | Yes | 1 | 0.52 |
Tool-augmented 3B model surpasses 7B baseline.
- More epochs
- Sub-1B experiments
- Release cleaned training/eval scripts