Do you accept agentic/scaffolded system submissions to the leaderboard, or only vanilla single-pass model evaluations?
On validation set
Model
- claude-sonnet-4-6 - Correctness - 53.3%, sub problem correctness - 74%
Scaffolding
- claude-sonnet-4-6: Correctness - 66.7%, sub problem correctness - 82%
I can share more details in this thread. Thanks
Do you accept agentic/scaffolded system submissions to the leaderboard, or only vanilla single-pass model evaluations?
On validation set
Model
Scaffolding
I can share more details in this thread. Thanks