Limit evaluation dataset issue #342

bklieger-groq · 2025-12-24T05:10:56Z

Summary

This PR fixes a bug in the gpqa_diamond evaluation where --limit failed and all multiple-choice answers were incorrectly shuffled to the same position.

What are you adding?

Bug fix (non-breaking change which fixes an issue)

Changes Made

Removed random.seed(0) from record_to_mcq_sample in gpqa_diamond.py.
Implemented a per-question random.Random instance, seeded by the question's hash, to ensure deterministic but varied shuffling of multiple-choice options.
Updated openbench version in uv.lock.

Testing

I have run the existing test suite (pytest)
I have added tests for my changes
I have tested with multiple model providers (if applicable)
I have run pre-commit hooks (pre-commit run --all-files)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context

The random.seed(0) call within record_to_mcq_sample in gpqa_diamond.py caused two main problems:

Incorrect Answer Distribution: It reset the global random state for every record, leading to all 198 samples having their correct answer at position "B". This rendered the benchmark invalid.
Interference with --limit: The global random state reset likely interfered with inspect-ai's sample selection mechanism when the --limit parameter was used, causing the evaluation to fail.

The fix ensures that each question has a unique but deterministic shuffle order for its options, resulting in a balanced target distribution (e.g., A=58, B=54, C=44, D=42) and proper functionality of the --limit parameter.

Slack Thread

Co-authored-by: bklieger <bklieger@groq.com>

cursor · 2025-12-24T05:10:57Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

Refactor: Improve GPQA Diamond eval and update uv.lock

a2a9038

Co-authored-by: bklieger <bklieger@groq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limit evaluation dataset issue #342

Limit evaluation dataset issue #342

Uh oh!

bklieger-groq commented Dec 24, 2025

Uh oh!

cursor bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Limit evaluation dataset issue #342

Are you sure you want to change the base?

Limit evaluation dataset issue #342

Uh oh!

Conversation

bklieger-groq commented Dec 24, 2025

Summary

What are you adding?

Changes Made

Testing

Checklist

Related Issues

Additional Context

Uh oh!

cursor bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants