Skip to content

Conversation

@bklieger-groq
Copy link
Contributor

Summary

This PR fixes a bug in the gpqa_diamond evaluation where --limit failed and all multiple-choice answers were incorrectly shuffled to the same position.

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)

Changes Made

  • Removed random.seed(0) from record_to_mcq_sample in gpqa_diamond.py.
  • Implemented a per-question random.Random instance, seeded by the question's hash, to ensure deterministic but varied shuffling of multiple-choice options.
  • Updated openbench version in uv.lock.

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context

The random.seed(0) call within record_to_mcq_sample in gpqa_diamond.py caused two main problems:

  1. Incorrect Answer Distribution: It reset the global random state for every record, leading to all 198 samples having their correct answer at position "B". This rendered the benchmark invalid.
  2. Interference with --limit: The global random state reset likely interfered with inspect-ai's sample selection mechanism when the --limit parameter was used, causing the evaluation to fail.

The fix ensures that each question has a unique but deterministic shuffle order for its options, resulting in a balanced target distribution (e.g., A=58, B=54, C=44, D=42) and proper functionality of the --limit parameter.


Slack Thread

Open in Cursor Open in Web

Co-authored-by: bklieger <bklieger@groq.com>
@cursor
Copy link

cursor bot commented Dec 24, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants