Limit evaluation dataset issue #342
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a bug in the
gpqa_diamondevaluation where--limitfailed and all multiple-choice answers were incorrectly shuffled to the same position.What are you adding?
Changes Made
random.seed(0)fromrecord_to_mcq_sampleingpqa_diamond.py.random.Randominstance, seeded by the question's hash, to ensure deterministic but varied shuffling of multiple-choice options.openbenchversion inuv.lock.Testing
pytest)pre-commit run --all-files)Checklist
Related Issues
Closes #
Additional Context
The
random.seed(0)call withinrecord_to_mcq_sampleingpqa_diamond.pycaused two main problems:--limit: The global random state reset likely interfered withinspect-ai's sample selection mechanism when the--limitparameter was used, causing the evaluation to fail.The fix ensures that each question has a unique but deterministic shuffle order for its options, resulting in a balanced target distribution (e.g., A=58, B=54, C=44, D=42) and proper functionality of the
--limitparameter.Slack Thread