Fix GPQA and index extractive metric#829
Merged
clefourrier merged 7 commits intomainfrom Jun 26, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR refines the GPQA prompt by removing extra whitespace and extracting the instruction header, and tightens the GPQA metric to avoid false positives by disabling try_extract_without_anchor.
- Refactors
gpqa_instructto strip inputs and separate the instruction from the query template. - Updates GPQA metrics to set
try_extract_without_anchor=Falsefor both gold and predicted extraction targets across sample‐level metrics.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/lighteval/tasks/default_prompts.py | Separated instruction from query, added .strip() on choices and question to remove extra whitespace. |
| src/lighteval/metrics/metrics.py | Changed IndicesExtractionConfig to disable try_extract_without_anchor and duplicated configs across metrics. |
Comments suppressed due to low confidence (2)
src/lighteval/tasks/default_prompts.py:902
- Add unit tests for
gpqa_instructto verify that the generatedqueryandinstructionfields correctly strip whitespace and match expected formatting when input contains surrounding spaces or newlines.
instruction = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering."
src/lighteval/metrics/metrics.py:877
- [nitpick] Add a brief comment explaining why
try_extract_without_anchoris set toFalsehere, so future maintainers understand the rationale behind making the metric stricter.
gold_extraction_target=[
Collaborator
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
clefourrier
added a commit
that referenced
this pull request
Jun 26, 2025
* too many false positives with the current gpqa metric extraction, making it more string * fixing whitespace and instruction in prompt * better to have a strict extraction for index extraction in general actually * added comment * fix tests, need to invert condition
NathanHB
pushed a commit
that referenced
this pull request
Sep 19, 2025
* too many false positives with the current gpqa metric extraction, making it more string * fixing whitespace and instruction in prompt * better to have a strict extraction for index extraction in general actually * added comment * fix tests, need to invert condition
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix the prompt template (removes extra whitespace, fixes instruction) and makes the metric stricter to avoid the false positive you get because of the
try_extract_without_anchor(which attempts to absolutely extract an index even by looking for the raw string almost as is)