-
Notifications
You must be signed in to change notification settings - Fork 358
feat: supports evaluation of multiple-choice benchmarks #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
405acc2
Adds multiple choice eval datasets.
xxman-google 67aae53
Add a verify worker for multiple-choice problems.
xxman-google 4134fcb
add prompts for MMLU and GPQA.
xxman-google 0ca559f
modifies eval script to support multiple-choice questions.
xxman-google 2163cbf
add eval config files.
xxman-google d9dd544
add unit tests.
xxman-google 11a1de5
add AIME 2024 dataset.
xxman-google 4da0c43
add GPQA main version.
xxman-google 5870e46
fix: remove reference_model_buffers in fsdp2 (#558)
yuki-97 79690a1
fix: Add assertion if async is disabled when using pp with vllm (#565)
parthchadha 940049f
fix: remove visualization code (#566)
parthchadha 745790c
Allow uneven shards for multi-GPU inference in vllm backend (#494)
KiddoZhu 9c59083
add GPQA main version.
xxman-google 628ef2d
updates doc.
xxman-google f431d48
feat: vllm Model diagnostic test checking long generation quality (#516)
vegaluisjose f6b948d
feat: Log code in wandb (#175)
yfw 4265fed
fix: add dynamic_batching key to SFT OpenMathInstruct config (#570)
ashors1 7c8367d
feat: support async in non-colocated (#523)
yuki-97 d0dca5b
fix: correct mcore dtype + assertion on activation_func (#572)
terrykong e257d88
fix: move core ray port from 6379 -> 54258 to reduce port collision f…
terrykong c27ff44
fix: fix overlap param gather (#561)
ashors1 16ac698
docs: fix some typos on nsys/model-quirk pages (#560)
terrykong 9b79e1e
feat: Add megatron to hf converter (#555)
ashors1 4022bee
docs: Add a note on supported backends (#553)
ashors1 f03e596
feat: Support pass@k (#536)
peri044 8f44492
fix: Megatron config fixes (#576)
SahilJain314 39b8f25
update docs for the new eval.
xxman-google 8f6ac97
docs: move training backends section (#580)
ashors1 2975315
docs: Add a note on supported backends (#553)
ashors1 26f8fb2
docs: move training backends section (#580)
ashors1 1055f5e
Update more docs for the new eval.
xxman-google 788c628
Merge branch 'main' into xx/new_eval
yuki-97 aaa3eeb
fix lint errors.
xxman-google 0d77a15
add missing copyright statements.
xxman-google 17fe405
add missing copyright statements.
xxman-google cf828d6
docs: Add missing arguments to DeepScaler evaluation (#502)
butsugiri 01c3840
fix: prevent divisible error by dropping last batch in loader (#583)
wedu-nvidia 658437d
feat: improve worker group args/kwargs (#539)
yuki-97 2eb0301
fix: update gemma3 prefix (#585)
ashors1 bc234a3
fix: Added copyright to functest (#584)
SahilJain314 2d876de
chore: Update github url after org transfer (#512)
chtruong814 ddac07c
feat: add OpenAI format dataset for SFT (#485)
AtsunoriFujita 283074a
fix: load HF model only on rank 0 (#544)
parthchadha e78af38
feat: support async in non-colocated (#523)
yuki-97 4cd4568
feat: Add megatron to hf converter (#555)
ashors1 c44efc0
Merge branch 'main' into xx/new_eval
xxman-google File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # GPQA evaluation Configuration | ||
| defaults: "eval.yaml" | ||
|
|
||
| generation: | ||
| model_name: "Qwen/Qwen2.5-7B-Instruct" | ||
| vllm_cfg: | ||
| max_model_len: 3072 | ||
|
|
||
| data: | ||
| prompt_file: "examples/prompts/gpqa.txt" | ||
| dataset_name: "gpqa" | ||
|
|
||
| env: | ||
| math: | ||
| verifier_type: "multichoice" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # Evaluation Configuration from local files. | ||
| defaults: "eval.yaml" | ||
|
|
||
| generation: | ||
| model_name: "Qwen/Qwen2.5-7B-Instruct" | ||
|
|
||
| data: | ||
| prompt_file: "examples/prompts/cot.txt" | ||
| dataset_name: "local" | ||
| problem_key: "Question" | ||
| solution_key: "Answer" | ||
| split: "train" | ||
| data_paths: "https:\/\/openaipublic.blob.core.windows.net\/simple-evals\/math_500_test.csv" | ||
| file_format: "csv" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Math evaluation Configuration | ||
| defaults: "eval.yaml" | ||
|
|
||
| generation: | ||
| model_name: "Qwen/Qwen2.5-7B-Instruct" | ||
|
|
||
| data: | ||
| prompt_file: "examples/prompts/cot.txt" | ||
| dataset_name: "math" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.