Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
405acc2
Adds multiple choice eval datasets.
xxman-google Jun 24, 2025
67aae53
Add a verify worker for multiple-choice problems.
xxman-google Jun 24, 2025
4134fcb
add prompts for MMLU and GPQA.
xxman-google Jun 24, 2025
0ca559f
modifies eval script to support multiple-choice questions.
xxman-google Jun 24, 2025
2163cbf
add eval config files.
xxman-google Jun 24, 2025
d9dd544
add unit tests.
xxman-google Jun 24, 2025
11a1de5
add AIME 2024 dataset.
xxman-google Jun 24, 2025
4da0c43
add GPQA main version.
xxman-google Jun 25, 2025
5870e46
fix: remove reference_model_buffers in fsdp2 (#558)
yuki-97 Jun 26, 2025
79690a1
fix: Add assertion if async is disabled when using pp with vllm (#565)
parthchadha Jun 26, 2025
940049f
fix: remove visualization code (#566)
parthchadha Jun 26, 2025
745790c
Allow uneven shards for multi-GPU inference in vllm backend (#494)
KiddoZhu Jun 26, 2025
9c59083
add GPQA main version.
xxman-google Jun 25, 2025
628ef2d
updates doc.
xxman-google Jun 27, 2025
f431d48
feat: vllm Model diagnostic test checking long generation quality (#516)
vegaluisjose Jun 26, 2025
f6b948d
feat: Log code in wandb (#175)
yfw Jun 26, 2025
4265fed
fix: add dynamic_batching key to SFT OpenMathInstruct config (#570)
ashors1 Jun 27, 2025
7c8367d
feat: support async in non-colocated (#523)
yuki-97 Jun 27, 2025
d0dca5b
fix: correct mcore dtype + assertion on activation_func (#572)
terrykong Jun 27, 2025
e257d88
fix: move core ray port from 6379 -> 54258 to reduce port collision f…
terrykong Jun 27, 2025
c27ff44
fix: fix overlap param gather (#561)
ashors1 Jun 27, 2025
16ac698
docs: fix some typos on nsys/model-quirk pages (#560)
terrykong Jun 27, 2025
9b79e1e
feat: Add megatron to hf converter (#555)
ashors1 Jun 27, 2025
4022bee
docs: Add a note on supported backends (#553)
ashors1 Jun 28, 2025
f03e596
feat: Support pass@k (#536)
peri044 Jun 28, 2025
8f44492
fix: Megatron config fixes (#576)
SahilJain314 Jun 28, 2025
39b8f25
update docs for the new eval.
xxman-google Jun 30, 2025
8f6ac97
docs: move training backends section (#580)
ashors1 Jun 30, 2025
2975315
docs: Add a note on supported backends (#553)
ashors1 Jun 28, 2025
26f8fb2
docs: move training backends section (#580)
ashors1 Jun 30, 2025
1055f5e
Update more docs for the new eval.
xxman-google Jun 30, 2025
788c628
Merge branch 'main' into xx/new_eval
yuki-97 Jun 30, 2025
aaa3eeb
fix lint errors.
xxman-google Jul 2, 2025
0d77a15
add missing copyright statements.
xxman-google Jul 2, 2025
17fe405
add missing copyright statements.
xxman-google Jul 2, 2025
cf828d6
docs: Add missing arguments to DeepScaler evaluation (#502)
butsugiri Jun 30, 2025
01c3840
fix: prevent divisible error by dropping last batch in loader (#583)
wedu-nvidia Jun 30, 2025
658437d
feat: improve worker group args/kwargs (#539)
yuki-97 Jun 30, 2025
2eb0301
fix: update gemma3 prefix (#585)
ashors1 Jun 30, 2025
bc234a3
fix: Added copyright to functest (#584)
SahilJain314 Jul 1, 2025
2d876de
chore: Update github url after org transfer (#512)
chtruong814 Jul 2, 2025
ddac07c
feat: add OpenAI format dataset for SFT (#485)
AtsunoriFujita Jul 2, 2025
283074a
fix: load HF model only on rank 0 (#544)
parthchadha Jul 2, 2025
e78af38
feat: support async in non-colocated (#523)
yuki-97 Jun 27, 2025
4cd4568
feat: Add megatron to hf converter (#555)
ashors1 Jun 27, 2025
c44efc0
Merge branch 'main' into xx/new_eval
xxman-google Jul 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ uv run python examples/run_eval.py \
```
> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.

Refer to `examples/configs/eval.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the [Evaluation documentation](docs/guides/eval.md).
Refer to `examples/configs/evals/eval.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the [Evaluation documentation](docs/guides/eval.md).

## Set Up Clusters

Expand Down
25 changes: 20 additions & 5 deletions docs/guides/eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Once the conversion is complete, you can override the `generation.model_name` to
### Prepare the Evaluation Configuration
**Override with Custom Settings**

To run the evaluation, you can use the [default configuration file](../../examples/configs/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line.
To run the evaluation, you can use the [default configuration file](../../examples/configs/evals/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line.

The default configuration employs greedy sampling to evaluate Qwen2.5-Math-1.5B-Instruct on AIME-2024.

Expand All @@ -42,7 +42,7 @@ We will use the `run_eval.py` script to run an evaluation using a model directly
Note that the evaluation script only supports the Hugging Face format model. If you haven't converted your DCP format model, you should back to [Convert DCP to HF](#convert-dcp-to-hf-optional) and follow the guide to convert your model.

```sh
# Run evaluation script with default config (examples/configs/eval.yaml)
# Run evaluation script with default config (examples/configs/evals/eval.yaml)
uv run python examples/run_eval.py

# Run evaluation script with converted model
Expand All @@ -51,16 +51,22 @@ uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
# Run evaluation script with custom config file
uv run python examples/run_eval.py --config path/to/custom_config.yaml

# Run evaluation script on one of the supported benchmarks (e.g., GPQA)
uv run python examples/run_eval.py --config examples/configs/evals/gpqa_eval.yaml

# Run evaluation script with a local dataset that is prefetched as a csv file.
uv run python examples/run_eval.py --config examples/configs/evals/local_eval.yaml

# Override specific config values via command line
# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
# Pass@1 accuracy averaged over 16 samples for each problem
uv run python examples/run_eval.py \
--config examples/configs/evals/math_eval.yaml \
generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
generation.temperature=0.6 \
generation.top_p=0.95 \
generation.vllm_cfg.max_model_len=32768 \
data.dataset_name=HuggingFaceH4/MATH-500 \
data.dataset_key=test \
generation.vllm_cfg.max_model_len=32768 \
data.dataset_name="math500" \
eval.num_tests_per_prompt=16 \
cluster.gpus_per_node=8
```
Expand All @@ -80,3 +86,12 @@ metric='pass@1' num_tests_per_prompt=1
score=0.1000 (3.0/30)
============================================================
```

## List of currently supported benchmarks

- [AIME-2024](../../nemo_rl/data/eval_datasets/aime2024.py)
- [GPQA and GPQA-diamond](../../nemo_rl/data/eval_datasets/gpqa.py)
- [MATH and MATH-500](../../nemo_rl/data/eval_datasets/math.py)
- [MMLU](../../nemo_rl/data/eval_datasets/mmlu.py)
- [MMLU-Pro](../../nemo_rl/data/eval_datasets/mmlu_pro.py)

2 changes: 1 addition & 1 deletion docs/guides/grpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def my_data_processor(
) -> DatumSpec:
```

We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py)
We have an example of this as `math_data_processor` in [processors.py](../../nemo_rl/data/processors.py)

#### Putting it all together

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/sft-openmathinstruct2.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ To evaluate on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingF

```
uv run examples/run_eval.py \
--config=examples/configs/eval.yaml \
--config=examples/configs/evals/eval.yaml \
generation.model_name=results/sft_openmathinstruct2/step_1855/hf \
tokenizer.name=meta-llama/Llama-3.1-8B-Instruct \
data.dataset_name=HuggingFaceH4/MATH-500 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,7 @@ data:
max_input_seq_length: ${generation.vllm_cfg.max_model_len} # useless since we directly use prompts in evaluation
prompt_file: null
system_prompt_file: null
dataset_name: "HuggingFaceH4/aime_2024"
dataset_key: "train"
problem_key: "problem"
solution_key: "answer"
dataset_name: "aime2024"

env:
math:
Expand Down
15 changes: 15 additions & 0 deletions examples/configs/evals/gpqa_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# GPQA evaluation Configuration
defaults: "eval.yaml"

generation:
model_name: "Qwen/Qwen2.5-7B-Instruct"
vllm_cfg:
max_model_len: 3072

data:
prompt_file: "examples/prompts/gpqa.txt"
dataset_name: "gpqa"

env:
math:
verifier_type: "multichoice"
14 changes: 14 additions & 0 deletions examples/configs/evals/local_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Evaluation Configuration from local files.
defaults: "eval.yaml"

generation:
model_name: "Qwen/Qwen2.5-7B-Instruct"

data:
prompt_file: "examples/prompts/cot.txt"
dataset_name: "local"
problem_key: "Question"
solution_key: "Answer"
split: "train"
data_paths: "https:\/\/openaipublic.blob.core.windows.net\/simple-evals\/math_500_test.csv"
file_format: "csv"
9 changes: 9 additions & 0 deletions examples/configs/evals/math_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Math evaluation Configuration
defaults: "eval.yaml"

generation:
model_name: "Qwen/Qwen2.5-7B-Instruct"

data:
prompt_file: "examples/prompts/cot.txt"
dataset_name: "math"
1 change: 1 addition & 0 deletions examples/prompts/gpqa.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
1 change: 1 addition & 0 deletions examples/prompts/mmlu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
53 changes: 20 additions & 33 deletions examples/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,22 @@

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from datasets import load_dataset
from omegaconf import OmegaConf
from transformers import AutoTokenizer
from transformers import AutoTokenizer, PreTrainedTokenizerBase

from examples.run_grpo_math import math_data_processor
from nemo_rl.algorithms.utils import get_tokenizer
from nemo_rl.data import MathDataConfig
from nemo_rl.data.datasets import AllTaskProcessedDataset
from nemo_rl.data.interfaces import TaskDataSpec
from nemo_rl.data.llm_message_utils import remap_dataset_keys
from nemo_rl.data.eval_datasets import load_eval_dataset
from nemo_rl.distributed.ray_actor_environment_registry import (
get_actor_python_env,
)
from nemo_rl.distributed.virtual_cluster import init_ray
from nemo_rl.environments.math_environment import MathEnvironment
from nemo_rl.evals.eval import MasterConfig, run_env_eval, setup
from nemo_rl.models.generation import configure_generation_config
from nemo_rl.utils.config import load_config

TokenizerType = PreTrainedTokenizerBase


def parse_args():
Expand All @@ -54,28 +53,14 @@ def parse_args():
return args, overrides


def setup_data(tokenizer: AutoTokenizer, data_config: MathDataConfig, env_configs):
print("\n▶ Setting up data...")
math_task_spec = TaskDataSpec(
task_name="math",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)
def setup_data(tokenizer: AutoTokenizer, data_config, env_configs):
print("Setting up data...")

# load dataset
Comment thread
xxman-google marked this conversation as resolved.
base_dataset = load_dataset(data_config["dataset_name"])
if data_config["dataset_key"] is not None:
base_dataset = base_dataset[data_config["dataset_key"]]
# remap problem and solution keys
remapped_dataset = remap_dataset_keys(
base_dataset,
mapping_dict={
data_config["problem_key"]: "problem",
data_config["solution_key"]: "expected_answer",
},
)
base_dataset = load_eval_dataset(data_config)
rekeyed_ds = base_dataset.rekeyed_ds

math_env = MathEnvironment.options(
env = MathEnvironment.options(
runtime_env={
"py_executable": get_actor_python_env(
"nemo_rl.environments.math_environment.MathEnvironment"
Expand All @@ -84,14 +69,14 @@ def setup_data(tokenizer: AutoTokenizer, data_config: MathDataConfig, env_config
).remote(env_configs["math"])

dataset = AllTaskProcessedDataset(
dataset=remapped_dataset,
dataset=rekeyed_ds,
tokenizer=tokenizer,
default_task_data_spec=math_task_spec,
task_data_processors=math_data_processor,
default_task_data_spec=base_dataset.task_spec,
task_data_processors=base_dataset.processor,
max_seq_length=data_config["max_input_seq_length"],
)

return dataset, math_env, tokenizer
return dataset, env, tokenizer


def main():
Expand All @@ -100,9 +85,11 @@ def main():
args, overrides = parse_args()

if not args.config:
args.config = os.path.join(os.path.dirname(__file__), "configs", "eval.yaml")
args.config = os.path.join(
os.path.dirname(__file__), "configs", "evals", "eval.yaml"
)

config = OmegaConf.load(args.config)
config = load_config(args.config)
print(f"Loaded configuration from: {args.config}")

if overrides:
Expand All @@ -129,7 +116,7 @@ def main():
# Setup data
(
dataset,
math_env,
env,
tokenizer,
) = setup_data(tokenizer, config["data"], config["env"])

Expand All @@ -144,7 +131,7 @@ def main():
run_env_eval(
vllm_generation,
dataloader,
math_env,
env,
master_config,
)

Expand Down
71 changes: 1 addition & 70 deletions examples/run_grpo_math.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@
import os
import pprint
from collections import defaultdict
from typing import Any, Optional, cast
from typing import Any, Optional

import torch
from omegaconf import OmegaConf
from transformers import PreTrainedTokenizerBase

Expand Down Expand Up @@ -116,74 +115,6 @@ def hf_data_processor(
return output


# Example of a generic math data processor
# TaskDataProcessFnCallable
def math_data_processor(
datum_dict: dict[str, Any],
task_data_spec: TaskDataSpec,
tokenizer: TokenizerType,
max_seq_length: int,
idx: int,
) -> DatumSpec:
"""Process a datum dictionary (directly loaded from dataset) into a DatumSpec for the Math Environment."""
problem = datum_dict["problem"]
solution = str(datum_dict["expected_answer"])
extra_env_info = {"ground_truth": solution}

message_log: LLMMessageLogType = []

# system prompt
if task_data_spec.system_prompt:
sys_prompt: dict[str, str | torch.Tensor] = {
"role": "system",
"content": task_data_spec.system_prompt,
}
sys = tokenizer.apply_chat_template(
[cast(dict[str, str], sys_prompt)],
tokenize=False,
add_generation_prompt=False,
add_special_tokens=False,
)
sys_prompt["token_ids"] = tokenizer(sys, return_tensors="pt")["input_ids"][0]
message_log.append(sys_prompt)

# user prompt
if task_data_spec.prompt:
problem = task_data_spec.prompt.format(problem)
user_message = {"role": "user", "content": problem}
message = tokenizer.apply_chat_template(
[user_message],
tokenize=False,
add_generation_prompt=True,
add_special_tokens=False,
)
user_message["token_ids"] = tokenizer(message, return_tensors="pt")["input_ids"][0]
user_message["content"] = message
message_log.append(user_message)

length = sum(len(m["token_ids"]) for m in message_log)

loss_multiplier = 1.0
if length > max_seq_length:
# make smaller and mask out
for indiv_message in message_log:
indiv_message["token_ids"] = indiv_message["token_ids"][
: min(4, max_seq_length // len(message_log))
]
loss_multiplier = 0.0

output: DatumSpec = {
"message_log": message_log,
"length": length,
"extra_env_info": extra_env_info,
"loss_multiplier": loss_multiplier,
"idx": idx,
}
if "task_name" in datum_dict:
output["task_name"] = datum_dict["task_name"]
return output


def setup_data(
tokenizer: TokenizerType,
data_config: DataConfig,
Expand Down
Loading
Loading