feat: Add DAPO dataset and Deepseek-v3 config#1281
Conversation
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
📝 WalkthroughWalkthroughAdds two GRPO Megatron config recipes, introduces the DAPO Math 17K dataset loader/formatter, adds a DAPO math verifier module with normalization and scoring, and updates math environment verification to optionally use the new verifier via a configuration flag. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Trainer
participant Env as MathEnvironment
participant Worker as HFVerifyWorker
participant DapoV as dapo_math_verifier
participant Legacy as verify_func
Trainer->>Env: step(..., config.use_dapo_math_verifier)
Env->>Worker: verify(pred_responses, ground_truths, return_extracted_answer, use_dapo_math_verifier)
alt use_dapo_math_verifier = true
Worker->>DapoV: compute_score(solution_str, ground_truth, strict_box_verify?)
DapoV-->>Worker: {score, acc, pred_answer}
else
Worker->>Legacy: verify_func(pred, ground_truth_parsable)
Legacy-->>Worker: score/extracted_answer
end
Worker-->>Env: scores, (optional) extracted answers
Env-->>Trainer: step result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/environments/math_environment.py (1)
97-123: Fix extracted-answer handling for DAPO verifier.When
return_extracted_answeris True, the DAPO branch assignsreward_dict["pred"](a plain string) toextracted_answer, so the later assertions (len(extracted_answer) == 2) and indexing (extracted_prediction[0][0]) crash. Wrap the DAPO prediction in the same(extracted_gold, extracted_prediction)structure returned bymath_metricbefore entering the shared post‑processing.Apply this diff:
- if use_dapo_math_verifier: - # This compute_score is from the DAPO Math Verifier from Verl - reward_dict = dapo_math_verify(response, ground_truth) - ret_score = reward_dict["score"] - extracted_answer = reward_dict["pred"] + if use_dapo_math_verifier: + # This compute_score is from the DAPO Math Verifier from Verl + reward_dict = dapo_math_verify(response, ground_truth) + ret_score = reward_dict["score"] + extracted_pred = reward_dict["pred"] + if return_extracted_answer: + extracted_answer = ( + [ground_truth], + [[extracted_pred]], + ) + else: + extracted_answer = None
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/configs/grpo_dapomath17k_1B_megatron.yaml(1 hunks)examples/configs/grpo_dapomath17k_dsv3_megatron.yaml(1 hunks)nemo_rl/data/datasets/response_datasets/__init__.py(3 hunks)nemo_rl/data/datasets/response_datasets/dapo_math.py(1 hunks)nemo_rl/environments/dapo_math_verifier.py(1 hunks)nemo_rl/environments/math_environment.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/environments/dapo_math_verifier.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/dapo_math.pynemo_rl/environments/math_environment.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/environments/dapo_math_verifier.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/dapo_math.pynemo_rl/environments/math_environment.py
examples/configs/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml
Files:
examples/configs/grpo_dapomath17k_1B_megatron.yamlexamples/configs/grpo_dapomath17k_dsv3_megatron.yaml
🧬 Code graph analysis (2)
nemo_rl/environments/dapo_math_verifier.py (1)
nemo_rl/environments/math_environment.py (3)
verify(73-134)verify(139-179)verify(184-221)
nemo_rl/environments/math_environment.py (1)
nemo_rl/environments/dapo_math_verifier.py (1)
compute_score(248-282)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Coverage (e2e)
- GitHub Check: Coverage (doc-test)
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
This PR pulls in the DAPO dataset from #602 and also adds a config for running Deepseek-V3 on this dataset.
Closes #1142
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Sample run with Deepseek-V3 training on DAPO-Math-17k and validation on AIME-2024:
Summary by CodeRabbit
New Features
Configuration