From 6ceed31b026926690305f3da8a0c0603f4e4943c Mon Sep 17 00:00:00 2001 From: Andrew Schilling Date: Fri, 25 Apr 2025 14:44:14 +0000 Subject: [PATCH 1/2] First pass at new build issues Signed-off-by: Andrew Schilling --- .github/workflows/cicd-main.yml | 2 +- README.md | 32 ++++++++++++++++---------------- docs/design-docs/gpu-logger.md | 0 docs/design-docs/index.md | 12 ------------ docs/guides/grpo.md | 2 +- docs/guides/index.md | 9 --------- 6 files changed, 18 insertions(+), 39 deletions(-) delete mode 100644 docs/design-docs/gpu-logger.md delete mode 100644 docs/design-docs/index.md delete mode 100644 docs/guides/index.md diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml index e3bee5e7f6..318f2e8d21 100644 --- a/.github/workflows/cicd-main.yml +++ b/.github/workflows/cicd-main.yml @@ -128,7 +128,7 @@ jobs: run: | pip install uv cd docs/ - uv run --group docs sphinx-build . _build/html + uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html build-container: if: ${{ needs.pre-flight.outputs.test_level != 'none' }} diff --git a/README.md b/README.md index 3381fef9f7..f6cee030e4 100644 --- a/README.md +++ b/README.md @@ -3,18 +3,18 @@ - [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s) - [Features](#features) - - [Prerequisuites](#prerequisuites) + - [Prerequisites](#prerequisites) - [Quick start](#quick-start) - [GRPO](#grpo) - - [Single Node](#single-node) - - [Multi-node](#multi-node) - - [GRPO Qwen2.5-32B](#grpo-qwen25-32b) + - [Single Node](#grpo-single-node) + - [Multi-node](#grpo-multi-node) + - [GRPO Qwen2.5-32B](#grpo-qwen2-5-32b) - [SFT](#sft) - - [Single Node](#single-node-1) - - [Multi-node](#multi-node-1) + - [Single Node](#sft-single-node) + - [Multi-node](#sft-multi-node) - [DPO](#dpo) - - [Single Node](#single-node-2) - - [Multi-node](#multi-node-2) + - [Single Node](#dpo-single-node) + - [Multi-node](#dpo-multi-node) - [Cluster Start](#cluster-start) **Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. @@ -48,7 +48,7 @@ What you can expect: - 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models - 🔜 **MoE Models** - Support DeepseekV3 and Llama4 -## Prerequisuites +## Prerequisites ```sh # For faster setup and environment isolation, we use `uv` @@ -73,7 +73,7 @@ pip install uv We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset. -#### Single Node +#### GRPO Single Node To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`: @@ -101,7 +101,7 @@ uv run python examples/run_grpo_math.py \ logger.num_val_samples_to_print=10 \ ``` -#### Multi-node +#### GRPO Multi-node ```sh # Run from the root of NeMo-Reinforcer repo @@ -149,7 +149,7 @@ sbatch \ We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). -#### Single Node +#### SFT Single Node The default SFT experiment is configured to run on a single GPU. To launch the experiment, @@ -171,7 +171,7 @@ uv run python examples/run_sft.py \ Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden. -#### Multi-node +#### SFT Multi-node ```sh # Run from the root of NeMo-Reinforcer repo @@ -194,7 +194,7 @@ sbatch \ We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training. -#### Single Node +#### DPO Single Node The default DPO experiment is configured to run on a single GPU. To launch the experiment: @@ -224,9 +224,9 @@ uv run python examples/run_dpo.py \ logger.wandb.name="llama-dpo-sft" ``` -Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md). +Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md). -#### Multi-node +#### DPO Multi-node For distributed DPO training across multiple nodes, modify the following script for your use case: diff --git a/docs/design-docs/gpu-logger.md b/docs/design-docs/gpu-logger.md deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/docs/design-docs/index.md b/docs/design-docs/index.md deleted file mode 100644 index e178a61002..0000000000 --- a/docs/design-docs/index.md +++ /dev/null @@ -1,12 +0,0 @@ -```{toctree} -:caption: 📐 Design Docs -:hidden: - -design-and-philosophy.md -padding.md -logger.md -uv.md -chat-datasets.md -generation.md -checkpointing.md -``` \ No newline at end of file diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md index 716e609642..6a0a373b6c 100644 --- a/docs/guides/grpo.md +++ b/docs/guides/grpo.md @@ -151,7 +151,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro #### Importance Sampling Correction -The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding_new_models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. +The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then, diff --git a/docs/guides/index.md b/docs/guides/index.md deleted file mode 100644 index 4276cc8d22..0000000000 --- a/docs/guides/index.md +++ /dev/null @@ -1,9 +0,0 @@ -```{toctree} -:caption: 📚 Guides -:hidden: - -adding-new-models.md -sft.md -grpo.md -eval.md -``` \ No newline at end of file From 0cd7d972883d61f0559ca888cf277dee34282116 Mon Sep 17 00:00:00 2001 From: Andrew Schilling Date: Fri, 25 Apr 2025 14:59:06 +0000 Subject: [PATCH 2/2] Adjusting myst_heading_anchors in conf.py Signed-off-by: Andrew Schilling --- README.md | 2 +- docs/conf.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f6cee030e4..99d4b775af 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ - [GRPO](#grpo) - [Single Node](#grpo-single-node) - [Multi-node](#grpo-multi-node) - - [GRPO Qwen2.5-32B](#grpo-qwen2-5-32b) + - [GRPO Qwen2.5-32B](#grpo-qwen25-32b) - [SFT](#sft) - [Single Node](#sft-single-node) - [Multi-node](#sft-multi-node) diff --git a/docs/conf.py b/docs/conf.py index c9f61d4faf..7dd9941c3e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -53,7 +53,7 @@ "fieldlist", # Enables field lists for metadata like :author: Name "tasklist", # Adds support for GitHub-style task lists with [ ] and [x] ] -myst_heading_anchors = 4 # Generates anchor links for headings up to level 4 +myst_heading_anchors = 5 # Generates anchor links for headings up to level 5 # -- Options for Autodoc2 --------------------------------------------------- sys.path.insert(0, os.path.abspath(".."))