From 6ceed31b026926690305f3da8a0c0603f4e4943c Mon Sep 17 00:00:00 2001
From: Andrew Schilling <aschilling@nvidia.com>
Date: Fri, 25 Apr 2025 14:44:14 +0000
Subject: [PATCH 1/2] First pass at new build issues

Signed-off-by: Andrew Schilling <aschilling@nvidia.com>
---
 .github/workflows/cicd-main.yml |  2 +-
 README.md                       | 32 ++++++++++++++++----------------
 docs/design-docs/gpu-logger.md  |  0
 docs/design-docs/index.md       | 12 ------------
 docs/guides/grpo.md             |  2 +-
 docs/guides/index.md            |  9 ---------
 6 files changed, 18 insertions(+), 39 deletions(-)
 delete mode 100644 docs/design-docs/gpu-logger.md
 delete mode 100644 docs/design-docs/index.md
 delete mode 100644 docs/guides/index.md

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index e3bee5e7f6..318f2e8d21 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -128,7 +128,7 @@ jobs:
         run: |
           pip install uv
           cd docs/
-          uv run --group docs sphinx-build . _build/html
+          uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html
 
   build-container:
     if: ${{ needs.pre-flight.outputs.test_level != 'none' }}
diff --git a/README.md b/README.md
index 3381fef9f7..f6cee030e4 100644
--- a/README.md
+++ b/README.md
@@ -3,18 +3,18 @@
 <!-- markdown all in one -->
 - [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
   - [Features](#features)
-  - [Prerequisuites](#prerequisuites)
+  - [Prerequisites](#prerequisites)
   - [Quick start](#quick-start)
     - [GRPO](#grpo)
-      - [Single Node](#single-node)
-      - [Multi-node](#multi-node)
-        - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
+      - [Single Node](#grpo-single-node)
+      - [Multi-node](#grpo-multi-node)
+        - [GRPO Qwen2.5-32B](#grpo-qwen2-5-32b)
     - [SFT](#sft)
-      - [Single Node](#single-node-1)
-      - [Multi-node](#multi-node-1)
+      - [Single Node](#sft-single-node)
+      - [Multi-node](#sft-multi-node)
     - [DPO](#dpo)
-      - [Single Node](#single-node-2)
-      - [Multi-node](#multi-node-2)
+      - [Single Node](#dpo-single-node)
+      - [Multi-node](#dpo-multi-node)
   - [Cluster Start](#cluster-start)
 
 **Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
@@ -48,7 +48,7 @@ What you can expect:
 - 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
 - 🔜 **MoE Models** - Support DeepseekV3 and Llama4
 
-## Prerequisuites
+## Prerequisites
 
 ```sh
 # For faster setup and environment isolation, we use `uv`
@@ -73,7 +73,7 @@ pip install uv
 
 We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
 
-#### Single Node
+#### GRPO Single Node
 
 To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:
 
@@ -101,7 +101,7 @@ uv run python examples/run_grpo_math.py \
   logger.num_val_samples_to_print=10 \
 ```
 
-#### Multi-node
+#### GRPO Multi-node
 
 ```sh
 # Run from the root of NeMo-Reinforcer repo
@@ -149,7 +149,7 @@ sbatch \
 
 We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
 
-#### Single Node
+#### SFT Single Node
 
 The default SFT experiment is configured to run on a single GPU. To launch the experiment,
 
@@ -171,7 +171,7 @@ uv run python examples/run_sft.py \
 
 Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
-#### Multi-node
+#### SFT Multi-node
 
 ```sh
 # Run from the root of NeMo-Reinforcer repo
@@ -194,7 +194,7 @@ sbatch \
 
 We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.
 
-#### Single Node
+#### DPO Single Node
 
 The default DPO experiment is configured to run on a single GPU. To launch the experiment:
 
@@ -224,9 +224,9 @@ uv run python examples/run_dpo.py \
   logger.wandb.name="llama-dpo-sft"
 ```
 
-Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
+Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
 
-#### Multi-node
+#### DPO Multi-node
 
 For distributed DPO training across multiple nodes, modify the following script for your use case:
 
diff --git a/docs/design-docs/gpu-logger.md b/docs/design-docs/gpu-logger.md
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/docs/design-docs/index.md b/docs/design-docs/index.md
deleted file mode 100644
index e178a61002..0000000000
--- a/docs/design-docs/index.md
+++ /dev/null
@@ -1,12 +0,0 @@
-```{toctree}
-:caption: 📐 Design Docs
-:hidden:
-
-design-and-philosophy.md
-padding.md
-logger.md
-uv.md
-chat-datasets.md
-generation.md
-checkpointing.md
-```
\ No newline at end of file
diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 716e609642..6a0a373b6c 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -151,7 +151,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro
 
 
 #### Importance Sampling Correction
-The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding_new_models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
+The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
 
 Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then,
 
diff --git a/docs/guides/index.md b/docs/guides/index.md
deleted file mode 100644
index 4276cc8d22..0000000000
--- a/docs/guides/index.md
+++ /dev/null
@@ -1,9 +0,0 @@
-```{toctree}
-:caption: 📚 Guides
-:hidden:
-
-adding-new-models.md
-sft.md
-grpo.md
-eval.md
-```
\ No newline at end of file

From 0cd7d972883d61f0559ca888cf277dee34282116 Mon Sep 17 00:00:00 2001
From: Andrew Schilling <aschilling@nvidia.com>
Date: Fri, 25 Apr 2025 14:59:06 +0000
Subject: [PATCH 2/2] Adjusting myst_heading_anchors in conf.py

Signed-off-by: Andrew Schilling <aschilling@nvidia.com>
---
 README.md    | 2 +-
 docs/conf.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index f6cee030e4..99d4b775af 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
     - [GRPO](#grpo)
       - [Single Node](#grpo-single-node)
       - [Multi-node](#grpo-multi-node)
-        - [GRPO Qwen2.5-32B](#grpo-qwen2-5-32b)
+        - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
     - [SFT](#sft)
       - [Single Node](#sft-single-node)
       - [Multi-node](#sft-multi-node)
diff --git a/docs/conf.py b/docs/conf.py
index c9f61d4faf..7dd9941c3e 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -53,7 +53,7 @@
     "fieldlist",  # Enables field lists for metadata like :author: Name
     "tasklist",  # Adds support for GitHub-style task lists with [ ] and [x]
 ]
-myst_heading_anchors = 4  # Generates anchor links for headings up to level 4
+myst_heading_anchors = 5  # Generates anchor links for headings up to level 5
 
 # -- Options for Autodoc2 ---------------------------------------------------
 sys.path.insert(0, os.path.abspath(".."))