From 60ba494adace12150cc2ada55f2f391a6219362c Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Thu, 1 May 2025 15:17:38 -0700
Subject: [PATCH 01/36] tech edit

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 README.md                                 | 113 +++++++++++++---------
 docs/adding-new-models.md                 |  40 ++++----
 docs/cluster.md                           |  27 +++---
 docs/design-docs/chat-datasets.md         |  12 ++-
 docs/design-docs/design-and-philosophy.md |  66 +++++++------
 docs/design-docs/generation.md            |  40 ++++----
 docs/design-docs/logger.md                |  40 ++++----
 docs/design-docs/padding.md               |  26 ++---
 docs/design-docs/uv.md                    |  34 +++----
 docs/docker.md                            |  10 +-
 docs/documentation.md                     |  14 +--
 docs/guides/eval.md                       |   9 +-
 docs/guides/grpo.md                       |  11 +--
 docs/guides/sft.md                        |  19 ++--
 docs/local-workstation.md                 |   6 +-
 docs/testing.md                           |  21 ++--
 16 files changed, 262 insertions(+), 226 deletions(-)

diff --git a/README.md b/README.md
index c857d84510..0952dac760 100644
--- a/README.md
+++ b/README.md
@@ -1,52 +1,55 @@
-# Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s
+# Nemo RL: A Scalable and Efficient Post-Training Library
 
 <!-- markdown all in one -->
-- [Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-rl-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
-  - [Features](#features)
+- [Nemo RL: A Scalable and Efficient Post-Training Library](#nemo-rl-a-scalable-and-efficient-post-training-library)
+  - [Table of Contents](#table-of-contents)
   - [Prerequisites](#prerequisites)
-  - [Quick start](#quick-start)
     - [GRPO](#grpo)
-      - [Single Node](#grpo-single-node)
-      - [Multi-node](#grpo-multi-node)
+      - [GRPO Single Node](#grpo-single-node)
+      - [GRPO Multi-node](#grpo-multi-node)
         - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
-    - [SFT](#sft)
-      - [Single Node](#sft-single-node)
-      - [Multi-node](#sft-multi-node)
+  - [Quickstart](#quickstart)
+  - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
+      - [Run Single Node SFT](#run-single-node-sft)
+      - [SFT Multi-node](#sft-multi-node)
     - [DPO](#dpo)
-      - [Single Node](#dpo-single-node)
-      - [Multi-node](#dpo-multi-node)
-  - [Cluster Start](#cluster-start)
+      - [DPO Single Node](#dpo-single-node)
+      - [DPO Multi-node](#dpo-multi-node)
+  - [Set Up Clusters](#set-up-clusters)
+  - [Citation](#citation)
+  - [Contributing](#contributing)
+  - [Licenses](#licenses)
 
-**Nemo-RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
+**Nemo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
 
 What you can expect:
 
-- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
-- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
+- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
+- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
 - **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations.
 - **Flexibility** with a modular design that allows easy integration and customization.
 - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
 
-## Features
+## Table of Contents
 
 ✅ _Available now_ | 🔜 _Coming in v0.3_
 
-- ✅ **Fast Generation** - vLLM backend for optimized inference
-- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama)
-- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure
+- ✅ **Fast Generation** - vLLM backend for optimized inference.
+- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama).
+- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure.
 - ✅ **Environment Support** - Support for multi-environment training.
-- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization)
-- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc. 
-- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters
-- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training
-- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
-- ✅ **Environment Isolation** - Dependency isolation between components
-
-- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron
-- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models
-- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core
-- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
-- 🔜 **MoE Models** - Support DeepseekV3 and Llama4
+- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
+- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
+- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters.
+- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training.
+- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state).
+- ✅ **Environment Isolation** - Dependency isolation between components.
+
+- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron.
+- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models.
+- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core.
+- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models.
+- 🔜 **MoE Models** - Support DeepseekV3 and Llama4.
 
 ## Prerequisites
 
@@ -72,9 +75,11 @@ pip install uv
 # Example: uv run python examples/run_grpo_math.py
 ```
 
-## Quick start
+**Important Notes:**
 
-**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+- Use the `uv run <command>` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
+- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
+- **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
 ### GRPO
 
@@ -111,7 +116,7 @@ uv run python examples/run_grpo_math.py \
 #### GRPO Multi-node
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 
 # grpo_math_8b uses Llama-3.1-8B-Instruct model
@@ -131,7 +136,7 @@ sbatch \
 ##### GRPO Qwen2.5-32B
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=16
 
 # Download Qwen before the job starts to avoid spending time downloading during the training loop
@@ -158,21 +163,25 @@ Reference example for training to play a Sliding Puzzle Game:
 uv run python examples/run_grpo_sliding_puzzle.py 
 ```
 
-### SFT
+## Quickstart
 
-We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`.
 
-#### SFT Single Node
+## Supervised Fine-Tuning (SFT)
 
-The default SFT experiment is configured to run on a single GPU. To launch the experiment,
+We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+
+#### Run Single Node SFT
+
+The default SFT configuration is set to run on a single GPU. To start the experiment:
 
 ```sh
 uv run python examples/run_sft.py
 ```
 
-This trains `Llama3.2-1B` on one GPU using the SQUAD dataset.
+This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU.
 
-If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size:
+To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:
 
 ```sh
 uv run python examples/run_sft.py \
@@ -187,7 +196,7 @@ Refer to `examples/configs/sft.yaml` for a full list of parameters that can be o
 #### SFT Multi-node
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 
 COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
@@ -244,7 +253,7 @@ Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters
 For distributed DPO training across multiple nodes, modify the following script for your use case:
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 ## number of nodes to use for your job
 NUM_ACTOR_NODES=2
 
@@ -262,19 +271,29 @@ sbatch \
     ray.sub
 ```
 
-## Cluster Start
+## Set Up Clusters
 
-Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes.
+For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation.
 
 ## Citation
 
-If you use NeMo-RL in your research, please cite it using the following BibTeX entry:
+If you use NeMo RL in your research, please cite it using the following BibTeX entry:
 
 ```bibtex
 @misc{nemo-rl,
-title = {NeMo-RL: A Scalable and Efficient Post-Training Library},
+title = {NeMo RL: A Scalable and Efficient Post-Training Library},
 howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
 year = {2025},
 note = {GitHub repository},
 }
 ```
+
+## Contributing
+
+We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/nemo-rl/blob/main/CONTRIBUTING.md) for more information on how to get involved.
+
+## Licenses
+
+NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/nemo-rl/blob/main/LICENSE).
+
+NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license.
diff --git a/docs/adding-new-models.md b/docs/adding-new-models.md
index 9afcb46cf9..d0265c3e4d 100644
--- a/docs/adding-new-models.md
+++ b/docs/adding-new-models.md
@@ -1,10 +1,10 @@
-# Adding New Models
+# Add New Models
 
-This guide outlines how to integrate and validate a new model within **NeMo-RL**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
+This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
 
 ## Importance of Log Probability Consistency in Training and Inference
 
-In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
+In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
 
 As an example, we would see errors in naive KL estimation:
 
@@ -24,33 +24,33 @@ where samples are drawn as $x \sim \pi_{\text{inference-framework}}$
 
 As a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
 
-## Understanding Discrepancies Between Backends
+## Understand Discrepancies Between Backends
 
 When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant:
 
 - **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8).
-  - Training may use mixed precision while the inference backend may not
-  - High-precision training with FP8 inference may not be numerically stable for certain models
-  - Differences can occur at the layer level, with some layers in FP32 while others use lower precision
+  - Training may use mixed precision, while the inference backend may not.
+  - High-precision training with FP8 inference may not be numerically stable for certain models.
+  - Differences can occur at the layer level, with some layers in FP32, while others use lower precision.
 
 - **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented.
-  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends
-  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences
-  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability
+  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends.
+  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences.
+  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability.
 
 - **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation.
-  - In some cases, disabling the inference backend cache can resolve discrepancies
+  - In some cases, disabling the inference backend cache can resolve discrepancies.
 
-- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations
+- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations.
 
-- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`)
+- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`).
 
 - **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities.
-  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels
+  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels.
 
-- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect
-  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong
-  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses
+- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect.
+  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong.
+  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses.
 
 - **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1`
 
@@ -69,7 +69,7 @@ When validating Hugging Face-based models, perform the following checks:
   Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches.
 
 - **Test parallelism**  
-  Verify consistency with other parallelism settings. 
+  Verify consistency with other parallelism settings.
 
 - **Variance**  
   Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance.
@@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks:
 ### Additional Validation
 
 - **Compare Megatron outputs**  
-  Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**.
+  Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**.
 
 - **Parallel settings**  
   Match the same parallelism configurations used for the HuggingFace-based tests.  
@@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf
 
 ---
 
-By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-RL**'s requirements.
\ No newline at end of file
+By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL.
diff --git a/docs/cluster.md b/docs/cluster.md
index 260acaeb1e..57d351fd66 100644
--- a/docs/cluster.md
+++ b/docs/cluster.md
@@ -1,13 +1,10 @@
-# Cluster start
+# Set Up Clusters
 
-- [Cluster start](#cluster-start)
-  - [Slurm](#slurm)
-    - [Batched Job Submission](#batched-job-submission)
-    - [Interactive Launching](#interactive-launching)
-    - [Slurm UV\_CACHE\_DIR](#slurm-uv_cache_dir)
-  - [Kubernetes](#kubernetes)
+This guide explains how to initialize NeMo RL clusters.
 
-## Slurm
+## Slurm (Batched and Interactive)
+
+ The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.
 
 ### Batched Job Submission
 
@@ -35,7 +32,7 @@ Which will print the `SLURM_JOB_ID`:
 ```text
 Submitted batch job 1980204
 ```
-Make note of the the job submission number. Once the job begins you can track it's process in the driver logs which you can `tail`:
+Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`:
 ```sh
 tail -f 1980204-logs/ray-driver.log
 ```
@@ -43,10 +40,10 @@ tail -f 1980204-logs/ray-driver.log
 ### Interactive Launching
 
 :::{tip}
-A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the SLURM job queue. This means during debugging sessions, you can avoid submitting a new `sbatch` command each time and instead debug and re-submit your NeMo-RL job directly from the interactive session.
+A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new `sbatch` command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session.
 :::
 
-To run interactively, launch the same command as the [Batched Job Submission](#batched-job-submission) except omit the `COMMAND` line:
+To run interactively, launch the same command as [Batched Job Submission](#batched-job-submission), but omit the `COMMAND` line:
 ```sh
 # Run from the root of NeMo-RL repo
 NUM_ACTOR_NODES=1  # Total nodes requested (head is colocated on ray-worker-0)
@@ -66,12 +63,12 @@ Which will print the `SLURM_JOB_ID`:
 ```text
 Submitted batch job 1980204
 ```
-Once the ray cluster is up, a script should be created to attach to the ray head node,
-which you can use launch experiments.
+Once the Ray cluster is up, a script should be created to attach to the Ray head node,
+which you can use to launch experiments.
 ```sh
 bash 1980204-attach.sh
 ```
-Now that you are on the head node, you can launch the command like so:
+Now that you are on the head node, you can launch the command as follows:
 ```sh
 uv run ./examples/run_grpo_math.py
 ```
@@ -96,4 +93,4 @@ covered by warmed cache.
 
 ## Kubernetes
 
-TBD
+TBD
\ No newline at end of file
diff --git a/docs/design-docs/chat-datasets.md b/docs/design-docs/chat-datasets.md
index 43e2801fdc..7fe570b99a 100644
--- a/docs/design-docs/chat-datasets.md
+++ b/docs/design-docs/chat-datasets.md
@@ -1,8 +1,10 @@
 # Data Format
 
-## HuggingFace Chat Datasets
+This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information.
 
-HuggingFace chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. `messages` should be a list of dictionaries, each with a `role` and `content` key. `role` is typically one of `system`, `user`, and `assistant`. For example:
+## Hugging Face Chat Datasets
+
+Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. The `messages` should be a list of dictionaries, each with a `role` and `content` key. The `role` typically has one of the following values: `system`, `user`, and `assistant`. For example:
 
 ```json
 {
@@ -23,9 +25,9 @@ HuggingFace chat datasets are expected to have the following structure: Each exa
 }
 ```
 
-### Chat Templates
+## Chat Templates
 
-Formatting the data in this way allows us to take advantage of HuggingFace tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details.
+Formatting the data with chat templates allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details.
 
 By default, `apply_chat_template` attempts to apply the `chat_template` associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated `chat_template`s, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to `apply_chat_template`. 
 The following is an example using a simple template which prepends a role header to each turn:
@@ -58,4 +60,4 @@ assert output == expected_output
 :hide:
 ```
 
-For more details on creating chat templates, refer to the [HuggingFace documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template).
\ No newline at end of file
+For more details on creating chat templates, refer to the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template).
\ No newline at end of file
diff --git a/docs/design-docs/design-and-philosophy.md b/docs/design-docs/design-and-philosophy.md
index 00d6284b3b..258193f171 100644
--- a/docs/design-docs/design-and-philosophy.md
+++ b/docs/design-docs/design-and-philosophy.md
@@ -1,54 +1,54 @@
 # Design and Philosophy
-In this section, we will describe the problems this library aims to solve and motivate/dicuss the NeMo-RL APIs.
+
+This section introduces the NeMo RL APIs and addresses the challenges of online Reinforcement Learning (RL). Coordinating various software components, known as RL Actors, requires effective resource allocation, isolation, coordination, and communication. Our design philosophy focuses on creating modular abstractions for these tasks, ensuring scalability from one GPU to thousands, regardless of the RL Actor's implementation.
 
 ## Motivation
-Online RL requires coordinating a lot of different pieces of software/models
+
+Online RL demands the coordination of a wide range of software components and models, for example:
 - Policy Model/Training Framework
-- Fast inference Framework (vLLM, SGLANG, TRT-LLM)
+- Fast Inference Framework (vLLM, SGLANG, TRT-LLM)
 - Reward Environments, Critics, etc.
 
 We refer to each of these pieces of software as an **RL Actor**.
 
-Fundamentally, we need to be able to do 4 things between these RL Actors:
-- Resource them (provide GPUs/CPUs)
-- Isolate them
-  - RL Actors may each set global variables or have conflicting dependencies, so they each need to live in an isolated process environment with configurable dependencies
-- Coordinate them (control)
-- Communicate between them (data)
+Fundamentally, managing these RL Actors requires four key capabilities:
+- Resource them (provide GPUs/CPUs).
+- Isolate them: RL Actors need isolated process environments with configurable dependencies to avoid global variable or dependency conflicts.
+- Coordinate them (control).
+- Communicate between them (data).
 
 ## Design
 
 We create composable and hackable abstractions for each layer of the tasks above
-- Resourcing -> {py:class}`RayVirtualCluster <nemo_rl.distributed.virtual_cluster.RayVirtualCluster>`
-- Isolation -> {py:class}`RayWorkerGroup <nemo_rl.distributed.worker_groups.RayWorkerGroup>`
-- Coordination -> A Single-Process Controller using Ray
-- Communication -> Data flows through one of the following:
+- Resourcing: {py:class}`RayVirtualCluster <nemo_rl.distributed.virtual_cluster.RayVirtualCluster>`
+- Isolation: {py:class}`RayWorkerGroup <nemo_rl.distributed.worker_groups.RayWorkerGroup>`
+- Coordination: A Single-Process Controller using Ray
+- Communication: Data flows through one of the following:
   - the single controller 
   - a communication scheme set-up by the controller such as
     - NCCL Collectives
     - Multiprocess Queues
 
-By creating a common interface for these 4 tasks, **RL algorithm code looks the same from 1 GPU to 1000 GPUs and does not care about the implementation of each RL Actor (Megatron, HF, Grad student with pen and paper)**
+By creating a common interface for these four tasks, the RL algorithm code can scale seamlessly from 1 to 1000 GPUs and remain independent of the specific RL Actor (such as Megatron, Hugging Face, or abstract components like a grad student with pen and paper).
 
 ![actor-wg-worker-vc](../assets/actor-wg-worker-vc.png)
 
 ### {py:class}`RayVirtualCluster <nemo_rl.distributed.virtual_cluster.RayVirtualCluster>`
 VirtualCluster provides a basic abstraction on top of Ray Placement Groups that allow you to section off a part of your compute resources for WorkerGroups to run on as though they had their own cluster. They support running just one WorkerGroup on each VirtualCluster, or *colocation*, where multiple WorkerGroups share resources (i.e running policy training(hf) and generation(vllm) on the same GPUs in-turn).
 
-Minimally, it has has the following core API:
 ```python
 class RayVirtualCluster:
 """
     Creates a virtual distributed cluster using Ray placement groups.
 
     This class simplifies distributed training setup by:
-    - Creating placement groups that represent logical compute nodes
-    - Allocating GPU and CPU resources for distributed workers
-    - Managing communication between distributed processes
+    - Creating placement groups that represent logical compute nodes.
+    - Allocating GPU and CPU resources for distributed workers.
+    - Managing communication between distributed processes.
 
-    - Bundle: A resource allocation unit (ex: 4 GPUs on a single node)
-    - Worker: A process that performs computation (model training/inference)
-    - Node: A physical or virtual machine containing multiple bundles
+    - Bundle: A resource allocation unit (ex: 4 GPUs on a single node).
+    - Worker: A process that performs computation (model training/inference).
+    - Node: A physical or virtual machine containing multiple bundles.
 """
     def __init__(self, bundle_ct_per_node_list: List[int], {other args}):
         """
@@ -64,12 +64,12 @@ class RayVirtualCluster:
         This represents the "virtual cluster" - only nodes that are actually being used.
 
         Returns:
-            List of placement groups that have at least one bundle
+            List of placement groups that have at least one bundle.
         """
 ```
 
 ### {py:class}`RayWorkerGroup <nemo_rl.distributed.worker_groups.RayWorkerGroup>`
-All work is done by "Worker Processes"(Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by *RayWorkerGroup*
+All work is done by "Worker Processes" (Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by the *RayWorkerGroup*.
 ```python
 class RayWorkerGroup:
     """
@@ -77,18 +77,20 @@ class RayWorkerGroup:
 
     This class creates and manages Ray actor instances that run on resources
     allocated by a RayVirtualCluster. It handles:
-    - Worker creation and placement on specific GPU resources
-    - Setting up distributed training environment variables (rank, world size, etc.)
-    - Executing methods across all workers in parallel
-    - Collecting and aggregating results
-    - Support for tied worker groups where multiple workers process the same data
+    - Worker creation and placement on specific GPU resources.
+    - Setting up distributed training environment variables (rank, world size, etc.).
+    - Executing methods across all workers in parallel.
+    - Collecting and aggregating results.
+    - Support for tied worker groups where multiple workers process the same data.
     """
 ```
 `RayWorkerGroup` provides functions like `run_all_workers_single_data` and `run_all_workers_multiple_data` to control and communicate to individual worker processes.
 
 
-### Single-Controller & Execution Diagram
-We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of GRPO as though we were working on 1 GPU
+### Single-Controller and Execution Diagram
+
+We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of Group Relative Policy Optimization (GRPO) as though we were working on 1 GPU.
+
 ```python
 # data processing/transformations between each step omitted
 def grpo_train(
@@ -106,7 +108,7 @@ def grpo_train(
         logprobs = policy.get_logprobs(generations)
         reference_logprobs = policy.get_reference_logprobs(generations)
 
-        training_data = calculate_grpo_trainnig_data(generations, logprobs, reference_logprobs, rewards)
+        training_data = calculate_grpo_training_data(generations, logprobs, reference_logprobs, rewards)
         policy.train(generations, logprobs, reference_logprobs, GRPOLossFn)
 ```
-For a real implementation of grpo (with valiation, checkpointing, memory movement, and the omitted data processing steps), see [grpo_train](../../nemo_rl/algorithms/grpo.py)
+For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](../../nemo_rl/algorithms/grpo.py)
diff --git a/docs/design-docs/generation.md b/docs/design-docs/generation.md
index 72c2554d92..bb83457b91 100644
--- a/docs/design-docs/generation.md
+++ b/docs/design-docs/generation.md
@@ -1,6 +1,6 @@
-# Generation Module
+# Token Generation
 
-This doc explains the token generation interface and various backends for the NeMo-RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, HuggingFace, SGLang, TRT-LLM) to provide token generation capabilities while adhering to the same API.
+This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
 
 ## Generation Interface
 
@@ -58,7 +58,7 @@ The core of the generation system is defined in `interfaces.py`, which establish
            pass
    ```
 
-A key thing to note about generation backends is that the generation backend takes in tokens and gives out tokens without dealing with the tokenizer. By ensuring that only tokens are communicated we eliminate the possibility of having different tokenizers (different versions/specs etc) for training and generation framework.
+A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks.
 
 ## VLLM Backend
 
@@ -66,29 +66,29 @@ The VLLM backend (`models/generation/vllm.py`) implements the {py:class}`Generat
 
 ### VllmGeneration Class
 
-The {py:class}`VllmGeneration <nemo_rl.models.generation.vllm.VllmGeneration>` class is the main implementation of the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` for VLLM. It:
+The {py:class}`VllmGeneration <nemo_rl.models.generation.vllm.VllmGeneration>` class is the main implementation of the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` for VLLM. It performs the following functions:
 
-1. Sets up VLLM workers in a distributed environment using Ray
-2. Manages the lifecycle of these workers (initialization, generation, shutdown)
-3. Distributes inputs to workers and collects outputs
-4. Handles weight updates and synchronization
+1. Sets up VLLM workers in a distributed environment using Ray.
+2. Manages the lifecycle of these workers (initialization, generation, shutdown).
+3. Distributes inputs to workers and collects outputs.
+4. Handles weight updates and synchronization.
 
 ### VllmGenerationWorker
 
 The {py:class}`VllmGenerationWorker <nemo_rl.models.generation.vllm.VllmGenerationWorker>` is a Ray actor that:
 
-1. Initializes and manages a VLLM model instance
-2. Performs the actual generation on a GPU
-3. Supports dynamic weight updates through IPC handles
-4. Implements sleep/wake mechanisms for efficient resource utilization
+1. Initializes and manages a VLLM model instance.
+2. Performs the actual generation on a GPU.
+3. Supports dynamic weight updates through IPC handles.
+4. Implements sleep/wake mechanisms for efficient resource utilization.
 
 ### Custom VLLM Extensions
 
 The {py:class}`UpdatableVllmInternalWorker <nemo_rl.models.generation.vllm_backend.UpdatableVllmInternalWorker>` class in `vllm_backend.py` extends the VLLM worker with additional capabilities:
 
-1. Reporting device IDs to allow mapping of workers to specific GPUs
-2. Updating weights from IPC handles for efficient weight sharing
-3. Checking if weights have been updated correctly
+1. Reporting device IDs to allow mapping of workers to specific GPUs.
+2. Updating weights from IPC handles for efficient weight sharing.
+3. Checking if weights have been updated correctly.
 
 ## Usage Example
 
@@ -133,13 +133,13 @@ output = generator.generate(input_data, greedy=False)
 generator.finish_generation()
 ```
 
-## Extending with New Backends
+## Extend with New Backends
 
 To add a new generation backend:
 
-1. Create a new class that implements {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>`
-2. Implement the required methods: {py:meth}`generate <nemo_rl.models.generation.interfaces.GenerationInterface.generate>`, {py:meth}`prepare_for_generation <nemo_rl.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:meth}`finish_generation <nemo_rl.models.generation.interfaces.GenerationInterface.finish_generation>`
-3. Ensure your implementation works with the standard {py:class}`GenerationConfig <nemo_rl.models.generation.interfaces.GenerationConfig>` and {py:class}`GenerationDatumSpec <nemo_rl.models.generation.interfaces.GenerationDatumSpec>` structures
-4. Register your backend with the system (if needed) to make it accessible
+1. Create a new class that implements {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>`.
+2. Implement the required methods: {py:meth}`generate <nemo_rl.models.generation.interfaces.GenerationInterface.generate>`, {py:meth}`prepare_for_generation <nemo_rl.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:meth}`finish_generation <nemo_rl.models.generation.interfaces.GenerationInterface.finish_generation>`.
+3. Ensure your implementation works with the standard {py:class}`GenerationConfig <nemo_rl.models.generation.interfaces.GenerationConfig>` and {py:class}`GenerationDatumSpec <nemo_rl.models.generation.interfaces.GenerationDatumSpec>` structures.
+4. Register your backend with the system (if needed) to make it accessible.
 
 This modular design allows for easy extension with new backends while maintaining a consistent interface for the rest of the system.
diff --git a/docs/design-docs/logger.md b/docs/design-docs/logger.md
index 8578fe621e..3e861ecab5 100644
--- a/docs/design-docs/logger.md
+++ b/docs/design-docs/logger.md
@@ -1,8 +1,10 @@
 # Logger
 
-## Requirements:
+The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB and Tensorboard.
 
-* Tracking distributed metrics with specified reductions (mean, max, etc)
+## Requirements
+
+* Tracking distributed metrics with specified reductions (mean, max, etc.)
 * Tracking distributed timing with (usually) 'max' reduction across ranks
 * Logging:
    * WandB
@@ -29,7 +31,7 @@ class LoggerInterface(ABC):
         pass
 ```
 
-A {py:class}`Logger <nemo_rl.utils.logger.Logger>` wrapper class will also implement {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` and will contain a list of loggers it delegates to when writing logs. This will be the main class the user uses in the training loop. Usage example:
+A {py:class}`Logger <nemo_rl.utils.logger.Logger>` wrapper class will also implement {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:
 
 ```python
 # Initialize logger with both wandb and tensorboard enabled
@@ -57,7 +59,7 @@ logger.log_metrics({
 
 ## Validation Pretty Logging
 
-The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter:
+The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter.
 
 ```python
 logger:
@@ -68,9 +70,9 @@ logger:
 
 When `num_val_samples_to_print` is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for:
 
-1. Quickly inspecting model generation quality during training
-2. Comparing inputs and outputs side-by-side
-3. Tracking validation sample performance over time
+1. Quickly inspecting model generation quality during training.
+2. Comparing inputs and outputs side-by-side.
+3. Tracking validation sample performance over time.
 
 ### Example Output
 
@@ -80,11 +82,11 @@ When enabled, the pretty logging will generate formatted text similar to:
 
 ## GPU Metric Logging
 
-NeMo-RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo-RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or Weights & Biases.
+NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or WandB.
 
-This approach allows us to offer the same GPU metric tracking on all loggers (not just wandb) and simplifies the implementation greatly.
+This approach allows us to offer the same GPU metric tracking on all loggers (not just Wandb) and simplifies the implementation greatly.
 
-This feature is enabled with the `monitor_gpus` configuration parameter and the frequency of collection and flushing to the loggers is controlled by `gpu_collection_interval` and `gpu_flush_interval` (both in seconds), respectively:
+This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds.
 
 ```python
 logger:
@@ -97,12 +99,12 @@ logger:
 ```
 
 :::{note}
-While monitoring through the remote workers is possible, it requires some delicate implementation details to make sure:
-* sending logs back to driver does not incur a large overhead
-* metrics are easily interpretable since we may be double counting due to colocated workers
-* workers gracefully flush their logs in the event of failure
-* the logging is the same for tensorboard and wandb
-* some workers which spawn other workers correctly report the total usage of the grandchild worker
-
-These reasons lead us to the simple implementation of collecting on the driver
-:::
+While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure:
+* Logs sent back to the driver do not introduce significant overhead.
+* Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
+* Workers can gracefully flush their logs in case of failure.
+* Logging behaves consistently across TensorBoard and Wandb.
+* Workers that spawn other workers accurately report the total resource usage of any grandchild workers.
+
+Due to these complexities, we opted for a simpler approach: collecting metrics directly on the driver.
+:::
\ No newline at end of file
diff --git a/docs/design-docs/padding.md b/docs/design-docs/padding.md
index 219e91573f..9c3278d651 100644
--- a/docs/design-docs/padding.md
+++ b/docs/design-docs/padding.md
@@ -15,9 +15,9 @@ NeMo RL uses **right padding** for all tensor operations, where padding tokens a
 ```
 
 This approach:
-1. **Naturally aligns with LLM processing**: Tokens are processed from left to right
-2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors
-3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value
+1. **Naturally aligns with LLM processing**: Tokens are processed from left to right.
+2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors.
+3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value.
 
 ## Right-Padded Generation Example
 
@@ -35,9 +35,9 @@ Corresponding logprobs:
 |-- zeros for input --|  |- gen logprobs -|  |pad|
 ```
 
-## Verifying Right Padding
+## Verify Right Padding
 
-NeMo RL provides utilities to verify correct padding:
+NeMo RL provides utilities to verify correct padding. For example:
 
 ```{testcode}
 import torch
@@ -79,20 +79,20 @@ if not is_right_padded:
 ```
 
 The {py:class}`verify_right_padding() <nemo_rl.models.generation.interfaces.verify_right_padding>` function checks that:
-1. All padding (zeros or padding token provided by the user) appears after valid tokens
-2. The padding starts at the position specified by the length tensor
+1. All padding (zeros or padding token provided by the user) appears after valid tokens.
+2. The padding starts at the position specified by the length tensor.
 
 The function automatically detects whether you're passing input or output data:
-- For input data: Requires `input_ids` and `input_lengths` fields
-- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths`
+- For input data: Requires `input_ids` and `input_lengths` fields.
+- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths`.
 
 
 ## Best Practices
 
-1. **Always Use Right Padding**: All components expect this format
+1. **Always Use Right Padding**: All components expect this format.
 
-2. **Track Length Tensors**: Include appropriate length tensors with your data
+2. **Track Length Tensors**: Include appropriate length tensors with your data.
 
-3. **Verify Padding**: Use {py:class}`verify_right_padding() <nemo_rl.models.generation.interfaces.verify_right_padding>` when in doubt
+3. **Verify Padding**: Use {py:class}`verify_right_padding() <nemo_rl.models.generation.interfaces.verify_right_padding>` when in doubt.
 
-4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations
+4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations.
diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md
index 12d8368501..2bdc33e432 100644
--- a/docs/design-docs/uv.md
+++ b/docs/design-docs/uv.md
@@ -1,36 +1,36 @@
-# uv in NeMo-RL
+# uv in NeMo RL
 
-Using `uv` for Dependency Management in NeMo-RL
+We use the `uv` Python package installer for managing dependencies in NeMo RL.
 
 ## Overview
 
-`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document outlines why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters.
+`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document explains why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters.
 
 ## Why `uv`?
 
 ### Speed and Efficiency
 
-- Written in Rust, making it significantly faster than traditional Python package managers
-- Optimized caching mechanisms that reduce redundant downloads and installations
-- Quick environment creation and switching, enabling rapid development cycles
+- Written in Rust, making it significantly faster than traditional Python package managers.
+- Optimized caching mechanisms that reduce redundant downloads and installations.
+- Quick environment creation and switching, enabling rapid development cycles.
 
 ### Isolated Environments
 
-- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages
-- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies
-- Ensures consistent behavior across different machines and deployment environments
+- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages.
+- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies.
+- Ensures consistent behavior across different machines and deployment environments.
 
 ### Dependency Management in Ray Clusters
 
-- Enables management of heterogeneous Python environments across a Ray cluster
-- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires
-- Simplifies propagation of environments to worker nodes without manual setup on each node
+- Enables management of heterogeneous Python environments across a Ray cluster.
+- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires.
+- Simplifies propagation of environments to worker nodes without manual setup on each node.
 
 ### Container-Free Flexibility
 
-- Frees us from having to publish many containers for different dependency combinations
-- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically
-- Reduces infrastructure complexity and maintenance overhead
+- Frees us from having to publish many containers for different dependency combinations.
+- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically.
+- Reduces infrastructure complexity and maintenance overhead.
 
 ## Implementation in NeMo RL
 
@@ -61,7 +61,7 @@ If you need a different Python executable configuration, you can override the de
 
 ## How It Works
 
-When a NeMo-RL job is started:
+When a NeMo RL job is started:
 
 1. The driver script creates several {py:class}`RayWorkerGroup <nemo_rl.distributed.worker_groups.RayWorkerGroup>`s.
 2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder <nemo_rl.distributed.worker_groups.RayWorkerBuilder>`
@@ -71,4 +71,4 @@ This approach allows a fast start-up and maintains dependency isolation. It also
 
 ## Conclusion
 
-Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems while enabling heterogeneous environments that can be tailored to specific workloads.
+Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems, while enabling heterogeneous environments that can be tailored to specific workloads.
diff --git a/docs/docker.md b/docs/docker.md
index fd42a5b404..96558f5e31 100644
--- a/docs/docker.md
+++ b/docs/docker.md
@@ -1,4 +1,6 @@
-# Building Docker Images
+# Build Docker Images
+
+This guide provides two methods for building Docker images: the base image, ideal for specifying Python dependencies at runtime, and the hermetic image, which includes default dependencies for offline use.
 
 ## Base Image
 
@@ -9,18 +11,18 @@ cd docker/
 docker buildx build --target base -t nemo_rl -f Dockerfile ..
 ```
 
-This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime.
+This is **our recommendation** as it is a small image and allows you to specify your Python dependencies at runtime.
 
 ## Hermetic Image
 
-The docker image build without a target stage will include all of the default dependencies to get started.
+The Docker image build without a target stage will include all of the default dependencies to get started.
 
 ```sh
 cd docker/
 docker buildx build -t nemo_rl -f Dockerfile ..
 ```
 
-This image sets up the python environment for you, so you do not have to use `uv` if you don't need
+This image sets up the Python environment for you, so you do not have to use `uv` if you don't need
 any other packages.
 
 This image is useful in situations where you may not have network connectivity to re-download packages.
diff --git a/docs/documentation.md b/docs/documentation.md
index df285cca68..58230a0592 100644
--- a/docs/documentation.md
+++ b/docs/documentation.md
@@ -7,9 +7,9 @@
   - [Writing Tests in Python Docstrings](#writing-tests-in-python-docstrings)
 
 
-## Building
+## Build the Documentation
 
-The following sections describe how to set up and build the NeMo-RL documentation.
+The following sections describe how to set up and build the NeMo RL documentation.
 
 Switch to the documentation source folder and generate HTML output.
 
@@ -23,9 +23,9 @@ uv run --group docs sphinx-build . _build/html
 
 ## Live Building
 
-When writing documentation it can be helpful to serve the documentation and have it update live while you edit.
+When writing documentation, it can be helpful to serve the documentation and have it update live while you edit.
 
-To do so run:
+To do so, run:
 
 ```sh
 cd docs/
@@ -35,16 +35,16 @@ uv run --group docs sphinx-autobuild . _build/html --port 12345 --host 0.0.0.0
 Open a web browser and go to `http://${HOST_WHERE_SPHINX_COMMAND_RUN}:12345` to view the output.
 
 
-## Running Tests in Python Docstrings
+## Run Tests in Python Docstrings
 
-We also run tests in our python docstrings. You can run them with:
+We also run tests in our Python docstrings. You can run them with:
 
 ```sh
 cd docs/
 uv run --group docs sphinx-build -b doctest . _build/doctest
 ```
 
-## Writing Tests in Python Docstrings
+## Write Tests in Python Docstrings
 
 Any code in triple backtick blocks with the `{doctest}` directive will be tested. The format follows Python's doctest module syntax, where `>>>` indicates Python input and the following line shows the expected output. Here's an example:
 
diff --git a/docs/guides/eval.md b/docs/guides/eval.md
index f547e19ff8..fbf1fe5fbe 100644
--- a/docs/guides/eval.md
+++ b/docs/guides/eval.md
@@ -1,7 +1,11 @@
 # Evaluation
 
+This document explains how to use an evaluation script for assessing model capabilities.
+
 ## Start Evaluation
 
+To run the evaluation, you can use the default configuration file or specify a custom one.
+
 ### Start Script
 
 **Evaluating Standard Models:**
@@ -52,11 +56,12 @@ score=0.10 (3.0/30)
 ============================================================
 ```
 
-## Configuration
+## Example Configuration File
 
-An example Evaluation configuration file can be found [here](../../examples/configs/eval.yaml).
+You can find an example evaluation configuration file [here](../../examples/configs/eval.yaml).
 
 ### Prompt Template Configuration
+
 Always remember to use the same `prompt_file` and `system_prompt_file` that were used during training.
 
 For open-source models, we recommend setting `prompt_file=null` and `system_prompt_file=null` to allow them to use their native chat templates.
diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 82526e0e66..6808682b6b 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -2,7 +2,7 @@
 
 ## Quickstart: Launch a GRPO Run
 
-If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
+To get started quickly, use the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or via Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](../cluster.md).
 
 We recommend launching the job using `uv`:
 
@@ -14,8 +14,6 @@ If not specified, `config` will default to [examples/configs/grpo.yaml](../../ex
 
 **Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
-## Now, for the details:
-
 In this guide, we'll walk through how we handle
 
 * Data
@@ -53,7 +51,8 @@ class DatumSpec(TypedDict):
 
 #### Data Processors
 
-We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task.
+We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code."
+
 For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py)
 
 ```python
@@ -76,7 +75,7 @@ GRPO expects datasets to have the following form:
 {"task_name": "math", /* actual data */}
 ```
 
-Then, you can set data up as such:
+Then, you can set the data up as follows:
 
 ```python
 base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"]
@@ -96,7 +95,7 @@ dataset = AllTaskProcessedDataset(
 )
 ```
 
-Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples.
+Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples.
 
 ### Policy Model
 
diff --git a/docs/guides/sft.md b/docs/guides/sft.md
index ff2fd196d5..84a1918038 100644
--- a/docs/guides/sft.md
+++ b/docs/guides/sft.md
@@ -1,16 +1,20 @@
-# Supervised Fine-tuning in NeMo-RL
+# Supervised Fine-Tuning in NeMo RL
+
+This document explains how to perform SFT within NeMo RL. It outlines key operations, including initiating SFT runs, managing experiment configurations using YAML, and integrating custom datasets that conform to the required structure and attributes.
 
 ## Launch an SFT Run
 
-The script [examples/run_sft.py](../../examples/run_sft.py) can be used to launch an experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
+The script, [examples/run_sft.py](../../examples/run_sft.py), can be used to launch an experiment. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
 
 Be sure to launch the job using `uv`. The command to launch an SFT job is as follows:
+
 ```bash
 uv run examples/run_sft.py --config <PATH TO YAML CONFIG> <OVERRIDES>
 ```
+
 If not specified, `config` will default to [examples/configs/sft.yaml](../../examples/configs/sft.yaml).
 
-## Configuration
+## Example Configuration File
 
 NeMo-RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml).
 
@@ -21,15 +25,16 @@ uv run examples/run_sft.py \
     cluster.gpus_per_node=1 \
     logger.wandb.name="sft-dev-1-gpu"
 ```
+
 **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
 ## Datasets
 
-SFT datasets in NeMo-RL are encapsulated using classes. Each SFT data class is expected to have the following attributes:
+SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is expected to have the following attributes:
   1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
   2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
 
-SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_rl/data/hf_datasets/squad.py) has an example:
+SFT datasets are expected to follow the Hugging Face chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_reinforcer/data/hf_datasets/squad.py) has an example:
 
 ```python
 def format_squad(data):
@@ -51,7 +56,7 @@ def format_squad(data):
     }
 ```
 
-NeMo-RL SFT uses HuggingFace chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example):
+NeMo RL SFT uses HuggingFace chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example):
 
 1. Apply the tokenizer's default chat template. To use the tokenizer's default, either omit `tokenizer.chat_template` from the config altogether, or set `tokenizer.chat_template="default"`.
 2. Use a "passthrough" template which simply concatenates all messages. This is desirable if the chat template has been applied to your dataset as an offline preprocessing step. In this case, you should set `tokenizer.chat_template` to None as follows:
@@ -67,7 +72,7 @@ NeMo-RL SFT uses HuggingFace chat templates to format the individual examples. T
     ```
 
 
-By default, NeMo-RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
+By default, NeMo RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
 
 Adding a new dataset is a straightforward process.
 As long as your custom dataset has the `formatted_ds` and `task_spec` attributes described above, it can serve as a drop-in replacement for Squad and OpenAssistant.
\ No newline at end of file
diff --git a/docs/local-workstation.md b/docs/local-workstation.md
index 482b41c5ad..860ec0428a 100644
--- a/docs/local-workstation.md
+++ b/docs/local-workstation.md
@@ -1,6 +1,4 @@
-# Local Workstation
-
-## Launching Locally
+# Run on Your Local Workstation
 
 When launching examples locally with `uv`, {py:class}`init_ray() <nemo_rl.distributed.virtual_cluster.init_ray>` will first attempt to connect to an existing cluster. If none is found, it will start a local one and connect to it using all available GPU and CPU resources on your node.
 
@@ -17,7 +15,7 @@ In the logs, you will see that Ray has started a local cluster instance, along w
 INFO:nemo_rl.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'CPU': 24.0, 'object_store_memory': 80448493977.0, 'accelerator_type:RTX': 1.0, 'memory': 177713152615.0, 'GPU': 1.0, 'node:10.0.0.1': 1.0}
 ```
 
-To control the GPUs ray uses locally more granularly, please use `CUDA_VISIBLE_DEVICES`:
+To have more precise control over the GPUs Ray uses locally, please use `CUDA_VISIBLE_DEVICES`:
 
 ```sh
 # Use the 0th and 3rd indexed GPU (for a total of 2 GPUs)
diff --git a/docs/testing.md b/docs/testing.md
index 672bdacc82..95d88b90aa 100644
--- a/docs/testing.md
+++ b/docs/testing.md
@@ -1,4 +1,6 @@
-# Testing NeMo-RL
+# Test NeMo RL
+
+This guide outlines how to test NeMo RL using unit and functional tests, detailing steps for local or Docker-based execution, dependency setup, and metric tracking to ensure effective and reliable testing.
 
 ## Unit Tests
 
@@ -12,16 +14,16 @@ uv run --group test bash tests/run_unit.sh
 ```
 
 :::{note}
-Tests can also be run on SLURM with `ray.sub`, but note that some tests will be skipped
+Tests can also be run on Slurm with `ray.sub`, but note that some tests will be skipped
 due to no GPUs being located on the head node. To run the full suite of tests, please
 launch on a regular GPU allocation.
 :::
 
-### Running Unit Tests in a Hermetic Environment
+### Run Unit Tests in a Hermetic Environment
 
 For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`)
 or where environmental configuration may be problematic, tests can be run
-in docker with this script:
+in Docker with this script:
 
 ```sh
 CONTAINER=... bash tests/run_unit_in_docker.sh
@@ -29,9 +31,10 @@ CONTAINER=... bash tests/run_unit_in_docker.sh
 
 The required `CONTAINER` can be built by following the instructions in the [docker documentation](docker.md).
 
-### Tracking metrics in unit tests
+### Track Metrics in Unit Tests
 
 Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API:
+
 ```python
 # Track an arbitrary metric (must be json serializable)
 tracker.track(metric_name, metric_value)
@@ -44,6 +47,7 @@ tracker.get_max_mem()
 Including the `tracker` fixture also tracks the elapsed time for the test implicitly.
 
 Here is an example test:
+
 ```python
 def test_exponentiate(tracker):
     starting_mem = tracker.get_max_mem()
@@ -58,6 +62,7 @@ def test_exponentiate(tracker):
 ```
 
 Which would produce this file in `tests/unit/unit_results.json`:
+
 ```json
 {
   "exit_status": 0,
@@ -94,7 +99,7 @@ jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_hf_policy_g
 ```
 :::
 
-## Functional tests
+## Functional Tests
 
 :::{important}
 Functional tests may require multiple GPUs to run. See each script to understand the requirements.
@@ -119,11 +124,11 @@ whether they pass or fail. Here is an example:
 └────────┴────────────────────────────────┴───────────────────┴─────────┘
 ```
 
-### Running Functional Tests in a Hermetic Environment
+### Run Functional Tests in a Hermetic Environment
 
 For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`)
 or where environmental configuration may be problematic, tests can be run
-in docker with this script:
+in Docker with this script:
 
 ```sh
 CONTAINER=... bash run_functional_in_docker.sh functional/sft.sh

From d16225c34422781a52db22bcefe4c9df6f169cbc Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 10:43:04 -0700
Subject: [PATCH 02/36] add dummy spaces to let checkpointing be reviewable

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index 101f57a059..e5a6f512a3 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -1,10 +1,11 @@
-# Checkpointing with HuggingFace Models
+# Checkpointing with HuggingFace Models 
 
-## Checkpoint Format
+## Checkpoint Format 
 NeMo-RL provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. 
 
-A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to HuggingFace format after training:
+A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to HuggingFace format after training: 
 
     ```python
     uv run examples/convert_dcp_to_hf.py --config=<YAML CONFIG USED DURING TRAINING> <ANY CONFIG OVERRIDES USED DURING TRAINING> --dcp-ckpt-path=<PATH TO DIST CHECKPOINT TO CONVERT> --hf-ckpt-path=<WHERE TO SAVE HF CHECKPOINT>
-    ```
\ No newline at end of file
+    ```
+

From b1c9b042920f617f8be1ca8fb9a10188bddc85e4 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Fri, 2 May 2025 12:26:34 -0700
Subject: [PATCH 03/36] Update docs/design-docs/checkpointing.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index e5a6f512a3..fc66b40eda 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -1,4 +1,4 @@
-# Checkpointing with HuggingFace Models 
+# Checkpointing with Hugging Face Models 
 
 ## Checkpoint Format 
 NeMo-RL provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. 

From 789074e798ab4d805d086e51e07bdd4c636a4529 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Fri, 2 May 2025 12:26:40 -0700
Subject: [PATCH 04/36] Update docs/design-docs/checkpointing.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index fc66b40eda..f767cc239c 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -3,7 +3,7 @@
 ## Checkpoint Format 
 NeMo-RL provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. 
 
-A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to HuggingFace format after training: 
+A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training: 
 
     ```python
     uv run examples/convert_dcp_to_hf.py --config=<YAML CONFIG USED DURING TRAINING> <ANY CONFIG OVERRIDES USED DURING TRAINING> --dcp-ckpt-path=<PATH TO DIST CHECKPOINT TO CONVERT> --hf-ckpt-path=<WHERE TO SAVE HF CHECKPOINT>

From aa292ffff7c9f2ee09307568ac8a1c4997112661 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Fri, 2 May 2025 12:27:02 -0700
Subject: [PATCH 05/36] Update docs/design-docs/checkpointing.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index f767cc239c..7d21e1b328 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -1,7 +1,7 @@
 # Checkpointing with Hugging Face Models 
 
 ## Checkpoint Format 
-NeMo-RL provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. 
+NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, and Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. 
 
 A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training: 
 

From 603b4831db15c347ea876406e083a5a3e95f3e00 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Fri, 2 May 2025 12:27:15 -0700
Subject: [PATCH 06/36] Update docs/adding-new-models.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/adding-new-models.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/adding-new-models.md b/docs/adding-new-models.md
index d0265c3e4d..711f84e6a1 100644
--- a/docs/adding-new-models.md
+++ b/docs/adding-new-models.md
@@ -66,7 +66,7 @@ When investigating discrepancies beyond the acceptable threshold, focus on these
 When validating Hugging Face-based models, perform the following checks:
 
 - **Compare log probabilities**  
-  Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches.
+  Ensure the generation log probabilities from inference backends like **vLLM** match those computed by Hugging Face. This comparison helps diagnose potential mismatches.
 
 - **Test parallelism**  
   Verify consistency with other parallelism settings.

From c9f905dad0ee404a57bec47714ba111d955b6744 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Fri, 2 May 2025 12:27:23 -0700
Subject: [PATCH 07/36] Update docs/testing.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/testing.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/testing.md b/docs/testing.md
index 95d88b90aa..35825ab50d 100644
--- a/docs/testing.md
+++ b/docs/testing.md
@@ -29,7 +29,7 @@ in Docker with this script:
 CONTAINER=... bash tests/run_unit_in_docker.sh
 ```
 
-The required `CONTAINER` can be built by following the instructions in the [docker documentation](docker.md).
+The required `CONTAINER` can be built by following the instructions in the [Docker documentation](docker.md).
 
 ### Track Metrics in Unit Tests
 

From 7a0e446211d9a704496742937bf6675030dd26cf Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:33:03 -0700
Subject: [PATCH 08/36] tech edit

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/adding-new-models.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/adding-new-models.md b/docs/adding-new-models.md
index 711f84e6a1..eefdfb5d9f 100644
--- a/docs/adding-new-models.md
+++ b/docs/adding-new-models.md
@@ -14,15 +14,15 @@ When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong
 
 $$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}(x) - \pi(x) \right)$$  
 
-So, to verify correctness, we calculate
+So, to verify correctness, we calculate:
 
 $$
 \frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right)
 $$
 
-where samples are drawn as $x \sim \pi_{\text{inference-framework}}$
+as a measure of multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$.
 
-As a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
+Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
 
 ## Understand Discrepancies Between Backends
 

From d7649a70899544f752e174513559a6fa876bed0c Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:35:29 -0700
Subject: [PATCH 09/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/chat-datasets.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/chat-datasets.md b/docs/design-docs/chat-datasets.md
index 7fe570b99a..f4526701fd 100644
--- a/docs/design-docs/chat-datasets.md
+++ b/docs/design-docs/chat-datasets.md
@@ -1,4 +1,4 @@
-# Data Format
+# Chat Datasets Format
 
 This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information.
 

From 31a2d47e5af023393272b59cd6b8e5c9620e3aa8 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:37:20 -0700
Subject: [PATCH 10/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/cluster.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cluster.md b/docs/cluster.md
index 57d351fd66..dc08f348cf 100644
--- a/docs/cluster.md
+++ b/docs/cluster.md
@@ -78,7 +78,7 @@ uv run ./examples/run_grpo_math.py
 There several choices for `UV_CACHE_DIR` when using `ray.sub`:
 
 1. (default) `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.
-2. Use the warm uv cache from our docker images
+2. Use the warm uv cache from our docker images:
     ```sh
     ...
     UV_CACHE_DIR=/home/ray/.cache/uv \

From 838ae4a2667a59350177896d2574a61f01434f03 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:38:20 -0700
Subject: [PATCH 11/36] period

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/design-and-philosophy.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/design-and-philosophy.md b/docs/design-docs/design-and-philosophy.md
index 258193f171..eec3b399a7 100644
--- a/docs/design-docs/design-and-philosophy.md
+++ b/docs/design-docs/design-and-philosophy.md
@@ -111,4 +111,4 @@ def grpo_train(
         training_data = calculate_grpo_training_data(generations, logprobs, reference_logprobs, rewards)
         policy.train(generations, logprobs, reference_logprobs, GRPOLossFn)
 ```
-For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](../../nemo_rl/algorithms/grpo.py)
+For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](../../nemo_rl/algorithms/grpo.py).

From f766a97bde3fd6c99761c24c7d98bd018c196e83 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:40:31 -0700
Subject: [PATCH 12/36] Revert "done"

This reverts commit 043e18688340be33a4d323dfc36b17132812fb68.

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/chat-datasets.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/chat-datasets.md b/docs/design-docs/chat-datasets.md
index f4526701fd..7fe570b99a 100644
--- a/docs/design-docs/chat-datasets.md
+++ b/docs/design-docs/chat-datasets.md
@@ -1,4 +1,4 @@
-# Chat Datasets Format
+# Data Format
 
 This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information.
 

From bbd0d34c0999c0beb8a0b0dbc804cd5ca3563f4f Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:47:21 -0700
Subject: [PATCH 13/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/padding.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/design-docs/padding.md b/docs/design-docs/padding.md
index 9c3278d651..da5a6def74 100644
--- a/docs/design-docs/padding.md
+++ b/docs/design-docs/padding.md
@@ -1,7 +1,5 @@
 # Padding in NeMo RL
 
-## Overview
-
 This document explains padding in NeMo RL and why consistent padding is critical for the framework.
 
 ## Padding Approach

From bb73f36138cda974ed243a10617ed127e5f65417 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Fri, 2 May 2025 12:48:33 -0700
Subject: [PATCH 14/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/uv.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md
index 2bdc33e432..58870bcf7e 100644
--- a/docs/design-docs/uv.md
+++ b/docs/design-docs/uv.md
@@ -8,6 +8,8 @@ We use the `uv` Python package installer for managing dependencies in NeMo RL.
 
 ## Why `uv`?
 
+`uv` brings the following key advantages to our Python development workflow:
+
 ### Speed and Efficiency
 
 - Written in Rust, making it significantly faster than traditional Python package managers.

From bd86cc31de49ca665e7c4ea2ca7856440b4bbc7b Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:19:17 -0700
Subject: [PATCH 15/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/uv.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md
index 58870bcf7e..3fcf41a09b 100644
--- a/docs/design-docs/uv.md
+++ b/docs/design-docs/uv.md
@@ -36,9 +36,11 @@ We use the `uv` Python package installer for managing dependencies in NeMo RL.
 
 ## Implementation in NeMo RL
 
+This section outlines how workers define their required executables, details the available predefined configurations (like BASE or VLLM), and explains how to customize these setups for specific needs, ensuring consistency across actors.
+
 ### Worker Configuration
 
-In our codebase, workers (classes decorated with `@ray.remote`, e.g., `HFPolicyWorker`) define a `DEFAULT_PY_EXECUTABLE` which specifies what dependencies the worker needs. This allows different parts of our application to have their own tailored environments.
+In our codebase, workers (classes decorated with `@ray.remote`, e.g., `HFPolicyWorker`) define a `DEFAULT_PY_EXECUTABLE` that specifies what dependencies the worker needs. This allows different parts of our application to have their own tailored environments.
 
 ### Supported Python Executables
 

From 6e7269b3d83fee0fae13da0e604690a429f2d799 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:19:58 -0700
Subject: [PATCH 16/36] period

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/uv.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md
index 3fcf41a09b..08d39cf390 100644
--- a/docs/design-docs/uv.md
+++ b/docs/design-docs/uv.md
@@ -68,7 +68,7 @@ If you need a different Python executable configuration, you can override the de
 When a NeMo RL job is started:
 
 1. The driver script creates several {py:class}`RayWorkerGroup <nemo_rl.distributed.worker_groups.RayWorkerGroup>`s.
-2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder <nemo_rl.distributed.worker_groups.RayWorkerBuilder>`
+2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder <nemo_rl.distributed.worker_groups.RayWorkerBuilder>`.
 3. Before the worker class is instantiated by the `RayWorkerBuilder`, if (1) `DEFAULT_PY_EXECUTABLE` is defined on the worker class (decorated with `@ray.remote`) and (2) it starts with `uv`; a `venv` is created with all the dependencies it needs and the `runtime_env["py_executable"]` is replaced with the `venv`'s python interpreter.
 
 This approach allows a fast start-up and maintains dependency isolation. It also has the added benefit of having all the virtual environments local under `./venvs`.

From 51d3399a7fcbb53c8623e0c2d54e1917f545e7e9 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Tue, 6 May 2025 11:20:45 -0700
Subject: [PATCH 17/36] Update docs/guides/eval.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/eval.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/eval.md b/docs/guides/eval.md
index fbf1fe5fbe..12c4447b2e 100644
--- a/docs/guides/eval.md
+++ b/docs/guides/eval.md
@@ -8,7 +8,7 @@ To run the evaluation, you can use the default configuration file or specify a c
 
 ### Start Script
 
-**Evaluating Standard Models:**
+**Evaluate Standard Models:**
 
 To run evaluation using a model directly from Hugging Face Hub or a local path already in HF format, use the `run_eval.py` script.
 

From 8c86c033ee6ac0db5fd6c8756218799dca2960c0 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:21:32 -0700
Subject: [PATCH 18/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/eval.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/eval.md b/docs/guides/eval.md
index 12c4447b2e..c175180ad0 100644
--- a/docs/guides/eval.md
+++ b/docs/guides/eval.md
@@ -23,7 +23,7 @@ uv run python examples/run_eval.py --config path/to/custom_config.yaml
 uv run python examples/run_eval.py generation.model_name="Qwen/Qwen2.5-Math-7B-Instruct"
 ```
 
-**Evaluating Models Trained with DCP Checkpoints (GRPO/SFT):**
+**Evaluate Models Trained with DCP Checkpoints (GRPO/SFT):**
 
 If you have trained a model using GRPO or SFT and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation.
 

From 00f94049b1c46561ee1f39dffb938c817c92c966 Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Tue, 6 May 2025 11:22:46 -0700
Subject: [PATCH 19/36] Update docs/guides/grpo.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/grpo.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 6808682b6b..91f649497b 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -1,5 +1,7 @@
 # An in-depth walkthrough of GRPO in NeMo-RL
+# An in-depth Walkthrough of GRPO in NeMo RL
 
+This guide details the Generative Reversal Policy Optimization (GRPO) implementation within NeMo RL. We'll walk through essential aspects including data handling, policy model training, fast generation, and the specifics of the GRPO loss function and its enhancements. 
 ## Quickstart: Launch a GRPO Run
 
 To get started quickly, use the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or via Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](../cluster.md).

From f4b50099b1b7a4d5c653ea28b1a8b815d0a25279 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:24:51 -0700
Subject: [PATCH 20/36] fix

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/grpo.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 91f649497b..03ab6d0570 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -1,7 +1,7 @@
-# An in-depth walkthrough of GRPO in NeMo-RL
 # An in-depth Walkthrough of GRPO in NeMo RL
 
-This guide details the Generative Reversal Policy Optimization (GRPO) implementation within NeMo RL. We'll walk through essential aspects including data handling, policy model training, fast generation, and the specifics of the GRPO loss function and its enhancements. 
+This guide details the Group Relative Policy Optimization(GRPO) implementation within NeMo RL. We'll walk through essential aspects including data handling, policy model training, fast generation, and the specifics of the GRPO loss function and its enhancements. 
+
 ## Quickstart: Launch a GRPO Run
 
 To get started quickly, use the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or via Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](../cluster.md).
@@ -12,7 +12,7 @@ We recommend launching the job using `uv`:
 uv run examples/run_grpo_math.py --config <PATH TO YAML CONFIG> {overrides}
 ```
 
-If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml)
+If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml).
 
 **Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 

From 58ac1415b8fbb0bb8dae2097c222992ef8832c0a Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Tue, 6 May 2025 11:25:44 -0700
Subject: [PATCH 21/36] Update docs/guides/grpo.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/grpo.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 03ab6d0570..7ae87f4d15 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -16,7 +16,7 @@ If not specified, `config` will default to [examples/configs/grpo.yaml](../../ex
 
 **Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
-In this guide, we'll walk through how we handle
+In this guide, we'll walk through how we handle:
 
 * Data
 * Model training

From cccf25f9def7ebb56cfdba04f27df9613d35a9c2 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:27:17 -0700
Subject: [PATCH 22/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/grpo.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 7ae87f4d15..5dd9ad9bef 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -55,7 +55,7 @@ class DatumSpec(TypedDict):
 
 We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code."
 
-For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py)
+For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py).
 
 ```python
 def my_data_processor(
@@ -67,9 +67,9 @@ def my_data_processor(
 ) -> DatumSpec:
 ```
 
-We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py)
+We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py).
 
-#### Putting it all together
+#### Put It All Together
 
 GRPO expects datasets to have the following form:
 

From 1ee6043c95e77ea57dfc935008a5d66c22836bae Mon Sep 17 00:00:00 2001
From: Terry Kong <terrycurtiskong@gmail.com>
Date: Tue, 6 May 2025 11:28:36 -0700
Subject: [PATCH 23/36] Update docs/guides/sft.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/sft.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/sft.md b/docs/guides/sft.md
index 84a1918038..9a84e969a5 100644
--- a/docs/guides/sft.md
+++ b/docs/guides/sft.md
@@ -56,7 +56,7 @@ def format_squad(data):
     }
 ```
 
-NeMo RL SFT uses HuggingFace chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example):
+NeMo RL SFT uses Hugging Face chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example):
 
 1. Apply the tokenizer's default chat template. To use the tokenizer's default, either omit `tokenizer.chat_template` from the config altogether, or set `tokenizer.chat_template="default"`.
 2. Use a "passthrough" template which simply concatenates all messages. This is desirable if the chat template has been applied to your dataset as an offline preprocessing step. In this case, you should set `tokenizer.chat_template` to None as follows:

From 5703b2f52111ffb7bd40d8057ab1213af49454e0 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:29:02 -0700
Subject: [PATCH 24/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index 7d21e1b328..34fd6093b6 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -1,6 +1,5 @@
 # Checkpointing with Hugging Face Models 
 
-## Checkpoint Format 
 NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, and Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. 
 
 A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training: 

From 6bcf7355e2625f390cf62fe1fee4c501c7d297b3 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:29:55 -0700
Subject: [PATCH 25/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 0952dac760..771c7bd318 100644
--- a/README.md
+++ b/README.md
@@ -53,13 +53,13 @@ What you can expect:
 
 ## Prerequisites
 
-Clone **NeMo RL**
+Clone **NeMo RL**.
 ```sh
 git clone git@github.com:NVIDIA/nemo-rl.git
 cd nemo-rl
 ```
 
-Install `uv`
+Install `uv`.
 ```sh
 # For faster setup and environment isolation, we use `uv`
 pip install uv

From 86eb41dcdd287717cb9eb6e760677858d69c83cf Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:30:41 -0700
Subject: [PATCH 26/36] done

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 771c7bd318..dd2e43aff3 100644
--- a/README.md
+++ b/README.md
@@ -94,7 +94,7 @@ To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:
 uv run python examples/run_grpo_math.py
 ```
 
-By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
+By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,
 
 ```sh
 # Run the GRPO math example using a 1B parameter model using 8 GPUs

From c396f25f202715992961b92d313c05abbeb87704 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:31:34 -0700
Subject: [PATCH 27/36] revet

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index dd2e43aff3..d8eef3caf0 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 <!-- markdown all in one -->
 - [Nemo RL: A Scalable and Efficient Post-Training Library](#nemo-rl-a-scalable-and-efficient-post-training-library)
-  - [Table of Contents](#table-of-contents)
+  - [Features](#features)
   - [Prerequisites](#prerequisites)
     - [GRPO](#grpo)
       - [GRPO Single Node](#grpo-single-node)
@@ -30,7 +30,7 @@ What you can expect:
 - **Flexibility** with a modular design that allows easy integration and customization.
 - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
 
-## Table of Contents
+## Features
 
 ✅ _Available now_ | 🔜 _Coming in v0.3_
 

From e37928de8be52be03777ceaab8083b95bcc97d09 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:34:33 -0700
Subject: [PATCH 28/36] go

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/cluster.md        | 4 ++--
 docs/design-docs/uv.md | 4 ++--
 docs/guides/dpo.md     | 8 ++++----
 docs/guides/sft.md     | 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/cluster.md b/docs/cluster.md
index dc08f348cf..dc9fca03b9 100644
--- a/docs/cluster.md
+++ b/docs/cluster.md
@@ -9,7 +9,7 @@ This guide explains how to initialize NeMo RL clusters.
 ### Batched Job Submission
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=1  # Total nodes requested (head is colocated on ray-worker-0)
 
 COMMAND="uv run ./examples/run_grpo_math.py" \
@@ -45,7 +45,7 @@ A key advantage of running interactively on the head node is the ability to exec
 
 To run interactively, launch the same command as [Batched Job Submission](#batched-job-submission), but omit the `COMMAND` line:
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=1  # Total nodes requested (head is colocated on ray-worker-0)
 
 CONTAINER=YOUR_CONTAINER \
diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md
index 08d39cf390..f8f98b1482 100644
--- a/docs/design-docs/uv.md
+++ b/docs/design-docs/uv.md
@@ -50,10 +50,10 @@ We provide several predefined Python executable configurations in {py:class}`PY_
 class PY_EXECUTABLES:
     SYSTEM = sys.executable
 
-    # Use NeMo-RL direct dependencies.
+    # Use NeMo RL direct dependencies.
     BASE = "uv run --locked"
 
-    # Use NeMo-RL direct dependencies and vllm.
+    # Use NeMo RL direct dependencies and vllm.
     VLLM = "uv run --locked --extra vllm"
 ```
 
diff --git a/docs/guides/dpo.md b/docs/guides/dpo.md
index 6c6ed62833..fcea9f5005 100644
--- a/docs/guides/dpo.md
+++ b/docs/guides/dpo.md
@@ -1,4 +1,4 @@
-# Direct Preference Optimization in NeMo-RL
+# Direct Preference Optimization in NeMo RL
 
 [Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims
 to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the
@@ -16,7 +16,7 @@ If not specified, `config` will default to [examples/configs/dpo.yaml](../../exa
 
 ## Configuration
 
-NeMo-RL allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml).
+NeMo RL allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml).
 
 To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example:
 
@@ -32,7 +32,7 @@ uv run examples/run_dpo.py \
 
 ## Datasets
 
-Each class representing a NeMo-RL DPO dataset is expected to have the following attributes:
+Each class representing a NeMo RL DPO dataset is expected to have the following attributes:
 1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
 2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
 
@@ -158,7 +158,7 @@ First train example rejected response: 5
 
 ## DPO-Specific Parameters
 
-The DPO implementation in NeMo-RL supports several key parameters that can be adjusted:
+The DPO implementation in NeMo RL supports several key parameters that can be adjusted:
 
 - `dpo.reference_policy_kl_penalty`: Controls the strength of the KL penalty term
 - `dpo.preference_loss_weight`: Weight for the preference loss
diff --git a/docs/guides/sft.md b/docs/guides/sft.md
index 9a84e969a5..946f8fc1a9 100644
--- a/docs/guides/sft.md
+++ b/docs/guides/sft.md
@@ -16,7 +16,7 @@ If not specified, `config` will default to [examples/configs/sft.yaml](../../exa
 
 ## Example Configuration File
 
-NeMo-RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml).
+NeMo RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml).
 
 To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example:
 

From a9cb407fcb880318d0a965ba2eb52717c5750f8e Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:39:45 -0700
Subject: [PATCH 29/36] fix

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/cluster.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cluster.md b/docs/cluster.md
index dc9fca03b9..a957ae7d8d 100644
--- a/docs/cluster.md
+++ b/docs/cluster.md
@@ -1,6 +1,6 @@
 # Set Up Clusters
 
-This guide explains how to initialize NeMo RL clusters.
+This guide explains how to run NeMo RL with ray on Slurm or Kubernetes.
 
 ## Slurm (Batched and Interactive)
 

From e438f73cb885aa4bc1f729cc07d931a7c4856efa Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:40:28 -0700
Subject: [PATCH 30/36] capitalize

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/cluster.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cluster.md b/docs/cluster.md
index a957ae7d8d..cfac258c8d 100644
--- a/docs/cluster.md
+++ b/docs/cluster.md
@@ -1,6 +1,6 @@
 # Set Up Clusters
 
-This guide explains how to run NeMo RL with ray on Slurm or Kubernetes.
+This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.
 
 ## Slurm (Batched and Interactive)
 

From 4603adc68d156e09f315893efc1b4c3bbe90df5c Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:47:30 -0700
Subject: [PATCH 31/36] revert

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/chat-datasets.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/chat-datasets.md b/docs/design-docs/chat-datasets.md
index 7fe570b99a..fafd387109 100644
--- a/docs/design-docs/chat-datasets.md
+++ b/docs/design-docs/chat-datasets.md
@@ -27,7 +27,7 @@ Hugging Face chat datasets are expected to have the following structure: Each ex
 
 ## Chat Templates
 
-Formatting the data with chat templates allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details.
+Formatting the data in this way allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details.
 
 By default, `apply_chat_template` attempts to apply the `chat_template` associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated `chat_template`s, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to `apply_chat_template`. 
 The following is an example using a simple template which prepends a role header to each turn:

From e8021a0c2bc66f8d19986a7c1588a8bdb9715278 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:48:02 -0700
Subject: [PATCH 32/36] fix

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/checkpointing.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md
index 34fd6093b6..f8f11b916f 100644
--- a/docs/design-docs/checkpointing.md
+++ b/docs/design-docs/checkpointing.md
@@ -2,9 +2,9 @@
 
 NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, and Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. 
 
-A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training: 
+A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training:
 
-    ```python
-    uv run examples/convert_dcp_to_hf.py --config=<YAML CONFIG USED DURING TRAINING> <ANY CONFIG OVERRIDES USED DURING TRAINING> --dcp-ckpt-path=<PATH TO DIST CHECKPOINT TO CONVERT> --hf-ckpt-path=<WHERE TO SAVE HF CHECKPOINT>
-    ```
+```sh
+uv run examples/convert_dcp_to_hf.py --config=<YAML CONFIG USED DURING TRAINING> <ANY CONFIG OVERRIDES USED DURING TRAINING> --dcp-ckpt-path=<PATH TO DIST CHECKPOINT TO CONVERT> --hf-ckpt-path=<WHERE TO SAVE HF CHECKPOINT>
+```
 

From 83ae37704e2a142d4dc5f59615d0c8fa9f06a388 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:52:02 -0700
Subject: [PATCH 33/36] go

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/generation.md | 2 +-
 docs/guides/grpo.md            | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/design-docs/generation.md b/docs/design-docs/generation.md
index bb83457b91..275625f371 100644
--- a/docs/design-docs/generation.md
+++ b/docs/design-docs/generation.md
@@ -1,4 +1,4 @@
-# Token Generation
+# Generation Interface
 
 This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
 
diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 5dd9ad9bef..7ae87f4d15 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -55,7 +55,7 @@ class DatumSpec(TypedDict):
 
 We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code."
 
-For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py).
+For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py)
 
 ```python
 def my_data_processor(
@@ -67,9 +67,9 @@ def my_data_processor(
 ) -> DatumSpec:
 ```
 
-We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py).
+We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py)
 
-#### Put It All Together
+#### Putting it all together
 
 GRPO expects datasets to have the following form:
 

From b95ac5b7ffbcf4663f0cf555db50bc6ba38d56ea Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 11:57:54 -0700
Subject: [PATCH 34/36] ok

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/design-docs/logger.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design-docs/logger.md b/docs/design-docs/logger.md
index 3e861ecab5..d15ad5c1ba 100644
--- a/docs/design-docs/logger.md
+++ b/docs/design-docs/logger.md
@@ -106,5 +106,5 @@ While it is feasible to monitor using remote workers, the implementation require
 * Logging behaves consistently across TensorBoard and Wandb.
 * Workers that spawn other workers accurately report the total resource usage of any grandchild workers.
 
-Due to these complexities, we opted for a simpler approach: collecting metrics directly on the driver.
+Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.
 :::
\ No newline at end of file

From 87aa887a37fa9a1c35572628a71ffe36fdd7eb46 Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 12:01:20 -0700
Subject: [PATCH 35/36] fix

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 docs/guides/sft.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/sft.md b/docs/guides/sft.md
index 946f8fc1a9..0933b0f540 100644
--- a/docs/guides/sft.md
+++ b/docs/guides/sft.md
@@ -34,7 +34,7 @@ SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is e
   1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
   2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
 
-SFT datasets are expected to follow the Hugging Face chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_reinforcer/data/hf_datasets/squad.py) has an example:
+SFT datasets are expected to follow the Hugging Face chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_rl/data/hf_datasets/squad.py) has an example:
 
 ```python
 def format_squad(data):

From d9edcb7211fe882c507b1b160c44c7abe0ba626f Mon Sep 17 00:00:00 2001
From: Terry Kong <terryk@nvidia.com>
Date: Tue, 6 May 2025 12:31:20 -0700
Subject: [PATCH 36/36] lint

Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 README.md             | 8 ++++----
 docs/documentation.md | 6 +++---
 docs/guides/grpo.md   | 2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index d8eef3caf0..ee9efd8df6 100644
--- a/README.md
+++ b/README.md
@@ -10,8 +10,8 @@
         - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
   - [Quickstart](#quickstart)
   - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
-      - [Run Single Node SFT](#run-single-node-sft)
-      - [SFT Multi-node](#sft-multi-node)
+    - [Run Single Node SFT](#run-single-node-sft)
+    - [SFT Multi-node](#sft-multi-node)
     - [DPO](#dpo)
       - [DPO Single Node](#dpo-single-node)
       - [DPO Multi-node](#dpo-multi-node)
@@ -171,7 +171,7 @@ Before running any experiments, remember to set your `HF_HOME` environment varia
 
 We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
 
-#### Run Single Node SFT
+### Run Single Node SFT
 
 The default SFT configuration is set to run on a single GPU. To start the experiment:
 
@@ -193,7 +193,7 @@ uv run python examples/run_sft.py \
 
 Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
-#### SFT Multi-node
+### SFT Multi-node
 
 ```sh
 # Run from the root of NeMo RL repo
diff --git a/docs/documentation.md b/docs/documentation.md
index 58230a0592..07d4e6b432 100644
--- a/docs/documentation.md
+++ b/docs/documentation.md
@@ -1,10 +1,10 @@
 # Documentation Development
 
 - [Documentation Development](#documentation-development)
-  - [Building](#building)
+  - [Build the Documentation](#build-the-documentation)
   - [Live Building](#live-building)
-  - [Running Tests in Python Docstrings](#running-tests-in-python-docstrings)
-  - [Writing Tests in Python Docstrings](#writing-tests-in-python-docstrings)
+  - [Run Tests in Python Docstrings](#run-tests-in-python-docstrings)
+  - [Write Tests in Python Docstrings](#write-tests-in-python-docstrings)
 
 
 ## Build the Documentation
diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
index 7ae87f4d15..4c9fa93767 100644
--- a/docs/guides/grpo.md
+++ b/docs/guides/grpo.md
@@ -152,7 +152,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro
 
 
 #### Importance Sampling Correction
-The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
+The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understand-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
 
 Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then,