NVIDIA-NeMo · terrykong · May 6, 2025 · May 1, 2025 · May 2, 2025 · May 2, 2025
@@ -1,28 +1,31 @@
-# Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s
+# Nemo RL: A Scalable and Efficient Post-Training Library
 
 <!-- markdown all in one -->
-- [Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-rl-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
+- [Nemo RL: A Scalable and Efficient Post-Training Library](#nemo-rl-a-scalable-and-efficient-post-training-library)
   - [Features](#features)
   - [Prerequisites](#prerequisites)
-  - [Quick start](#quick-start)
     - [GRPO](#grpo)
-      - [Single Node](#grpo-single-node)
-      - [Multi-node](#grpo-multi-node)
+      - [GRPO Single Node](#grpo-single-node)
+      - [GRPO Multi-node](#grpo-multi-node)
         - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
-    - [SFT](#sft)
-      - [Single Node](#sft-single-node)
-      - [Multi-node](#sft-multi-node)
+  - [Quickstart](#quickstart)
+  - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
+    - [Run Single Node SFT](#run-single-node-sft)
+    - [SFT Multi-node](#sft-multi-node)
     - [DPO](#dpo)
-      - [Single Node](#dpo-single-node)
-      - [Multi-node](#dpo-multi-node)
-  - [Cluster Start](#cluster-start)
+      - [DPO Single Node](#dpo-single-node)
+      - [DPO Multi-node](#dpo-multi-node)
+  - [Set Up Clusters](#set-up-clusters)
+  - [Citation](#citation)
+  - [Contributing](#contributing)
+  - [Licenses](#licenses)
 
-**Nemo-RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
+**Nemo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
 
 What you can expect:
 
-- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
-- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
+- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
+- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
 - **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations.
 - **Flexibility** with a modular design that allows easy integration and customization.
 - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
@@ -31,32 +34,32 @@ What you can expect:
 
 ✅ _Available now_ | 🔜 _Coming in v0.3_
 
-- ✅ **Fast Generation** - vLLM backend for optimized inference
-- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama)
-- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure
+- ✅ **Fast Generation** - vLLM backend for optimized inference.
+- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama).
+- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure.
 - ✅ **Environment Support** - Support for multi-environment training.
-- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization)
-- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc. 
-- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters
-- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training
-- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
-- ✅ **Environment Isolation** - Dependency isolation between components
-
-- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron
-- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models
-- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core
-- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
-- 🔜 **MoE Models** - Support DeepseekV3 and Llama4
+- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
+- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
+- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters.
+- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training.
+- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state).
+- ✅ **Environment Isolation** - Dependency isolation between components.
+
+- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron.
+- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models.
+- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core.
+- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models.
+- 🔜 **MoE Models** - Support DeepseekV3 and Llama4.
 
 ## Prerequisites
 
-Clone **NeMo RL**
+Clone **NeMo RL**.
 ```sh
 git clone git@github.com:NVIDIA/nemo-rl.git
 cd nemo-rl
 ```
 
-Install `uv`
+Install `uv`.
 ```sh
 # For faster setup and environment isolation, we use `uv`
 pip install uv
@@ -72,9 +75,11 @@ pip install uv
 # Example: uv run python examples/run_grpo_math.py
 ```
 
-## Quick start
+**Important Notes:**
 
-**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+- Use the `uv run <command>` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
+- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
+- **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
 ### GRPO
 
@@ -89,7 +94,7 @@ To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:
 uv run python examples/run_grpo_math.py
 ```
 
-By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
+By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,
 
 ```sh
 # Run the GRPO math example using a 1B parameter model using 8 GPUs
@@ -111,7 +116,7 @@ uv run python examples/run_grpo_math.py \
 #### GRPO Multi-node
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 
 # grpo_math_8b uses Llama-3.1-8B-Instruct model
@@ -131,7 +136,7 @@ sbatch \
 ##### GRPO Qwen2.5-32B
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=16
 
 # Download Qwen before the job starts to avoid spending time downloading during the training loop
@@ -158,21 +163,25 @@ Reference example for training to play a Sliding Puzzle Game:
 uv run python examples/run_grpo_sliding_puzzle.py 
 ```
 
-### SFT
+## Quickstart
 
-We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`.
 
-#### SFT Single Node
+## Supervised Fine-Tuning (SFT)
 
-The default SFT experiment is configured to run on a single GPU. To launch the experiment,
+We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
+
+### Run Single Node SFT
+
+The default SFT configuration is set to run on a single GPU. To start the experiment:
 
 ```sh
 uv run python examples/run_sft.py
 ```
 
-This trains `Llama3.2-1B` on one GPU using the SQUAD dataset.
+This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU.
 
-If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size:
+To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:
 
 ```sh
 uv run python examples/run_sft.py \
@@ -184,10 +193,10 @@ uv run python examples/run_sft.py \
 
 Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
-#### SFT Multi-node
+### SFT Multi-node
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 NUM_ACTOR_NODES=2
 
 COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
@@ -244,7 +253,7 @@ Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters
 For distributed DPO training across multiple nodes, modify the following script for your use case:
 
 ```sh
-# Run from the root of NeMo-RL repo
+# Run from the root of NeMo RL repo
 ## number of nodes to use for your job
 NUM_ACTOR_NODES=2
 
@@ -262,19 +271,29 @@ sbatch \
     ray.sub
 ```
 
-## Cluster Start
+## Set Up Clusters
 
-Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes.
+For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation.
 
 ## Citation
 
-If you use NeMo-RL in your research, please cite it using the following BibTeX entry:
+If you use NeMo RL in your research, please cite it using the following BibTeX entry:
 
 ```bibtex
 @misc{nemo-rl,
-title = {NeMo-RL: A Scalable and Efficient Post-Training Library},
+title = {NeMo RL: A Scalable and Efficient Post-Training Library},
 howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
 year = {2025},
 note = {GitHub repository},
 }
 ```
+
+## Contributing
+
+We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/nemo-rl/blob/main/CONTRIBUTING.md) for more information on how to get involved.
+
+## Licenses
+
+NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/nemo-rl/blob/main/LICENSE).
+
+NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license.
@@ -1,10 +1,10 @@
-# Adding New Models
+# Add New Models
 
-This guide outlines how to integrate and validate a new model within **NeMo-RL**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
+This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
 
 ## Importance of Log Probability Consistency in Training and Inference
 
-In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
+In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
 
 As an example, we would see errors in naive KL estimation:
 
@@ -14,43 +14,43 @@ When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong
 
 $$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}(x) - \pi(x) \right)$$  
 
-So, to verify correctness, we calculate
+So, to verify correctness, we calculate:
 
 $$
 \frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right)
 $$
 
-where samples are drawn as $x \sim \pi_{\text{inference-framework}}$
+as a measure of multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$.
 
-As a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
+Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
 
-## Understanding Discrepancies Between Backends
+## Understand Discrepancies Between Backends
 
 When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant:
 
 - **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8).
-  - Training may use mixed precision while the inference backend may not
-  - High-precision training with FP8 inference may not be numerically stable for certain models
-  - Differences can occur at the layer level, with some layers in FP32 while others use lower precision
+  - Training may use mixed precision, while the inference backend may not.
+  - High-precision training with FP8 inference may not be numerically stable for certain models.
+  - Differences can occur at the layer level, with some layers in FP32, while others use lower precision.
 
 - **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented.
-  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends
-  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences
-  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability
+  - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends.
+  - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences.
+  - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability.
 
 - **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation.
-  - In some cases, disabling the inference backend cache can resolve discrepancies
+  - In some cases, disabling the inference backend cache can resolve discrepancies.
 
-- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations
+- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations.
 
-- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`)
+- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`).
 
 - **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities.
-  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels
+  - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels.
 
-- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect
-  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong
-  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses
+- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect.
+  - If weights are reshaped or reordered incorrectly, generations tend to be very wrong.
+  - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses.
 
 - **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1`
 
@@ -66,10 +66,10 @@ When investigating discrepancies beyond the acceptable threshold, focus on these
 When validating Hugging Face-based models, perform the following checks:
 
 - **Compare log probabilities**  
-  Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches.
+  Ensure the generation log probabilities from inference backends like **vLLM** match those computed by Hugging Face. This comparison helps diagnose potential mismatches.
 
 - **Test parallelism**  
-  Verify consistency with other parallelism settings. 
+  Verify consistency with other parallelism settings.
 
 - **Variance**  
   Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance.
@@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks:
 ### Additional Validation
 
 - **Compare Megatron outputs**  
-  Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**.
+  Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**.
 
 - **Parallel settings**  
   Match the same parallelism configurations used for the HuggingFace-based tests.  
@@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf
 
 ---
 
-By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-RL**'s requirements.
+By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL.