Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
60ba494
tech edit
terrykong May 1, 2025
d16225c
add dummy spaces to let checkpointing be reviewable
terrykong May 2, 2025
b1c9b04
Update docs/design-docs/checkpointing.md
terrykong May 2, 2025
789074e
Update docs/design-docs/checkpointing.md
terrykong May 2, 2025
aa292ff
Update docs/design-docs/checkpointing.md
terrykong May 2, 2025
603b483
Update docs/adding-new-models.md
terrykong May 2, 2025
c9f905d
Update docs/testing.md
terrykong May 2, 2025
7a0e446
tech edit
terrykong May 2, 2025
d7649a7
done
terrykong May 2, 2025
31a2d47
done
terrykong May 2, 2025
838ae4a
period
terrykong May 2, 2025
f766a97
Revert "done"
terrykong May 2, 2025
bbd0d34
done
terrykong May 2, 2025
bb73f36
done
terrykong May 2, 2025
bd86cc3
done
terrykong May 6, 2025
6e7269b
period
terrykong May 6, 2025
51d3399
Update docs/guides/eval.md
terrykong May 6, 2025
8c86c03
done
terrykong May 6, 2025
00f9404
Update docs/guides/grpo.md
terrykong May 6, 2025
f4b5009
fix
terrykong May 6, 2025
58ac141
Update docs/guides/grpo.md
terrykong May 6, 2025
cccf25f
done
terrykong May 6, 2025
1ee6043
Update docs/guides/sft.md
terrykong May 6, 2025
5703b2f
done
terrykong May 6, 2025
6bcf735
done
terrykong May 6, 2025
86eb41d
done
terrykong May 6, 2025
c396f25
revet
terrykong May 6, 2025
e37928d
go
terrykong May 6, 2025
a9cb407
fix
terrykong May 6, 2025
e438f73
capitalize
terrykong May 6, 2025
4603adc
revert
terrykong May 6, 2025
e8021a0
fix
terrykong May 6, 2025
83ae377
go
terrykong May 6, 2025
b95ac5b
ok
terrykong May 6, 2025
87aa887
fix
terrykong May 6, 2025
d9edcb7
lint
terrykong May 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 68 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,31 @@
# Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s
# Nemo RL: A Scalable and Efficient Post-Training Library

<!-- markdown all in one -->
- [Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-rl-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
- [Nemo RL: A Scalable and Efficient Post-Training Library](#nemo-rl-a-scalable-and-efficient-post-training-library)
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Quick start](#quick-start)
- [GRPO](#grpo)
- [Single Node](#grpo-single-node)
- [Multi-node](#grpo-multi-node)
- [GRPO Single Node](#grpo-single-node)
- [GRPO Multi-node](#grpo-multi-node)
- [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
- [SFT](#sft)
- [Single Node](#sft-single-node)
- [Multi-node](#sft-multi-node)
- [Quickstart](#quickstart)
- [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
- [Run Single Node SFT](#run-single-node-sft)
- [SFT Multi-node](#sft-multi-node)
- [DPO](#dpo)
- [Single Node](#dpo-single-node)
- [Multi-node](#dpo-multi-node)
- [Cluster Start](#cluster-start)
- [DPO Single Node](#dpo-single-node)
- [DPO Multi-node](#dpo-multi-node)
- [Set Up Clusters](#set-up-clusters)
- [Citation](#citation)
- [Contributing](#contributing)
- [Licenses](#licenses)

**Nemo-RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
**Nemo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths.
- **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations.
- **Flexibility** with a modular design that allows easy integration and customization.
- **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
Expand All @@ -31,32 +34,32 @@ What you can expect:

✅ _Available now_ | 🔜 _Coming in v0.3_

- ✅ **Fast Generation** - vLLM backend for optimized inference
- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama)
- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure
- ✅ **Fast Generation** - vLLM backend for optimized inference.
- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama).
- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure.
- ✅ **Environment Support** - Support for multi-environment training.
- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization)
- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters
- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training
- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
- ✅ **Environment Isolation** - Dependency isolation between components

- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron
- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models
- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core
- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
- 🔜 **MoE Models** - Support DeepseekV3 and Llama4
- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters.
- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training.
- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state).
- ✅ **Environment Isolation** - Dependency isolation between components.

- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron.
- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models.
- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core.
- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models.
- 🔜 **MoE Models** - Support DeepseekV3 and Llama4.

## Prerequisites

Comment thread
terrykong marked this conversation as resolved.
Clone **NeMo RL**
Clone **NeMo RL**.
```sh
git clone git@github.com:NVIDIA/nemo-rl.git
cd nemo-rl
```

Install `uv`
Install `uv`.
```sh
# For faster setup and environment isolation, we use `uv`
pip install uv
Expand All @@ -72,9 +75,11 @@ pip install uv
# Example: uv run python examples/run_grpo_math.py
```

## Quick start
**Important Notes:**

**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
- Use the `uv run <command>` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
- **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.

### GRPO

Comment thread
terrykong marked this conversation as resolved.
Expand All @@ -89,7 +94,7 @@ To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:
uv run python examples/run_grpo_math.py
```

By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,

```sh
# Run the GRPO math example using a 1B parameter model using 8 GPUs
Expand All @@ -111,7 +116,7 @@ uv run python examples/run_grpo_math.py \
#### GRPO Multi-node

```sh
# Run from the root of NeMo-RL repo
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

# grpo_math_8b uses Llama-3.1-8B-Instruct model
Expand All @@ -131,7 +136,7 @@ sbatch \
##### GRPO Qwen2.5-32B

```sh
# Run from the root of NeMo-RL repo
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=16

# Download Qwen before the job starts to avoid spending time downloading during the training loop
Expand All @@ -158,21 +163,25 @@ Reference example for training to play a Sliding Puzzle Game:
uv run python examples/run_grpo_sliding_puzzle.py
```

### SFT
## Quickstart

We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`.

#### SFT Single Node
## Supervised Fine-Tuning (SFT)

The default SFT experiment is configured to run on a single GPU. To launch the experiment,
We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).

### Run Single Node SFT

The default SFT configuration is set to run on a single GPU. To start the experiment:

```sh
uv run python examples/run_sft.py
```

This trains `Llama3.2-1B` on one GPU using the SQUAD dataset.
This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size:
To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:

```sh
uv run python examples/run_sft.py \
Expand All @@ -184,10 +193,10 @@ uv run python examples/run_sft.py \

Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.

#### SFT Multi-node
### SFT Multi-node

```sh
# Run from the root of NeMo-RL repo
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
Expand Down Expand Up @@ -244,7 +253,7 @@ Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters
For distributed DPO training across multiple nodes, modify the following script for your use case:

```sh
# Run from the root of NeMo-RL repo
# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2

Expand All @@ -262,19 +271,29 @@ sbatch \
ray.sub
```

## Cluster Start
## Set Up Clusters

Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes.
For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation.

## Citation

If you use NeMo-RL in your research, please cite it using the following BibTeX entry:
If you use NeMo RL in your research, please cite it using the following BibTeX entry:

```bibtex
@misc{nemo-rl,
title = {NeMo-RL: A Scalable and Efficient Post-Training Library},
title = {NeMo RL: A Scalable and Efficient Post-Training Library},
howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
year = {2025},
note = {GitHub repository},
}
```

## Contributing

We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/nemo-rl/blob/main/CONTRIBUTING.md) for more information on how to get involved.

## Licenses

NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/nemo-rl/blob/main/LICENSE).

NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license.
48 changes: 24 additions & 24 deletions docs/adding-new-models.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Adding New Models
# Add New Models

This guide outlines how to integrate and validate a new model within **NeMo-RL**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.
This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines.

## Importance of Log Probability Consistency in Training and Inference

In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.
In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation.

As an example, we would see errors in naive KL estimation:

Expand All @@ -14,43 +14,43 @@ When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong

$$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}(x) - \pi(x) \right)$$

So, to verify correctness, we calculate
So, to verify correctness, we calculate:

$$
\frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right)
$$

where samples are drawn as $x \sim \pi_{\text{inference-framework}}$
as a measure of multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$.

As a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.

## Understanding Discrepancies Between Backends
## Understand Discrepancies Between Backends

When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant:

- **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8).
- Training may use mixed precision while the inference backend may not
- High-precision training with FP8 inference may not be numerically stable for certain models
- Differences can occur at the layer level, with some layers in FP32 while others use lower precision
- Training may use mixed precision, while the inference backend may not.
- High-precision training with FP8 inference may not be numerically stable for certain models.
- Differences can occur at the layer level, with some layers in FP32, while others use lower precision.

- **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented.
- Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends
- Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences
- Softmax in training frameworks may be calculated differently than in inference backends for numerical stability
- Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends.
- Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences.
- Softmax in training frameworks may be calculated differently than in inference backends for numerical stability.

- **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation.
- In some cases, disabling the inference backend cache can resolve discrepancies
- In some cases, disabling the inference backend cache can resolve discrepancies.

- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations
- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations.

- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`)
- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`).

- **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities.
- Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels
- Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels.

- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect
- If weights are reshaped or reordered incorrectly, generations tend to be very wrong
- In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses
- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect.
- If weights are reshaped or reordered incorrectly, generations tend to be very wrong.
- In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses.

- **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1`

Expand All @@ -66,10 +66,10 @@ When investigating discrepancies beyond the acceptable threshold, focus on these
When validating Hugging Face-based models, perform the following checks:

- **Compare log probabilities**
Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches.
Ensure the generation log probabilities from inference backends like **vLLM** match those computed by Hugging Face. This comparison helps diagnose potential mismatches.

- **Test parallelism**
Verify consistency with other parallelism settings.
Verify consistency with other parallelism settings.

- **Variance**
Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance.
Expand All @@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks:
### Additional Validation

- **Compare Megatron outputs**
Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**.
Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**.

- **Parallel settings**
Match the same parallelism configurations used for the HuggingFace-based tests.
Expand All @@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf

---

By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-RL**'s requirements.
By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL.
Comment thread
terrykong marked this conversation as resolved.
Loading
Loading