diff --git a/README.md b/README.md index c857d84510..ee9efd8df6 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,31 @@ -# Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s +# Nemo RL: A Scalable and Efficient Post-Training Library -- [Nemo-RL: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-rl-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s) +- [Nemo RL: A Scalable and Efficient Post-Training Library](#nemo-rl-a-scalable-and-efficient-post-training-library) - [Features](#features) - [Prerequisites](#prerequisites) - - [Quick start](#quick-start) - [GRPO](#grpo) - - [Single Node](#grpo-single-node) - - [Multi-node](#grpo-multi-node) + - [GRPO Single Node](#grpo-single-node) + - [GRPO Multi-node](#grpo-multi-node) - [GRPO Qwen2.5-32B](#grpo-qwen25-32b) - - [SFT](#sft) - - [Single Node](#sft-single-node) - - [Multi-node](#sft-multi-node) + - [Quickstart](#quickstart) + - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft) + - [Run Single Node SFT](#run-single-node-sft) + - [SFT Multi-node](#sft-multi-node) - [DPO](#dpo) - - [Single Node](#dpo-single-node) - - [Multi-node](#dpo-multi-node) - - [Cluster Start](#cluster-start) + - [DPO Single Node](#dpo-single-node) + - [DPO Multi-node](#dpo-multi-node) + - [Set Up Clusters](#set-up-clusters) + - [Citation](#citation) + - [Contributing](#contributing) + - [Licenses](#licenses) -**Nemo-RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. +**Nemo RL** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. What you can expect: -- **Seamless integration with HuggingFace** for ease of use, allowing users to leverage a wide range of pre-trained models and tools. -- **High-performance implementation with Megatron core**, supporting various parallelism techniques for large models (>100B) and large context lengths. +- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools. +- **High-performance implementation with Megatron Core**, supporting various parallelism techniques for large models (>100B) and large context lengths. - **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations. - **Flexibility** with a modular design that allows easy integration and customization. - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples. @@ -31,32 +34,32 @@ What you can expect: ✅ _Available now_ | 🔜 _Coming in v0.3_ -- ✅ **Fast Generation** - vLLM backend for optimized inference -- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama) -- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure +- ✅ **Fast Generation** - vLLM backend for optimized inference. +- ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama). +- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure. - ✅ **Environment Support** - Support for multi-environment training. -- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization) -- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc. -- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters -- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training -- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state) -- ✅ **Environment Isolation** - Dependency isolation between components - -- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron -- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models -- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core -- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models -- 🔜 **MoE Models** - Support DeepseekV3 and Llama4 +- ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization). +- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc. +- ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters. +- ✅ **Advanced Parallelism** - FSDP2, TP, and SP for efficient training. +- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state). +- ✅ **Environment Isolation** - Dependency isolation between components. + +- 🔜 **(Even) Larger Model Support** - Native PyTorch & Megatron. +- 🔜 **Improved Native Performance** - Improve training time for Native Pytorch Models. +- 🔜 **Megatron Policy** - Support advanced parallelism in training with Megatron Core. +- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models. +- 🔜 **MoE Models** - Support DeepseekV3 and Llama4. ## Prerequisites -Clone **NeMo RL** +Clone **NeMo RL**. ```sh git clone git@github.com:NVIDIA/nemo-rl.git cd nemo-rl ``` -Install `uv` +Install `uv`. ```sh # For faster setup and environment isolation, we use `uv` pip install uv @@ -72,9 +75,11 @@ pip install uv # Example: uv run python examples/run_grpo_math.py ``` -## Quick start +**Important Notes:** -**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. +- Use the `uv run ` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions. +- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware. +- **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. ### GRPO @@ -89,7 +94,7 @@ To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`: uv run python examples/run_grpo_math.py ``` -By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus, +By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs, ```sh # Run the GRPO math example using a 1B parameter model using 8 GPUs @@ -111,7 +116,7 @@ uv run python examples/run_grpo_math.py \ #### GRPO Multi-node ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=2 # grpo_math_8b uses Llama-3.1-8B-Instruct model @@ -131,7 +136,7 @@ sbatch \ ##### GRPO Qwen2.5-32B ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=16 # Download Qwen before the job starts to avoid spending time downloading during the training loop @@ -158,21 +163,25 @@ Reference example for training to play a Sliding Puzzle Game: uv run python examples/run_grpo_sliding_puzzle.py ``` -### SFT +## Quickstart -We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). +Before running any experiments, remember to set your `HF_HOME` environment variable and your `WANDB_API_KEY` if you intend to use Weights & Biases for logging. For accessing Llama models, you might also need to log in using `huggingface-cli login`. -#### SFT Single Node +## Supervised Fine-Tuning (SFT) -The default SFT experiment is configured to run on a single GPU. To launch the experiment, +We provide an example SFT experiment using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). + +### Run Single Node SFT + +The default SFT configuration is set to run on a single GPU. To start the experiment: ```sh uv run python examples/run_sft.py ``` -This trains `Llama3.2-1B` on one GPU using the SQUAD dataset. +This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU. -If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration. We also switch to an 8B Llama base model and increase the batch size: +To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size: ```sh uv run python examples/run_sft.py \ @@ -184,10 +193,10 @@ uv run python examples/run_sft.py \ Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden. -#### SFT Multi-node +### SFT Multi-node ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=2 COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \ @@ -244,7 +253,7 @@ Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters For distributed DPO training across multiple nodes, modify the following script for your use case: ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo ## number of nodes to use for your job NUM_ACTOR_NODES=2 @@ -262,19 +271,29 @@ sbatch \ ray.sub ``` -## Cluster Start +## Set Up Clusters -Please visit [Cluster Start](docs/cluster.md) for how to get started on Slurm or Kubernetes. +For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation. ## Citation -If you use NeMo-RL in your research, please cite it using the following BibTeX entry: +If you use NeMo RL in your research, please cite it using the following BibTeX entry: ```bibtex @misc{nemo-rl, -title = {NeMo-RL: A Scalable and Efficient Post-Training Library}, +title = {NeMo RL: A Scalable and Efficient Post-Training Library}, howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}}, year = {2025}, note = {GitHub repository}, } ``` + +## Contributing + +We welcome contributions to NeMo RL\! Please see our [Contributing Guidelines](https://github.com/NVIDIA/nemo-rl/blob/main/CONTRIBUTING.md) for more information on how to get involved. + +## Licenses + +NVIDIA NeMo RL is licensed under the [Apache License 2.0](https://github.com/NVIDIA/nemo-rl/blob/main/LICENSE). + +NeMo is licensed under the [NVIDIA AI PRODUCT AGREEMENT](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). By pulling and using the container, you accept the terms and conditions of this license. diff --git a/docs/adding-new-models.md b/docs/adding-new-models.md index 9afcb46cf9..eefdfb5d9f 100644 --- a/docs/adding-new-models.md +++ b/docs/adding-new-models.md @@ -1,10 +1,10 @@ -# Adding New Models +# Add New Models -This guide outlines how to integrate and validate a new model within **NeMo-RL**. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines. +This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines. ## Importance of Log Probability Consistency in Training and Inference -In on-policy RL, we sample tokens (actions) from the latest version of the policy, meaning the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation. +In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation. As an example, we would see errors in naive KL estimation: @@ -14,43 +14,43 @@ When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong $$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}(x) - \pi(x) \right)$$ -So, to verify correctness, we calculate +So, to verify correctness, we calculate: $$ \frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right) $$ -where samples are drawn as $x \sim \pi_{\text{inference-framework}}$ +as a measure of multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$. -As a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient. +Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient. -## Understanding Discrepancies Between Backends +## Understand Discrepancies Between Backends When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant: - **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8). - - Training may use mixed precision while the inference backend may not - - High-precision training with FP8 inference may not be numerically stable for certain models - - Differences can occur at the layer level, with some layers in FP32 while others use lower precision + - Training may use mixed precision, while the inference backend may not. + - High-precision training with FP8 inference may not be numerically stable for certain models. + - Differences can occur at the layer level, with some layers in FP32, while others use lower precision. - **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented. - - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends - - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences - - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability + - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends. + - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences. + - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability. - **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation. - - In some cases, disabling the inference backend cache can resolve discrepancies + - In some cases, disabling the inference backend cache can resolve discrepancies. -- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations +- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations. -- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`) +- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`). - **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities. - - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels + - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels. -- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect - - If weights are reshaped or reordered incorrectly, generations tend to be very wrong - - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses +- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect. + - If weights are reshaped or reordered incorrectly, generations tend to be very wrong. + - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses. - **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1` @@ -66,10 +66,10 @@ When investigating discrepancies beyond the acceptable threshold, focus on these When validating Hugging Face-based models, perform the following checks: - **Compare log probabilities** - Ensure the generation log probabilities from inference backends like **vLLM** match those computed by HuggingFace. This comparison helps diagnose potential mismatches. + Ensure the generation log probabilities from inference backends like **vLLM** match those computed by Hugging Face. This comparison helps diagnose potential mismatches. - **Test parallelism** - Verify consistency with other parallelism settings. + Verify consistency with other parallelism settings. - **Variance** Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance. @@ -96,7 +96,7 @@ When validating Hugging Face-based models, perform the following checks: ### Additional Validation - **Compare Megatron outputs** - Ensure the Megatron forward pass aligns with HuggingFace and the generation log probabilities from inference backends like **vLLM**. + Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**. - **Parallel settings** Match the same parallelism configurations used for the HuggingFace-based tests. @@ -120,4 +120,4 @@ When validating your model, you should analyze the results across different conf --- -By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-RL**'s requirements. \ No newline at end of file +By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL. diff --git a/docs/cluster.md b/docs/cluster.md index 260acaeb1e..cfac258c8d 100644 --- a/docs/cluster.md +++ b/docs/cluster.md @@ -1,18 +1,15 @@ -# Cluster start +# Set Up Clusters -- [Cluster start](#cluster-start) - - [Slurm](#slurm) - - [Batched Job Submission](#batched-job-submission) - - [Interactive Launching](#interactive-launching) - - [Slurm UV\_CACHE\_DIR](#slurm-uv_cache_dir) - - [Kubernetes](#kubernetes) +This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes. -## Slurm +## Slurm (Batched and Interactive) + + The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively. ### Batched Job Submission ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0) COMMAND="uv run ./examples/run_grpo_math.py" \ @@ -35,7 +32,7 @@ Which will print the `SLURM_JOB_ID`: ```text Submitted batch job 1980204 ``` -Make note of the the job submission number. Once the job begins you can track it's process in the driver logs which you can `tail`: +Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`: ```sh tail -f 1980204-logs/ray-driver.log ``` @@ -43,12 +40,12 @@ tail -f 1980204-logs/ray-driver.log ### Interactive Launching :::{tip} -A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the SLURM job queue. This means during debugging sessions, you can avoid submitting a new `sbatch` command each time and instead debug and re-submit your NeMo-RL job directly from the interactive session. +A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new `sbatch` command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session. ::: -To run interactively, launch the same command as the [Batched Job Submission](#batched-job-submission) except omit the `COMMAND` line: +To run interactively, launch the same command as [Batched Job Submission](#batched-job-submission), but omit the `COMMAND` line: ```sh -# Run from the root of NeMo-RL repo +# Run from the root of NeMo RL repo NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0) CONTAINER=YOUR_CONTAINER \ @@ -66,12 +63,12 @@ Which will print the `SLURM_JOB_ID`: ```text Submitted batch job 1980204 ``` -Once the ray cluster is up, a script should be created to attach to the ray head node, -which you can use launch experiments. +Once the Ray cluster is up, a script should be created to attach to the Ray head node, +which you can use to launch experiments. ```sh bash 1980204-attach.sh ``` -Now that you are on the head node, you can launch the command like so: +Now that you are on the head node, you can launch the command as follows: ```sh uv run ./examples/run_grpo_math.py ``` @@ -81,7 +78,7 @@ uv run ./examples/run_grpo_math.py There several choices for `UV_CACHE_DIR` when using `ray.sub`: 1. (default) `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs. -2. Use the warm uv cache from our docker images +2. Use the warm uv cache from our docker images: ```sh ... UV_CACHE_DIR=/home/ray/.cache/uv \ @@ -96,4 +93,4 @@ covered by warmed cache. ## Kubernetes -TBD +TBD \ No newline at end of file diff --git a/docs/design-docs/chat-datasets.md b/docs/design-docs/chat-datasets.md index 43e2801fdc..fafd387109 100644 --- a/docs/design-docs/chat-datasets.md +++ b/docs/design-docs/chat-datasets.md @@ -1,8 +1,10 @@ # Data Format -## HuggingFace Chat Datasets +This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information. -HuggingFace chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. `messages` should be a list of dictionaries, each with a `role` and `content` key. `role` is typically one of `system`, `user`, and `assistant`. For example: +## Hugging Face Chat Datasets + +Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. The `messages` should be a list of dictionaries, each with a `role` and `content` key. The `role` typically has one of the following values: `system`, `user`, and `assistant`. For example: ```json { @@ -23,9 +25,9 @@ HuggingFace chat datasets are expected to have the following structure: Each exa } ``` -### Chat Templates +## Chat Templates -Formatting the data in this way allows us to take advantage of HuggingFace tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details. +Formatting the data in this way allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#applychattemplate) for details. By default, `apply_chat_template` attempts to apply the `chat_template` associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated `chat_template`s, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to `apply_chat_template`. The following is an example using a simple template which prepends a role header to each turn: @@ -58,4 +60,4 @@ assert output == expected_output :hide: ``` -For more details on creating chat templates, refer to the [HuggingFace documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template). \ No newline at end of file +For more details on creating chat templates, refer to the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template). \ No newline at end of file diff --git a/docs/design-docs/checkpointing.md b/docs/design-docs/checkpointing.md index 101f57a059..f8f11b916f 100644 --- a/docs/design-docs/checkpointing.md +++ b/docs/design-docs/checkpointing.md @@ -1,10 +1,10 @@ -# Checkpointing with HuggingFace Models +# Checkpointing with Hugging Face Models -## Checkpoint Format -NeMo-RL provides two checkpoint formats for HuggingFace models: Torch distributed and HuggingFace format. Torch distributed is used by default for efficiency, and HuggingFace format is provided for compatibility with HuggingFace's `AutoModel.from_pretrained` API. Note that HuggingFace format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a HuggingFace checkpoint only at the end of training. +NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, and Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. -A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to HuggingFace format after training: +A checkpoint converter is provided to convert a Torch distributed checkpoint checkpoint to Hugging Face format after training: + +```sh +uv run examples/convert_dcp_to_hf.py --config= --dcp-ckpt-path= --hf-ckpt-path= +``` - ```python - uv run examples/convert_dcp_to_hf.py --config= --dcp-ckpt-path= --hf-ckpt-path= - ``` \ No newline at end of file diff --git a/docs/design-docs/design-and-philosophy.md b/docs/design-docs/design-and-philosophy.md index 00d6284b3b..eec3b399a7 100644 --- a/docs/design-docs/design-and-philosophy.md +++ b/docs/design-docs/design-and-philosophy.md @@ -1,54 +1,54 @@ # Design and Philosophy -In this section, we will describe the problems this library aims to solve and motivate/dicuss the NeMo-RL APIs. + +This section introduces the NeMo RL APIs and addresses the challenges of online Reinforcement Learning (RL). Coordinating various software components, known as RL Actors, requires effective resource allocation, isolation, coordination, and communication. Our design philosophy focuses on creating modular abstractions for these tasks, ensuring scalability from one GPU to thousands, regardless of the RL Actor's implementation. ## Motivation -Online RL requires coordinating a lot of different pieces of software/models + +Online RL demands the coordination of a wide range of software components and models, for example: - Policy Model/Training Framework -- Fast inference Framework (vLLM, SGLANG, TRT-LLM) +- Fast Inference Framework (vLLM, SGLANG, TRT-LLM) - Reward Environments, Critics, etc. We refer to each of these pieces of software as an **RL Actor**. -Fundamentally, we need to be able to do 4 things between these RL Actors: -- Resource them (provide GPUs/CPUs) -- Isolate them - - RL Actors may each set global variables or have conflicting dependencies, so they each need to live in an isolated process environment with configurable dependencies -- Coordinate them (control) -- Communicate between them (data) +Fundamentally, managing these RL Actors requires four key capabilities: +- Resource them (provide GPUs/CPUs). +- Isolate them: RL Actors need isolated process environments with configurable dependencies to avoid global variable or dependency conflicts. +- Coordinate them (control). +- Communicate between them (data). ## Design We create composable and hackable abstractions for each layer of the tasks above -- Resourcing -> {py:class}`RayVirtualCluster ` -- Isolation -> {py:class}`RayWorkerGroup ` -- Coordination -> A Single-Process Controller using Ray -- Communication -> Data flows through one of the following: +- Resourcing: {py:class}`RayVirtualCluster ` +- Isolation: {py:class}`RayWorkerGroup ` +- Coordination: A Single-Process Controller using Ray +- Communication: Data flows through one of the following: - the single controller - a communication scheme set-up by the controller such as - NCCL Collectives - Multiprocess Queues -By creating a common interface for these 4 tasks, **RL algorithm code looks the same from 1 GPU to 1000 GPUs and does not care about the implementation of each RL Actor (Megatron, HF, Grad student with pen and paper)** +By creating a common interface for these four tasks, the RL algorithm code can scale seamlessly from 1 to 1000 GPUs and remain independent of the specific RL Actor (such as Megatron, Hugging Face, or abstract components like a grad student with pen and paper). ![actor-wg-worker-vc](../assets/actor-wg-worker-vc.png) ### {py:class}`RayVirtualCluster ` VirtualCluster provides a basic abstraction on top of Ray Placement Groups that allow you to section off a part of your compute resources for WorkerGroups to run on as though they had their own cluster. They support running just one WorkerGroup on each VirtualCluster, or *colocation*, where multiple WorkerGroups share resources (i.e running policy training(hf) and generation(vllm) on the same GPUs in-turn). -Minimally, it has has the following core API: ```python class RayVirtualCluster: """ Creates a virtual distributed cluster using Ray placement groups. This class simplifies distributed training setup by: - - Creating placement groups that represent logical compute nodes - - Allocating GPU and CPU resources for distributed workers - - Managing communication between distributed processes + - Creating placement groups that represent logical compute nodes. + - Allocating GPU and CPU resources for distributed workers. + - Managing communication between distributed processes. - - Bundle: A resource allocation unit (ex: 4 GPUs on a single node) - - Worker: A process that performs computation (model training/inference) - - Node: A physical or virtual machine containing multiple bundles + - Bundle: A resource allocation unit (ex: 4 GPUs on a single node). + - Worker: A process that performs computation (model training/inference). + - Node: A physical or virtual machine containing multiple bundles. """ def __init__(self, bundle_ct_per_node_list: List[int], {other args}): """ @@ -64,12 +64,12 @@ class RayVirtualCluster: This represents the "virtual cluster" - only nodes that are actually being used. Returns: - List of placement groups that have at least one bundle + List of placement groups that have at least one bundle. """ ``` ### {py:class}`RayWorkerGroup ` -All work is done by "Worker Processes"(Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by *RayWorkerGroup* +All work is done by "Worker Processes" (Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by the *RayWorkerGroup*. ```python class RayWorkerGroup: """ @@ -77,18 +77,20 @@ class RayWorkerGroup: This class creates and manages Ray actor instances that run on resources allocated by a RayVirtualCluster. It handles: - - Worker creation and placement on specific GPU resources - - Setting up distributed training environment variables (rank, world size, etc.) - - Executing methods across all workers in parallel - - Collecting and aggregating results - - Support for tied worker groups where multiple workers process the same data + - Worker creation and placement on specific GPU resources. + - Setting up distributed training environment variables (rank, world size, etc.). + - Executing methods across all workers in parallel. + - Collecting and aggregating results. + - Support for tied worker groups where multiple workers process the same data. """ ``` `RayWorkerGroup` provides functions like `run_all_workers_single_data` and `run_all_workers_multiple_data` to control and communicate to individual worker processes. -### Single-Controller & Execution Diagram -We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of GRPO as though we were working on 1 GPU +### Single-Controller and Execution Diagram + +We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of Group Relative Policy Optimization (GRPO) as though we were working on 1 GPU. + ```python # data processing/transformations between each step omitted def grpo_train( @@ -106,7 +108,7 @@ def grpo_train( logprobs = policy.get_logprobs(generations) reference_logprobs = policy.get_reference_logprobs(generations) - training_data = calculate_grpo_trainnig_data(generations, logprobs, reference_logprobs, rewards) + training_data = calculate_grpo_training_data(generations, logprobs, reference_logprobs, rewards) policy.train(generations, logprobs, reference_logprobs, GRPOLossFn) ``` -For a real implementation of grpo (with valiation, checkpointing, memory movement, and the omitted data processing steps), see [grpo_train](../../nemo_rl/algorithms/grpo.py) +For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](../../nemo_rl/algorithms/grpo.py). diff --git a/docs/design-docs/generation.md b/docs/design-docs/generation.md index 72c2554d92..275625f371 100644 --- a/docs/design-docs/generation.md +++ b/docs/design-docs/generation.md @@ -1,6 +1,6 @@ -# Generation Module +# Generation Interface -This doc explains the token generation interface and various backends for the NeMo-RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, HuggingFace, SGLang, TRT-LLM) to provide token generation capabilities while adhering to the same API. +This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API. ## Generation Interface @@ -58,7 +58,7 @@ The core of the generation system is defined in `interfaces.py`, which establish pass ``` -A key thing to note about generation backends is that the generation backend takes in tokens and gives out tokens without dealing with the tokenizer. By ensuring that only tokens are communicated we eliminate the possibility of having different tokenizers (different versions/specs etc) for training and generation framework. +A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks. ## VLLM Backend @@ -66,29 +66,29 @@ The VLLM backend (`models/generation/vllm.py`) implements the {py:class}`Generat ### VllmGeneration Class -The {py:class}`VllmGeneration ` class is the main implementation of the {py:class}`GenerationInterface ` for VLLM. It: +The {py:class}`VllmGeneration ` class is the main implementation of the {py:class}`GenerationInterface ` for VLLM. It performs the following functions: -1. Sets up VLLM workers in a distributed environment using Ray -2. Manages the lifecycle of these workers (initialization, generation, shutdown) -3. Distributes inputs to workers and collects outputs -4. Handles weight updates and synchronization +1. Sets up VLLM workers in a distributed environment using Ray. +2. Manages the lifecycle of these workers (initialization, generation, shutdown). +3. Distributes inputs to workers and collects outputs. +4. Handles weight updates and synchronization. ### VllmGenerationWorker The {py:class}`VllmGenerationWorker ` is a Ray actor that: -1. Initializes and manages a VLLM model instance -2. Performs the actual generation on a GPU -3. Supports dynamic weight updates through IPC handles -4. Implements sleep/wake mechanisms for efficient resource utilization +1. Initializes and manages a VLLM model instance. +2. Performs the actual generation on a GPU. +3. Supports dynamic weight updates through IPC handles. +4. Implements sleep/wake mechanisms for efficient resource utilization. ### Custom VLLM Extensions The {py:class}`UpdatableVllmInternalWorker ` class in `vllm_backend.py` extends the VLLM worker with additional capabilities: -1. Reporting device IDs to allow mapping of workers to specific GPUs -2. Updating weights from IPC handles for efficient weight sharing -3. Checking if weights have been updated correctly +1. Reporting device IDs to allow mapping of workers to specific GPUs. +2. Updating weights from IPC handles for efficient weight sharing. +3. Checking if weights have been updated correctly. ## Usage Example @@ -133,13 +133,13 @@ output = generator.generate(input_data, greedy=False) generator.finish_generation() ``` -## Extending with New Backends +## Extend with New Backends To add a new generation backend: -1. Create a new class that implements {py:class}`GenerationInterface ` -2. Implement the required methods: {py:meth}`generate `, {py:meth}`prepare_for_generation `, and {py:meth}`finish_generation ` -3. Ensure your implementation works with the standard {py:class}`GenerationConfig ` and {py:class}`GenerationDatumSpec ` structures -4. Register your backend with the system (if needed) to make it accessible +1. Create a new class that implements {py:class}`GenerationInterface `. +2. Implement the required methods: {py:meth}`generate `, {py:meth}`prepare_for_generation `, and {py:meth}`finish_generation `. +3. Ensure your implementation works with the standard {py:class}`GenerationConfig ` and {py:class}`GenerationDatumSpec ` structures. +4. Register your backend with the system (if needed) to make it accessible. This modular design allows for easy extension with new backends while maintaining a consistent interface for the rest of the system. diff --git a/docs/design-docs/logger.md b/docs/design-docs/logger.md index 8578fe621e..d15ad5c1ba 100644 --- a/docs/design-docs/logger.md +++ b/docs/design-docs/logger.md @@ -1,8 +1,10 @@ # Logger -## Requirements: +The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB and Tensorboard. -* Tracking distributed metrics with specified reductions (mean, max, etc) +## Requirements + +* Tracking distributed metrics with specified reductions (mean, max, etc.) * Tracking distributed timing with (usually) 'max' reduction across ranks * Logging: * WandB @@ -29,7 +31,7 @@ class LoggerInterface(ABC): pass ``` -A {py:class}`Logger ` wrapper class will also implement {py:class}`LoggerInterface ` and will contain a list of loggers it delegates to when writing logs. This will be the main class the user uses in the training loop. Usage example: +A {py:class}`Logger ` wrapper class will also implement {py:class}`LoggerInterface ` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example: ```python # Initialize logger with both wandb and tensorboard enabled @@ -57,7 +59,7 @@ logger.log_metrics({ ## Validation Pretty Logging -The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter: +The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter. ```python logger: @@ -68,9 +70,9 @@ logger: When `num_val_samples_to_print` is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for: -1. Quickly inspecting model generation quality during training -2. Comparing inputs and outputs side-by-side -3. Tracking validation sample performance over time +1. Quickly inspecting model generation quality during training. +2. Comparing inputs and outputs side-by-side. +3. Tracking validation sample performance over time. ### Example Output @@ -80,11 +82,11 @@ When enabled, the pretty logging will generate formatted text similar to: ## GPU Metric Logging -NeMo-RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo-RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or Weights & Biases. +NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or WandB. -This approach allows us to offer the same GPU metric tracking on all loggers (not just wandb) and simplifies the implementation greatly. +This approach allows us to offer the same GPU metric tracking on all loggers (not just Wandb) and simplifies the implementation greatly. -This feature is enabled with the `monitor_gpus` configuration parameter and the frequency of collection and flushing to the loggers is controlled by `gpu_collection_interval` and `gpu_flush_interval` (both in seconds), respectively: +This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds. ```python logger: @@ -97,12 +99,12 @@ logger: ``` :::{note} -While monitoring through the remote workers is possible, it requires some delicate implementation details to make sure: -* sending logs back to driver does not incur a large overhead -* metrics are easily interpretable since we may be double counting due to colocated workers -* workers gracefully flush their logs in the event of failure -* the logging is the same for tensorboard and wandb -* some workers which spawn other workers correctly report the total usage of the grandchild worker - -These reasons lead us to the simple implementation of collecting on the driver -::: +While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure: +* Logs sent back to the driver do not introduce significant overhead. +* Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers. +* Workers can gracefully flush their logs in case of failure. +* Logging behaves consistently across TensorBoard and Wandb. +* Workers that spawn other workers accurately report the total resource usage of any grandchild workers. + +Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver. +::: \ No newline at end of file diff --git a/docs/design-docs/padding.md b/docs/design-docs/padding.md index 219e91573f..da5a6def74 100644 --- a/docs/design-docs/padding.md +++ b/docs/design-docs/padding.md @@ -1,7 +1,5 @@ # Padding in NeMo RL -## Overview - This document explains padding in NeMo RL and why consistent padding is critical for the framework. ## Padding Approach @@ -15,9 +13,9 @@ NeMo RL uses **right padding** for all tensor operations, where padding tokens a ``` This approach: -1. **Naturally aligns with LLM processing**: Tokens are processed from left to right -2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors -3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value +1. **Naturally aligns with LLM processing**: Tokens are processed from left to right. +2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors. +3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value. ## Right-Padded Generation Example @@ -35,9 +33,9 @@ Corresponding logprobs: |-- zeros for input --| |- gen logprobs -| |pad| ``` -## Verifying Right Padding +## Verify Right Padding -NeMo RL provides utilities to verify correct padding: +NeMo RL provides utilities to verify correct padding. For example: ```{testcode} import torch @@ -79,20 +77,20 @@ if not is_right_padded: ``` The {py:class}`verify_right_padding() ` function checks that: -1. All padding (zeros or padding token provided by the user) appears after valid tokens -2. The padding starts at the position specified by the length tensor +1. All padding (zeros or padding token provided by the user) appears after valid tokens. +2. The padding starts at the position specified by the length tensor. The function automatically detects whether you're passing input or output data: -- For input data: Requires `input_ids` and `input_lengths` fields -- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths` +- For input data: Requires `input_ids` and `input_lengths` fields. +- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths`. ## Best Practices -1. **Always Use Right Padding**: All components expect this format +1. **Always Use Right Padding**: All components expect this format. -2. **Track Length Tensors**: Include appropriate length tensors with your data +2. **Track Length Tensors**: Include appropriate length tensors with your data. -3. **Verify Padding**: Use {py:class}`verify_right_padding() ` when in doubt +3. **Verify Padding**: Use {py:class}`verify_right_padding() ` when in doubt. -4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations +4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations. diff --git a/docs/design-docs/uv.md b/docs/design-docs/uv.md index 12d8368501..f8f98b1482 100644 --- a/docs/design-docs/uv.md +++ b/docs/design-docs/uv.md @@ -1,42 +1,46 @@ -# uv in NeMo-RL +# uv in NeMo RL -Using `uv` for Dependency Management in NeMo-RL +We use the `uv` Python package installer for managing dependencies in NeMo RL. ## Overview -`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document outlines why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters. +`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document explains why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters. ## Why `uv`? +`uv` brings the following key advantages to our Python development workflow: + ### Speed and Efficiency -- Written in Rust, making it significantly faster than traditional Python package managers -- Optimized caching mechanisms that reduce redundant downloads and installations -- Quick environment creation and switching, enabling rapid development cycles +- Written in Rust, making it significantly faster than traditional Python package managers. +- Optimized caching mechanisms that reduce redundant downloads and installations. +- Quick environment creation and switching, enabling rapid development cycles. ### Isolated Environments -- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages -- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies -- Ensures consistent behavior across different machines and deployment environments +- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages. +- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies. +- Ensures consistent behavior across different machines and deployment environments. ### Dependency Management in Ray Clusters -- Enables management of heterogeneous Python environments across a Ray cluster -- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires -- Simplifies propagation of environments to worker nodes without manual setup on each node +- Enables management of heterogeneous Python environments across a Ray cluster. +- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires. +- Simplifies propagation of environments to worker nodes without manual setup on each node. ### Container-Free Flexibility -- Frees us from having to publish many containers for different dependency combinations -- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically -- Reduces infrastructure complexity and maintenance overhead +- Frees us from having to publish many containers for different dependency combinations. +- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically. +- Reduces infrastructure complexity and maintenance overhead. ## Implementation in NeMo RL +This section outlines how workers define their required executables, details the available predefined configurations (like BASE or VLLM), and explains how to customize these setups for specific needs, ensuring consistency across actors. + ### Worker Configuration -In our codebase, workers (classes decorated with `@ray.remote`, e.g., `HFPolicyWorker`) define a `DEFAULT_PY_EXECUTABLE` which specifies what dependencies the worker needs. This allows different parts of our application to have their own tailored environments. +In our codebase, workers (classes decorated with `@ray.remote`, e.g., `HFPolicyWorker`) define a `DEFAULT_PY_EXECUTABLE` that specifies what dependencies the worker needs. This allows different parts of our application to have their own tailored environments. ### Supported Python Executables @@ -46,10 +50,10 @@ We provide several predefined Python executable configurations in {py:class}`PY_ class PY_EXECUTABLES: SYSTEM = sys.executable - # Use NeMo-RL direct dependencies. + # Use NeMo RL direct dependencies. BASE = "uv run --locked" - # Use NeMo-RL direct dependencies and vllm. + # Use NeMo RL direct dependencies and vllm. VLLM = "uv run --locked --extra vllm" ``` @@ -61,14 +65,14 @@ If you need a different Python executable configuration, you can override the de ## How It Works -When a NeMo-RL job is started: +When a NeMo RL job is started: 1. The driver script creates several {py:class}`RayWorkerGroup `s. -2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder ` +2. Each worker group will create their workers which are wrapped in a {py:class}`RayWorkerBuilder `. 3. Before the worker class is instantiated by the `RayWorkerBuilder`, if (1) `DEFAULT_PY_EXECUTABLE` is defined on the worker class (decorated with `@ray.remote`) and (2) it starts with `uv`; a `venv` is created with all the dependencies it needs and the `runtime_env["py_executable"]` is replaced with the `venv`'s python interpreter. This approach allows a fast start-up and maintains dependency isolation. It also has the added benefit of having all the virtual environments local under `./venvs`. ## Conclusion -Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems while enabling heterogeneous environments that can be tailored to specific workloads. +Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems, while enabling heterogeneous environments that can be tailored to specific workloads. diff --git a/docs/docker.md b/docs/docker.md index fd42a5b404..96558f5e31 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -1,4 +1,6 @@ -# Building Docker Images +# Build Docker Images + +This guide provides two methods for building Docker images: the base image, ideal for specifying Python dependencies at runtime, and the hermetic image, which includes default dependencies for offline use. ## Base Image @@ -9,18 +11,18 @@ cd docker/ docker buildx build --target base -t nemo_rl -f Dockerfile .. ``` -This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime. +This is **our recommendation** as it is a small image and allows you to specify your Python dependencies at runtime. ## Hermetic Image -The docker image build without a target stage will include all of the default dependencies to get started. +The Docker image build without a target stage will include all of the default dependencies to get started. ```sh cd docker/ docker buildx build -t nemo_rl -f Dockerfile .. ``` -This image sets up the python environment for you, so you do not have to use `uv` if you don't need +This image sets up the Python environment for you, so you do not have to use `uv` if you don't need any other packages. This image is useful in situations where you may not have network connectivity to re-download packages. diff --git a/docs/documentation.md b/docs/documentation.md index df285cca68..07d4e6b432 100644 --- a/docs/documentation.md +++ b/docs/documentation.md @@ -1,15 +1,15 @@ # Documentation Development - [Documentation Development](#documentation-development) - - [Building](#building) + - [Build the Documentation](#build-the-documentation) - [Live Building](#live-building) - - [Running Tests in Python Docstrings](#running-tests-in-python-docstrings) - - [Writing Tests in Python Docstrings](#writing-tests-in-python-docstrings) + - [Run Tests in Python Docstrings](#run-tests-in-python-docstrings) + - [Write Tests in Python Docstrings](#write-tests-in-python-docstrings) -## Building +## Build the Documentation -The following sections describe how to set up and build the NeMo-RL documentation. +The following sections describe how to set up and build the NeMo RL documentation. Switch to the documentation source folder and generate HTML output. @@ -23,9 +23,9 @@ uv run --group docs sphinx-build . _build/html ## Live Building -When writing documentation it can be helpful to serve the documentation and have it update live while you edit. +When writing documentation, it can be helpful to serve the documentation and have it update live while you edit. -To do so run: +To do so, run: ```sh cd docs/ @@ -35,16 +35,16 @@ uv run --group docs sphinx-autobuild . _build/html --port 12345 --host 0.0.0.0 Open a web browser and go to `http://${HOST_WHERE_SPHINX_COMMAND_RUN}:12345` to view the output. -## Running Tests in Python Docstrings +## Run Tests in Python Docstrings -We also run tests in our python docstrings. You can run them with: +We also run tests in our Python docstrings. You can run them with: ```sh cd docs/ uv run --group docs sphinx-build -b doctest . _build/doctest ``` -## Writing Tests in Python Docstrings +## Write Tests in Python Docstrings Any code in triple backtick blocks with the `{doctest}` directive will be tested. The format follows Python's doctest module syntax, where `>>>` indicates Python input and the following line shows the expected output. Here's an example: diff --git a/docs/guides/dpo.md b/docs/guides/dpo.md index 6c6ed62833..fcea9f5005 100644 --- a/docs/guides/dpo.md +++ b/docs/guides/dpo.md @@ -1,4 +1,4 @@ -# Direct Preference Optimization in NeMo-RL +# Direct Preference Optimization in NeMo RL [Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the @@ -16,7 +16,7 @@ If not specified, `config` will default to [examples/configs/dpo.yaml](../../exa ## Configuration -NeMo-RL allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml). +NeMo RL allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml). To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example: @@ -32,7 +32,7 @@ uv run examples/run_dpo.py \ ## Datasets -Each class representing a NeMo-RL DPO dataset is expected to have the following attributes: +Each class representing a NeMo RL DPO dataset is expected to have the following attributes: 1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below. 2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset. @@ -158,7 +158,7 @@ First train example rejected response: 5 ## DPO-Specific Parameters -The DPO implementation in NeMo-RL supports several key parameters that can be adjusted: +The DPO implementation in NeMo RL supports several key parameters that can be adjusted: - `dpo.reference_policy_kl_penalty`: Controls the strength of the KL penalty term - `dpo.preference_loss_weight`: Weight for the preference loss diff --git a/docs/guides/eval.md b/docs/guides/eval.md index f547e19ff8..c175180ad0 100644 --- a/docs/guides/eval.md +++ b/docs/guides/eval.md @@ -1,10 +1,14 @@ # Evaluation +This document explains how to use an evaluation script for assessing model capabilities. + ## Start Evaluation +To run the evaluation, you can use the default configuration file or specify a custom one. + ### Start Script -**Evaluating Standard Models:** +**Evaluate Standard Models:** To run evaluation using a model directly from Hugging Face Hub or a local path already in HF format, use the `run_eval.py` script. @@ -19,7 +23,7 @@ uv run python examples/run_eval.py --config path/to/custom_config.yaml uv run python examples/run_eval.py generation.model_name="Qwen/Qwen2.5-Math-7B-Instruct" ``` -**Evaluating Models Trained with DCP Checkpoints (GRPO/SFT):** +**Evaluate Models Trained with DCP Checkpoints (GRPO/SFT):** If you have trained a model using GRPO or SFT and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation. @@ -52,11 +56,12 @@ score=0.10 (3.0/30) ============================================================ ``` -## Configuration +## Example Configuration File -An example Evaluation configuration file can be found [here](../../examples/configs/eval.yaml). +You can find an example evaluation configuration file [here](../../examples/configs/eval.yaml). ### Prompt Template Configuration + Always remember to use the same `prompt_file` and `system_prompt_file` that were used during training. For open-source models, we recommend setting `prompt_file=null` and `system_prompt_file=null` to allow them to use their native chat templates. diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md index 82526e0e66..4c9fa93767 100644 --- a/docs/guides/grpo.md +++ b/docs/guides/grpo.md @@ -1,8 +1,10 @@ -# An in-depth walkthrough of GRPO in NeMo-RL +# An in-depth Walkthrough of GRPO in NeMo RL + +This guide details the Group Relative Policy Optimization(GRPO) implementation within NeMo RL. We'll walk through essential aspects including data handling, policy model training, fast generation, and the specifics of the GRPO loss function and its enhancements. ## Quickstart: Launch a GRPO Run -If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). +To get started quickly, use the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or via Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](../cluster.md). We recommend launching the job using `uv`: @@ -10,13 +12,11 @@ We recommend launching the job using `uv`: uv run examples/run_grpo_math.py --config {overrides} ``` -If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml) +If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml). **Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. -## Now, for the details: - -In this guide, we'll walk through how we handle +In this guide, we'll walk through how we handle: * Data * Model training @@ -53,7 +53,8 @@ class DatumSpec(TypedDict): #### Data Processors -We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task. +We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code." + For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_rl/data/interfaces.py) ```python @@ -76,7 +77,7 @@ GRPO expects datasets to have the following form: {"task_name": "math", /* actual data */} ``` -Then, you can set data up as such: +Then, you can set the data up as follows: ```python base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"] @@ -96,7 +97,7 @@ dataset = AllTaskProcessedDataset( ) ``` -Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples. +Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples. ### Policy Model @@ -151,7 +152,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro #### Importance Sampling Correction -The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. +The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understand-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then, diff --git a/docs/guides/sft.md b/docs/guides/sft.md index ff2fd196d5..0933b0f540 100644 --- a/docs/guides/sft.md +++ b/docs/guides/sft.md @@ -1,18 +1,22 @@ -# Supervised Fine-tuning in NeMo-RL +# Supervised Fine-Tuning in NeMo RL + +This document explains how to perform SFT within NeMo RL. It outlines key operations, including initiating SFT runs, managing experiment configurations using YAML, and integrating custom datasets that conform to the required structure and attributes. ## Launch an SFT Run -The script [examples/run_sft.py](../../examples/run_sft.py) can be used to launch an experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). +The script, [examples/run_sft.py](../../examples/run_sft.py), can be used to launch an experiment. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). Be sure to launch the job using `uv`. The command to launch an SFT job is as follows: + ```bash uv run examples/run_sft.py --config ``` + If not specified, `config` will default to [examples/configs/sft.yaml](../../examples/configs/sft.yaml). -## Configuration +## Example Configuration File -NeMo-RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml). +NeMo RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](../../examples/configs/sft.yaml). To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example: @@ -21,15 +25,16 @@ uv run examples/run_sft.py \ cluster.gpus_per_node=1 \ logger.wandb.name="sft-dev-1-gpu" ``` + **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. ## Datasets -SFT datasets in NeMo-RL are encapsulated using classes. Each SFT data class is expected to have the following attributes: +SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is expected to have the following attributes: 1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below. 2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset. -SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_rl/data/hf_datasets/squad.py) has an example: +SFT datasets are expected to follow the Hugging Face chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [data/hf_datasets/squad.py](../../nemo_rl/data/hf_datasets/squad.py) has an example: ```python def format_squad(data): @@ -51,7 +56,7 @@ def format_squad(data): } ``` -NeMo-RL SFT uses HuggingFace chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example): +NeMo RL SFT uses Hugging Face chat templates to format the individual examples. Three types of chat templates are supported, which can be configured via `tokenizer.chat_template` in your yaml config (see [sft.yaml](../../examples/configs/sft.yaml) for an example): 1. Apply the tokenizer's default chat template. To use the tokenizer's default, either omit `tokenizer.chat_template` from the config altogether, or set `tokenizer.chat_template="default"`. 2. Use a "passthrough" template which simply concatenates all messages. This is desirable if the chat template has been applied to your dataset as an offline preprocessing step. In this case, you should set `tokenizer.chat_template` to None as follows: @@ -67,7 +72,7 @@ NeMo-RL SFT uses HuggingFace chat templates to format the individual examples. T ``` -By default, NeMo-RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. +By default, NeMo RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. Adding a new dataset is a straightforward process. As long as your custom dataset has the `formatted_ds` and `task_spec` attributes described above, it can serve as a drop-in replacement for Squad and OpenAssistant. \ No newline at end of file diff --git a/docs/local-workstation.md b/docs/local-workstation.md index 482b41c5ad..860ec0428a 100644 --- a/docs/local-workstation.md +++ b/docs/local-workstation.md @@ -1,6 +1,4 @@ -# Local Workstation - -## Launching Locally +# Run on Your Local Workstation When launching examples locally with `uv`, {py:class}`init_ray() ` will first attempt to connect to an existing cluster. If none is found, it will start a local one and connect to it using all available GPU and CPU resources on your node. @@ -17,7 +15,7 @@ In the logs, you will see that Ray has started a local cluster instance, along w INFO:nemo_rl.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'CPU': 24.0, 'object_store_memory': 80448493977.0, 'accelerator_type:RTX': 1.0, 'memory': 177713152615.0, 'GPU': 1.0, 'node:10.0.0.1': 1.0} ``` -To control the GPUs ray uses locally more granularly, please use `CUDA_VISIBLE_DEVICES`: +To have more precise control over the GPUs Ray uses locally, please use `CUDA_VISIBLE_DEVICES`: ```sh # Use the 0th and 3rd indexed GPU (for a total of 2 GPUs) diff --git a/docs/testing.md b/docs/testing.md index 672bdacc82..35825ab50d 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -1,4 +1,6 @@ -# Testing NeMo-RL +# Test NeMo RL + +This guide outlines how to test NeMo RL using unit and functional tests, detailing steps for local or Docker-based execution, dependency setup, and metric tracking to ensure effective and reliable testing. ## Unit Tests @@ -12,26 +14,27 @@ uv run --group test bash tests/run_unit.sh ``` :::{note} -Tests can also be run on SLURM with `ray.sub`, but note that some tests will be skipped +Tests can also be run on Slurm with `ray.sub`, but note that some tests will be skipped due to no GPUs being located on the head node. To run the full suite of tests, please launch on a regular GPU allocation. ::: -### Running Unit Tests in a Hermetic Environment +### Run Unit Tests in a Hermetic Environment For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run -in docker with this script: +in Docker with this script: ```sh CONTAINER=... bash tests/run_unit_in_docker.sh ``` -The required `CONTAINER` can be built by following the instructions in the [docker documentation](docker.md). +The required `CONTAINER` can be built by following the instructions in the [Docker documentation](docker.md). -### Tracking metrics in unit tests +### Track Metrics in Unit Tests Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API: + ```python # Track an arbitrary metric (must be json serializable) tracker.track(metric_name, metric_value) @@ -44,6 +47,7 @@ tracker.get_max_mem() Including the `tracker` fixture also tracks the elapsed time for the test implicitly. Here is an example test: + ```python def test_exponentiate(tracker): starting_mem = tracker.get_max_mem() @@ -58,6 +62,7 @@ def test_exponentiate(tracker): ``` Which would produce this file in `tests/unit/unit_results.json`: + ```json { "exit_status": 0, @@ -94,7 +99,7 @@ jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_hf_policy_g ``` ::: -## Functional tests +## Functional Tests :::{important} Functional tests may require multiple GPUs to run. See each script to understand the requirements. @@ -119,11 +124,11 @@ whether they pass or fail. Here is an example: └────────┴────────────────────────────────┴───────────────────┴─────────┘ ``` -### Running Functional Tests in a Hermetic Environment +### Run Functional Tests in a Hermetic Environment For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run -in docker with this script: +in Docker with this script: ```sh CONTAINER=... bash run_functional_in_docker.sh functional/sft.sh