Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
default:
image: ${DOC_BUILD_IMAGE}
tags:
- os/linux
- type/docker

stages:
- build
- deploy

.sphinx-build: &sphinx-build
- cd docs
- uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html

build-docs:
stage: build
script:
- *sphinx-build
artifacts:
name: ${CI_PROJECT_NAME}-${CI_COMMIT_SHORT_SHA}
paths:
- docs/_build

pages:
stage: deploy
needs: ["build-docs"]
script:
- echo "Publishing HTML to GitLab Pages"
- cp -r docs/_build/html/* public/
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == $CI_DEFAULT_BRANCH
environment: main
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ What you can expect:

## Features

_✅ Available now | 🔜 Coming in v0.2_
✅ _Available now_ | 🔜 _Coming in v0.2_

- ✅ **Fast Generation** - vLLM backend for optimized inference
- ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama)
Expand All @@ -51,7 +51,7 @@ uv pip install -e .[vllm]
# Install NeMo-Reinforcer with dev/test dependencies
uv pip install -e '.[dev,test]'

# Use uv run to launch any runs.
# Use uv run to launch any runs.
# Note that it is recommended to not activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py
Expand Down Expand Up @@ -85,13 +85,14 @@ uv run python examples/run_sft.py \
cluster.gpus_per_node=8
```

Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden.
Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.

#### Multi-node

For distributed training across multiple nodes:

Set `UV_CACHE_DIR` to a directory that can be read from all workers before running any uv run command.

```sh
export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
```
Expand Down
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"fieldlist", # Enables field lists for metadata like :author: Name
"tasklist", # Adds support for GitHub-style task lists with [ ] and [x]
]
myst_heading_anchors = 3 # Generates anchor links for headings up to level 3
myst_heading_anchors = 4 # Generates anchor links for headings up to level 4

# -- Options for Autodoc2 ---------------------------------------------------
sys.path.insert(0, os.path.abspath(".."))
Expand All @@ -67,7 +67,7 @@
# render google style docstrings.
# Related Issue: https://github.com/sphinx-extensions2/sphinx-autodoc2/issues/33
autodoc2_docstring_parser_regexes = [
(r".*", "autodoc2_docstrings_parser"),
(r".*", "docs.autodoc2_docstrings_parser"),
]

# -- Options for HTML output -------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/design_docs/generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The core of the generation system is defined in `interfaces.py`, which establish
backend: str # The backend to use (e.g., "vllm", "hf")
max_new_tokens: int # Maximum number of tokens to generate
temperature: float # Sampling temperature
top_p: float # Top-p sampling parameter
top_p: float # Top-p sampling parameter
top_k: int # Top-k sampling parameter
model_name: str # Name or path of the model
```
Expand Down Expand Up @@ -138,7 +138,7 @@ generator.finish_generation()
To add a new generation backend:

1. Create a new class that implements {py:class}`GenerationInterface <nemo_reinforcer.models.generation.interfaces.GenerationInterface>`
2. Implement the required methods: {py:method}`generate <nemo_reinforcer.models.generation.interfaces.GenerationInterface.generate>`, {py:method}`prepare_for_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:method}`finish_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.finish_generation>`
2. Implement the required methods: {py:meth}`generate <nemo_reinforcer.models.generation.interfaces.GenerationInterface.generate>`, {py:meth}`prepare_for_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:meth}`finish_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.finish_generation>`
3. Ensure your implementation works with the standard {py:class}`GenerationConfig <nemo_reinforcer.models.generation.interfaces.GenerationConfig>` and {py:class}`GenerationDatumSpec <nemo_reinforcer.models.generation.interfaces.GenerationDatumSpec>` structures
4. Register your backend with the system (if needed) to make it accessible

Expand Down
Empty file removed docs/design_docs/gpu_logger.md
Empty file.
8 changes: 6 additions & 2 deletions docs/docker.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
# Building Docker Images

### Base Image
## Base Image

If you only need the base image with ray + uv, you can build it like so:

```sh
cd docker/
docker buildx build --target base -t reinforcer -f Dockerfile ..
```

This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime.

### Hermetic Image
## Hermetic Image

The docker image build without a target stage will include all of the default dependencies to get started.

```sh
cd docker/
docker buildx build -t reinforcer -f Dockerfile ..
Expand Down
29 changes: 22 additions & 7 deletions docs/guides/grpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,40 @@
If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).

We recommend launching the job using `uv`:

```bash
uv run examples/run_grpo_math.py --config <PATH TO YAML CONFIG> {overrides}
```
If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo.yaml)

If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml)

**Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.

## Now, for the details:

In this guide, we'll walk through we handle

* Data
* Model training
* Fast generation
* Overall Resource Flow

### Data

We support training with multiple RL "Environments" at the same time.

An [Environment](../../nemo_reinforcer/environments/interfaces.py) is an object that accepts a state/action history and returns an update state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_reinforcer/environments/math_environment.py).

To support this, we need to know:

* What environments you have
* Which data should go to which environments
* How to prepare the data from your dataset into a form we can use

#### Common Data Format

We define a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) that holds all relevant information for each training example:

```python
class DatumSpec(TypedDict):
message_log: LLMMessageLogType
Expand All @@ -44,7 +51,8 @@ class DatumSpec(TypedDict):
```

#### Data Processors
We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task.

We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task.
For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_reinforcer/data/interfaces.py)

```python
Expand All @@ -56,14 +64,19 @@ def my_data_processor(
idx: int,
) -> DatumSpec:
```

We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py)

#### Putting it all together:
#### Putting it all together

GRPO expects datasets to have the following form:

```json
{"task_name": "math", <actual data>}
{"task_name": "math", /* actual data */}
```

Then, you can set data up as such:

```python
base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"]
tokenizer = AutoTokenizer.from_pretrained(policy_config["model_name"])
Expand All @@ -81,15 +94,17 @@ dataset = AllTaskProcessedDataset(
max_seq_length=data_config["max_input_seq_length"],
)
```
Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples.

Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples.

### Policy Model
We define a [PolicyInterface]() that contains everything you need to train a Policy model.

We define a {py:class}`PolicyInterface]() <nemo_reinforcer.models.interfaces>` that contains everything you need to train a Policy model.

This Policy object holds a [RayWorkerGroup](../../nemo_reinforcer/distributed/worker_groups.py) of SPMD (1 proc/gpu) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU!

### Fast Generation

We support vLLM through the [VllmGeneration](../../nemo_reinforcer/models/generation/vllm.py) class right now.

The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop.
The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop.
1 change: 1 addition & 0 deletions nemo_reinforcer/utils/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,7 @@ def flatten_dict(d: Dict[str, Any], sep: str = ".") -> Dict[str, Any]:

Examples:
```{doctest}
>>> from nemo_reinforcer.utils.logger import flatten_dict
>>> flatten_dict({"a": 1, "b": {"c": 2}})
{'a': 1, 'b.c': 2}

Expand Down