NVIDIA-NeMo · terrykong · Apr 14, 2025 · Apr 7, 2025 · Apr 9, 2025 · Apr 10, 2025
@@ -0,0 +1,33 @@
+default:
+  image: ${DOC_BUILD_IMAGE}
+  tags:
+    - os/linux
+    - type/docker
+
+stages:
+  - build
+  - deploy
+
+.sphinx-build: &sphinx-build
+  - cd docs
+  - uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html
+
+build-docs:
+  stage: build
+  script:
+    - *sphinx-build
+  artifacts:
+    name: ${CI_PROJECT_NAME}-${CI_COMMIT_SHORT_SHA}
+    paths:
+      - docs/_build
+
+pages:
+  stage: deploy
+  needs: ["build-docs"]
+  script:
+    - echo "Publishing HTML to GitLab Pages"
+    - cp -r docs/_build/html/* public/
+  rules:
+    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
+    - if: $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == $CI_DEFAULT_BRANCH
+  environment: main
@@ -25,7 +25,7 @@ What you can expect:
 
 ## Features
 
-_✅ Available now | 🔜 Coming in v0.2_
+✅ _Available now_ | 🔜 _Coming in v0.2_
 
 - ✅ **Fast Generation** - vLLM backend for optimized inference
 - ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama)
@@ -51,7 +51,7 @@ uv pip install -e .[vllm]
 # Install NeMo-Reinforcer with dev/test dependencies
 uv pip install -e '.[dev,test]'
 
-# Use uv run to launch any runs. 
+# Use uv run to launch any runs.
 # Note that it is recommended to not activate the venv and instead use `uv run` since
 # it ensures consistent environment usage across different shells and sessions.
 # Example: uv run python examples/run_grpo_math.py
@@ -85,13 +85,14 @@ uv run python examples/run_sft.py \
   cluster.gpus_per_node=8
 ```
 
-Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden.
+Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
 #### Multi-node
 
 For distributed training across multiple nodes:
 
 Set `UV_CACHE_DIR` to a directory that can be read from all workers before running any uv run command.
+
 ```sh
 export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
 ```

@@ -53,7 +53,7 @@
     "fieldlist",  # Enables field lists for metadata like :author: Name
     "tasklist",  # Adds support for GitHub-style task lists with [ ] and [x]
 ]
-myst_heading_anchors = 3  # Generates anchor links for headings up to level 3
+myst_heading_anchors = 4  # Generates anchor links for headings up to level 4
 
 # -- Options for Autodoc2 ---------------------------------------------------
 sys.path.insert(0, os.path.abspath(".."))
@@ -67,7 +67,7 @@
 # render google style docstrings.
 # Related Issue: https://github.com/sphinx-extensions2/sphinx-autodoc2/issues/33
 autodoc2_docstring_parser_regexes = [
-    (r".*", "autodoc2_docstrings_parser"),
+    (r".*", "docs.autodoc2_docstrings_parser"),
 ]
 
 # -- Options for HTML output -------------------------------------------------

@@ -15,7 +15,7 @@ The core of the generation system is defined in `interfaces.py`, which establish
        backend: str              # The backend to use (e.g., "vllm", "hf")
        max_new_tokens: int       # Maximum number of tokens to generate
        temperature: float        # Sampling temperature
-       top_p: float              # Top-p sampling parameter 
+       top_p: float              # Top-p sampling parameter
        top_k: int                # Top-k sampling parameter
        model_name: str           # Name or path of the model
    ```
@@ -138,7 +138,7 @@ generator.finish_generation()
 To add a new generation backend:
 
 1. Create a new class that implements {py:class}`GenerationInterface <nemo_reinforcer.models.generation.interfaces.GenerationInterface>`
-2. Implement the required methods: {py:method}`generate <nemo_reinforcer.models.generation.interfaces.GenerationInterface.generate>`, {py:method}`prepare_for_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:method}`finish_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.finish_generation>`
+2. Implement the required methods: {py:meth}`generate <nemo_reinforcer.models.generation.interfaces.GenerationInterface.generate>`, {py:meth}`prepare_for_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.prepare_for_generation>`, and {py:meth}`finish_generation <nemo_reinforcer.models.generation.interfaces.GenerationInterface.finish_generation>`
 3. Ensure your implementation works with the standard {py:class}`GenerationConfig <nemo_reinforcer.models.generation.interfaces.GenerationConfig>` and {py:class}`GenerationDatumSpec <nemo_reinforcer.models.generation.interfaces.GenerationDatumSpec>` structures
 4. Register your backend with the system (if needed) to make it accessible
 

@@ -1,16 +1,20 @@
 # Building Docker Images
 
-### Base Image
+## Base Image
+
 If you only need the base image with ray + uv, you can build it like so:
+
 ```sh
 cd docker/
 docker buildx build --target base -t reinforcer -f Dockerfile ..
 ```
 
 This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime.
 
-### Hermetic Image
+## Hermetic Image
+
 The docker image build without a target stage will include all of the default dependencies to get started.
+
 ```sh
 cd docker/
 docker buildx build -t reinforcer -f Dockerfile ..

@@ -5,33 +5,40 @@
 If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
 
 We recommend launching the job using `uv`:
+
 ```bash
 uv run examples/run_grpo_math.py --config <PATH TO YAML CONFIG> {overrides}
 ```
-If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo.yaml)
+
+If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml)
 
 **Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
 ## Now, for the details:
 
 In this guide, we'll walk through we handle
+
 * Data
 * Model training
 * Fast generation
 * Overall Resource Flow
 
 ### Data
+
 We support training with multiple RL "Environments" at the same time.
 
 An [Environment](../../nemo_reinforcer/environments/interfaces.py) is an object that accepts a state/action history and returns an update state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_reinforcer/environments/math_environment.py).
 
 To support this, we need to know:
+
 * What environments you have
 * Which data should go to which environments
 * How to prepare the data from your dataset into a form we can use
 
 #### Common Data Format
+
 We define a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) that holds all relevant information for each training example:
+
 ```python
 class DatumSpec(TypedDict):
     message_log: LLMMessageLogType
@@ -44,7 +51,8 @@ class DatumSpec(TypedDict):
 ```
 
 #### Data Processors
-We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task. 
+
+We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task.
 For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_reinforcer/data/interfaces.py)
 
 ```python
@@ -56,14 +64,19 @@ def my_data_processor(
     idx: int,
 ) -> DatumSpec:
 ```
+
 We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py)
 
-#### Putting it all together:
+#### Putting it all together
+
 GRPO expects datasets to have the following form:
+
 ```json
-{"task_name": "math", <actual data>}
+{"task_name": "math", /* actual data */}
 ```
+
 Then, you can set data up as such:
+
 ```python
 base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"]
 tokenizer = AutoTokenizer.from_pretrained(policy_config["model_name"])
@@ -81,15 +94,17 @@ dataset = AllTaskProcessedDataset(
     max_seq_length=data_config["max_input_seq_length"],
 )
 ```
-Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples.
 
+Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples.
 
 ### Policy Model
-We define a [PolicyInterface]() that contains everything you need to train a Policy model.
+
+We define a {py:class}`PolicyInterface]() <nemo_reinforcer.models.interfaces>` that contains everything you need to train a Policy model.
 
 This Policy object holds a [RayWorkerGroup](../../nemo_reinforcer/distributed/worker_groups.py) of SPMD (1 proc/gpu) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU!
 
 ### Fast Generation
+
 We support vLLM through the [VllmGeneration](../../nemo_reinforcer/models/generation/vllm.py) class right now.
 
-The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop.
+The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop.
@@ -589,6 +589,7 @@ def flatten_dict(d: Dict[str, Any], sep: str = ".") -> Dict[str, Any]:
 
     Examples:
         ```{doctest}
+        >>> from nemo_reinforcer.utils.logger import flatten_dict
         >>> flatten_dict({"a": 1, "b": {"c": 2}})
         {'a': 1, 'b.c': 2}