diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml new file mode 100644 index 0000000000..b29e5ab0fe --- /dev/null +++ b/.gitlab-ci.yml @@ -0,0 +1,33 @@ +default: + image: ${DOC_BUILD_IMAGE} + tags: + - os/linux + - type/docker + +stages: + - build + - deploy + +.sphinx-build: &sphinx-build + - cd docs + - uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html + +build-docs: + stage: build + script: + - *sphinx-build + artifacts: + name: ${CI_PROJECT_NAME}-${CI_COMMIT_SHORT_SHA} + paths: + - docs/_build + +pages: + stage: deploy + needs: ["build-docs"] + script: + - echo "Publishing HTML to GitLab Pages" + - cp -r docs/_build/html/* public/ + rules: + - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH + - if: $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == $CI_DEFAULT_BRANCH + environment: main diff --git a/README.md b/README.md index 044c9cd954..a172fcfd14 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ What you can expect: ## Features -_✅ Available now | 🔜 Coming in v0.2_ +✅ _Available now_ | 🔜 _Coming in v0.2_ - ✅ **Fast Generation** - vLLM backend for optimized inference - ✅ **HuggingFace Integration** - Works with 1-8B models (Qwen1.5, Llama) @@ -51,7 +51,7 @@ uv pip install -e .[vllm] # Install NeMo-Reinforcer with dev/test dependencies uv pip install -e '.[dev,test]' -# Use uv run to launch any runs. +# Use uv run to launch any runs. # Note that it is recommended to not activate the venv and instead use `uv run` since # it ensures consistent environment usage across different shells and sessions. # Example: uv run python examples/run_grpo_math.py @@ -85,13 +85,14 @@ uv run python examples/run_sft.py \ cluster.gpus_per_node=8 ``` -Refer to [sft.yaml](examples/configs/sft.yaml) for a full list of parameters that can be overridden. +Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden. #### Multi-node For distributed training across multiple nodes: Set `UV_CACHE_DIR` to a directory that can be read from all workers before running any uv run command. + ```sh export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache ``` diff --git a/docs/conf.py b/docs/conf.py index e800a2595d..c9f61d4faf 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -53,7 +53,7 @@ "fieldlist", # Enables field lists for metadata like :author: Name "tasklist", # Adds support for GitHub-style task lists with [ ] and [x] ] -myst_heading_anchors = 3 # Generates anchor links for headings up to level 3 +myst_heading_anchors = 4 # Generates anchor links for headings up to level 4 # -- Options for Autodoc2 --------------------------------------------------- sys.path.insert(0, os.path.abspath("..")) @@ -67,7 +67,7 @@ # render google style docstrings. # Related Issue: https://github.com/sphinx-extensions2/sphinx-autodoc2/issues/33 autodoc2_docstring_parser_regexes = [ - (r".*", "autodoc2_docstrings_parser"), + (r".*", "docs.autodoc2_docstrings_parser"), ] # -- Options for HTML output ------------------------------------------------- diff --git a/docs/design_docs/generation.md b/docs/design_docs/generation.md index 84f450c7cc..b519b2d249 100644 --- a/docs/design_docs/generation.md +++ b/docs/design_docs/generation.md @@ -15,7 +15,7 @@ The core of the generation system is defined in `interfaces.py`, which establish backend: str # The backend to use (e.g., "vllm", "hf") max_new_tokens: int # Maximum number of tokens to generate temperature: float # Sampling temperature - top_p: float # Top-p sampling parameter + top_p: float # Top-p sampling parameter top_k: int # Top-k sampling parameter model_name: str # Name or path of the model ``` @@ -138,7 +138,7 @@ generator.finish_generation() To add a new generation backend: 1. Create a new class that implements {py:class}`GenerationInterface ` -2. Implement the required methods: {py:method}`generate `, {py:method}`prepare_for_generation `, and {py:method}`finish_generation ` +2. Implement the required methods: {py:meth}`generate `, {py:meth}`prepare_for_generation `, and {py:meth}`finish_generation ` 3. Ensure your implementation works with the standard {py:class}`GenerationConfig ` and {py:class}`GenerationDatumSpec ` structures 4. Register your backend with the system (if needed) to make it accessible diff --git a/docs/design_docs/gpu_logger.md b/docs/design_docs/gpu_logger.md deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/docs/docker.md b/docs/docker.md index 37548ff282..5ea3581ea3 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -1,7 +1,9 @@ # Building Docker Images -### Base Image +## Base Image + If you only need the base image with ray + uv, you can build it like so: + ```sh cd docker/ docker buildx build --target base -t reinforcer -f Dockerfile .. @@ -9,8 +11,10 @@ docker buildx build --target base -t reinforcer -f Dockerfile .. This is **our recommendation** as it is a small image and allows you to specify your python dependencies at runtime. -### Hermetic Image +## Hermetic Image + The docker image build without a target stage will include all of the default dependencies to get started. + ```sh cd docker/ docker buildx build -t reinforcer -f Dockerfile .. diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md index 6ace84876d..b84cbf9f0c 100644 --- a/docs/guides/grpo.md +++ b/docs/guides/grpo.md @@ -5,33 +5,40 @@ If you want to get running quickly, the script [examples/run_grpo_math.py](../../examples/run_grpo_math.py) has an example implementation of using GRPO to train a model on math problems. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md). We recommend launching the job using `uv`: + ```bash uv run examples/run_grpo_math.py --config {overrides} ``` -If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo.yaml) + +If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml) **Reminder**: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. ## Now, for the details: In this guide, we'll walk through we handle + * Data * Model training * Fast generation * Overall Resource Flow ### Data + We support training with multiple RL "Environments" at the same time. An [Environment](../../nemo_reinforcer/environments/interfaces.py) is an object that accepts a state/action history and returns an update state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_reinforcer/environments/math_environment.py). To support this, we need to know: + * What environments you have * Which data should go to which environments * How to prepare the data from your dataset into a form we can use #### Common Data Format + We define a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) that holds all relevant information for each training example: + ```python class DatumSpec(TypedDict): message_log: LLMMessageLogType @@ -44,7 +51,8 @@ class DatumSpec(TypedDict): ``` #### Data Processors -We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task. + +We name all distinct "environments your model wants to optimize against" "tasks". So you might define a "math" task or a "code" task. For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](../../nemo_reinforcer/data/interfaces.py) ```python @@ -56,14 +64,19 @@ def my_data_processor( idx: int, ) -> DatumSpec: ``` + We have an example of this as `math_data_processor` in [run_grpo_math.py](../../examples/run_grpo_math.py) -#### Putting it all together: +#### Putting it all together + GRPO expects datasets to have the following form: + ```json -{"task_name": "math", } +{"task_name": "math", /* actual data */} ``` + Then, you can set data up as such: + ```python base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"] tokenizer = AutoTokenizer.from_pretrained(policy_config["model_name"]) @@ -81,15 +94,17 @@ dataset = AllTaskProcessedDataset( max_seq_length=data_config["max_input_seq_length"], ) ``` -Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples. +Notice that you provide a mapping of tasks to their processors so the dataset knows what to use when processing samples. ### Policy Model -We define a [PolicyInterface]() that contains everything you need to train a Policy model. + +We define a {py:class}`PolicyInterface]() ` that contains everything you need to train a Policy model. This Policy object holds a [RayWorkerGroup](../../nemo_reinforcer/distributed/worker_groups.py) of SPMD (1 proc/gpu) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU! ### Fast Generation + We support vLLM through the [VllmGeneration](../../nemo_reinforcer/models/generation/vllm.py) class right now. -The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop. \ No newline at end of file +The function [grpo_train](../../nemo_reinforcer/algorithms/grpo.py) contains the core GRPO training loop. diff --git a/nemo_reinforcer/utils/logger.py b/nemo_reinforcer/utils/logger.py index bc0157d564..3564ae6f0f 100644 --- a/nemo_reinforcer/utils/logger.py +++ b/nemo_reinforcer/utils/logger.py @@ -589,6 +589,7 @@ def flatten_dict(d: Dict[str, Any], sep: str = ".") -> Dict[str, Any]: Examples: ```{doctest} + >>> from nemo_reinforcer.utils.logger import flatten_dict >>> flatten_dict({"a": 1, "b": {"c": 2}}) {'a': 1, 'b.c': 2}