feat: fp8 block scaling#543

Merged

terrykong merged 40 commits intomainfrom

jiemingz/fp8_block

Aug 22, 2025

Contributor

jiemingz commented Jun 24, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

jiemingz force-pushed the jiemingz/fp8_block branch 4 times, most recently from fb57ec1 to 5b9c1ba Compare

June 26, 2025 14:35

vcuinv reviewed

View reviewed changes

nemo_rl/models/generation/fp8.py Outdated

nemo_rl/models/generation/fp8.py

vcuinv reviewed

View reviewed changes

nemo_rl/models/generation/fp8.py Outdated

jiemingz force-pushed the jiemingz/fp8_block branch 4 times, most recently from 53d8ec3 to 59e8b12 Compare

July 8, 2025 19:47

jiemingz force-pushed the jiemingz/fp8_block branch 3 times, most recently from 975df8c to 36c1710 Compare

July 14, 2025 15:48

jiemingz changed the title ~~draft: fp8 block scaling~~ feat: fp8 block scaling

terrykong added the r0.3.0 label

jiemingz force-pushed the jiemingz/fp8_block branch from c8304c0 to 5bc8868 Compare

July 14, 2025 18:57

jiemingz requested review from SahilJain314, parthchadha and terrykong

July 14, 2025 19:01

jiemingz force-pushed the jiemingz/fp8_block branch from d68514a to e3a8daf Compare

July 14, 2025 19:49

jiemingz requested a review from vcuinv

July 14, 2025 21:55

terrykong removed the r0.3.0 label

rybakov previously approved these changes

View reviewed changes

rybakov left a comment

Should we also add a config, RL/examples/configs/grpo_math_8B_fp8_L3_F1_G_i.yaml
For example, below config can be a good candidate (with optionally set num_last_layers_in_bf16: 0 num_first_layers_in_bf16: 0):

GRPO Algorithm Configuration

defaults: "grpo_math_1B.yaml"

grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32

loss_fn:
use_importance_sampling_correction: true

policy:
model_name: "meta-llama/Llama-3.1-8B-Instruct"
tokenizer:
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
train_global_batch_size: 512
train_micro_batch_size: 1
generation_batch_size: 32 # Only used when generating using HF backend
logprob_batch_size: 2
max_total_sequence_length: 4096
precision: "bfloat16"
fsdp_offload_enabled: false
activation_checkpointing_enabled: false

dtensor_cfg:
enabled: True

dynamic_batching:
train_mb_tokens: 4096
logprob_mb_tokens: 8192

optimizer:
name: "torch.optim.AdamW"
kwargs:
lr: 3.0e-7
weight_decay: 0.01
betas: [0.9, 0.999]
eps: 1e-8

scheduler:
- name: "torch.optim.lr_scheduler.LinearLR"
kwargs:
start_factor: 0.1
end_factor: 1.0
# The scheduler iteration is per GPRO step and is decoupled with the optimizer step (may be >=1 per GPRO step)
total_iters: 13
- name: "torch.optim.lr_scheduler.ConstantLR"
kwargs:
factor: 1.0
total_iters: 10000000000
- milestones: [13]

generation:
backend: "vllm"
max_new_tokens: ${policy.max_total_sequence_length}
temperature: 1.0
top_p: 1.0
top_k: null
stop_token_ids: null
stop_strings: null
vllm_cfg:
precision: 'fp8'
use_deep_gemm: true
num_last_layers_in_bf16: 3
num_first_layers_in_bf16: 1
tensor_parallel_size: 1
gpu_memory_utilization: 0.6
max_model_len: ${policy.max_total_sequence_length}

cluster:
gpus_per_node: 8
num_nodes: 1

jiemingz force-pushed the jiemingz/fp8_block branch from e3a8daf to 32ada21 Compare

July 21, 2025 19:55

jiemingz dismissed rybakov’s stale review via

4a35c9d

July 21, 2025 21:33

vcuinv approved these changes

View reviewed changes

jiemingz force-pushed the jiemingz/fp8_block branch 2 times, most recently from 36a127e to b899f3b Compare

July 23, 2025 14:59

SahilJain314 reviewed

View reviewed changes

Contributor

SahilJain314 left a comment

Not super necessary immediately, but I think it'd be nice to include convergence plots for proof in the repo.

nemo_rl/models/generation/vllm_backend.py Outdated

nemo_rl/models/generation/vllm_backend.py Outdated

nemo_rl/models/generation/fp8.py Outdated

nemo_rl/algorithms/grpo.py Outdated

pyproject.toml

jiemingz force-pushed the jiemingz/fp8_block branch from b899f3b to f5401dc Compare

July 24, 2025 03:49

jiemingz and others added 19 commits

August 20, 2025 13:16


          lint

4a32ad2

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          ensure importance sampling on

ca502a2

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          add fp8 config

f8b2e99

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix TP and async engine

b7d4a21

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          lint

7633ee2

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          Update grpo.py

51c2cf0

Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>


          add doc, fix single gpu case

dbd3572

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix async

10dac2d

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix rebase

f441f59

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix tests

dd8d345

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix lint

1bfa5e1

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix rebase

4d9ecbd

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          Lint/copyright

4d19688

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          address comments

36dae19

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          Update docs/fp8.md

9fc983a

Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>


          fix sphinx

192764f

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          fix noncolocate

e715b7a

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          skip fp8 tests on <h100

6c85b1e

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          uv lock

2573ae5

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

jiemingz dismissed parthchadha’s stale review via

2573ae5

August 20, 2025 21:56

jiemingz force-pushed the jiemingz/fp8_block branch from 8e74171 to 2573ae5 Compare

August 20, 2025 21:56

jiemingz added CI:L1 and removed CI:L1 labels

jiemingz temporarily deployed to nemo-ci

August 20, 2025 21:57

— with

GitHub Actions Inactive

jiemingz temporarily deployed to nemo-ci

August 20, 2025 23:02

— with

GitHub Actions Inactive

jiemingz and others added 4 commits

August 21, 2025 08:06


          add functional

b2d7e9a

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>


          Merge branch 'main' into jiemingz/fp8_block

5cf2e68


          Update grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml

33988a8

Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>


          add missed cfgs

4d30861

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

terrykong approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

terrykong terrykong approved these changes

rybakov rybakov left review comments

+3 more reviewers

vcuinv vcuinv approved these changes

parthchadha parthchadha left review comments

SahilJain314 SahilJain314 left review comments

Labels

CI:L1 Documentation