fix: memory optimizations for Nemotron12B 12k seqlen DPO training by ybgao-nvidia · Pull Request #926 · NVIDIA-NeMo/RL

ybgao-nvidia · 2025-08-14T19:46:30Z

What does this PR do ?

Memory optimizations

This PR applies memory optimizations that allows for single-node (8xH100) training of the Nemotron 12B model with sequence length 12288.

We need the following optimizations to make 12K context work:

the RMSNorm layers are not being checkpointed and they contribute 1/3 of the memory footprint for forward/backward (excl. weights, optimizer and gradients)
there's a lot of fragmentation due to the lack of support for FlashAttention2 for the Mistral model Nemotron 12B is based on (attention allocates ~6GB of memory at a time) and leaves a 8-9GB difference between active and resident memory
adding an occasional torch gc got us the rest of the way there to 12K

The additional checkpointed layers provides a significant decrease in peak memory usage with minimal performance impact. However, enabling a smaller max_split_size in the allocator does increase the step time slightly. The collated performance results are below:

Seqlen	Config	Peak Allocated (GB)	Peak Reserved (GB)	Step Time (s)
5500	baseline	61.66	67.69	13.88
	+checkpoint norm	42.71	48.75	10.04
	+allocator frag	43.05	51.20	14.04
8192	baseline	OOM
	+checkpoint norm	52.17	60.38	12.94
	+allocator frag	52.42	61.08	16.36
12288	baseline	OOM
	+checkpoint norm	OOM (66.56)	OOM (73.24)	18.19
	+allocator frag	66.81	73.26	23.43

Removal of `configure_expandable_segments`

Furthermore, the current implementation of configure_expandable_segments does not actually perform its intended function.

It calls torch.cuda.get_device_properties(0).major which initializes torch, including the memory allocator. The subsequent assignment to the environment variable will therefore not affect the allocator. Instead, the torch.cuda.memory._set_allocator_settings function should be used.

However, setting expandable segments results in minimal affect to peak memory usage while causing a large performance overhead (from 20s to 80s per training iteration).

We have deleted the function and the related invocations and tests to keep the runtime behaviour consistent. Should the need arise to set expandable segments, the user shall do so instead in the env_vars in the recipe configuration.

Minor fixes for config schema

Some tweaks are done to make config validation pass.

Made tensorboard field of logger optional

Issues

This PR resolves #848.

Usage

It is recommended to run DPO training with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 to reduce allocator fragmentation.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Yubo Gao <yubog@nvidia.com>

terrykong

thanks for the improving performance @ybgao-nvidia !

Signed-off-by: Yubo Gao <yubog@nvidia.com>

comments addressed

Signed-off-by: Yubo Gao <yubog@nvidia.com>

wangshangsam · 2025-08-20T02:33:37Z

However, setting expandable segments results in minimal affect to peak memory usage while causing a large performance overhead (from 20s to 80s per training iteration). We have disabled it by default.

Wait ... where is it disabled by default?

Signed-off-by: Yubo Gao <yubog@nvidia.com>

wangshangsam

Some small nits, but otherwise LGTM

Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Signed-off-by: Yubo Gao <yubog@nvidia.com>

Signed-off-by: Yubo Gao <yubog@nvidia.com>

github-actions · 2025-08-25T21:57:54Z

⚠️ File Synchronization Check

Check based on commit: 189868b (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py was not updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Update 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py if necessary to maintain synchronization
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-08-25T21:59:08Z

⚠️ File Synchronization Check

Check based on commit: 189868b (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py was not updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Update 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py if necessary to maintain synchronization
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-08-26T03:32:41Z

⚠️ File Synchronization Check

Check based on commit: 7573f6d (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py was not updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Update 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py if necessary to maintain synchronization
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Yubo Gao <yubog@nvidia.com>

github-actions · 2025-08-26T15:17:36Z

⚠️ File Consistency Check

Check based on commit: 3271a08 (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-08-26T15:19:05Z

⚠️ File Consistency Check

Check based on commit: b97abd2 (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

wangshangsam

Thanks @ybgao-nvidia ! @pjin-nvidia @bxyu-nvidia FYI

wangshangsam · 2025-08-26T18:06:28Z

⚠️ File Consistency Check

Check based on commit: b97abd2 (PR #926 from ybgao/aug13-dpo-12k-memory)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters: These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

Update the appropriate related file(s) if necessary to maintain functional consistency

Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index

Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements

If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py

Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py

Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py

nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Corresponding fix in Automodel: NVIDIA-NeMo/Automodel#391

…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Stanislav Kirdey <stan@inflection.ai>

…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

memory optimizations for Nemotron12B 12k seqlen DPO training

7929d39

Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia requested review from jiemingz, parthchadha, terrykong and wangshangsam August 14, 2025 19:49

ybgao-nvidia marked this pull request as ready for review August 14, 2025 19:50

terrykong reviewed Aug 15, 2025

View reviewed changes

Comment thread nemo_rl/models/policy/utils.py Outdated

Comment thread nemo_rl/models/dtensor/parallelize.py Outdated

Comment thread examples/configs/dpo_nemotron12b.yaml

wangshangsam previously requested changes Aug 15, 2025

View reviewed changes

parthchadha previously approved these changes Aug 18, 2025

View reviewed changes

implement suggested changes

02bee2a

Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia dismissed parthchadha’s stale review via 02bee2a August 18, 2025 21:44

ybgao-nvidia added 2 commits August 18, 2025 14:46

add copyright

fb8c1bb

Signed-off-by: Yubo Gao <yubog@nvidia.com>

make lint pass

87f858b

Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia requested review from parthchadha, terrykong and wangshangsam August 18, 2025 22:14

Merge branch 'main' into ybgao/aug13-dpo-12k-memory

b637b03

wangshangsam reviewed Aug 19, 2025

View reviewed changes

Comment thread nemo_rl/utils/envvars.py Outdated

update configuration key and README

00d5d9a

Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia requested a review from wangshangsam August 19, 2025 18:36

ybgao-nvidia added 2 commits August 19, 2025 16:25

fix allocator setting

9be4a2c

Signed-off-by: Yubo Gao <yubog@nvidia.com>

update readme and lint

63a82de

Signed-off-by: Yubo Gao <yubog@nvidia.com>

wangshangsam assigned ybgao-nvidia Aug 20, 2025

ybgao-nvidia added 3 commits August 19, 2025 21:50

disable expandable segments by default

a2cdc5a

Signed-off-by: Yubo Gao <yubog@nvidia.com>

Merge branch 'main' into ybgao/aug13-dpo-12k-memory

1a38935

Signed-off-by: Yubo Gao <yubog@nvidia.com>

Merge branch 'main' into ybgao/aug13-dpo-12k-memory

34124c8

wangshangsam previously approved these changes Aug 20, 2025

View reviewed changes

Comment thread README.md Outdated

Comment thread nemo_rl/models/policy/utils.py Outdated

Update README.md

d3c9ad7

Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia temporarily deployed to nemo-ci August 25, 2025 20:06 — with GitHub Actions Inactive

please pass :(

189868b

Signed-off-by: Yubo Gao <yubog@nvidia.com>

ybgao-nvidia removed the CI:L1 Run doctests, unit tests, and functional tests label Aug 25, 2025

ybgao-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Aug 25, 2025

ybgao-nvidia temporarily deployed to nemo-ci August 25, 2025 21:58 — with GitHub Actions Inactive

ybgao-nvidia temporarily deployed to nemo-ci August 25, 2025 22:03 — with GitHub Actions Inactive

ybgao-nvidia temporarily deployed to nemo-ci August 26, 2025 01:58 — with GitHub Actions Inactive

Merge branch 'main' into ybgao/aug13-dpo-12k-memory

7573f6d

ybgao-nvidia requested review from terrykong and wangshangsam August 26, 2025 04:06

ybgao-nvidia mentioned this pull request Aug 26, 2025

fix: checkpoint memory consuming layers in Nemotron 12B model NVIDIA-NeMo/Automodel#391

Merged

empty cache

3271a08

Signed-off-by: Yubo Gao <yubog@nvidia.com>

Merge branch 'main' into ybgao/aug13-dpo-12k-memory

b97abd2

wangshangsam approved these changes Aug 26, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 26, 2025

Merged via the queue into main with commit 989f177 Aug 26, 2025
21 checks passed

terrykong deleted the ybgao/aug13-dpo-12k-memory branch August 26, 2025 22:34

wangshangsam mentioned this pull request Sep 2, 2025

policy_training's speed regression in both dtensor v1/v2 path #1036

Closed

ybgao-nvidia mentioned this pull request Sep 5, 2025

fix: optional clear cache between microbatch iterations #1074

Merged

4 tasks

Conversation

ybgao-nvidia commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Memory optimizations

Removal of configure_expandable_segments

Minor fixes for config schema

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshangsam commented Aug 20, 2025

Uh oh!

wangshangsam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Aug 25, 2025

⚠️ File Synchronization Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Aug 25, 2025

⚠️ File Synchronization Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Aug 26, 2025

⚠️ File Synchronization Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Aug 26, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Aug 26, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

wangshangsam left a comment

Choose a reason for hiding this comment

Uh oh!

wangshangsam commented Aug 26, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ybgao-nvidia commented Aug 14, 2025 •

edited

Loading

Removal of `configure_expandable_segments`